University of Pisa

Department of Information Engineering

Master Degree in

Robotics and Automation Engineering

### Development of Algorithms for the Analysis and

### Classication of Electric Energy Consumption

Author: Luca Semeraro

Supervisors:

Prof. Eng. A. Landi Prof. Eng. A.Franco

Examiner: Eng. E. Crisostomi

To the desk n 002344, for the support it gave me.

## Abstract

In this work we use clustering algorithms to compute the typical Italian load prole in dierent days of the week in dierent seasons. This result can be exploited by energy providers to tailor more attractive time-varying taris for their customers. In addition, we propose a day ahead load trend forecasting method based on clustering output and historical data.

We nd out that better results are obtained if the clustering is not performed directly
on the data, but on some features extracted from the data. Accordingly, most
conven-tional load features are compared for the Italian case to identify the most informative
ones. This work is also described in the article Electrical Load Clustering: the Italian
case we submitted for the 5th _{IEEE PES Innovative Smart Grid Technologies (ISGT)}

European 2014 Conference. At last, the proposed forecasting method, together with a classication routine based on calendar and Mahalanobis distance, allows us to predict the day ahead load trend with performances similar to those obtained by the main electricity transmission grid operator in Italy and, in general, better than those obtained with linear regressions.

The document is structured in six chapters. Chapter 1 introduces the data set used in this work. Chapter 2 and Chapter 3 describe the analysis of typical load prole and clus-tering results considering dierent days of the week and seasons, respectively. In Chapter 4 we analyze the load according to dierent typologies of customers and Chapter 5 illustrates the proposed forecasting algorithm and compares its results with those of other methods. Finally, in Chapter 6 we present our conclusions and proposals for future works.

## Preface

One of the envisaged features of the upcoming smart grid is the ability to seamlessly integrate power generated from renewable sources, coping with the intrinsic uctuating nature of such an intermittently available power. While this is a general objective of many countries along the road map outlined by the Kyoto protocol, some countries are already characterized by having a large share of power generated from renewable sources: among others, wind plants alone provided more than 30% of electricity production in Denmark in 2012, and are planned to supply 50% of the overall demand by the year 2020 [13]. Also, Denmark is currently trying to become a system based only on renewable energy by 2050 [15]. Alongside to the many activities that are planned to achieve the previous objectives from the perspective of power generation, there is also a lot of research in designing ecient demand response functionalities to adapt the power load to the power generation prole as much as possible. In particular, there is a growing interest in analyzing the characteristics and the patterns of the power consumption load, in order to develop accurate algorithms to predict in advance future load proles. In fact, if the load could be exactly predicted some hours in advance, when accurate weather forecasts become also available, then energy providers could schedule the optimal switching on and o of conventional power plants to support the power generation from renewable sources, when needed.

Energy suppliers do also have a further motivation to analyze load consumption: as the energy markets are getting more competitive and dynamic, energy suppliers have started to oer more diversied energy taris, following the trend of diversied bills in other more-established elds (e.g., telecommunications). From this perspective, knowing the typical load consumption patterns of dierent classes of customers can be used by energy suppliers to oer tailored, and in principle more attractive, energy taris.

Following the previous discussion, the objective of this work is to analyze electrical en-ergy consumption proles in Italy, to infer the characteristic weekly and seasonal patterns of such load proles and to develop a forecasting algorithm for the day ahead load prole. Although the discussion here might not be extended tout-court to other countries dierent from Italy, we still believe that some of the properties found here do have a general inter-pretation, and that the algorithms illustrated here can be useful for the power community in other countries as well.

Classication and clustering of time series signals is an important area of research in several elds, such as economics, engineering, nance, medicine, biology, physics, geology, and many others. Clustering refers to the ability to aggregate similar objects together, and the basic clustering operation corresponds to take a set of N objects and group them into K clusters. There are three main motivations for doing so [14]:

• First, a good clustering has predictive power; in this case, we perform clustering because we believe the underlying cluster labels are meaningful, will lead to a more ecient description of our data, and will help us choose better actions. This type of clustering is sometimes called mixture density modelling, and the objective function that measures how well the predictive model is working is the information content of the data. In the load consumption example, this property suggests that clusters can be used to predict future energy consumption as well;

• Secondly, clusters allow people to compress the information into a single information, corresponding to the center of the cluster, i.e., centroid. For instance, by classifying load consumption into two categories (working days and holidays), it is possible to identify two clusters and their two corresponding centroids. The centroids can be used to identify the typical consumption of that day of the week. Thus, it summarizes in only one 24-hour prole the information content of the load proles during the, say, whole year. This type of clustering is sometimes called vector quantization; • A third reason to make clusters is to identify the outliers, i.e., the cases in which

clusters fail to accurately represent particular data. An example of this, in our load consumption case, is represented by working days that for some reasons present load

consumption that are very close to holidays (e.g., working days in the middle of two holidays). Clearly, such anomalous days should be identied and not considered when building the prole of typical working days.

The adoption of clustering techniques to analyze load data is not fully novel: for instance, reference [6] analyses the load proles of a representative sample of Spanish residential users, by means of dynamic clustering (i.e., dynamic in the sense that the load proles are interpreted as a time-series database). Specically, they use the K-means clustering algorithm. Interestingly, they conclude that seasonality eects can be clearly observed, with generalized higher consumption in winter, lower energy consumption in summer, and the lowest values in autumn and spring. Also, they identify three main types of energy consumption users. The rst type (majority of users) is characterized by three ascending peaks of energy consumption (one in the morning around 8.00, another one at lunch time around 15.00, and the highest one after dinner at 22.00). The second type of consumers are characterized by having a quasi-at energy consumption prole during the day. Finally, the third type of clients involves people who consume a signicant amount of energy also at night time.

A similar attempt to classify electricity customers had previously been performed, among others, in [4], in the case of 234 non-residential customers in Italy connected to the MV distribution system. Reference [4] compares several clustering algorithms, but does not try to interpret the clustering results (i.e., the habits of customers). Instead, the optimal number of clusters, and the best clustering algorithm, is evaluated on the basis of some empirical indexes. Similarly, [7] performs a clustering analysis for the specic case of a building in a university campus in Greece.

Other authors in the recent literature have tried to combine clustering techniques to predict future load, and have generally found out that forecasts become more accurate. Among others, we remind references [5]-[8]. In [17], the authors nd out that also aggregations of several consumers can be convenient to improve forecasting accuracy.

Another interesting example comes from [2], where the authors apply clustering methods to identify groups of DMU (Decision Making Units), in the specic case of the electricity distribution network in Finland. In this case, information regarding network lengths, num-ber of customers and energy transmission are used to identify four classes of clients, that were later recognized to belong to rural, suburban, urban and industrial network rms.

Finally, we mention reference [19], where a sequential cluster weighted modeling (SCWM) approach was used to recognize specic electric load transients, and later used to improve energy consumption predictions.

Note that each of the previous examples addresses a single case of clustering (i.e., clustering of residential households where clusters corresponds to dierent families, clustering of en-ergy transmission proles, where clusters corresponds to geographical areas, and clustering of electric load transients, where clusters correspond to single appliances). Also, clusters are built over data that are known to correspond to a given class of users, or a given class of objects. The objective of this paper is to rather investigate the overall aggregate data, and still try to infer information that can be used to distinguish dierent operating con-ditions. Also, our objective is to identify the load features that are maximally informative in separating dierent load clusters.

## Contents

1 Data Description 1

1.1 Actual load data . . . 1

1.2 Day ahead forecast data . . . 4

1.3 Data pre-processing . . . 4

2 Weekday Pattern Clustering 8 2.1 Performance Index . . . 13

2.2 Data sets used in clustering . . . 15

2.2.1 Features denition . . . 16

2.2.2 Features determination and grouping . . . 16

2.2.3 Features extraction and clustering . . . 18

2.3 Clustering policies . . . 20

2.4 Results . . . 22

2.4.1 Aggregated national data . . . 23

2.4.2 Other areas . . . 25

3 Seasonal Pattern Clustering 28 3.1 Performance index . . . 30

3.2 Results . . . 31

3.2.1 Aggregated national data . . . 31

3.2.2 group a results . . . 35

3.2.3 group b results . . . 37

3.2.4 One step back . . . 40

3.2.5 Mondays cluster . . . 42

CONTENTS CONTENTS

4 Electrical Consumption Analysis by Customers 45

4.1 Results . . . 47

5 Electric Load Forecasting 49 5.1 Database Structure . . . 50

5.2 Classication step . . . 52

5.2.1 Daily pattern classication . . . 52

5.2.2 Seasonal pattern classication . . . 52

5.3 Forecasting step . . . 53

5.4 Database update . . . 54

5.5 Results . . . 55

6 Conclusions 58 A Clustering Algorithms 61 A.1 Fuzzy c-Means . . . 61

## Chapter 1

## Data Description

In this Chapter we give a brief description of the data set we used. All data were taken from the database of Terna, which is the main electricity transmission grid opera-tor in Italy, and is also the rst independent operaopera-tor in Europe, and sixth in the world for kilometers of electricity lines managed [21]. Most data of interest are contained in the transparency report section of the website. The transparency report section contains data that, according to the EC Regulation no. 1228/2003, the national transmission system op-erators must communicate relating to the physical ows in transmission system opop-erators' networks. In particular, it is possible to nd information related to load, transmission and interconnection, generation, and balancing.

For the purpose of this work, we considered two dierent type of data sets: the actual load data set, containing the consumptive of load consumption, and the day-ahead load forecast data, containing the load forecast provided by Terna.

### 1.1 Actual load data

Actual load represents the injections of power into the grid (vertical load), including grid losses, for 7 geographical reference areas of Italy corresponding to Northern Italy, Central-Northern Italy, Central-Southern Italy, Southern Italy, Sardinia, Sicily, and the aggregated national data (Figure 1.1). The online database is updated every 24 hours and data are packed in Excel sheets, one per day, with an hourly resolution. We considered

CHAPTER 1. DATA DESCRIPTION 1.1. ACTUAL LOAD DATA

data from January, 1st _{2011} _{to December, 31}st _{2013}_{.}

Area Regions ITALY All regions

NORTH Aosta Valley, Piedmont Liguria, Lombardy Emilia-Romagna, Veneto Trentino-Alto Adige Friuli-Venezia Giulia CENTRAL Tuscany, Umbria

NORTH Marche

CENTRAL Lazio, Abruzzo SOUTH Campania SOUTH Molise, Apulia

Basilicata, Calabria SICILY Sicily

SARDINIA Sardinia

Figure 1.1: Italian national territory divided in areas as suggested by Terna.

In order to easily access data, we packed them into Matlab matrices. Data were rst packed in three subsets according to year, and then data of each subset were further divided according to the reference area.

T =T11_{, T}12_{, T}13_{ ,}

with

Ty _{=}_{T}y,N_{, T}y,CN_{, T}y,CS_{, T}y,S_{, T}i_{, T}y,SA_{, T}y,IT

, y = 11, 12, 13 ,

where the superscript N stands for Northern Italy, CN stands for Central-Northern Italy,
CS stands for Central-Southern Italy, S stand for Southern Italy, SI stands for Sicily, SA
stands for Sardinia and IT stands for whole national territory. Since we are interested in
daily load prole, each subset Ty _{was organized as a 24 × 365 × 7 3D matrix with hours}

of the day corresponding to rows, days of the year corresponding to columns and areas corresponding to layers. Let us take the subset of year 2013 associated to whole national territory as example. We can write it as:

CHAPTER 1. DATA DESCRIPTION 1.1. ACTUAL LOAD DATA

X13,IT

∈ R24×365 _{=}h

x13,IT_{1} x13,IT_{2} . . . x13,IT_{365}
i
=
t1,0 t2,0 . . . ti,0 . . . t365,0
t1,1 t2,1 . . . ti,1 . . . t365,1
... ... ... ... ... ...
t1,j t2,j . . . ti,j . . . t365,j
... ... ... ... ... ...
t1,24 t2,24 . . . ti,24 . . . t365,24
,

ti,j ∈ T13,IT ∀i = 1, . . . , 24; j = 1, . . . , 365,

while the whole 3D matrix associated to year 2013 is given by:

X13

∈ R24×365×7 _{= stack(X}13,N_{, X}13,CN_{, X}13,CS_{, X}13,S_{, X}13,SI_{, X}13,SA_{, X}13,IT_{).}

It has to be noticed that we considered year 2012 as non-leap year, so it is composed by 365 days (load values referred to February, 29th were discarded). Figure 1.2 shows the 3D

representation of the aggregated Italian daily load trends for year 2013.

CHAPTER 1. DATA DESCRIPTION 1.2. DAY AHEAD FORECAST DATA

### 1.2 Day ahead forecast data

Terna provides a forecast of the electric energy consumption with the same granularity of the actual load data set. The algorithm used by the provider in forecasting the load consumption is described in the Grid Code [20] and it takes into account several factors such as the actual load of a number of days taken as models, the meteorological conditions (i.e. cover and maximum daily temperature) and some specic socio-economic events that may sensibly aect the load consumption (i.e. particular holidays, industrial actions of some categories, and so on). We will refer to such a database as:

ˆ
X_{T}y _{∈ R}24×365×7 _{= stack( ˆ}_{X}y,N
T , ˆX
y,CN
T , ˆX
y,CS
T , ˆX
y,S
T , ˆX
y,SI
T , ˆX
y,SA
T , ˆX
y,SA
T ),
y = 11, 12, 13.

It is important to say that Terna does not provide the day ahead load forecast for the national territory but we computed it by adding all predicted load values referred to the other areas.

### 1.3 Data pre-processing

By visual inspection, we noticed that the actual load data set has some irregularities with respect to the general load trend. We classied them into three groups.

One-hour. It comprehends peaks limited to one or at least two consecutive load values. Most of times, such a kind of irregularities is located in the rst part of the daily load trend where the load value is higher than the one we expected to see (Figure 1.3). One-hour peaks are equally distributed in all years and in all areas.

Half-day. In this case the irregularity is limited to a number of consecutive hours com-prised between 4 and 12. There are two typical trends associated to half-day irreg-ularities: in the rst case, the load drops down assuming a at trend while, in the second one, it seems to be amplied. Generally this irregularity is characterized by large discontinuities on the borders of the time interval. It may also happen that the variation of the load begins in the evening of a day and lasts for the whole night ending during the next day (Figure 1.4).

CHAPTER 1. DATA DESCRIPTION 1.3. DATA PRE-PROCESSING

Figure 1.3: Example of one-hour irregularity.

Figure 1.4: Example of half-day irregularity. We can see how the load values underlined in red on the left
of the gure belongs to the same day while the irregularity on the right of the gure starts at night of
Aug, 27th_{and ends on Aug, 28}th_{.}

CHAPTER 1. DATA DESCRIPTION 1.3. DATA PRE-PROCESSING

All-day. This is the less recurring irregularity and involves the load values of a whole day. When it happens the daily trend does not have any big discontinuity but the value of load is always much greater than any daily trend. An example of such a irregularity is shown in Figure 1.5.

Figure 1.5: Example of all-day irregularity. The daily trend associate to November, 8th_{2012}_{(red) is much}

higher than all other ones.

The load trend can be aected by a number of dierent events. For example, one-hour irregularities could be associated to an acquisition fault, half-day ones could be the conse-quence of a strike or a particular social event while all-day ones could be associated to a particularly hot/cold day. Unfortunately, we were not able to understand if the irregularity corresponds to the real value of the load due to a particular and unpredictable event or if it is just a representation fault. Also, we noticed that ˆX data set does not have any irregularity. We chose to process only one-hour irregularities because smoothing one-hour peaks according to the load value of their neighborhood does not lead to a large variation of the information contained in the data themselves. On the contrary, the lack of informa-tion about the nature of half-day and all-day irregularities and the large dierence between them and the rest of the load trend brought us not to process them and to use such data as-is.

CHAPTER 1. DATA DESCRIPTION 1.3. DATA PRE-PROCESSING

At last, data in both sets, X and ˆXT, are expressed in kW. In order to have a numerically

better set to work with, we scaled all values by a factor equal to 10−3 _{so that the load is}

## Chapter 2

## Weekday Pattern Clustering

The research of some peculiar patterns in electric consumption starts from a visual
inspection of the aggregated national actual load data. Let us consider as example the
electric consumption from May, 9th _{to May, 22}nd _{2011} _{(Figure 2.1). It is easy to note how}

the daily load pattern has a similar trend for all weekdays from Monday to Friday while it decreases on Saturdays and it is even smaller on Sundays.

Figure 2.1: Aggregated Italian load value referred to days from May, 9th _{to May, 22}nd _{2011}_{. The load}

trend is similar from Monday to Friday (blue), it decreases on Saturday (violet) and, nally, it is very low on Sunday (orange).

CHAPTER 2. WEEKDAY PATTERN CLUSTERING

This suggests that exists a correspondence between the daily load prole and the specic weekday it is referred to. Accordingly, the load can be split in three clusters corresponding to one of the weekdays categories just described. In order to generalize such a result, we searched for the weekly pattern, from Monday to Sunday, over the whole data set nding that it keep happening in all years, from January to December, but some exceptions (Figure 2.2).

(a) (b)

(c)

Figure 2.2: Examples of exceptions in the weekly pattern repetition. (a) Load trend during Summer
holidays from Aug, 1st _{to Aug, 28}th_{, (b) Load trend during Christmas holidays from Dec, 19}th _{to Jan,}

15th_{, (c) load trend during the Republic Day on Jun, 2}nd_{.}

The characteristic week partition become less evident during conventional Summer hol-idays (Figure 2.2.(a)), during conventional Christmas holhol-idays (Figure 2.2.(b)), and, in general, in proximity of conventional Italian holidays (in Figure 2.2.(c) the days around the Republic Day are shown). We noticed that, when this happens, the load clearly de-creases and even in working days it seems to be more similar to the one associated to Saturdays or Sundays. This can be explained considering that most of industries and com-mercial activities are closed during holidays and then the electric consumption drops down.

CHAPTER 2. WEEKDAY PATTERN CLUSTERING

Starting from previous considerations, it appears that a rst partitioning of the load con-sumption relies on the correspondence between the daily load prole and the typology of weekday, as we introduced them. On one hand, information required for such a clustering is all available in one, easy-to-use tool: the calendar. On the other hand, it is necessary to establish some rules in order to assign a class to irregular days. In order to solve this issue, we created a model that describes which day belong to which cluster. The model is a collection of rules designed considering all data referred to year 2011 together with the calendar and it describes the following three clusters:

Working days. It includes all days from Monday to Friday, from January to December. We assigned to this cluster the label 1.

Pre-holidays. It includes • all Saturdays,

• long weekends (days within two holidays close together),

• all working days in the weeks before and containing August, 15th_{.}

We assigned to this cluster the label 2. Holidays. It includes

• all Sundays,

• all conventional Italian holidays, We assigned to this cluster the label 3.

The model was validated assigning each day of years 2012 and 2013 to a cluster, only considering the information given by the calendar, that is to say without looking at the actual load trend. Then, we compared the actual load of each day to the one it is assumed to be, according to the clusters denition. We found a perfect correspondence but in Summer where, for both 2012 and 2013, we have to slightly modify the rules:

• all working days in the weeks before and after the one containing August, 15th _{are}

considered as pre-holidays in 2012,

• all working days in the week after the one containing August, 15th _{are considered as}

pre-holidays in 2013,

CHAPTER 2. WEEKDAY PATTERN CLUSTERING

We refer to such clusters as the target clusters and we dene gold set the vector containing the ordered sequence of labels associated to each day of the data set.

G11 _{=}h_{3 2 1 . . . 3}i _{∈ R}1×365
G12 _{=}h_{3 1 1 . . . 3}i _{∈ R}1×365 _{→} _{G =} h_{G}11 _{G}12 _{G}13i_{.}
G13 _{=}h_{3 1 2 . . . 3}i _{∈ R}1×365
(2.1)

Each cluster in the model can be represented by the mean of all daily trends within it, the center of the cluster, that resumes the main features of all daily trends within the cluster. Figure 2.3 shows the centers of working days, pre-holidays and holidays clusters, referred to the whole data set X . All points of X associated to the label 1 in G form the working days cluster and its center is obtained computing the mean of all load values within the cluster, hour by hour. In the same way we obtain pre-holidays and holidays centers. It is clear that, even if individual peculiarities of each weekday trend cannot be represented by the center (i.e. the working days cluster has just two peaks while each working day pattern in Figure 2.1 has three peaks), looking at them we still are able to identify the typical trend of each cluster.

Figure 2.3: Centers of clusters working days (blue), pre-holidays (violet) and holidays (orange) obtained considering all data from 2011 to 2013.

CHAPTER 2. WEEKDAY PATTERN CLUSTERING

Figure 2.4: Gold set projection on actual load data for year 2011 (top), 2012 (middle) and 2013 (bottom). Dierence in modeling the load trend are limited to the Summer holidays, in the middle of all three gures.

CHAPTER 2. WEEKDAY PATTERN CLUSTERING 2.1. PERFORMANCE INDEX

In Figure 2.4 we projected the gold set G on the aggregated national data set obtaining a visual division of weekday trends.

The goal of this chapter is to test the ability of two clustering algorithms, Fuzzy c-Means (FCM) and Linkage-Ward (LW) (see Appendix A), in partitioning the aggregate actual load data set in three clusters corresponding to working days, pre-holidays and holidays. Beside this, we want to extend a similar procedure to each area specied by Terna, with the same gold set G. In following sections we detail each step of the test and present some results.

### 2.1 Performance Index

Clusters obtained from the clustering process should be equal to the target ones, or, in the same way, the clustering labels vector L should correspond to the gold set G. In order to give a measure of how much clustering output resembles the model we dened a performance index, called matching percentage, that expresses how many days are labeled in the same way both in L and in G. Let us recall the denition of the gold set G given in equation 2.1, the clustering labels vector L is dened in the same way:

L11 _{=}h_{3 2 1 . . . 3}i _{∈ R}1×365
L12 _{=}h_{3 1 1 . . . 3}i _{∈ R}1×365 _{→} _{L =}h_{L}11 _{L}12 _{L}13i_{.}
L13 _{=}h
3 1 2 . . . 3
i
∈ R1×365
(2.2)

We call mismatch points those days belonging to dierent clusters in the gold set and in the clustering label vector, and we stored them in the vector pmm

pmm = {i ∈ N | G(1, i) 6= L(1, i) , i = 1, . . . , ](G)} .

where i represents the index associated to each day in the gold set, i.e. its position in the vector.

CHAPTER 2. WEEKDAY PATTERN CLUSTERING 2.1. PERFORMANCE INDEX

Once all mismatch points are known, we can compute the matching percentage index as

M%= 1 −](pmm) ](G) 100, (2.3)

where ]() represents the cardinality of the vector. The more M% increases the more gold

set clusters and clustering result resemble each other.

It has to be noticed that the labels obtained from clustering algorithms are randomly assigned to each cluster. As consequence, a point could be dierently labeled in G and in L even if it belong to the same cluster in both sets. If this happens the index M% no

longer has any points. This is a point of utmost importance for the rest of the work: we have to ensure that the cluster li ∈ L that most resemble the cluster gj ∈ G is labeled in

the same way of the latter. So, every time we partition a data set using one of the clus-tering algorithms, we modify L in order to re-assign labels and to associate to each cluster in L the label of the cluster in G that most resembles. Such a trick is done considering all the possible labels assignments in L and assigning to each of them the corresponding Hubert-Arabie adjusted Rand index according to the well-known formulation [11]:

C = (Ntp+ Nf n)(Ntp+ Nf p) + (Nf p+ Ntn)(Nf n+ Nf p),

ARI = Npairs(Ntp+ Ntn) − C Npairs2− C

,

where Ntp is the number of couples of true positives, Ntn is the number of couples of true

negatives, Nf p is the number of couples of false positives, Nf n is the number of couples of

false negatives and Npairs is the number of couples in the set. The higher ARI is the closer

the clustering labels vector is to the gold-set, so we choose as the best re-labeling the one corresponding to the highest value of ARI index.

The relabeling process could take a long time if the number of clusters to be extracted, so the number of labels to be re-assigned, is high. In this work we search for a maximum of three clusters and six possible labels assignments so the method is fast enough. We did not test it on higher number of labels, so we do not ensure its validity and its eciency in any other case.

CHAPTER 2. WEEKDAY PATTERN CLUSTERING2.2. DATA SETS USED IN CLUSTERING

### 2.2 Data sets used in clustering

Finding a partition of the data set as close to the gold set as possible is not an easy task even because of the high dispersion of daily patterns. In Figure 2.5 the daily pattern of each day of year 2013 is colored according to the clusters to which it belongs, in G.

Figure 2.5: Gold set clusters' within dispersion. All pattern of the same color belong to the same cluster (blue for working days, violet for pre-holidays and orange for holidays) while the black curves represent the center of each cluster.

It is easy to note how each cluster contains a number of pattern far away from the center, in both direction and, in particular, in the one towards other centers. As consequence of this clusters are not well separated, actually they overlap each other. This is not a good starting point because it is plausible that clustering algorithms will partition data in well separated clusters according to the clustering criterion but dierently with respect to the gold set. Reckoning with this consideration we decided to perform both clustering techniques on three dierent kind of data sets.

Raw data. It represents the actual load as-is, it is the data set X , as described in Chapter 1. We will refer to such a data set with the abbreviation R.

CHAPTER 2. WEEKDAY PATTERN CLUSTERING2.2. DATA SETS USED IN CLUSTERING

have a zero mean and a unitary standard deviation. We will refer to such a data set with the abbreviation RD.

Pre-processed data. This data set is thought in order to underline some peculiarities of the gold set clusters. Each day is represented as a vector of features extracted from raw data. It is important to say that in pre-processed data set each point is no more a 24-elements vector but it is represented by a certain number of features we consider meaningful for the representation of the cluster. We will refer to such a data set with the abbreviation F.

Using a pre-processed data set we want to project data in a smaller space (the total number of feature is smaller than the 24 hours of a day) trying to obtain a well separated representation of gold set clusters. In this way we want to lead somehow the clustering process towards the specied partition. In following sections we explain which features we use, how to compute them and how to elect the most signicant ones.

### 2.2.1 Features denition

Features set F is composed by a mixture of standard features in data analysis (such as mean, standard deviation, etc.) and features created ad-hoc for the problem we are dealing with.

F = [fi] , i = 1, . . . , 19.

Some features are computed considering the entire 24-element long daily trend while some other ones are dened on a portion of it. Table 2.1 summarizes all features we used, with a brief description of each.

### 2.2.2 Features determination and grouping

Features are extracted from each point of the raw data set, then they are grouped according to the gold set cluster the point belongs to. The information contained in each feature is compressed using two values: the mean of the feature and its standard deviation within each cluster of G. The mean and the dispersion of features are easy to represent, as shown in Figure 2.6 in which each subplot is associated to a feature fi, each colored round

CHAPTER 2. WEEKDAY PATTERN CLUSTERING2.2. DATA SETS USED IN CLUSTERING

Feature Definition

Daily Load Sum of daily load values Daily Mean Mean of daily load values Daily Variance Variance of daily load values

Min-max Dierence between the maximum and the minimum value of the daily load

Max Peak Maximum value of the daily load

Peak Hour Hour of the day at which we have the maximum value of the daily load

Early Morning Mean Mean of load values between and 12.00am and 05.00am Morning Slope Dierence between the load value at 09.00am and at

05.00am

Night Slope Dierence between the load value at 11.00pm and at 08.00pm

Partial Daily Mean Mean of daily load values between 10.00am and 07.00pm Partial Daily Variance Variance of the daily load values between 10.00am and

07.00pm

Partial Min-max Dierence between the maximum and the minimum value of the load between 10.00am and 07.00pm

Partial Min-di Dierence between the average load and the minimum load between 10.00am and 07.00pm

Partial Max-di Dierence between the maximum load and the average load between 10.00am and 07.00pm

Late Morning Dierence between the load value at 10.00am and at 02.00pm

Late Aternoon Dierence between the load value at 08.00pm and at 05.00pm

Kurtosis Kurtosis of daily load values

FFT Peak Maximum of the absolute values of the Fast Fourier Trans-form of the daily load values

FFT Sum of the absolute values of the Fast Fourier Transform of the daily load values

Table 2.1: Features list

represents the mean of the feature in each gold set cluster ¯fci

i and each colored segment

represents the dispersion around the mean of the feature in the same cluster σci

fi.

It is important to note that feature are not homogeneous. In order to be able to compare their values, we normalized them to vary between 0 and 1.

CHAPTER 2. WEEKDAY PATTERN CLUSTERING2.2. DATA SETS USED IN CLUSTERING

Figure 2.6: Features comparison for the aggregated national data set. Each subplot represents one feature of the features set F and each color is associated to one cluster of G (blue for working days, violet for pre-holidays and orange for holidays).

### 2.2.3 Features extraction and clustering

Looking at Figure 2.6 we can emphasize two aspects. First, the less the dispersion of a feature within a cluster is overlapped to those of the same feature but referred to the other clusters, the more that feature well separates clusters. Second, the smaller is the dispersion of a feature within a cluster, the more we can consider that feature as a peculiarity of that particular cluster. So we are searching for features with values well separated between clusters and with a small dispersion in the same cluster. As example we could say that the feature Morning slope is a good marker to separate clusters while Partial min-max has no useful information.

According to this, we are interested in extracting a group of feature that we think could help in clustering. The extraction is done automatically using a greedy forward approach [9], described in the following procedure:

1. Set nmax. It is a tuning parameter and corresponds both to the max number of

feature to be processed in each step and to the maximum length of the feature set used in clustering, i.e. the maximum number of features used to represent each load

CHAPTER 2. WEEKDAY PATTERN CLUSTERING2.2. DATA SETS USED IN CLUSTERING

trend. Experimental results show that choosing nmax higher than 5 do not increases

sensibly the clustering performance so, in this work we consider nmax = 5.

2. Compute each feature fi ∈ F for each point in xi ∈ X. For each feature and for each

target cluster compute ¯f and σf.

3. Sort the features set F in ascending order, according to how much the dispersion of each feature is overlapped among all clusters.

˜

FI _{= sort (F ) ,}

where I stands for step 1. The rst element of ˜FI_{corresponds to the most meaningful}

feature, that is to say the feature that assumes well separated values from one cluster to another. If two or more features have completely separate dispersion, they are further sorted according to the mean within cluster dispersion so that the closer are feature values to the mean of each cluster, the more the feature is relevant.

4. Select the rst nmax feature from ˜FI.

˜

F_{n}I_{max} = ˜FI(1 : nmax).

5. For each feature fI

i ∈ ˜FnImax build a data set in which each point is represented by a

one-dimensional vector containing the correspondent value of fI
i.
XI
F(i) =
f_{i}I(1)
fI
i(2)
...
fI
i(365)
, i = 1, . . . , nmax,

where () indicates the day of the year. 6. For each data set XI

F(i)perform the clustering and compute the matching percentage

index MI

%(i), as dened in equation 2.3. Select as best feature of the rst step f∗I the

feature corresponding to the best value of the performance index and delete it from ˜

F.

˜

CHAPTER 2. WEEKDAY PATTERN CLUSTERING 2.3. CLUSTERING POLICIES

7. Select again the best nmax features:

˜ FII

nmax = ˜F

II_{(1 : n}
max).

For each of them, create a data set in which each point is represented as a two-elements vector XII F (i) = f∗I(1) fiII(1) fI ∗(2) fiII(2) ... ... fI ∗(365) fiII(365) , i = 1, . . . , nmax,

and repeat the clustering process extracting the best couple (fI ∗, f∗II).

8. Repeat the procedure until the number of features considered is equal to nmax and

the data set to be clustered is

Xnmax
F (i) =
f_{∗}I(1) f_{∗}II(1) . . . fnmax
i (1)
f_{∗}I(2) f_{∗}II(2) . . . fnmax
i (2)
... ... ... ...
fI
∗(365) f∗II(365) . . . finmax(365)
, i = 1, . . . , nmax,

When the research stops we have nmaxvectors of best features, with an increasing dimension

from 1 to nmax, and for each of them a matching percentage. The highest value of matching

percentage leads to dene the best set of features X∗

F to consider in order to obtain a

partition of T as similar as possible to the partition dened in G. A greedy strategy does not in general produce an optimal solution but it aims to nd the locally optimum choice in each step. Nonetheless, the chosen greedy heuristic and the prior features vector sorting allows to always consider the best features set, with a maximum xed dimension.

### 2.3 Clustering policies

In addition to the three dierent data sets to work with, we chose to follow two dier-ent clustering policies. We will refer to them as Direct clustering (denoted as (DC)) and Nested clustering (denoted as (NC)).

CHAPTER 2. WEEKDAY PATTERN CLUSTERING 2.3. CLUSTERING POLICIES

Figure 2.7: Pre-processed data set clustering pipeline.

In the rst case we partition the data set directly into the three clusters, searching for working days, pre-holidays and holidays clusters. In nested clustering, instead, we use a two-step procedure. In the st step we search for two clusters corresponding to working days and non-working days. In the second step the cluster denoted as non-working days is further clustered in order to search for pre-holidays and holidays. There are some remarks

CHAPTER 2. WEEKDAY PATTERN CLUSTERING 2.4. RESULTS

both steps so we repeat the two steps procedure for both clustering algorithms. Secondly, the performance evaluation is done after the second step, considering a clustering labels vector obtained merging the ones produced in each step. Related to the last issue, we have to say that the second step clustering is performed on a subset corresponding to the non-working days cluster obtained from the rst step, not on the corresponding one expressed in the gold set. In this way we take into account all mismatch points. At last, when applying the nested clustering on a pre-processed data set, the feature selection is reset after the rst step. This means that the best features set research of the second step starts considering again all available features in F. Figure 2.8 resumes the procedure just described.

Figure 2.8: Direct clustering pipeline (left) vs nested clustering pipeline (right). Both approaches have to be run twice: at rst using the the Fuzzy c-Means and then using the Linkage-Ward.

### 2.4 Results

For each area dened by Terna, clustering is performed using two dierent policies, direct and nested clustering, on three dierent kind of data sets, raw data, demeaned data and pre-processed data, using two dierent clustering algorithms, Fuzzy c-Means and Linkage-Ward. So, for each electric load data set, we perform 12 tests. In following sections we report results obtained for the weekday pattern clustering. At rst results on

CHAPTER 2. WEEKDAY PATTERN CLUSTERING 2.4. RESULTS

aggregated load data are presented and then results on the other areas are summarized.

### 2.4.1 Aggregated national data

Table 2.2 shows the matching percentage related to each tests performed on the aggre-gated national data.

FCM (DC) LW (DC) FCM (NC) LW (NC) IT AL Y R 59.54% 66.67% 88.40% 90.23% RD 59.27% 72.24% 75.80% 71.78% F 94.43% 95.25% 95.53% 96.07%

step 1 → morning slope, step 1 → morning slope, morning slope min-max daily variance

step 2 → morning slope step 2 → morning slope Table 2.2: Weekday pattern clustering results on aggregated load data.

How we expected, performance is better when we consider the pre-processed data set and the nested clustering. In particular, both clustering algorithms have a matching percentage higher than 95% but Linkage-Ward results to be the best one.

Figure 2.9: Comparison between the label vector obtained with Linkage-Ward applied to the 2013 data set and the gold set G. Light blue circles represent the mismatch points between the two sets.

CHAPTER 2. WEEKDAY PATTERN CLUSTERING 2.4. RESULTS

(a) (b)

CHAPTER 2. WEEKDAY PATTERN CLUSTERING 2.4. RESULTS

gold set G. In particular blue triangles represent days clustered as working days, violet
circles represent days clustered as pre-holidays, red squares represent days clustered as
holidays and, nally, light blue circles represent mismatch points between L13 _{and G. Most}

of mismatch points are located in correspondence of Summer holidays and Christmas holi-days, that is to say in those periods in which the accuracy of the model is smaller. Another important consideration deals with the features used in clustering. It is easy to note that the morning slope assumes a central role in clustering. This agrees to what said in Section 2.2 about the dispersion overlapping: it turns out that the morning slope is the only fea-ture that assumes well separated values among clusters in both steps of nested clustering, as Figure 2.10 shows. Let us compare now Figure 2.10, and Figure 2.6, where the rst is associated to the nested clustering and the latter to the direct policy and let use refer to Table 2.2 too. We note that, for both clustering algorithms, in the direct approach the only signicant feature is the morning slope. This is not true in the nested approach where features like the min-max and the daily variance are elected as part of the best feature set for step 1. This dierence underlines how using a two-step procedure the capability in capturing the key-points of the typical load trend associated to a cluster grows up.

### 2.4.2 Other areas

Area M% Features

NORTH 97.35% step 1 → morning slope step 2 → morning slope

CENTRAL-NORTH 94.06% step 1 → morning slope, daily mean step 2 → morning slope

CENTRAL-SOUTH 87.67% step 1 → morning slope, daily variance step 2 → morning slope, daily mean

SOUTH 63.65% step 1 → min-max step 2 → morning slope

SICILY 66.12% step 1 → night slope step 2 → morning slope

SARDINIA 59.00% step 1 → morning slope

step 2 → morning slope, daily variance, partial min-di

Table 2.3: Weekday pattern nested clustering results on all areas of Italy using the Linkage-Ward algorithm on pre-processed data.

According to the result obtained for the national aggregated data, we performed the nested clustering with the Linkage-Ward algorithm on the pre-processed data set on all

CHAPTER 2. WEEKDAY PATTERN CLUSTERING 2.4. RESULTS

• The more we move towards the South the more the matching percentage decreases. Such a behavior can be explained looking at the centers of clusters in G when evalu-ated on dierent areas. Let us take as example the centers evaluevalu-ated on data referred to Northern Italy, to Central-Southern Italy and to Sicily and let us compare them with clustering result on year 2013 (Figure 2.11). The more we move from North to South the more the center of working days cluster moves towards the pre-holidays one until they overlap each other, in Sicily. Such a behavior is probably due to the large dierence in industrialization between the Northern and the Southern Italy. In those areas we are not able any more to separate working days from pre-holidays, whichever is the data set we use, the clustering technique or the clustering policy. In addition, when centers are overlapped it seems that extracted clusters identify a sort of seasonal pattern.

• Morning slope is once again the most informative feature. It appears in every step of the nested clustering both when we succeed in nding the target clusters and when not.

Finally, we can state that we cannot generalize results obtained in clustering the aggregated national data to every area dened by Terna. In particular we can divide areas in group a, composed by data sets referred to Northern Italy, .Northern Italy and Central-Southern Italy, in which we are able to extract three partition corresponding to the gold set, and group b, composed by data sets referred to Southern Italy, Sicily and Sardinia, in which the clustering does not succeed. In addition we can consider the feature morning slope as the most informative for the weekday pattern recognition.

CHAPTER 2. WEEKDAY PATTERN CLUSTERING 2.4. RESULTS

(a)

(b)

(c)

Figure 2.11: Comparison of centers of the gold set clusters (left) and results of Linkage-Ward clustering on the 2013 data set (right) when evaluated on data referred to Northern Italy (a), to Central-Southern Italy (b) and to Sicily (c). The pre-holidays center (violet) is increasingly closer to the working days center (blue) if we move from North to South.Holidays cluster center is colored in orange.

## Chapter 3

## Seasonal Pattern Clustering

Besides the daily pattern discussed in previous chapter, daily load proles also dier each other according to a seasonal pattern. Seasonal patterns are less evident than the weekday one and they become well distinguishable if we focus the attention on one weekday cluster at time. So, let us consider the working days cluster extracted from the aggregated national data and referred to year 2013. By visual inspection of daily trends we found three possibly clusters: the hot season, corresponding to Winter, the mid-season, corresponding to Spring and Autumn together, and the hot season, corresponding to Summer. In this case too we dened a gold set, GS, based on the conventional starting date of the so-called

meteorological seasons. So, the cold season starts on December, 1st and ends on February,

28th, the hot season starts on June, 1st and ends on August 31st while the mid season
comprehend all days from March, 1st _{to May, 31}St _{and from September, 1}st _{to November,}

30th. Gi S =g i S(j) j=1...365 , g i S(j) = 1 1 ≤ j ≤ 59 ∧ 335 ≤ j ≤ 365 2 60 ≤ j ≤ 151 ∧ 244 ≤ j ≤ 334 3 152 ≤ j ≤ 243 , i = 11, 12, 13. GS = h G11 S GS12 GS13 i .

where 1 is the label associated to the cold season, 2 is the label associated to the mid-season and 3 is the label associated to the hot season. Figure 3.1 shows the centers of gold set clusters for working days.

CHAPTER 3. SEASONAL PATTERN CLUSTERING

Figure 3.1: Centers of clusters dened in GS related to the working days cluster extracted from national

aggregated data in 2013. Cold season center is colored in blue, mid-season center is colored in violet and hot season center is colored in orange.

Centers clearly represent three dierent typical daily trend. The cold season center has two peaks, one in the morning and one in the evening, due to the use of heaters. Hot season center too has two peaks but the second one is anticipated to the afternoon. Hot season peaks can be related to the use of air conditioning. In addition, the center drops down in the evening while it assumes values higher than the other centers during the night. At last, mid-season center is characterized by one more peak: the rst one is during the morning, the second one in the afternoon and the last one in the evening, later than the cold season one. It is the center with smaller load values and this agrees with the hypothesis we made about what causes consumption peaks: during the mid season neither heater nor air con-ditioning is used as in Winter and Summer.

Such an initial analysis encourages the research of a seasonal pattern in each cluster ob-tained from daily pattern research. However, we have to make a note about the gold set we dened. Obviously the accuracy of GS is much smaller than the one of G. The accuracy

loss is due to the relationship between the seasons alternation and the weather. The typical season prole smoothly evolves in the one corresponding to the following season, according to the weather and then to the particular area. It is not possible to a-priori dene a gold

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.1. PERFORMANCE INDEX

set with a global eectiveness in time and in space. Taking this into account, we will use the gold set as guideline in searching features for the pre-processed data set but we cannot take it as a meter on which compute the performance of a clustering process.

### 3.1 Performance index

The previous consideration underlines the necessity to dene a new performance index. Since we do not have enough information on where to locate the frontier between seasons, the only measure of goodness of the clustering result relies on the compactness of clusters. Data points belonging to the same season have to be close together until a season change happens. In other words if points labeled as cold season appears in the middle of the cluster hot season, the clustering result is not acceptable. The index we propose aims to consider as best clustering results the one that minimizes the number of switches in the clustering label vector LS (dened in the same way of Equation 2.2).

Let us consider the gold set GS. The total number of switches between season is equal

to 12, that is to say that we nd 4 switches within a year. Let us call the minimum number of switches as ds, default switches

ds = 4(yn) = 12,

where yn stands for the number of years considered in the gold set. Let us consider now

the clustering label vector LS and let S(LS) be the number of switches between labels in

it. Note that S(LS) can vary between 0 and ](LS) − 1.

We dene the performance index NS as

NS = 1 −

S(LS) − ds

(](LS) − 1)

, (3.1)

If NS is 0 two consecutive days are always associated to dierent clusters. If NS is 1 the

number of switches in the clustering labels vector is equal to the default switches number ds. It may happen that NS is higher than one. When it happens the number of switches

in LS is less than ds so a season disappeared, e.g. the daily trend of the hot season jumps

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

Index NS gives a good end extensive measure of the goodness of the clustering result and

the more it tents to 1 the more the clustering succeeds in partitioning data into seasons. However,in this type of analysis the visual inspection of results is fundamental.

### 3.2 Results

Tests were performed on each cluster extracted from the daily pattern clustering, for each area. In addition we performed clustering on all kind of data sets described in Section 2.2 but using only a direct approach. In following sections we detail the result obtained for the aggregated national data and we resume results for both group a and group b data sets.

### 3.2.1 Aggregated national data

Table 3.1 resumes results obtained for the aggregated national data set.

working days pre-holidays holidays FCM (D) LW (D) FCM (D) LW (D) FCM (D) LW (D) IT AL Y R 0.82 0.90 0.71 0.83 0.71 0.78 RD 0.94 0.96 0.92 0.96 0.88 0.90 F 0.96 0.98 0.94 0.97 0.98 1 night slope partial min-max partial daily variance late afternoon night slope partial min-max

late morning late afternoon partial max-di Table 3.1: Daily pattern clustering results on aggregated load data.

Whatever if we consider results about working days, pre-holidays or holidays, the clustering performed on pre-processed data with the Linkage-Ward algorithm leads again to the best results. It has to be noticed that most important features are all referred to the load corresponding to afternoon and evening, that is to say the portion of the daily trend in which centers dier the most. In addition, pre-holidays and holidays needs more features to be correctly partitioned. This behavior can be explained considering the the dispersion of such clusters is higher than the one of working days so we need more information about the daily trend shape. Figures from 3.2 to 3.4 show results on the three years for the three type of clusters.

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

Figure 3.4: Results of seasonal pattern clustering on holidays pre-processed data with Linkage-Ward on year 2011 (top), 2012 (middle) and 2013 (bottom).

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

We can make some remarks. First, let us consider clustering output for pre-holidays (Figure 3.3). It is easy to note that in years 2011 and 2013 the load prole switches from the hot season directly to the cold season and the mid season between them is completely absent. We expect that the correspondent performance index NS is higher than 1 but it stops at

0.97, because there other switches during the year that make the overall switches number higher than 12. This example justies the necessity to a visual analysis of results. On the other hand we can compare seasons distribution between working days, pre-holidays and holidays. It appears that the seasons alternation is similar to the meteorological one for working days (but the cold season that starts on November) and changes in the other two clusters. In particular, the second tranche of mid-season is absent in pre-holidays while the hot season is very extended in holidays. Such a dierence can be explained considering the programmed use of heaters and air conditioning in oces and commercial activities: this helps in having a constant seasons alternation during working days and a more free condition during pre-holidays and holidays.

### 3.2.2 group a results

group a includes those areas in which we succeeded in clustering the data set ac-cording to the weekdays pattern. Table 3.2 resumes the results obtained for Northern Italy, Central-Northern Italy and Central-Southern Italy pre-processed data sets, using the Linkage-Ward algorithm.

Results are similar to those obtained for the aggregated national data with the performance index always higher than 0.9 but in Central-North data set when we consider the pre-holidays cluster. In this case, the division in compact seasonal clusters is not so clear, as Figure 3.5 shows for year 2013. Despite this we will consider such a result as acceptable because intruder points are isolated. Indeed, we cannot nd a group of days, consecutive or close together, with the wrong label. It may happen that in particular days we have a weather typical of another season, and so it may happen that a day assumes characteristics typical of another cluster.

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

working days pre-holidays holidays

Area NS NS NS

NORTH

0.95 0.99 0.95

night slope night slope partial daily variance late afternoon partial max-di night slope

partial daily variance partial min-max partial min di late afternoon late morning

CENTRAL-NORTH

0.91 0.83 0.93

night slope partial daily variance peak position late afternoon partial min-max

partial max-di night slope late afternoon

CENTRAL-SOUTH

0.94 0.97 0.96

night slope partial daily variance partial min-max late afternoon night slope night slope partial max-di partial max di late afternoon partial min-max partial min-max daily variance early morning mean early morning mean partial daily mean

Table 3.2: Seasonal pattern clustering results on group a pre-processed data sets using Linkage-Ward algorithm.

Figure 3.5: Results of seasonal pattern clustering in Central-North data set when considering pre-holidays cluster.

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

### 3.2.3 group b results

group b includes those areas in which we did not succeeded in clustering the data set according to the weekdays pattern. In this case we try to extract a seasonal pattern from the whole data set, including working days, pre-holidays and holidays. In Figure 3.6 centers of the gold set GS referred to Southern Italy, Sicily and Sardinia data sets are

illustrated. Centers of cold season and mid-season are similar but in the nal part of the pattern in which the rst is characterized by an higher peak and a steeper night slope. Hot season center, instead, is very dierent in shape and values and it dier the most from other ones during the night, in the late morning and in the position of the evening peak, that appears later.

As in the previous case, the centers analysis leads us to continue in searching a seasonal pattern considering the whole data set of each area. Table 3.3 resumes the results obtained for group b data sets, considering only a direct approach.

FCM (D) LW (D) SOUTH R 0.68 0.76 RD 0.71 0.73 F 0.88 0.92 early morning mean late afternoon partial min-max FCM (D) LW (D) SICIL Y R 0.78 0.82 RD 0.83 0.75 F 0.82 0.88 early morning mean min-max partial max-di night slope FCM (D) LW (D) SARDINIA R 0.77 0.80 RD 0.69 0.68 F 0.84 0.87 night slope partial max-di partial daily variance late afternoon partial min-di Table 3.3: Seasonal pattern clustering results on group b data sets.

Once again the Linkage-Ward algorithm applied on a pre-processed data set gives the best results (Figure 3.7 shows the partitions obtained in year 2013 for all areas of the group b). Features used in clustering are referred to the ending portion of the daily trend, from the afternoon on, and to the night with the early morning mean feature. This agrees with the initial analysis of gold set clusters centers. Most of NS indexes are smaller than 0.9, so

we have a global degradation of the performance. However if we take a look to results we can notice that, in general, the clustering succeeds in partitioning the data according to a seasonal pattern even if there some dierences between the three data sets. In particular in Southern Italy the hot season is limited to the month of July while in Sardinia it ends before the standard meteorological summer season, that is to say that August belongs to

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

Figure 3.6: Results of seasonal pattern clustering on data set belonging to group b in 2013. In particular the gure at the top is referred to Southern-Italy, the gure in the middle is referred to Sicily and the gure at the bottom is referred to Sardinia.

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

Figure 3.7: Results of seasonal pattern clustering on data set belonging to group b in 2013. In particular the gure at the top is referred to Southern-Italy, the gure in the middle is referred to Sicily and the

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

data set in which the mid-season includes most of holidays belonging to the cold season. Taking into account previous considerations, we can conclude that even if the clustering is performed on the entire data set for areas belonging to group b, it is possible to extract a seasonal pattern in each of them using the Linkage-Ward algorithm on the pre-processed data set.

### 3.2.4 One step back

Figure 2.11 shows how in Southern Italy data set centers referred to working days and to pre-holidays are overlapped. This is the reason why we were not good at clustering the load trends according to a weekdays pattern. In last section, instead, we illustrated how the same data set is suitable for a seasonal pattern partitioning. We can say the same for all data sets of group b. Let us consider now one of seasonal cluster extracted from Southern Italy data set. It still holds that we cannot separate working days from pre-holidays but we can isolate holidays cluster. Such a hypothesis is conrmed by two considerations. First, the visual analysis of Figure 3.8, in which we compare the centers of two clusters: non-holidays, that comprehend both working days and pre-non-holidays, and non-holidays, as they were dened in the weekdays pattern model, for Southern Italy, shows how the holidays cluster eectively diers from the non-holidays one. Second, results obtained for Sicily data sets strengthens the idea that holidays pattern can be considered as an independent trend.

cold season mid season hot season mean FCM (D) LW (D) FCM (D) LW (D) FCM (D) LW (D) FCM LW

SOUTH F

84.00% 87.14% 75.84% 84.16% 90.00% 90.42%

83.28% 87.24% night slope max peak morning slope

morning slope night slope night slope

SICILY F

- - 84.70% 88.89% 94.89% 98.18%

89.80% 93.53% morning slope morning slope

night slope night slope partial min-di

SARDINIA F

79.76% 83.54% 78.80% 83.54% 79.04% 90.93%

79.20% 86.00% morning slope kurtosis morning slope

night slope morning slope

Table 3.4: Matching percentages of clustering results in searching the Mondays cluster.

We performed again a clustering process considering all available data sets, both clustering algorithms and a calendar-based gold set. Accordingly, performance was computed using

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

Figure 3.8: Comparison between centers of non-holidays cluster (blue) and holidays cluster (violet) in cold season (top), mid season (middle) and hot season (bottom) clusters obtained from the Southern Italy data

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

the matching percentage index M%. We found that the Linkage-Ward method applied on

the pre-processed data-set gives the best results, in all areas. It has to be noticed that for the Sicily data set we searched only the mid-season and the hot season clusters, considering the cold season cluster formed only by non-holidays patterns.

### 3.2.5 Mondays cluster

Let us focus on the working days clusters in group a data sets together with the aggregated national one. For all seasonal clusters we can nd that Mondays have a peculiar trend during the night. In particular, even if the shape of the load trend is similar to those referred to other days, the load values are smaller during the night and then the morning slope arises more steeply. Such a particular behavior come from the trend of holidays smaller than the working days one, so, on Mondays, the load starts from a lower value and arises until it follows the typical working days trend (Figure 3.9). We want to isolate such a peculiarity in a new cluster made of all Mondays and all those days following a holiday, for each seasonal cluster.

Figure 3.9: Zoom of the weekly trend shown in Figure 2.1. June, 16th_{is a Monday and the initial part of}

the daily trend assumes values smaller of those in the following day.

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

in each seasonal cluster of group a areas considering all kind of data sets, both algorithms, a direct approach and a calendar-based gold set. We found that results obtained with the Linkage-Ward on the pre-processed data are again the best ones (Table 3.5). the best performance evaluation is done considering the mean of matching percentages over the three seasonal clusters. We note that performance decreases moving towards the South: this agrees with what previously said about changes in daily patterns. The more we move from North to South the more peculiarity of some typical daily trends disappear, Mondays' ones included.

cold season mid season hot season mean FCM (D) LW (D) FCM (D) LW (D) FCM (D) LW (D) FCM LW

ITALY F

93.26% 92.23% 89.46% 88.65% 85.62% 93.15%

89.45% 91.34% early morning mean early morning mean kurtosis

min-max daily variance daily variance daily variance late morning

NORTH F

93.64% 93.64% 98.32% 97.31% 94.85% 94.85%

95.60% 95.26% min-max early morning mean kurtosis

kurtosis min-max early morning mean kurtosis min-max

CENTRAL

F

76.35% 86.30% 79.07% 93.36% 70.00% 75.50%

75.14% 85.05% early morning mean early morning mean kurtosis

NORTH

daily variance daily variance night slope min-max min-max

CENTRAL

F

86.82% 85.45% 80.38% 82.80% 63.28% 72.88%

76.82% 80.37%

SOUTH early morning mean early morning mean min-max_{daily variance} _{daily variance} _{morning slope}

Table 3.5: Matching percentages of clustering results in searching the Mondays cluster.

### 3.2.6 Resume

Resuming we found

• 12clusters for the aggregated national data set and for those belonging to the group a. Each daily trend can be classied according to the weekday in four clusters, Mondays, working days, pre-holidays and holidays, and according to the season in three cluster, cold season, mid season and hot season.

• 6 clusters for data sets belonging to group b. Each daily trend can be classied according to the weekday in two clusters, non-holidays and holidays, and according

CHAPTER 3. SEASONAL PATTERN CLUSTERING 3.2. RESULTS

to the season in three clusters, cold season, mid season and hot season. In particular, for the Sicily data set, we found that the cold season cluster is composed only by non-holidays trends and, consequently, holidays cluster can be divided only in two seasons: mid season and hot season.

Figure 3.10 gives a visual resume of the hierarchy in clustering.

## Chapter 4

## Electrical Consumption Analysis by

## Customers

In previous chapters we extracted typical load trends according to the weekday and to the season. Besides this, electric energy providers are interested in extract typical load proles according to the category of customer. This analysis could bring to a more detailed electric consumption partitioning and thus to more diversied taris for customers. We could mainly divide the load consumption according to canonical economy sectors: pri-mary, related to the agriculture, secondary, related to the industry, and tertiary, related to services and households. Each of previous categories includes a multitude of dierent type of customers with a dierent daily consumption prole. Let us take as example the household consumption. Several authors tried to design a model able to represent the energy consumption demand [18] [22] [23] [24] and all agrees in saying that it depends on a huge number of variables with dierent nature. In particular Yu at al. [24] resume meaningful variables in ve categories: the population, the living standard of residents, the urban construction and development level, the social development level and the natural condition. Particular attention may be paid to the building technology [22] and to the typology of residents: the load consumption of a hall of residence is very dierent from the consumption of a single man in a two-room at.

It is easy to understand how it is dicult to extract a typical load prole for each macro-area considering aggregated data. However, in this work we try to relate each weekdays cluster to a group of customers categories so that, known the daily load prole and the

CHAPTER 4. ELECTRICAL CONSUMPTION ANALYSIS BY CUSTOMERS

weekdays cluster to which it belongs, we can deduce how much of the daily load is used by which main customer category.

In particular we refer to the Grid Code, published by Terna, in which we nd the an-nual consumption of electricity according to Italian regions and to some categories of customers. Terna establishes four main categories corresponding to Agriculture, Industry, Tertiary and Households and for each of them provides the annual consumption of some principal sub-categories as listed in Table 4.1.

Agriculture Industry

iron and steel nonferrous metals chemical building material paper-making food textile mechanics transportation plastics wood fuel extraction renery and coke ovens gas and electricity water main

others Tertiary

transportation communications commercial food services assurances public administration public illumination others

Households

Table 4.1: Terna customers categories.

The idea is to group such data in three groups that can be associated to weekdays clusters consumption and then compute for each group, called cust_i

c1 = _{cust_1+cust_2+cust_3}cust_1 100

c2 = _{cust}_{_1+cust_2+cust_3}cust_2 100

c3 = _{cust}_{_1+cust_2+cust_3}cust_3 100

(4.1)

searching for the grouping that best ts the same quantities computed on weekdays clusters

centers: _{}
wd =
P24
i=1µwd(i)
P24

i=1µwd(i)+P24i=1µph(i)+P24i=1µh(i)100

ph =

P24 i=1µph(i)

P24

i=1µwd(i)+P24i=1µph(i)+P24i=1µh(i)100

h = P24 i=1µh(i) P24 i=1µwd(i)+ P24 i=1µph(i)+ P24 i=1µh(i)100 (4.2)