• Non ci sono risultati.

Analysis, design and implementation of a parking recommendation system applying machine learning techniques

N/A
N/A
Protected

Academic year: 2021

Condividi "Analysis, design and implementation of a parking recommendation system applying machine learning techniques"

Copied!
100
0
0

Testo completo

(1)

Engineering and Architecture Faculty

Master’s Degree in Computer Science and Engineering

ANALYSIS, DESIGN AND

IMPLEMENTATION OF A PARKING

RECOMMENDATION SYSTEM APPLYING

MACHINE LEARNING TECHNIQUES

Subject

MACHINE LEARNING AND DATA SCIENCE

Thesis supervisor

Prof. MARCO ANTONIO

BOSCHETTI

Thesis co-supervisor

Eng. DAVIDE ALESSANDRINI

Presented by

GABRIELE GUERRINI

Single graduation session

Accademic Year 2019 – 2020

(2)
(3)

Urban mobility

Parking recommendation system

Machine learning

Data mining

myCicero

(4)
(5)

Introduction vii

1 State of the art 1

1.1 Related works revision . . . 1

1.2 myCicero . . . 3 1.3 Competitors . . . 5 1.3.1 EasyPark . . . 5 1.3.2 APParked . . . 6 2 Data analysis 7 2.1 Data exploration . . . 7 2.1.1 myCicero . . . 8 2.1.2 Parking meters . . . 8

2.1.3 Verifiers dataset and seasonal tickets . . . 10

2.2 Time series model . . . 12

2.3 Preprocessing . . . 13

2.3.1 Timestamp approximation . . . 13

2.3.2 Aggregation of near parking meters . . . 14

2.3.3 Drop of outliers . . . 16

2.3.4 Filling of missing values . . . 17

2.4 Time analysis . . . 19

2.4.1 myCicero . . . 19

2.4.2 Parking meters . . . 21

2.4.3 Seasonal tickets . . . 23

2.5 Spatial analysis . . . 26

2.5.1 University area (north wing) . . . 26

2.5.2 Area #2 . . . 30

2.5.3 Conclusions . . . 36

2.6 Miscellaneous . . . 37

2.6.1 Arrivals vs Stops . . . 37

2.6.2 Stop length and late afternoon factor . . . 39

(6)

3 Problem formalization 43

3.1 Occupation status estimation . . . 43

3.1.1 OPV estimation . . . 43

3.1.2 Forecasting model: Prophet . . . 45

3.1.3 Capacity estimation . . . 48

3.1.4 Occupation percentage to occupation status mapping . . 50

3.2 Geometric models . . . 51 3.2.1 Polygon-oriented models . . . 51 3.2.2 Graph-oriented models . . . 53 4 Solution development 57 4.1 System architecture . . . 57 4.2 SystemSetup . . . 58 4.2.1 Geometric model . . . 59

4.2.2 Further setup tasks . . . 63

4.3 RunningSystem . . . 65 4.4 ForecastingService . . . 68 4.5 StorageService . . . 70 4.6 RoadCutterService . . . 71 5 Performance evaluation 73 5.1 Experimental setup . . . 74 5.2 Real-world gap . . . 74 5.3 Generalization error . . . 75 5.4 Forecasting error . . . 76 5.5 Behavior goodness . . . 81 6 Conclusions 89 6.1 Summary . . . 89 6.2 Future developments . . . 90

(7)

Nowadays, smart city and urban mobility topics are focusing growing in-terest for business and R&D purposes both, and one of its trendy application regards the study and the realization of parking recommendation systems. As a matter of fact, understanding the parking behavior and making effective predictions play a key role in multiple optimization aspects. Besides the facil-itation of the parking operation itself when looking for a free parking lot, the understanding of the existing dynamics also enables the possibility to better frame the traffic conditions and, consequently, to dislocate resources such as the traffic police. Furthermore, vehicles looking for free parking spaces nega-tively impact the traffic speed, and long queue can often be observed during peak hours. Accurate feedbacks to users about the current availability of park-ing lots can reduce the traffic congestion by limitpark-ing the “cruispark-ing for parkpark-ing” phenomenon that is affecting the city centers increasingly.

In our case, a parking recommendation system is going to be developed for the city of Bologna. After a brief revision of similar proposed solutions and their approaches and findings, a first data analysis phase is going to be executed to explore the data and to mine any useful information about the parking behavior. Then, once the adopted forecasting model has been exposed, the design and the development of the system are going to be described. In the end, the performances of the system are going to be evaluated under multiple points of view, considering some project-specific constraints too.

(8)
(9)

State of the art

Firstly, it follows a revision of the related works about parking forecasting topic in Section 1.1. Then, it is executed a deepening about the application where the parking recommendation system is going to be integrated (Section 1.2) and the competitors in the Italian market that already introduced a similar functionality (Section 1.3).

1.1

Related works revision

Many different projects have already been developed to model and predict the occupation of parking lots, and very different ideas and approaches have been explored. Such a great variance is mainly due to the fact that the domain is very complex, and so each singular project deals with different scenarios depending on multiple factors, among which the features of the available data.

Available data

Besides the occupation of parking lots, information about exogenous fac-tors are often exploited too, including weather data, traffic speed and outlier temporary conditions. This latter aspect is comprehensive of car crashes, city market day, closed roads, holidays.

Also, we can identify further aspects that make it possible to provide different capabilities to the solution, namely:

• Usage of real-time data against usage of the historical data solely: the usage of real-time data permits to have accurate information that other-wise must be approximated (e.g., exact arrival and/or departure time). • Historical depth of the datasets: the period covered by the data influences

the type of seasonalities that can be found.

(10)

• Sampling period of the data: it influences the minimum period a new prediction can be generated (every 5 minutes, 15 minutes, etc.).

Forecasting model

As well as the commonplace approach based on neural networks, statis-tical predictors (e.g., ARIMA, [2]) and other well-known machine learning techniques (e.g., random forest, [3]) have been applied too. However, an ac-knowledged best performing standard is still not present among the different forecasting models, and there is not a compliance in the type of problem to be solved neither (regression against classification).

Also, further models have been enriched using the fuzzy logic principles to es-timate the uncertainty of parking availability during the peak parking demand period [6].

When a regression problem is faced, the time series concept has been used to model the stops trend in most of the cases. Though, the way the time series are built is not always the same and depends upon the forecasting goal of each singular project (find out an occupation percentage, find out a delta in number of arrivals and departures, find out the time a parking lot is going to remain occupied/free, etc.).

Concerning the neural network domain, we have different adopted method-ologies straddling between more architectures. Indeed, besides classical MLP networks that are still used, Recurrent Neural Networks (RNN) are widely employed to catch time dependences and, in some cases, Graph Convolutional Neural Networks (CGNN) are employed to catch spatial dependences too [4]. Farther, more complex techniques have also been tried by:

• Ad-hoc redesigning the architecture of the LSTM unit [5]: the input gate is modified to merge the regular input with information about weather conditions and recurrent patterns at three different granularity level (day, week, month).

• Training the neural network using genetic algorithms [1].

• Combining more neural networks and merging their partial results [4] (similarly to bagging techniques principle).

Geometric model

When dealing with the geometric aspect, the common solution is to split the territory up in different regions so that situated forecasts can be executed. The number of regions a city can be divided in highly varies case by case,

(11)

and the methodology adopted to split the territory is generally handmade by considering the districts on the map or the road network.

The correct detection of regions is a crucial aspect since it has been found that different areas may assume completely different behaviors depending on the time of the day, the day of the week and weather conditions. This latter factor significantly influences performances in general, and it has been discovered that different areas suffer occupation changes on different type of weather. Indeed, business areas expose higher values on adverse weather (storm, snow, etc.), while recreational areas follow an opposite behaviour.

Moreover, areas containing the more parking lots seem to perform better be-cause the occupation rate variance is lowered, and so more stable and accurate forecasts can be made. On the contrary, areas containing just a bunch of parking lots are susceptible to fluctuations.

Furthermore, a time correlation between adjacent areas has been found in some cases [4]. Such areas reach a high occupation in similar times during the day, except for a little shift. This is probably due to the filling of a certain block where it is contained a major place of interest firstly. Then, once such a block is full, people begin to look for more distant parking lots, and so near blocks start to grow their occupation too.

In the end, notice that few projects worked on a simplified scenarios where the incoming data covers just one or more restricted areas. Hence, the domain had been considered as atomic, and no particular geometric model had been adopted.

1.2

myCicero

“myCicero” [7] is a digital platform developed by the “Pluservice” [8] com-pany and whose goal is the offering of different services, including info mobility, mobile ticketing, shopping promos, points of interest (POI).

The application is based on a MaaS architecture (Mobility-as-a-Service), and already integrates more than 200 mobility operators.

When talking about the ticketing area, the application handles different typologies of transport, such as railways, taxis, bus routes, car pooling. The “on street parking” functionality handles the paid parking lots manage-ment, and Figure 1.1 reports an example of its usage.

(12)

Figure 1.1: The “on street parking” functionality of “myCicero” application.

Different colors identify different hourly rates.

This latter functionality enables the payment of the parking through the mobile application so that users can get rid of parking meters just by using their mobile phone. The main advantages are:

• Faster parking: the user does not need to reach a parking meter by foot anymore.

• Easier payment: the is no need for the user to use coins to pay. • Increased flexibility: the application is going to charge only the

(13)

without coming back to the car.

The parking recommendation system to be developed will be integrated in this parking functionality, and there is no previous version of it. The goal is to provide a visual feedback for each street on the map ideally.

1.3

Competitors

1.3.1

EasyPark

“EasyPark” [9] is the first competitor that introduced a parking recommen-dation system in the Italian market, and the company already launched the functionality in several European cities. In Italy, Verona had been the first one in 2018, but then other cities were included too (such as Milan and Rome). The “Find & Park” functionality allows the users to check the probability to find parking for each street, and such probability is expressed using different colors on the map (Figure 1.2). The goal is to save time when looking for a parking lot, and it has been confirmed that the time spent by the users is cut down from a couple of minutes to the half depending on the situation.

It has been declared that the recommendation system considers the following factors:

• “EasyPark” data: stops executed by the users that pay through the application.

• Data provided by the local mobility operator, also including exogenous factors such as temporary closed streets and city market day.

In the end, as well as displaying the map containing the status of each street, users can also activate a navigator whose calculated journeys are optimized upon the availability of the parking lots instead of the travel time.

(14)

Figure 1.2: Example of the “Find & Park” functionality.

1.3.2

APParked

“APParked” [10] is another application providing a similar functionality. However, even though the goal is way similar, the application uses a com-pletely different approach. Indeed, it is based upon the social idea and the collaboration of the users so that free parking lots are reported.

A user can ask for a free parking lot, and the platform will return all the nearby availabilities. Then, the user notifies the platform when the parking lot is effectively occupied (park-in), and a suggested duration of the stop can be provided so that it is known an expected time for the parking lot to be released. Obviously, the user can also notify to the platform when he is leaving (park-out).

Besides the common benefits, users are enticed to use the application through various rewards. The more information they provide, the more reward they can obtain.

(15)

Data analysis

Firstly, it follows a description of the available data in Section 2.1. Then, once the model adopted to shape the stops has been explained (Section 2.2) and the datasets have been preprocessed (Section 2.3), a data analysis phase is executed to mine any useful information that may be considered during the development of the system itself1.

Premise: Notice that there is no info about night hours since the parking

is free during such period and, consequently, there is no check made by police officers too. Hence, no analysis can be executed nor any feedback to users can be given using the provided data for the time being.

Moreover, a similar reasoning can be made for holidays where the parking is free too.

2.1

Data exploration

There are three admissible ways to occupy a paid parking lot, that is: • Use a near parking meter to pay the required amount.

• Use an application that allows the payment using a digital wallet (or any similar feature). We focus on information coming from “myCicero” application since data of competitors is obviously unknown.

• Use a seasonal ticket (or any similar special convention).

A dataset for each payment method is available grouping transaction that happened in the period 01-01-2019/31-12-2019.

1Analysis that did not lead to any interesting result are not reported for conciseness.

(16)

2.1.1

myCicero

The dataset contains an overall of 310k records with a size of 20MB. Each record represents a payment made in a certain moment and in a certain point of the space. Hence, each entry contains few useful information such as:

• Longitude and latitude coordinates

• Transaction timestamp representing the moment the stop begun • Stop length (minutes)

Records are provided in the format:

(longitude: float, latitude: float, stop length: int, timestamp: string) Timestamps, already rounded to quarter of hours, are provided in the following format:

Y-M-D h:m:s Table 2.1 reports few example records.

Longitude Latitude Stop length Timestamp 11.347791 44.508027 27 2019-01-01 11:15:00 11.311599 44.491037 60 2019-01-01 11:15:00 11.354347 44.487044 197 2019-01-01 13:00:00 Table 2.1: Sample of the “myCicero” transactions dataset.

2.1.2

Parking meters

We need information about location, time and length for each stop but, unfortunately, such information is split up among multiple datasets. Indeed, besides the transactions dataset storing the arrival time, we must use two more datasets to calculate stop location and length, namely the parking meters’ registry (storing location) and the hourly rates dataset (storing the required information to correctly calculate stop length from the paid amount).

(17)

Transactions dataset

The dataset contains an overall of 3800k records with a size of 380MB. Each record represents a payment made in a certain moment to a certain parking meter. Hence, each entry contains few useful information such as:

• Parking meter’s id • Transaction timestamp • Paid amount

A first preprocessing operation2 is required to correctly shape the data since column separator and amount values use the comma character both. Therefore, amounts containing decimal digits are split upon two columns, and records do not have all the same shape.

After such operation, the records are provided in the format:

(parking meter id: int, paid amount: float, timestamp: string) Timestamps are provided in the following format:

Y-M-D h:m:s.ms Table 2.2 reports few example records.

Parking meter id Amount (€) Timestamp

10001 2.0 2019-01-01 23:04:00.000

10001 1.0 2019-01-02 08:23:00.000

5516 0.9 2019-01-26 18:31:00.000

Table 2.2: Sample of the parking meters’ transactions dataset.

Hourly rates dataset

A parking area is a region of the city where a same hourly rate is applied to all parking lots within it.

Each record of the dataset includes the name of an area and the information associated to such area. The rate values hold just for regular days since free parking is applied during holidays (the rate is “0”).

Table 2.3 reports the complete hourly rate dataset (just 4 areas).

2Such operation is not exposed in next preprocessing section since it is required to actually

(18)

Area Rate (€) Validity Cerchia del Mille 2.4 8:00 20:00

Centro Storico 1.8 8:00 20:00

Corona Semicentrale 1.5 8:00 18:00 Zona Bolognina-Arcoveggio 1.2 8:00 18:00

Table 2.3: Hourly rate of each area.

Registry dataset

The dataset stores the registry of parking meters (819 records in total) with information about:

• Parking meter id

• Longitude and latitude coordinates

• Parking area where a parking meter is located

• Other info of no interest (e.g., human readable location where the parking meter is placed in term of street name and house number)

Table 2.4 reports few example records.

Id Longitude Latitude Area

10001 11.348345 44.508008 Corona Semicentrale 3032 11.348821 44.485008 Cerchia del Mille 3023 11.346152 44.8913 Centro Storico

Table 2.4: Sample of the parking meters’ registry. Only columns of interest

are shown.

2.1.3

Verifiers dataset and seasonal tickets

The verifiers dataset includes all the checks made by police officers during the selected period3 and it contains an overall of 1800k records with a size greater than 500MB.

Each record represents a check made in a certain location at a certain time, and so it includes the following useful info, that is:

(19)

• Longitude and latitude coordinates representing the point of the space where the check has been registered (and so where the vehicle is parked approximatively)

• Timestamp representing the time the check has been registered • Type of ticket (regular parking meters’ ticket, seasonal tickets, etc.) • Id of the street (useful to execute aggregated analysis)

The remaining columns report information of no interest and so they are omit-ted.

Few ETL transformations are required to generate the dataset from the input raw files. Indeed, even though the data is already organized in columns and separated by commas, it is split among 12 text files (one for each month) that need to be converted to CSV format and then merged. When doing the above operations, a preliminary projection is executed contextually in order to drop useless columns.

Besides the complete verifiers dataset that may be useful during the next performance evaluation stages, the target is also to create a dataset repre-senting the slice of occupied parking lots that can not be framed using the two aforementioned datasets, that is stops that did not require any payment thanks to seasonal tickets, hotel conventions, school conventions, etc. Hence, the verifiers dataset is filtered to select only the records of interest, and the resulting dataset’s size is about 100MB, reduced by 80% from the starting one. When talking about the seasonal tickets data, records are provided in the format:

(longitude: float, latitude: float, timestamp: string, street id: string, ticket type: string)

Timestamps are provided in the following format: D-M-Y 00:00:00 h:m Table 2.5 reports few example records.

Longitude Latitude Timestamp Ticket type Street id

11.3412 44.5153 01/10/2019 00:00:00 08:08 ABBONAMENTO 32760 11.316 44.4932 26/11/2019 00:00:00 09:21 ABBONAMENTO 10750 11.316 44.4932 26/11/2019 00:00:00 09:21 INVALIDI 10750

(20)

2.2

Time series model

Times Series (TS) with different time intervals and sampling periods are employed to describe the stop trend at different granularity levels.

Table 2.6 reports the adopted taxonomy for different types of TS.

Note: TS spread over multiple years may be used to find high-level behaviors

that live aside the single year (such as an expected lower occupation during summer period), but this would require a dataset with a historical depth of few years.

Type Time interval Sampling period Number of values on time axis

Day by quarters 1 day quarter (of an hour) 96 1

Year by quarters 1 year quarter 35040-35136 2

Month by days 1 month day 31

Year by days 1 year day 365-366

Year by months 1 year month 12

Table 2.6: All types of time series along with respective defining info.

Any value of a given TS represents the number of occupied parking lots in such time. Let us consider, for example, the following TS:

[1, 5, 4, 3, 2, 10, 7, 8, 3, 2]

We have 1 occupied parking lot at time 0, 5 at time 1, 4 at time 2 and so on. It follows a detailed explanation of the day by quarters TS. A similar rea-soning, omitted for conciseness, can be made for other types of TS too.

Day by quarters TS

Quarters are indexed in range [0; 95] using the following criteria: 00:00 -> 0, 00:15 -> 1, 00:30 -> 2, . . . , 23:45 -> 95

Each record of the dataset contributes as “1” for any quarter the stop covers, and the number of quarters it contributes to is given by the stop length using the formula:

n quarters= round(stop length/15) + 1

124 hours per day * 4 quarters per hour

(21)

Once covered quarters for each transaction is calculated, we generate the TS just by counting how many times each quarter appears.

It follows a clarifier example. Table 2.7 reports few records and the respective information required to generate the TS.

Input info Calculated values Output

Arrival time Stop length Arrival quarter index # quarters Covered quarters

10:00 48 32 4 [32, 33, 34, 35]

10:15 63 33 5 [33, 34, 35, 36, 37]

23:30 68 94 5 [94, 95, 0, 1, 2]

Table 2.7: Examples of covered quarters calculation.

Once we have stop lengths in term of covered quarters, we can merge them all and generate the following TS by counting the number of occurrences of each quarter: [ 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1 ]

The first element of the TS (i.e., time 00:00 having index “0”) has value “1” since we have only one “0” in the covered quarters array.

The second element of the TS (i.e., time 00:15 having index “1”) has value “1” since we have only one “1” in the covered quarters array.

The 34th element has value “2” since we have two values “34” in the covered quarters array.

Similarly for all other items of the TS.

2.3

Preprocessing

2.3.1

Timestamp approximation

It is required an approximation of timestamps to execute further analysis because of two main reasons, that is:

• All datasets should use the same time detail on timestamp to correctly merge/compare them. However, “myCicero” timestamps are already

(22)

rounded to quarters, and so such a round must be executed on other datasets too.

• Even if we had minutes detail for all datasets, such fine-grained infor-mation would be an overkill, and it should be reduced since there is no point on dividing between transaction occurred in one-minute distance. On the other hand, just dropping minutes information may be excessive since we may lose dynamics happening within hours. Thus, the quarter approximation may be a reasonable tradeoff.

Therefore, we approximate timestamps using quarter of an hour. There are several possibilities to do such an approximation (to the nearest quarter, to the previous one, etc.). Notice that there is no wrong option, and each one would lead to a different modeling of the data.

The approximation to the nearest quarter has been chosen4 (Figure 2.1 reports a graphical explanation).

Figure 2.1: Approximation of each minute to the nearest quarter.

2.3.2

Aggregation of near parking meters

It is required an aggregation of parking meters for data mining and solution development both. The problem arises from the distribution of parking meters over the territory, and it is addressed to the excessive proximity of few couples of them. This problematic situation occurs in two main casuistries, that is:

• Two parking meters located at the two sides of a street.

4Actually, since “myCicero” data is already rounded to quarters, the same approximation

should be used to keep consistent data. Unfortunately, the type of that approximation is not known for the time being.

(23)

• Two parking meters located at the opposite corners of a crossroad. Indeed, parking meters of given couple likely share a similar behaviour (in-tended as the trend of the stops along daylight hours), and so they can be aggregated and reduced to one. Figure 2.2 reports a clarifier example.

Figure 2.2: It is reported the trend of stops along the day for three different

parking meters (1016, 1045, 10007). 1016 and 1045 are located at the two sides of the same street (“Via Irnerio”) and expose an identical trend. Meanwhile, 10007, exposing a totally different behaviour, is a randomly chosen parking meter far from to above ones.

By aggregating close parking meters, we obtain the following advantages: • Reduction of the amount of computation to be done.

• Collapse of different items that actually represent the same underlying concept.

This operation is executed iteratively by aggregating couples one by one. When a couple is aggregated, the new parking meter will be located in the middle of the segment that connects the two starting points (Figure 2.3), and the resulting parking meter will own one of the two starting id (it does not matter which one).

The metric used when calculating the distance between any couple of parking meters is the Haversine distance.

(24)

Figure 2.3: Two points (blue) and their middle point (red). (xAB, yAB) =

((xA+ xB)/2, (yA+ yB)/2) = ((2 + 8)/2, (2 + 5)/2).

The starting cardinality of parking meters is 816 (3 already dropped from starting 819 due to missing values). The minimum allowed distance for two parking meters that avoid them from being aggregated is a parameter. Table 2.8 reports few attempts when tuning such a parameter so that only parking meters with same behaviour are collapsed.

Minimum allowed distance (m) Number of dropped parking meters

50 122

20 38

10 23

Table 2.8: Number of aggregated parking meters when varying the minimum

allowed distance.

By analyzing the behaviour of a sample of the aggregated parking meters, the usage of 20m distance seems to be a good trade-off. The aforementioned assumption (i.e., merge only parking meter with similar behaviors) cannot be considered true any more when using higher distances (such as 50m), while few couples may be lost and not aggregated when going under 20m.

Thus, the number of aggregated couples is 38, resulting in 778 final parking meters.

2.3.3

Drop of outliers

The considered territory is included in range [44.47; 44.55] and [11.30; 11.38] for latitude and longitude respectively, and records detected to be out-side the domain are dropped.

By analyzing “myCicero” dataset, 37 records are detected to be located in different Italian cities, namely:

(25)

• Riva del Garda (TN): area 10.83/45.88, 22 records • Rimini (RN): area 11.71/44.36, 10 records

• Imola (BO): area 12.56/44.06, 5 records

Such outliers are dropped even though it is not clear why they are present in the datasets.

By analyzing parking meters registry, the ones with id 3017, 14109, 15025 are dropped. Such outliers clearly are typing errors (wrong values, missing digits, etc.) but the involved parking meters result on being outside of the domain, and so they are dropped.

In the end, seasonal tickets dataset is cleaned too. In this case, a bunch of records are dropped since they have latitude or longitude set to “0”. Actually, such records are dropped now since they are located outside the domain, but it is obvious how they are not real outliers (probably, “0” value has been put as placeholder to fill missing information).

2.3.4

Filling of missing values

There are four fields to be filled or calculated, namely: • Stop length for parking meters transactions • Stop location for parking meters transactions • Stop length for seasonal tickets

• Stop location for few seasonal tickets records

Parking meters: stop length and location

The location in term of longitude and latitude coordinates is not carried with each transaction, but it can be easily retrieved from the registry dataset. In reverse, there is no explicit stop length for parking meters transactions, but it can indirectly be calculated from the paid amount and the hourly rate using the following formula:

length= (amount/hourly rate) ∗ 60 (2.1)

The stop length is expressed in minutes. The hourly rate varies on the parking area where the parking meter is located in.

In order to retrieve the missing values, parking meters registry is firstly joined with the hourly rates dataset on the “area” column to retrieve the

(26)

hourly rate associated to each parking meter. Next, the transactions dataset is joined with the registry on the “id” column in order to retrieve location and hourly rate both for each transaction, and then the Formula (2.1) is applied to find the stop length out.

In the end, the transaction dataset may contain few missing values if any parking meter’s id in the transaction dataset is not also present in the registry (failed join operation). Such records are simply dropped.

Seasonal tickets: stop length and location

The main issue is that there is no information about the stop length for seasonal tickets data. However, this latter value is mandatory to create a TS, and such info is set up to 1 minute. Notice that, since a same value is going to be used for all records, the chosen length is not a knot or an error because any other greater value would just modify the TS by widening the “bells” created by the peaks, and so the overall trend would just be blurred while keeping the same overall shape of the time series (Figure 2.4).

Figure 2.4: Trend of stops along the day with stop length set to 1 (left) and 80

(right) minutes, respectively.

All the records contained in the period April 7th - June 11th have missing information about location. This is not a problem when talking about aggre-gated analysis since they can be dropped, and the remaining dataset is still significant for most of the techniques.

However, it may become an issue of primary importance when executing situ-ated analysis or forecasting tasks. Hence, such an aspect must be kept in mind and fixed if required (e.g., by interpolating missing values of the TS).

(27)

2.4

Time analysis

2.4.1

myCicero

Figure 2.5 reports three different time series along with respective Pearson autocorrelation values and normality tests (using Q-Q plots).

Figure 2.5: Three different TS along with information about Pearson

autocor-relation and normality test using Q-Q plot for daylight hours.

Day by quarters

It is quite easy to identify a neat division between day and night hours with very low values on the latter ones. From now on, daylight hours are estimated to be the ones in the interval 8 a.m. – 6 p.m., and most of the analysis will be executed on such period since it is the one of interest.

During daylight hours, the time series is shaped as a bimodal one since the almost constant trend occupation is interrupted in the middle. This is

(28)

probably due to the launch break, and so it is not a novel fact of particular interest.

In the end, the distribution of data can be assumed as normal if not considering few outliers.

Month by days

The times series does not show any major regularity among different periods of the month, and this is formally confirmed by the Pearson autocorrelation.

Anyhow, we can detect a slightly 7-lags autocorrelation. This may point a similar trend out for the days of the week independently from the month. The issue that may mask such a correlation can be addressed to the shift of the days as months proceed (that is, the 1st of a month is not always Sunday or Monday, etc.).

Hence, we use a different time series modelling each day of the year (year by

days), and we look to the Pearson autocorrelation index (Figure 2.6). As we

can notice, the seasonality is now clear, and the period’s length is seven (one week). Moreover, we can conclude there is no sub-seasonality on midweek days.

Figure 2.6: Year by days TS along with associated Pearson autocorrelation

index for “myCicero” data.

In the end, the distribution of data can be associated to a normal one except for 2 outliers that have lower values, that is:

• 1st day of the month: a lower number of stops seems to occur on such day. This is probably due to a combination of holidays and Sundays that occurred in that year so we just have 8 values out of 12 (1st January and 1st November are holidays, 1st September and 1st December are Sundays).

(29)

• 31st day of the month: obviously, only 7 months out of 12 have such a day, and so it is normal to find a lower value. Hence, it cannot be considered a real outlier.

Year by months

The time series cannot be analyzed to find any recurring pattern since only one-year period data is available (e.g., higher values and growing trend after summer till the end of the year can be either a general trend among multiple years or a recurrent seasonality within the year). However, as we expected, we find a fall on number of stops in August.

Moreover, we do not have enough values to correctly apply the normality test (or better, to trust its results).

Thus, nothing can be said about autocorrelation and normality, but the re-spective plots are reported for completeness.

2.4.2

Parking meters

Figure 2.7 reports three different time series along with respective Pearson autocorrelation values and normality tests (using Q-Q plots).

(30)

Figure 2.7: Three different TS along with information about Pearson

autocor-relation and normality test using Q-Q plot for daylight hours.

Day by quarters

The TS trend is very similar to the one of “myCicero” data, and so similar consideration can be exposed. The only outstanding element is the presence of few periodic little spikes that break the “linear trend” of the TS (a deepening about such aspect is made in Section 2.6).

Month by days

Similar considerations to the ones stated for “myCicero” data can be ex-ecuted here too since similar trends and patterns are observed. Figure 2.8 reports the year by days time series.

(31)

Figure 2.8: Year by days TS along with associated Pearson autocorrelation

index for parking meters.

Year by months

Despite similar patterns to “myCicero” data such as a fall in stops on August, there is one main difference which resides in the number of stops. Indeed, now the growth of stops after the summer period does not occur any more (may be seen clearly by comparing Figure 2.6 to Figure 2.8). There are three main possibilities that may produce such a gap, namely:

• Parking meters and “myCicero” are used by different classes of people, and so it is normal that they do not share the same trend since they represent different underlying concepts, even though all the other time series create a match between parking meters and “myCicero” with very similar trends.

• The difference is given by a spread of the “myCicero” company and more people using the application, while the parking meters are of public domain and so its usage remains constant. This would imply that there is no real difference in term of habits of the customers.

• The increase is simply given by an outlier period.

None of the three hypotheses can be discarded with the provided data.

2.4.3

Seasonal tickets

Figure 2.9 reports three different time series along with respective Pearson autocorrelation values and normality tests (using Q-Q plots).

(32)

Figure 2.9: Three different TS along with information about Pearson

autocor-relation and normality test using Q-Q plot for daylight hours.

Day by quarters

The time series shows a well-defined recurrent pattern: high and low valued are interleaved during the daylight hours. This is probably due to the checking activity following a regular frequency along the day, and so the trend of stops is not trustable to be coherent with the real one. Indeed, while for parking meters and “myCicero” we can exactly model the arrival of a customer and the associated stop by using the timestamp and the stop length, now we do not know the real arrival time nor the length of the stop (length is mocked as exposed in previous Section 2.3.4, arrival timestamp is the one when the check had been made).

The TS actually takes snapshots of the occupation of parking lots at a given time, while not considering the stop over the time passing. To better under-stand this issue, we may just look to night hours where the recorded number of stops is exactly always zero since checking are not executed during such

(33)

pe-riod. This is not a problem since the occupation during night hours would be low anyway, but such a concept becomes a big issue when considering daylight hours.

Hence, low peaks on daylight hours can not be assumed as true since they do not represent a real decrease in the number of occupied parking lots. As exposed Section 2.3.4, a possible solution is to create a “blurred” TS by using length set to any random value (e.g., 60 minutes) so low peaks are filled. However, this way we do not obtain any useful information anymore since the trend becomes almost constant, except for a little peak in middle morning. Month by days

Similar considerations to the ones already stated in previous analysis can be made. Year by days TS is exploited to assess that the seasonality is seven. The gap in spring months is not unexpected since those records have been removed during outliers dropping stage.

Figure 2.10: Year by days TS along with associated Pearson autocorrelation

index.

Year by months

Similar considerations to the ones already stated in previous analysis can be made about seasonality, normality test, etc. There are only two remarkable thing that may be underlined, that is:

• As already discovered, the falling during spring period is due to the missing data in such period.

• There is not a clear fall during August month, and this implies that the slice of the population that makes use of seasonal tickets is not greatly affected by summer variations.

(34)

2.5

Spatial analysis

Parking meters and “myCicero” data are compared on different areas. The area sampling is handmade and, in order to minimize the sample bias, different areas are chosen considering few characteristics, such as:

• Downtown areas vs suburbs areas.

• Different shapes: stretched areas, rounded areas, etc.

• Completely random areas vs areas with known potential peculiarities (e.g., university).

It follows an analysis on two of the sampled areas5 over multiple abstrac-tion levels (analysis on whole area and on sub-areas both). In some cases, a robustness check has been executed too.

2.5.1

University area (north wing)

The considered area, whose schema is reported in Figure 2.11, is the north wing of the university district.

5Just few examples are shown since the description of all the analyzed areas would be

(35)

Figure 2.11: Schema of the selected area. Blue markers identify parking meters

(each one with its own id).

We firstly consider parking meters to understand the trend of stops along the day in Figure 2.12.

We notice different behaviours among different areas. Indeed, as we can see for parking meters 2012, 2013 and 2014, the trend is strongly influenced by the territory peculiarities: they are all located on the same street (S.Giacomo) within the university context, and they share the same monotonic increasing trend. This is not true for other parking meters that are located nearby the university though.

Moreover, we understand that the distance is a tricky dimension. Indeed,

d(2012, 2014) > d(2012, 1041)6, but parking meters 2012 and 1041 do not share

the same behaviour (while it is true for 2012 and 2014). Thus, we can conclude that the little the distance the similar the behaviour (et vice versa) is not a fair assumption in general.

(36)

Figure 2.12: Trend of stops (expressed using the day by quarter TS) of few

parking meters within the considered area.

Now, we look for any evidence of similar behaviors on “myCicero” data. In particular, we consider “S.Giacomo” street (Figure 2.13) and, then, we look to corresponding time series (Figure 2.14).

As we can notice, there is not any correspondence between parking meters and “myCicero” in this area with the two data source modelling completely different behaviours.

Moreover, there is no correspondence neither between the two halves of the street within the solely “myCicero” data. Indeed, by splitting the street up into two halves, the first exposes a peak in the middle of the day and low levels for other hours, while the second one has an almost constant trend.

(37)

Figure 2.13: S.Giacomo street on the map. The street is split up in two halves

so it can be approximated by the two red rectangles in the best way.

Figure 2.14: Day by quarters TS for area S.Giacomo street (part 1 on the left,

part 2 on the right) using myCicero data.

Thus, we can conclude that the two data sources are not interchangeable, nor an accurate forecast can be made by using just one of the twos since different behaviours may appear for a same considered area. The mean of such difference is still unknown, and it may rely on two different classes of people that use parking meters and “myCicero”. However, by using the provided information, it is impossible to deepen such a speculation.

Moreover, it is quite evident how territory peculiarities influence the trend of stops along daylight hours, and so they should be taken in consideration when developing the solution.

(38)

portion of territory since the behaviour may vary quite suddenly for nearby points of the space.

2.5.2

Area #2

The selected area is a sampling of the downtown having a circular-like shape (Figure 2.15).

Figure 2.15: Area made up by the intersections of few different streets: Via

delle Lame, Via Riva del Reno, Via Marconi, Via Minzoni, SS9. Parking meters (along with their ids) are highlighted by the blue markers.

First of all, the stops trend along the day is reported in Figure 2.16. While for “myCicero” data the behavior is not informative at all (almost a constant time series), the parking meters one can provide few novel information about the area with a higher occupation in the afternoon hours at the expense of the morning ones.

(39)

Figure 2.16: Times series “day by quarters” for “myCicero” and parking meters

on the sampled area.

However, the area is still too wide to be considered atomic since differ-ent sub-areas may show completely differdiffer-ent patterns, resulting in misleading results. We split the whole area in five sub-areas, namely:

• “Via Gardino” and surrounding area

• “Via Caduti del Lavoro” and surrounding area • “Via Castellaccio” and surrounding area • “Via Riva del Reno” and surrounding area • “Via delle Lame” and surrounding area

It is confirmed that different areas expose completely different behaviors, and such behavior is less informative as the number of parking lots inside that area increases.

Riva del Reno

“Riva del Reno” area, whose behavior is reported in Figure 2.17, exposes the less informative behaviour among the five areas and it is the only one containing a big parking (more than 100 parking lots, map in Figure 2.18). This may underline a correlation between the number of parking lots and the information a time series can provide (a deepening in such direction for all the territory in general is going to be executed in Chapter 5).

(40)

Figure 2.17: “Riva del Reno” area, day by quarters TS for “myCicero” and

parking meters both.

Figure 2.18: Street view of “Riva del Reno” area.

Via delle Lame

Boundary and behavior of the area are reported in Figure 2.19 and 2.20, respectively. The parking meters within such area are {5017, 5018, 5030, 5031}. As we can notice, “myCicero” and parking meters follow a compliant trend with an increase of stops during afternoon hours.

(41)

Figure 2.19: “Via delle Lame” area.

Figure 2.20: “Via delle Lame” area, day by quarters TS for “myCicero” and

parking meters both.

Now, we execute a robustness check by adding some jitter to the area positioning. The goal is to detect if the behavior substantially changes and how much the positioning may affect the results.

The initial area is identified by the rectangle with vertices: bottom-left = (44.5005, 11.3337), top-right = (44.5022, 11.3345). The jitter will make the area out of phase by moving it along any axis by a quantity equal to the half

(42)

of its length, that is ±0.008 on latitude and ±0.004 on longitude. Results reported in Figure 2.21 show that:

• LEFT: the behavior changes in a non-negligible way since novel parking lots, now considered in the area, influence the results.

• RIGHT: similar behavior with little variations. By moving the cell on the right, the introduced portion of the area does not contain paid parking lots at all, and so it is irrelevant for the analysis; meanwhile, we are losing on the left few parking lots, but such a loss does not compromise the behavior that is still captured by the remaining ones.

• UP: Almost the same behavior, similar considerations to the RIGHT case.

• DOWN : Behavior changed in non-negligible way, similar considerations to the LEFT case.

Thus, we can conclude that the identified behavior highly depends upon a correct positioning of the area. Obviously, areas containing no paid parking lot will not falsify the results, and so they can be put in one region rather than another. Hence, it is confirmed as a random division of the territory without considering the road network and/or the districts’ boundaries is not the best solution at all.

(43)

Figure 2.21: “Via delle Lame” behavior (center) and the four out of phase ones

using “myCicero” data.

Via Castellaccio

In the end, the “Via Castellaccio” area (map in Figure 2.22, behavior in Figure 2.23) is reported to underline two concepts, namely:

• The is no compliance at all between the trends, with the two data sources exposing opposite behaviors (peak of the stops in morning for “myCicero” against peak in the afternoon for parking meters).

• The area has been selected since there is a little number of stops and, at the same time, it is one of those having the most informative behavior. As stated before, a correlation may exist between the number of parking lots and the information provided by the trend of stops.

(44)

Figure 2.22: Map of “Via Castellaccio” area.

Figure 2.23: “Via Catellaccio” area, day by quarters TS for “myCicero” and

parking meters both.

2.5.3

Conclusions

We can summarize the following findings from the previous analysis, namely: • Usage of a random division of the territory might be impossible if the solution must be focused on effectiveness since it does not consider local territory peculiarities that emerged to be of primary importance. Indeed, the correct positioning of regions is essential, and little variations so that

(45)

regions with different behaviors are intersected and/or mixed make the solution lose effectiveness.

• Euclidean distance is a tricky dimension when dealing with this domain since “the little the distance the similar the behavior (et vice versa)” is not a fair assumption in general since local peculiarities create behavioral patterns in clusters of arbitrary shapes. Hence, equal regions having trivial shapes (circle, rectangle, etc.) for all the territory will not fit well the problem.

Moreover, regions size must be accurately tuned and very little (edges over 150m might already mix different behaviors) but, on the other hand, creating a division using too shorter edges would generate a huge number of regions, and this might affect the feasibility in term of computational complexity and in term of quantity of data required to correctly gen-eralize assumptions (i.e., too much regions correspond to a too meager quantity of data for each region).

• Given an area, parking meters and “myCicero” data are not compliant in term of behavior during daylight hours in most of the cases. Hence, the usage of “myCicero” data solely might be impossible if solution must be focused on effectiveness. Notice that this is not due to wrong territory split up since analysis on such direction has been executed starting from parking meters (that partially reflects the street network that, in turn, reflect territory features), and then considering “myCicero” data in such portion of the territory.

• Parking meters and “myCicero” data become more and more similar as the size of the considered area grows and as the number of parking lots in such area increases.

2.6

Miscellaneous

Very interesting results emerged from further analysis. However, the utility of such results may live aside from the current project itself or they have no precise classification, and so they are reported in this dedicated section.

2.6.1

Arrivals vs Stops

Datasets have been analyzed by counting the arrivals instead of modeling the stop along the time (merely associating the stop length to 1 minute so that each transaction is exactly associated just to one quarter of an hour).

(46)

The main goal is to understand if such simplified model still correctly shapes the behavior of the stops (and so the TS model exposed in Section 2.2 is an overkill).

The comparison of the two models is shown in Figure 2.24 using the day

by quarterstime series for parking meters and “myCicero” both. As expected,

the times series trends for arrivals and stops are quite different.

Rather, the most interesting aspect emerges from the arrivals time series of parking meters. Indeed, it is surprising how it comes out an exact seasonality of length four (Pearson autocorrelation diagram is not reported for conciseness) with the values arranged in a “lightning bolt” shape. We have local minimum and maximum associated to the first and fourth quarters, respectively. The difference maximum-minimum in term of number of arrivals in too high to be associated to fluctuations (almost 3x factor), and this pattern is probably explainable by the fact that most of the planned activities (meetings, lessons, openings, etc.) start at the stoke of the hour, and so people tend to arrive few minutes before. Consequently, the minimum is reached just after since the aforementioned activities have just begun, and so few people are expected to arrive.

Moreover, we can suppose that people using parking meters and “myCicero” aim for different goal, and such aspect partially emerged in the Section 2.5 when it was found out that often times series of parking meters and “myCi-cero” do not share the same behavior along the day when executing situated analysis on restricted areas. Such hypothesis is now strengthened since there is no strong seasonality is “myCicero” data, and so it means that the above motivations do not hold any more for “myCicero” customers. On the con-trary, they probably are usual customers that use the application for various purposes (against occasional customers for parking meters).

(47)

Figure 2.24: Comparison between arrivals (first column) and stops (second

column) for parking meters (first row) and “myCicero” (second row) both.

2.6.2

Stop length and late afternoon factor

The subject of the analysis is the stop length, and the distribution of its values is reported in Figure 2.25 for “myCicero” and parking meters both.

Firstly, we notice that “myCicero” data is almost all concentrated in few different values corresponding to multiple of hours (60 minutes, 120 minutes etc.), while it does not happen for parking meters values. This is indirectly due to the “myCicero” application logic that hints a default length of one hour to users, and such time seems to be kept by most of the users since the exceeding amount will be refunded if they are going to leave before the validity of the stop expires (for example, a user activates a stop with length 60 minutes, but he leaves after 48 minutes. The cost of the remaining 12 minutes will be paid back). By keeping in mind such principle, few users may select two or more hours if they are going to stay that long just by tapping on hours knob, and so the minutes one is not going to be used so much.

When talking about parking meters, the main subject is the user and the coins it disposes. Hence, there is different type of reasoning that is no more focused on the stop length expressed using time measure (minutes and hours).

(48)

Figure 2.25: Distribution of stop lengths for “myCicero” and parking meters.

Just stops with length less or equal to 180 minutes are considered (few tail values are dropped so that a nice visualization can be provided).

Seen the previous considerations, the goal is to discover if different types of user do exist. In particular, the main question is about which type of user is going to set a stop length detailed to the exact number of minutes and why. We divide the stops in two groups by their length: the ones that have a stop length multiple to hours and the ones that do not. The stops trend is reported in Figure 2.26.

As we can notice, there is a peak in the late afternoon for the stops whose length is not multiple to hours, and such peak dramatically increases when considering only stops shorter than one hour. Moreover, the major peak before 6 p.m is followed by a minor one just before 8 p.m.

On the contrary, the peak for stops with length multiple to hours are mainly distributed on morning hours.

Such a phenomenon, that actually can be slightly found in parking meters data too where the peak is not so high even though it exists (figure not reported for conciseness), can be addressed to the fact that the paid period ends at 6 p.m. or 8 p.m. depending on the area, and so people tend to do shorter stops during

(49)

the previous hour to avoid a waste of money. However, while it is justifiable for parking meters, it should not happen for “myCicero” since any minute in excess out of the paid period is not going to be charged.

The cause of the abnormal peak probably resides in the logic of the application. Indeed, the default suggested length is one hour for the cases when the paid period expires before too (e.g., paid period expires at 6 p.m., user arrives at 5:20 p.m., stop straddling between paid and free parking periods if using a one hour long stop). When in these cases, even though the application is not going to charge the costumer for the exceeding minutes, users normally set the stop length so that an exact length till 6 p.m. (or 8 p.m.) is considered. In order to do that, the minutes are fine-tuned using the proper knob in the application, and so stops with the most disparate minutes emerge (e.g, user arrives at 5:24 p.m. and sets a 36 minutes stop to avoid a possible extra charge of 24 minutes).

This means that most of the users do not trust the application from charging the correct amount when the stop is straddling between the paid and the free parking periods. Hence, in order to solve such a problem, a suggested length till the end of the paid period may be the hinted one instead of the regular value of 60 minutes (or at least, without changing the application logic, a textual feedback may be displayed to tell the safety of the operation).

Figure 2.26: Different day by quarters time series selecting only particular

(50)
(51)

Problem formalization

We should find out a high-level indication of the parking occupation in a given area A and in a given time t. Obviously, such a value cannot be estimated as once for all the territory. Therefore, the domain must be split up in several regions so that a different occupation value can be estimated for each one.

Section 3.1 exposes the way the occupation status is calculated, while Sec-tion 3.2 exposes different ways and approaches that can be exploited to obtain a division of the territory.

3.1

Occupation status estimation

Given a time t and an area A, the goal is to obtain a categorical occupation status for the area A at time t. In order to find such occupation status out, we firstly calculate the Occupation Percentage Value (OPV), that is a numerical value in range [0,1] representing the fraction of occupied parking lots, and then we map the OPV to a categorical value.

3.1.1

OPV estimation

For the time being, the number of occupied parking lots can be modeled by the sum of the three following components: “myCicero”, parking meters, seasonal tickets. Hence, the optimal solution would be calculated using the following formula:

OP V(t) = min{1,M C(t) + P M(t) + ST (t)

C } (3.1)

where:

(52)

M C(t) = estimated number of occupied parking lots for “myCicero” at time t;

P M(t) = estimated number of occupied parking lots for parking meters at

time t;

ST(t) = estimated number of occupied parking lots for seasonal tickets at time t.

C = capacity of the area in term of number of parking lots1.

The number of occupied parking lots is not a ground value, but it can be forecasted in any way. In our case, it will be formalized as a regression problem, and the TS model already explained in Section 2.2 will be used to shape the data.

The capacity of the area is not known too, and it must be calculated. For safety, the “min” operator is used to bring the final OPV back in [0,1].

As projects constraints force, it must be possible to easily add further data sources (or update the employed ones) to an already existing system, and so the Formula (3.1) can not be directly applied since all data is mixed together2 and the capacity is calculated once and for all. Hence, each data source should be separately considered, and this does not represent a problem when forecasting the number of occupied parking lots (just creating three different time series and executing a forecast on each one).

The question raises when considering the capacity because it can not be esti-mated as a unique value, and so partial capacities are going to be calculated on each data source. This way, each data source has its own estimated capacity and it is completely independent from the others.

Given a region, we use the following formula to merge the partial results into the final OPV:

OP V = min{1, Pn i=0f orecasti Pn i=0capacityi } where:

n = number of employed data sources.

f orecasti = estimated number of occupied parking lots for the i-th data source.

capacityi = partial capacity of the i-th data source.

1Obviously, it is intended as the number of paid parking lots solely.

2Indeed, all dataset would be merged, and a unique time series representing the overall

(53)

Simply, the idea is to calculate the total number of occupied parking lots by summing partial forecasts on each data source, and then to divide the resultant by the sum of partial capacities.

Notice that it is equivalent to a weighted arithmetic mean on the singular OPVs with weights imposed by the partial capacities.

The different weighting is required since the amount of data can highly differ from data source to data source.

For example, let us consider two data sources where we have 3 out of 10 occupied parking lots and 80 out of 100, respectively. This would lead in a resulting OPV of 0.55 ((0.3 + 0.8) / 2) when using a simple mean, while the weighted one will return an OPV of 0.75 ((80 + 3) / (100 + 10)) by implicitly giving more importance to the prominent data source, and this intuitively better represents the real occupation.

3.1.2

Forecasting model: Prophet

The “Prophet” forecasting model is the one adopted to estimate the number of occupied parking lots. The model, developed by Sean J. Taylor and Ben-jamin Letham in 2018 [11] and based on the previous work of A. C. Harvey and S. Peters [12], has been properly designed to catch recurrent business factors that other common approaches may not consider without the development of ad hoc solutions.

The main distinguishing feature of the model is that it straddles between the statistical forecasting approach and the judgemental one, and so it at-tempts to inherit the advantages of them both, namely:

• Flexibility: the model can be easily tuned to handle multiple season-alities and trend changes. This is a key factor that emerged during the previous data analysis stage (see Chapter 2).

• Computational complexity: the computation time of the training stage is limited. This feature is of primary importance because hundreds of models may be potentially created (one for each different region). • Explainability: the model allows to easily tune each additive

compo-nent, thus producing interpretable changes in the solution.

• Sampling time: it is not required a regular sampling over time, and so sparse missing values do not need to be interpolated. Taking into account that we have the availability of full observations for the training stage, this peculiarity may be relevant in the condition where the data of a specific region is temporally unavailable.

(54)

The main idea is to use an additive model to decompose a time series into four components:

y(t) = g(t) + s(t) + h(t) + t (3.2)

where:

y(t) = value assumed by the time series at time t.

g(t) = function that models the non-periodic changes, i.e. the trend. s(t) = function that represents the periodic changes, i.e. seasonality. h(t) = bias function that represents the effect of holidays.

t = error (any idiosyncratic changes not accommodated by the model).

Trend component

Two different designs of the trend are proposed: saturating growth (non linear), piecewise trend (linear).

The piecewise trend was assumed in order to simplify the model and alleviate potential overfitting.

The trend changes its growing rate periodically depending on various fac-tors (especially when dealing with business aspects), and so the change-point concept is to be defined. A change-point is a point in the time where the high-level trend changes significantly.

Let us consider S change-points at times sj, j = 0, ..., S, and a basic growing

rate of k. The linear model considers a constant growing rate between any two change-points, and variations in the growing rate are stored in a vector δ where

δi is the rate change that occurs at change-point si3. Hence, the trend at time

ti is calculated from the starting rate k and summing all changes occurred till

ti.

The total trend component is calculated using the Formula (3.3).

g(t) = (k + a(t)Tδ)t + (m + a(t)Tγ) (3.3)

where:

m= offset parameter to connect the endpoints of adjacent segments identified

by the change-points.

γ = vector used to make the function continuous (γj = −sjδj).

3Notice that it is the relative change that occurs from the previous rate, and not the

(55)

A proper vector a ∈ {0, 1}S is defined to so that the rate a time t simply is k+ a(t)Tγ where: aj(t) =    1, if t ≥ sj 0, otherwise

Change-points can be manually or automatically selected both. In the latter case, a sparse prior Laplace(0, τ) is added to δ so that few change-points are selected from a large number of candidates (conceptually equivalent to L1 regularization). The parameter τ controls the flexibility of the model in altering its rate (the higher the τ, the higher the trend variations that can be imposed by the model).

Seasonality component

The seasonality component summarizes information about seasonalities at different granularity levels (daily, weekly, etc.).

A periodic function of time s(t) is required to catch recurrent patterns, and a Fourier series is adopted to approximate it (Formula (3.4)). Indeed, terms of the series may be seen as refining patterns at different time granularities, and so an increase in number of terms allows the model to consider shorter seasonalites. On the other hand, an higher number of terms increases the overfitting of the model obviously.

s(t) = N X n=0 ancos 2πnt P  + bnsin 2πnt P  (3.4) where: N = number of terms;

P = seasonality length in days (e.g., weekly patterns -> 7); an, bn = parameters to be tuned by the model.

Holiday component

The holiday component just biases values in correspondence of any holiday. A hyper-parameter ki (holidays prior scale) that represents the effect on the

predicted value is to be defined for each detected holiday. Such effect can be an increase or a decrease either depending on the domain (in our case, it will be a decrease since the parking is free during holidays).

Figura

Figure 2.4: Trend of stops along the day with stop length set to 1 (left) and 80
Figure 2.5 reports three different time series along with respective Pearson autocorrelation values and normality tests (using Q-Q plots).
Figure 2.6: Year by days TS along with associated Pearson autocorrelation
Figure 2.7: Three different TS along with information about Pearson autocor-
+7

Riferimenti

Documenti correlati

Using synthetic modified and non-modified peptide and protein fragments as antigenic probes, the possible role of viral and bacterial infections as triggers for different kind of

In seguito alle elezioni del 2011 (si veda questa Rubrica in Quaderni dell’Osservatorio elettorale n. 67) il governo finlandese si è retto inizialmente su un governo composto da

In questo lavoro si vuole colmare in parte tale lacuna analizzando il grado di integrazione degli adolescenti di seconda generazione nei paesi EU15 e in quelli OCSE

The taxonomically significant holotype specimen of Diplodocus longus, YPM 1920, is lacking from both reduced consensus trees as well as from the pruned tree using implied weights..

Increased activity of 6-phosphogluconate dehydrogenase and glucose-6-phosphate dehydrogenase in purified cell suspensions and single cells from the uterine cervix in

A non-invasive approach based on the spectral analysis of heart rate variability HRV, able to provide semi-quantitative information on the cardiovascular balance between sympathetic

Three types of sensor devices are used: (1) slave devices, also called “receivers”, which are placed on the parking spots to detect presence /absence; (2) master devices, also