Analyzing customer journey with process mining : from discovery to recommendations

(1)

Analyzing Customer

Journey with Process

Mining: from Discovery

to Recommendations

Master Thesis

Alessandro Terragni

879112

Master of Science Degree in Computer Science and Engineering

Supervisors: dr. Monica Vitali (]₎ dr. ing. Marwan Hassani (∇)

(2)

(3)

Oggigiorno, il customer journey é considerato uno degli hot topics nel mondo del marketing. Difatti, capire come i customers si comportano all’interno di un sito web é cruciale ed é con-siderato uno dei fattori piú importanti per il successo di un web business. Nella letteratura, un approccio data-driven per analizzare il customer journey non é ancora stato sviluppato. Per esempio, tools di web analytics come Google Analytics forniscono una versione troppo sem-plicistica dello user behaviour e non sono in grado di estrarre un process model che descriva il customer journey nella sua interezza. Inoltre, questa tesi fornisce un secondo contributo cer-cando di fare da ponte tra il mondo process mining e quello recommender systems. Difatti, le tecniche di recommender systems forniscono raccomandazioni personalizzate agli utenti bas-andosi sul loro comportamento, personalizzando cos´ı il journey in base alle loro preferenze. Questo concetto é strettamente legato alle tecniche di process mining, che permettono di visu-alizzare, analizzare e ottimizzare il customer journey. Di conseguenza, questa tesi propone un innovativo framework in grado di unire questi due mondi, formalizzando l’uso di algor-itmi di raccomandazione in un contesto di process mining analysis. Per prima cosa, tramite tecniche di process mining, é possibile identificare dei particolari customer journey paths che possono essere indotti per ottimizzare una o piú KPIs (Key Performance Indicators). In seguito, considerando il customer journey come un feedback implicito fornito dall’utente, tramite algoritmi di recommender systems é possibile raccomandare agli utenti delle particol-ari azioni che ottimizzino la KPI selezionata precedentemente. Infine, i risultati possono essere analizzati nuovamente con tecniche di process mining, controllando che le raccomandazioni fornite abbiano effettivamente cambiato il customer journey nella maniera desiderata. Per dimostrare la validitá dei concetti proposti, il framework é stato testato su un real case study: partendo da un web log reale, diversi process models sono stati estratti e testati; succes-sivamente, le informazioni ottenute sono state usate per selezionare e ottimizzare una KPI tramite raccomandazioni personalizzate. In particolare, includendo il concetto di sequenza negli algoritmi di raccomandazione, siamo riusciti a migliorare il CTR (click through rate) del 10% rispetto agli algoritmi che caratterizzano lo stato dell’arte.

Un position paper, contenente i risultati preliminari di questa tesi, ´e stato pubblicato nella IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud-2018), 6-8 Agosto 2018, Barcellona, Spagna. Inoltre, un altro paper, contenente i risultati completi, ´e stato inviato al 34th ACM/SIGAPP Symposium On Applied Computing: SAC 2019, 8-12 Aprile 2019, Cipro.

(4)

(5)

Customer journey analysis is a hot topic in marketing. Understanding how the customers behave is crucial and is considered as one of the key drivers of business success. To the best of our knowledge, a data-driven approach to analyze the customer journey is still missing. For instance, web analytics tools like Google Analytics provide an oversimplified version of the user behavior, focusing more on the frequency of page visits rather than discovering the process of the visit itself. On the other hand, customer journey maps have shown their use-fulness, but they need to be created manually by domain experts. This thesis contributes a novel approach for applying process mining techniques to web log customer journey analysis. Through process mining it is possible to (i) discover the process that better describes the user behavior, (ii) find useful insights, (iii) discover and compare the processes of different behavi-oural clusters of users. Moreover, this thesis aims to make a second contribution by bridging the gap between process mining and recommender systems worlds. Indeed, recommender systems techniques provide personalized recommendations to users based on their behaviour and tailor the customer journey on their preferences. This concept is tightly linked with the process mining world, which allows to visualize, analyze and improve the customer journey. Consequently, this thesis proposes a novel formalization to ensemble these two worlds, in order to use recommender system algorithms in the process mining scenario. Firstly, through pro-cess mining, it is possible to identify particular customer journey paths that can be enforced to optimize some KPIs (Key Performance Indicators). Then, with recommender systems algorithms, it is possible to recommend to users particular actions that will optimize the se-lected KPIs, using the customer journey as an implicit user feedback. Finally, the results can be analyzed to discover the new customer journey model and check if the recommendations were actually able to divert the journey in the desired way. The proof of the correctness of the introduced concepts is demonstrated through a real-life case study, by showing and evaluating the discovered process models from a real web log, and then using the extracted information from the process models to select and optimize a KPI via personalized recommendations. In particular, including event sequence as additional information, we were able to increase the recommendation click trough rate of 10 % compared to state-of-the-art collaborative filtering algorithms.

A position paper [61], containing the preliminary results of this thesis, has been accepted and published in IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud-2018), 6-8 August 2018, Barcelona, Spain. Moreover, another paper, containing the complete results of this thesis, has been sent to The 34th ACM/SIGAPP Symposium On Applied Computing: SAC 2019, 8-12 April 2019, Cyprus.

(6)

(7)

This thesis represents the last milestone of the wonderful journey of the EIT Digital Double Degree Master in Data Science (Politecnico di Milano / Eindhoven University of Technology). The project was undertaken at the request of S.P, a company set in Eindhoven, where I was engaged in researching and writing this dissertation from February to August 2018.

I would like to thank all the people that contributed to this achievement, starting from my company supervisors Alessio and Eugenio to my university supervisors Marwan and Monica, who guided and supported me over these six months.

My parents and my family deserve a particular note of thanks: your wise counsel and kind words gave me the motivation to never give up. Without your inspiration, drive, and support I might not be the person I am today.

(8)

(9)

Contents ix

List of Figures xiii

List of Tables xv

1 Introduction 1

1.1 Thesis Structure . . . 3

1.2 Academic Publications Related to this Thesis . . . 3

2 Literature Analysis 5 2.1 Introduction to Customer Journey . . . 5

2.1.1 Customer Journey Maps Overview . . . 5

2.1.2 Overview of Web Usage Mining Techniques in Customer Journeys . . 7

2.2 Introduction to Process Mining . . . 8

2.2.1 Event Log . . . 9

2.2.2 Process Discovery . . . 10

2.2.3 Conformance Checking. . . 11

2.2.4 Process Enhancement . . . 12

2.2.5 Trace Clustering . . . 12

2.2.6 Process Mining and Customer Journey Maps . . . 14

2.2.7 Process Mining Applied to Web Logs. . . 15

2.2.8 Process Mining in the Recommendation Environment . . . 16

2.3 Introduction to Recommender Systems. . . 17

2.3.1 Collaborative Filtering . . . 18

2.3.2 Evaluation Metrics . . . 19

2.3.3 Collaborative Filtering for Implicit Feedback Data Sets . . . 22

2.3.4 BPR: Bayesian Personalized Ranking from Implicit Feedback . . . 23

2.3.5 Scalable and Interpretable Product Recommendations via Overlapping Co-Clustering . . . 25

2.3.6 Capturing Browsing Interests of Users into Web Usage Profiles . . . . 26

3 Problem Formulation 29 4 Customer Journey Analysis with Process Mining 32 4.1 Preliminaries . . . 32

(10)

4.1.2 Web Logs Characteristics . . . 33

4.2 Web Log - Event Log Mapping . . . 33

4.3 Data Filtering. . . 35

4.4 Discovering the Process Model . . . 36

4.5 Discovering Behavioural Patterns . . . 37

4.5.1 Bag of Activities Methods . . . 38

4.5.2 Sequence-based Distances Methods . . . 39

4.5.3 Creating Personalized Sub-logs . . . 40

4.6 Decision Point Mining . . . 41

4.7 Bottlenecks Analysis . . . 43

5 Generating Personalized Recommendations 45 5.1 Notation . . . 45

5.2 Using Process Mining to Find a KPI to Optimize . . . 47

5.3 Recommending Actions with Implicit Feedback Recommender Systems Al-gorithms . . . 48

5.4 Integrating Sequences in Implicit Feedback Recommender Systems Algorithms 51 5.4.1 Memory-Based Sequence-Aware Methods . . . 51

5.4.2 Model-Based Sequence-Aware Methods . . . 53

5.5 Analyzing Recommender Systems Results with Process Mining . . . 54

6 Application Scenario and Data Set 57 6.1 Application Scenario . . . 57

6.2 Data Set Used for the Experiments . . . 57

7 Results 59 7.1 Results of Customer Journey Analysis with Process Mining . . . 59

7.1.1 Web Log - Event Log Mapping . . . 59

7.1.2 Web Log Data Filtering . . . 60

7.1.3 Discovered Customer Journey Process Models. . . 60

7.1.4 Discovered Behavioral Patterns . . . 61

7.2 Recommender Systems Results . . . 68

7.2.1 Offline Results of Implicit Feedback Recommender Systems Algorithms to Optimize a Process KPI . . . 68

7.2.2 Offline Results of Sequence-Aware Recommender Systems . . . 69

7.2.3 Online Results of Recommender Systems . . . 71

8 Conclusions 73 8.1 Future Directions . . . 74

Bibliography 75 Appendix 83 A Process Mining Tools 83 A.1 ProM . . . 84

A.2 RapidProM . . . 84

(11)

A.4 Other Process Mining Tools . . . 85

B Process Models Images 86

C Further Recommendation Results 93

C.1 Recommendation Results with Non Registered Users Data Set . . . 93

C.2 Recommendation Results with Open Source Data Sets . . . 96

D Published Papers 98

D.1 FiCloud 2018: The IEEE 6th International Conference on Future Internet of Things and Cloud . . . 98

(12)

(13)

2.1 Customer behavior model graph . . . 7

2.2 Google analytics behavior flow example . . . 8

2.3 Overview of process mining spectrum. . . 9

2.4 Petri net describing the main behavior present in the event log of Table 2.2 . 11 2.5 Conformance checking criteria . . . 12

2.6 Conformance checking analysis example . . . 12

2.7 Trace clustering example. . . 13

2.8 XML structure of a CJM . . . 14

2.9 Recommendation service overview . . . 16

2.10 Possible abstraction of cases . . . 17

2.11 User rating matrices example: explicit rating (left), implicit rating (right) . . 19

2.12 ROC curve example . . . 21

2.13 Alternating least square overview . . . 23

2.14 Overlapping user-item co-clusters example . . . 25

2.15 Website hierarchical structure example . . . 27

3.1 Methodology overview . . . 29

4.1 Methodology overview: web log - event log mapping . . . 34

4.2 Methodology overview: discovering the process model . . . 36

4.3 Methodology overview: discovering behavioural patterns . . . 38

4.4 Overview of trace clustering . . . 39

4.5 Graphical representation of the edit distance between traces tr1 in tr2 . . . . 40

4.6 Methodology overview: decision point mining . . . 41

4.7 Decision point example. . . 41

4.8 Decision tree explaining the choice of the petri net in Figure 4.7, learned from Table 4.6 . . . 42

4.9 Petri net of Figure 4.7 with guards extracted from the decision tree of Figure 4.8 42 4.10 Methodology overview: bottlenecks analysis . . . 43

5.1 Methodology overview: identifying KPIs to optimize . . . 47

5.2 Process model example for recommendations . . . 47

5.3 Methodology overview: generating personalized recommendations . . . 48

5.4 User rating matrix hold out example . . . 51

7.1 Spaghetti-like model generated from the whole filtered event log . . . 60

(14)

7.3 Fuzzy model of cluster 1 . . . 64

7.4 Fuzzy model of cluster 2 . . . 65

7.5 Fuzzy model of users navigating with a PC . . . 66

7.6 Fuzzy model of users navigating with a smart-phone . . . 67

7.7 ALS and BPR AUC variations (registered users data set) . . . 71

7.8 ALS and BPR MAP@5 variations (registered users data set) . . . 72

A.1 Process mining tools map . . . 83

A.2 ProM logo . . . 84

A.3 RapidProM logo . . . 84

A.4 Disco logo . . . 85

B.1 Petri net generated by heuristic miner . . . 87

B.2 Petri net generated by inductive miner . . . 88

B.3 Petri net generated by inductive miner from the edit distance clustering (cluster 1) . . . 89

B.4 Petri net generated by inductive miner from the edit distance clustering (cluster 2) . . . 90

B.5 Petri net generated by inductive miner from the sub-log of PC users . . . 91

(15)

2.1 Event log example . . . 9

2.2 Car fines event log example . . . 11

2.3 CJM - XES correspondence . . . 14

2.4 Web log example . . . 15

2.5 Recommender systems confusion matrix . . . 20

4.1 Web log example . . . 34

4.2 Event log example . . . 34

4.3 Url action pairing example. . . 35

4.4 Web log - event log mapping . . . 35

4.5 Process discovery algorithms comparison . . . 37

4.6 Decision mining table example . . . 42

5.1 User rating matrix filled with page visit frequencies. . . 46

5.2 Event log example for recommendations . . . 50

5.3 User rating matrix example built on the model of Figure 5.2 and the event log of Table 5.2 . . . 50

5.4 Event log example for memory-based sequence-aware methods. . . 53

5.5 User rating matrix example for memory-based sequence-aware methods . . . 53

5.6 Users traces from the event log of Table 5.4 . . . 53

5.7 Cosine similarity and edit similarity of user 1 . . . 53

5.8 Event log example for model-based sequence-aware methods . . . 54

6.1 Extract of web log . . . 58

7.1 Extracted actions . . . 60

7.2 Conformance checking results of petri nets generated from the whole filtered event log . . . 61

7.3 Conformance checking results of the petri nets generated from the clusters . 62 7.4 Conformance checking results of inductive miner on the personalized sub-logs 62 7.5 Offline results of state-of-the-art recommendation algorithms (registered users data set). . . 69

7.6 Results of kNN algorithm using a combination of cosine and edit similarity as user similarity (registered users data set). . . 70

7.7 ALS results with r_∗c,p as input (registered users data set) . . . 70

7.8 BPR results with r∗c,p as input (registered users data set) . . . 71

(16)

C.1 Offline results of state-of-the-art recommendation algorithms (non registered users data set) . . . 94

C.2 ALS results with r_∗c,p as input and rc,p = absolute page visit frequency (non

registered users data set). . . 94

C.3 ALS results with r∗c,p as input and rc,p = relative page visit frequency (non

C.4 BPR results with r_∗c,p as input and rc,p = absolute page visit frequency (non

C.5 BPR results with r_∗c,p as input and rc,p = relative page visit frequency (non

C.6 ALS results with r∗c,p as input and boolean rc,p (online retail data set). . . . 96

C.7 BPR results with r_∗c,p as input and boolean rc,p (online retail data set) . . . 96

C.8 ALS results with r_∗c,p as input and boolean rc,p (Last.fm data set) . . . 97

(17)

CJM Customer Journey Map

CBMG Customer Behavior Model Graphs XML eXtensible Markup Language XES eXtensible Event Stream

IEEE Institute of Electrical and Electronics Engineers BPMN Business Process Model and Notation

EPC Event-driven Process Chain YAWL Yet Another Workflow Language URL Uniform Resource Locator IP Internet Protocol

PCA Principal Component Analysis SVD Singular Value Decomposition MAE Mean Absolute Error

RMSE Root Mean Squared Error P@K Precision at K

R@K Recall at K AUC Area Under Curve

ROC Receiver Operating Characteristic kNN k Nearest Neighbor

BPR Bayesian Personalized Ranking

OCuLaR Overlapping co-CLuster Recommendation algorithm CTR Click Through Rate

(18)

(19)

Introduction

Customer journey analysis is an extremely useful exercise for companies aiming at under-standing and improving the customer experience for their users. Analyzing how customers navigate their journey is critical to assess and deliver their needs because users are start-ing to demand a more and more personalized experience, tailored to their exact preferences. Nowadays, thanks to the data-driven transformation that most of the companies are facing, analyzing the customer journey has become a major issue. For this reason, customer journey maps (CJMs) have become very popular. These diagrams represent graphically the story of the customer, showing all the touch-points (s)he comes into contact with. In particular, they allow companies to have an overall idea of how the customer behaves, to compare this beha-vior with what they expect, and to make the user experience more personalized and unique. Nevertheless, despite the popularity of CJMs, a scientific repeatable formalization has not been deployed yet, resulting in many different models that are neither consistent nor mutu-ally compatible [5]. Indeed, in the literature, a data-driven method to build customer journey maps and analyze the customer journey is still missing. In this thesis, we are making, to the best of our knowledge, the first efforts to fill this gap, by creating a repeatable framework to analyze the on line customer journey on a website. Websites are the perfect environment to explore the concept of customer journey tracking because every action the users take can be recorded in web logs, a tabular format that can be used for a wide range of analysis. Many companies rely on web analytics tools to take useful insights from the data, however, these programs provide a too simplified overview of the customer journey and thus, they lead to an incorrect understanding of the user behavior on the website [41]. In particular, for each page the users visited, they only identify the previous and subsequent pages in the journey, presenting an abstracted view of the relation between pages, exit points, and critical paths taken by customers [41]. Consequently, these flows are useful to have an overall idea of the journey on the website, showing the main paths the users follow, but they present the follow-ing limitations. (i) They lack the capability of reflectfollow-ing which characteristics of cases (users) decide a particular choice in the journey. (ii) They are unable to provide end-to-end process maps that can show and explain choices and loops between activities. (iii) They will not directly allow the analysis of the processes from different perspectives (e.g.: time constraints, bottlenecks, or relations between resources). On the other hand, process mining techniques are capable of effectively addressing all of the above limitations [64]. This thesis explores the new possibility of using process mining on the web logs to explore the customer journey. Pro-cess mining looks particularly suitable for this problem, because, as highlighted by Bernard

(20)

and Andritos [5], it works with event logs, a sequential format ideal for representing customer journeys. Moreover, working with expected and actual models is at the core of the process mining framework, being able to capture the causality and paths of user interactions that lead to certain outcomes of interest, such as buying a product [41]. In particular, through process mining it is possible to (i) discover the process that better describes the user behavior, (ii) find useful insights and (iii) discover and compare the processes of different behavioural clusters of users. In addition, after analyzing the customer journey, we tried to exploit the obtained insights to optimize some KPIs through personalized recommendations based on the user journey. In fact, one of the main aims of the customer journey analysis is to improve it, providing a better personalized experience to users and, at the same time, optimizing some company KPIs. Collaborative filtering recommender systems algorithms, which provide re-commendations to users based on their behaviour, suit perfectly in this scenario, but they have never been used in association with process mining. Consequently, this thesis aims to bridge the gap between process mining and recommender systems worlds by using them as an ensemble to improve the customer journey analysis. More specifically, the goal of combining these two techniques is two-fold. Uppermost, collaborative filtering algorithms can be used to tailor an ad-hoc experience to the user, exploiting the customer journey insights, discovered with process mining, to increase the recommendation accuracy. Secondly, process mining can be used as a support to select which paths of the process can be enforced through personalized recommendations, in order to optimize a specific company KPI. To prove this concept, we experimented the proposed methodology on a real data set. In particular, the data has been collected from a popular advertising web portal website: 1 month event data in which every action of the users was recorded. The evaluation has been performed both offline and online. In particular, we masked the 30 % of the user item interactions, and then we checked if the masked interactions were actually recommended, computing some offline evaluation measures like AUC and MAP@5. Then, after selecting the best algorithms, we tested them online with emails containing personalized recommendations. More specifically, using A/B testing, we sent different groups of email containing recommendations shaped on the user journey. Note that each group of emails contained recommendations generated by a particular algorithm. In this way, we computed the click through rate of the various recommendations algorithms, checking if the users actually click on the provided recommendations.

To conclude, this thesis addresses three main research questions:

1. How can process mining techniques be exploited in the context of customer journey analysis ?

2. How can recommender system techniques be used in collaboration with process mining, in order to improve the customer journey ?

3. Is it possible to improve the state-of-the-art collaborative filtering algorithms by ex-ploiting insights discovered with process mining ?

Answering these questions, this thesis aims to make two main contributions:

1. Showing the power of process mining in customer journey analysis, designing a repeat-able data driven framework.

2. Bridging the gap between process mining and recommender system worlds, using them as an ensemble to improve the customer journey analysis.

(21)

1.1 Thesis Structure

This thesis is structured as follows:

• Chapter 2 gives an overview of the customer journey analysis, process mining and recommender systems.

• Chapter3 formalizes the problem more precisely.

• Chapter 4 describes the proposed framework to run a customer journey analysis with process mining techniques.

• Chapter 5 illustrates how process mining and recommender systems worlds can be combined.

• Chapter6describes the application scenario and data set in which the experiment were run.

• Chapter 7 shows the results of the aforementioned methodologies, applied to the case study.

• Chapter8 draws the conclusions and the further directions.

Moreover, Appendix A briefly describes the process mining tools used to perform the analysis, while AppendixesBandCshow additional results that were not included in Chapter

7.

1.2 Academic Publications Related to this Thesis

A position paper [61], containing the preliminary results of this thesis, has been accepted and published in IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud-2018), 6-8 August 2018, Barcelona, Spain. Moreover, another paper, containing the complete results of this thesis, has been sent to The 34th ACM/SIGAPP Symposium On Applied Computing: SAC 2019, 8-12 April 2019, Cyprus. The reader can find both papers attached in AppendixD, respectively in Section D.1and D.2.

(22)

(23)

Literature Analysis

To understand the methodology proposed in the following chapters, this chapter introduces the most important related theoretical concepts and literature background. In particular, Section2.1gives an overview of the customer journey world, while Sections2.2and 2.3draw an outline of process mining and recommender systems.

2.1 Introduction to Customer Journey

This section introduces the customer journey scenario, showing its characteristics and the techniques that have been developed to analyze it. The customer journey [57] is the total set of events that customers experience when interacting with an organization. Nenonen et al. [40] refer at the customer journey as as a visual, process-oriented procedure to analyze people’s experiences. Rather than focusing at just a specific part of the journey, the customer journey describes the whole experience of being a customer. Usually, the customer life cycle begins when the customer desires or needs a product or service and it lasts until the product is bought or the service is provided [40]. The company’s goal is to design this journey in a way that maximizes the customer and the organization’s value, [40] trying to personalize the customer experience as much as possible. In fact, through customer journey analysis, organizations can gain knowledge on how potential and current customers experience and feel the various channels and touch points, highlighting how much this experience differs compared to their expectations [40]. Indeed, this knowledge can be used to design an optimal experience that meets the customers’ desires and assists the fulfillment of the company’s objectives, achieving competitive advantage [40]. Actually, a research survey by Forbes [24] proves that customers are expecting highly personalized experiences. During the survey, most of the people interviewed express disappointment about the lack of personalization in their shopping experiences, addressing them as “too impersonal”. This survey shows the importance of a personalized experience: it drives impulse purchases, increases revenues and leads to more customer loyalty.

2.1.1 Customer Journey Maps Overview

This subsection illustrates a popular way to represent customer journeys: CJMs (customer journey maps). Although many variants of customer journey maps are present in the literat-ure, the overall idea is common for all of them.

(24)

Definition 1 (Customer Journey Map) “The Customer Journey Map (CJM) is a linear, time-based representation of the main stages that a customer goes through in interacting with a company or a service” [35].

As Wolny and Charoensuksai point out [70], CJMs are characterized by non linear structure, reflecting cognitive, emotional, and behavioral drives. The main goal of tracking the cus-tomer experience is to comprehend how this experience can be improved [29]. Usually, in a customer journey analysis, companies focus on how customers interact with multiple touch points, moving from consideration, search, purchase, post-purchase, consumption, and future engagement or repurchase. The targets of the analysis are to describe this journey and un-derstand the customers’ options and choices for touch points in multiple purchase phases [29]. As Halvorsrud et al. [43] point out, although the extensive diffusion of customer journeys analysis, few publications tried to formalize a repeatable methodology [31]. In fact, these ex-isting approaches appear quite diverse and are focused more on the insights that the analysis bring rather than on the methodology itself [51]. As already stated before, many variants of CJMs exist in the literature. Here follows the main CJMs components that Andritos et al. [5] found in their literature analysis:

• Customer: a customer is the stakeholder that benefits from a service.

• Journey: a CJM contains at least one journey, which is a typical path pursued by a customer. It is possible to distinguish between two main types of CJMs: (i) expected journey maps, aiming to represent journeys in advance, and (ii) actual journey maps, which target is to describe how the customers really experienced the journey.

• Goal: target that the company has in mind when it maps the customer journey, i.e., reducing the churn rate.

• Touch-point: is the moment in which customers interact with companies through a product or service, i.e. when a user searches for a flight or contacts the customer service.

• Timeline: it describes the length of the journey in a time span that goes from the first touch-point until the last one.

• Channel: it is the approach chosen by the customer to interact with the touch-point. Sometimes, other elements like Stage, Experience, Lens and Multimedia are also present. Even thought these types of maps can be very useful, offering an overall view of what happens, they take a lot of time to be manually created. Indeed, generating these maps requires a lot of domain knowledge, and it is very difficult to support them with data. Moreover, because of the time they took to be created, their usefulness is limited to long term decision making problems. For this reason, this thesis explores the idea of using process mining to analyze the customer journey. Process mining allows to speed up and scale up the entire creation of the customer journey maps, being a completely automated and data driven method. Indeed, it supports the creation of customer journey maps on real time, removing the need of manual intervention, and thus, removing all the consequent biases. Before diving in the process mining world, the next Section (2.1.2) gives an overview of the current web analytic techniques to analyze the customer journey.

(25)

2.1.2 Overview of Web Usage Mining Techniques in Customer Journeys

This subsection gives an overview of the web usage mining techniques developed to analyze the customer journey. As pointed out by Poggi et al. [41], web usage mining refers to the collection, measurement, and analysis of web server data. Data mining techniques such classification, clustering and association rules have been used successfully in the last years [2]. These techniques work by summarizing the sequences of data into sessions characterization: a set of high-level data summarizing what happened during the user’s navigation [48]. These structures are based on different metrics, including the number of unique and returning visits, url access frequency, geo-location, client web browser and version, and other statistics around these metrics. Indeed, the goal of web mining analytic techniques is to provide feedback on user behavior, in order to improve site navigation and conversion goals. For example, customer behavior model graphs (CBMG) can be used to provide an abstracted view on web navigation [39]. Figure2.1 shows an example of a CBMG. In this graph, each possible state corresponds to a node with probability transitions, i.e. home page, browse and search.

Figure 2.1: Customer behavior model graph [48]

In particular, a CBMG is built using the k-means clustering algorithm [20], that creates a probability matrix for the possible path transitions from a state. Currently, a lot of commercial tools for analyzing web logs are also available in the market. One of the most popular is Google Analytics. Following the description of Kent et al. [30], the data collected by Google Analytics are used to discover which are the most popular or most accessed pages in a company’s website, what type of information users are more interested in accessing, what paths they take as they navigate, how much time they spend on each page, etc. Figure2.2

shows an example of a behavior flow from Google Analytics. Unfortunately, these tools do not extract the user behavior at an abstraction level that is appropriate to understand the actual critical paths taken by consumers. Consequently, the concept of process mining, which techniques are capable of effectively addressing this limitation [64], will be introduced.

(26)

Figure 2.2: Google analytics behavior flow example

2.2 Introduction to Process Mining

This section gives an outline of the process mining world, briefly describing the various aspects of this topic: process discovery (Section2.2.2), conformance checking (Section 2.2.3), process enhanchment (Section 2.2.4) and trace clustering (Section 2.2.5). Moreover, the connection between process mining and customer journey maps (Section2.2.6), web logs (Section 2.2.7) and recommender system (Section2.2.8), is explained. Process mining [63] adds the process perspective to machine learning and data mining, trying to fill the gap between traditional process analysis and data analysis techniques such as machine learning and data mining [63]. The goal of process mining is to use event data to extract process-related information e.g., to automatically discover a process model by observing events from an enterprise system. While process science approaches are process-centric, focusing on modeling rather than learning from event data, process mining focuses on the comparison between event data (i.e., observed behavior) and process models (hand-made or discovered automatically) [63]. Consequently, thanks to the widespread of data collection and availability, process mining can be exploited to improve end-to-end processes [63]. In fact, it allows process owners to gain knowledge and insights in (business) processes by analyzing the event data stored during execution of the process. As Weber et al. noticed [50], in real processes, there is often a significant gap between what is supposed to happen, and what really happens. Indeed, the idea of process mining is to discover, analyze and improve these real processes. The more freedom the users have in deviating from the standard paths, the more appealing is to examine and evaluate processes as they are executed [50]. In particular, there are three main sub-fields of process mining [63]:

• Process discovery: producing a process model from an event log without using any a-priori information.

• Conformance checking: comparing an existing process with an event log that refers to the same process.

(27)

• Process enhancement: extending or improving a process model with information from the actual process.

Figure 2.3shows how process mining is linked to real data, actual processes and process models.

Figure 2.3: Overview of process mining spectrum [63]

2.2.1 Event Log

Process mining algorithms need a specific format to run: event logs, described in Definition

2.

Definition 2 (Event Log) In an event log L = (e1, e2, .., en), each event ei = (cj, aw, t) must refer to an activityaw from the activity set A ={a1, a2, .., a|A|}, to a timestamp t, and to a casecj, which uniquely identifies the user who performs the action [63]. Thus, an event log can be seen as a collection of cases, and a case can be seen as a trace of events.

Consequently, data extraction is an integral part of any process mining work: most informa-tion systems store informainforma-tion in unstructured form, and therefore, some efforts are needed to extract them. Table2.1shows an example of event log from a loan process.

Table 2.1: Event log example

Timestamp Case Identifier Activity Resource 2018-01-01 00:00 123456789 Enter loan application Pete 2018-01-01 00:10 123456789 Retrieve applicant data Pete 2018-01-01 00:20 123456789 Notify client John 2018-01-01 00:30 987654321 Create new loan application John 2018-01-01 00:40 987654321 Ask client for data Mary From the event log we can distinguish two cases:

• The case 123456789, in which Pete entered the loan application and then retrieved the applicant data, while John notified the client.

(28)

• The case 987654321, in which John created a new loan application and then Mary asked a client for the data necessary to proceed further.

In particular, event logs are stored in the XES format, a XML file used as input from all the popular process mining tools as ProM [65] and Disco [19]. XES format has been officially standardized by IEEE [25] because of its simplicity, flexibility, extensibility and expressivity. Following the definition of van der Aalst [63]:

Definition 3 (XES Document) “A XES document contains one log consisting of any num-ber of traces. Each trace describes a sequential list of events corresponding to a particular case. The log, its traces, and its events may have any number of attributes”.

2.2.2 Process Discovery

Process discovery algorithms [63] create, from an event log, a process model that captures the behaviour seen in the log. More formally, referring to the definition given by van der Aalst [63]:

Definition 4 (Process Discovery Algorithm) “Let L be an event log. A process discov-ery algorithm is a function that maps L onto a process model such that the model is ‘repres-entative’ for the behavior seen in the event log. The challenge is to find such an algorithm”.

There are some aspects of the Definition4 that need to be discussed further:

• The process model generated from a discovery algorithm does not belong to a pre-defined type. BPMN, EPC, YAWL and petri net models are the most used models in the literature [63].

• The concept of “representative” model of the behaviour will be explained in Section

2.2.3.

In the last years, various process discovery approaches have been developed, and con-sequently, different algorithms have been created. Here follows a non-comprehensive list of the process discovery algorithms:

• Alpha Algorithm [66] • Heuristic Miner [68] • Inductive Miner [34] • Fuzzy Miner [18]

An example of Petri net discovered model is shown in Figure2.4. This process is discovered from an event log of a enterprise system to handle car fines. Table 2.2 shows the first rows of this log. The process starts with the activity Create Fine, then there is a parallelism between the upper block of activities (Send Fine, Insert Fine Notification and Add Penalty), and the lower branch, made of the Payment activity. It is interesting to notice that both the blocks can be skipped thanks to the silent activities. Consequently, from this model, it is easy to derive that the activity Create Fine must always be performed. Then, the sequence of Send fine, Insert Fine Notification, and Add Penalty can be done or skipped. Moreover, the Payment activity can be executed at any point in time during the execution of the sequence of activities of the upper branch.

(29)

Table 2.2: Car fines event log example

Case id Activity Timestamp Vehicle type Amount AAAAA Create Fine 01-05-2018 18:00:00 Ford Galaxy 35

AAAAA Send Fine 01-05-2018 18:30:00 Ford Galaxy 35 BBBBB Create Fine 01-05-2018 19:30:00 Ferrari 150 BBBBB Payment 02-05-2018 12:30:00 Ferrari 150 AAAAA Insert Fine Notification 02-05-2018 13:00:00 Ford Galaxy 35 AAAAA Add Penalty 05-05-2018 13:00:00 Ford Galaxy 35

Figure 2.4: Petri net describing the main behavior present in the event log of Table2.2 [69]

2.2.3 Conformance Checking

Following the definition of van der Aalst [63], conformance checking “relates events in the event log to activities in the process model and compares both”. This concept is very complex, because conformance is as a trade-off between the following four quality criteria [8], shown in Figure 2.5:

• Replay Fitness: the discovered model should be able to reproduce the behavior seen in the event log.

• Precision: the discovered model should not allow for behavior completely unrelated to what was seen in the event log. In fact, a flower model is able to produce any arbitrary finite sequence of events [47], but it is a perfect example of a completely under-fitting model.

• Generalization: the discovered model should generalize the example behavior seen in the event log, in order to be able to reproduce future behaviour of the process.

• Simplicity: the discovered model should be as simple as possible. Usually, process discovery algorithms produce process models that are very hard to read, commonly called spaghetti-like process models.

As it is clear from the definitions above, precision is related to the notion of under-fitting, while generalization is strictly related to the concept of over-fitting. A model with low precision under-fits, allowing for behaviors that are very different from what was seen in the event log. On the other hand, a model with a poor generalization over-fits, not being able to reproduce slightly different cases compared to the example seen in the event log. Figure2.6

shows the output of a conformance checking analysis. In this case, the trace {Create Fine, Send Fine, Insert Fine Notification, Add Penalty, Send for Credit Collection_{} is compared} with the model discovered, depicted in Figure 2.4. The trace contains the activity Send for

(30)

Figure 2.5: Conformance checking criteria [63]

Credit Collection, which is not present in the model, and, consequently, it can not fit the model perfectly. In Figure 2.6, the green blocks indicate that the there is a match between an event in the trace to an activity in the model. The gray blocks correspond to the invisible activities, while the yellow blocks refer to a non-conforming event, an activity that did occur in the trace but it is not present in the model [69].

Figure 2.6: Conformance checking analysis example [69]

2.2.4 Process Enhancement

Process enhancement, also referred as performance analysis, replays the event log on the model discovered [69]. For example, by exploiting the timestamps, performance analysis can be used to identify the problematic parts of the process like bottlenecks.

2.2.5 Trace Clustering

Usually, process mining algorithms have problems when the underlined process is unstruc-tured, generating really complex models. In fact, when the behaviour captured by the event log is very heterogeneous, considering all the set of traces in the event log can mislead the mining algorithms, resulting in spaghetti-like models, that are too specific, or flower mod-els, that are too general. Unfortunately, process mining is most appealing in these domains, characterized by flexibility, while, for an already structured process, the discovery-aspect of process mining is less interesting [56]. To solve this issue, trace clustering can be applied, in order to create different set of clusters that correspond to a coherent set of process instances [7]. Figure 2.7 shows an overview of this concept. It goes without saying that clustering makes the process more clear and comprehensible.

Note that, in the process mining scenario, the goodness of a trace clustering algorithm is measured through the same concept introduced in Section 2.2.3. Usually, a good set of clusters should produce a set of models with (i) good fitness and (ii) low degree of structural

(31)

Figure 2.7: Trace clustering example [7]

complexity [7]. In particular, in the literature, different trace clustering approaches have been proposed:

• Bag of activities approach: each trace is transformed in a vector, where each dimension of the vector corresponds to an activity and each value corresponds to the frequency count of the activity in that trace [7]. Consequently, similarity between traces is com-puted using classic vector distance metrics. Unfortunately, this method has some draw-backs because it does not take in consideration the order of activities and the context in which the process is executed.

• K-gram model: each trace is embedded into the vector space model using a set of a sub-sequence of k activities, also called k-gram. This method is able to integrate the sequence of activities and the context in which the process lives, but on the other hand, it is very computational expensive. In fact, with a set of _{|A| activities, the number of} potential combinations is exponential compared to k, being _|A|k _[₇_].

• Sequence-based distances: trace similarity is computed using sequence distances like the Hamming distance and the edit distance [45], described below.

– Hamming distancedefines the difference between two sequences of the same length as the count of the activities that differ at a position.

– Edit distance defines the difference between two sequences as the “minimum num-ber of edit operations needed to transform one sequence into the other” [7]. An edit operation is an insertion, deletion or substitution of an element. The good-ness of the clustering results highly depends on the cost function, which defines a different cost for each edit operation.

(32)

2.2.6 Process Mining and Customer Journey Maps

Process mining emerged only recently, but it has already been applied to many different fields: from analyzing treatment processes in hospitals [36] to improving the auditing process [26]. In particular, a few interesting approaches of using process mining in the customer journey analysis scenario have already been proposed. Bernard and Andritsos [5] are the ones who first highlighted the prospective value of process mining for customer, illustrating how it can be used to analyze the customer journey. They point out that the expected journey maps and actual journey maps perfectly correspond to two of the most important concepts of the process mining world, respectively, (i) process discovery algorithm and (ii) conformance checking. In particular, they demonstrate a perfect correspondence between the components of a customer journey map and the XES format, in which process mining event logs are stored. Figure2.8 shows the XML structure they propose to store the CJM. Table

2.3displays the correspondence between the main components of a CJM model and the XES format, described in Section2.2.1.

Figure 2.8: XML structure of a CJM [5]

Table 2.3: CJM - XES correspondence

CJM XES

cjm log

journey trace

touchpoint event

touchpoint:timestamp timestamp:date

Demonstrating that process mining and customer journey mapping are complementary, Bernard and Andritsos have consequently opened the way to process mining in this new scenario. In their next work [4], they describe a possible approach of using process discovery techniques to output the journeys onto a CJM: CJM-ex. Like a classic process discovery algorithm, CJM-ex takes an event log as input, and, instead of outputting a BPMN or a Petri net, it outputs a CJM. The authors use CJM as ouput for two main reasons. (i) They are able to incorporate customers personal information such as emotions. (ii) They are able to represent exception behaviours, which would have been removed from other representational process models, in order to improve readability. In fact, one of the main limitations of the BPMN, is the poor number of journeys they can show at the same time. CJM-ex aims at providing a solution to deal with a big number of customer journeys at the same time, using a hierarchical clustering approach to generate groups of similar clusters. In fact, a hierarchical

(33)

clustering algorithm segments the original data in different areas of interest, allowing a top-down navigation of automatically generated groups of similar journeys. In this way, a tree is built bottom up merging, at each iteration, the activities that are most similar. In particular, the hierarchical algorithm uses a similarity measure based on the Jaccard similarity, but which takes in account the order of activities. Then, the obtained clusters are cut to form layers, a set of predetermined number of journeys that will be grouped together. Finally, a frequent sequences mining algorithm finds, for each layer, a representative journey of the cluster.

2.2.7 Process Mining Applied to Web Logs

This subsection describes, to our knowledge, the only attempt to apply process mining tech-niques to a web server log, described in Definition5.

Definition 5 (Web Logs) Web logs contain every action the users do on a particular web-site. Every action is recorded using a tracking tool, and then is stored in database, that can be queried to perform analysis.

Table 2.4 shows an example of web log. Other than the timestamp, the user identifier, and the URL of the page visited, other information like the IP, the browser and device used can be recorded.

Table 2.4: Web log example

Timestamp User Id Url IP Browser Device

2018-01-01 00:00 123456789 /search/param 172.16.254.1 Chrome IPhoneX 2018-01-01 02:00 123456789 /product/param 172.16.254.1 Chrome PC

In their work [41], Poggi et al. show how to map a web log in an event log. Starting from a web log like the one depicted in Table 2.4, a timestamp and a unique identifier for each user are already set by the software used to track the data. The only preprocessing needed is to build a mapping function that maps each URL to a specific action from the the activity set. To do that, Poggi et al. propose two approaches: hand written rules and unsupervised clustering. In the first case, domain knowledge about the website and the structure of the URLs is required. In fact, starting from the URL visited by the user, some rules are built to extract an action. The same result can be achieved by using a clustering algorithm first on the set of URLs, and then representing all the elements of the same cluster with an action, according to the type of URLs that the cluster contains. In the second part of the paper [41], the authors point out a common problem of dealing with web log data. Usually, only a small percentage of the users reaches the last step of journey and consequently, the low-frequency traces will be filtered out by the majority of mining algorithms. Poggi et al. describe two possible approaches to solve this issue:

• Saturating the data set: producing a new data set by removing customer sessions that did not came to an end.

• Clustering sessions: trace clustering is applied to remove noise and mine specific cus-tomer clusters or target groups without the need to saturate the data set.

(34)

2.2.8 Process Mining in the Recommendation Environment

Usually, process mining has been seen as a stand-alone application to discover processes and assess them, but it can also be used to support recommendations. In particular, Weber et al. [50], explore the possibility of using process mining to recommend to a user what should be the next step in the process. When a user requires a recommendation, its partial journey is sent to the recommendation service, which compares it to the log and generates a recommendation that optimizes a target function. This concept is particularly useful in the customer journey analysis environment, where personalizing the customer experience is essential. Figure2.9 shows an overview of the proposed recommendation service. A user can request a recommendation for a specific case by sending a partial case to the recommendation service, that “applies process mining on-the-fly, i.e. by looking at a log (set of completed executions) and a partial execution, and predicts the future of the (partial) case” [50]. Finally, it sends back to the user a list of recommended next steps to assist the user’s decision. In order to transform the predictions in recommendations, a user’s goal is necessary, i.e. optimizing the time spent using the service or minimizing the churn rate.

Figure 2.9: Recommendation service overview [50]

In particular, Weber et al. distinguish between different types of target functions that can be optimized in a recommendation scenario:

• The duration of a case. • The business value of a case. • The learning curve of a person. • The value for money of a case.

As already stated before, in order to produce recommendations, the recommendation service needs to compare the partial instance a user sends to the service with the cases already present in the process log. To do this, several comparison mechanism can be used, but firstly, the event log needs to be abstracted in order to reduce its complexity and focus only on the relevant information. Figure2.10shows the abstraction mechanisms proposed by Weber et al., listed below.

(35)

• Finite vs. Infinite Horizon: “in some scenarios the entire case has to be considered in order to make meaningful recommendations (hence the horizon is infinite), while in other scenarios it is sufficient to consider the last events of a case” [50] or the events happened in a certain window of time.

• Filtering of events: “only a selected set of events may be considered”[50], i.e., consid-ering only the cases which started with activity A and finished with activity B.

• Removing order and/or frequency: “this abstraction mechanism removes the order an-d/or frequency of activities from a case”[50]. For example, in certain cases it is interest-ing to know which activities happened, but not their order or multiplicity. In particular, there are three possible ways of representing cases:

– Sequence: “the order of events is considered for a particular case”[50].

– Multi-set of events: “the number of times each events is executed is considered, order is ignored”[50].

– Set of events: “the mere presence of events is taken into consideration”[50].

Figure 2.10: Possible abstraction of cases [50]

2.3 Introduction to Recommender Systems

In the last years, with the widespread of data collection and user profiling, a new category of algorithms, aimed to make the user experience unique, has emerged: recommender systems. This section briefly illustrates this concept. In particular, Section 2.3.1 describes the most popular family of techniques in the literature, while Section 2.3.2 gives an overview of the recommender systems evaluation metrics. Moreover, Sections 2.3.3, 2.3.4, 2.3.5 and 2.3.6

describe the state-of-the-art recommender systems that are more suitable in the customer journey scenario.

Definition 6 A recommender system or a recommendation system (sometimes replacing “system” with a synonym such as platform or engine) “is a subclass of information filter-ing system that seeks to predict the ‘ratfilter-ing’ or ‘preference’ a user would give to an item”[55].

(36)

In particular, recommender systems use users’ feedback as input to provide personalized recommendations, that can be related to various decision-making processes, such as what items to buy, what music to listen to, or what online news to read. These recommenda-tion algorithms are designed depending on the domain and the peculiar features of the data available [38], that usually record the quality of interactions between users and items. These interactions are usually called feedback and can be distinguished in explicit and implicit. Explicit feedback is usually given through ratings. In this case, users specify their degree of interests on various items, using a specific evaluation system, e.g., five-star rating system. Unfortunately, these types of feedback are not easy to get, because they require the users to actively rate the items, and thus, they can reflect the reluctance of users to rate products, becoming not completely reliable. On the other hand, implicit feedback are easier to retrieve because the collection is completely effortless in terms of the work required by the customer. Indeed, in this case, customer preferences are derived from their activities rather than their explicitly specified ratings. For example, when a customer buys or browses an item, we can infer that the user likes the item, considering these acts as endorsements for that item. How-ever, in implicit ratings, there is no information available if user dislikes an item. Indeed, “the act of not buying or not browsing an item from a large catalogue of products does not always indicate a dislike” [1]. Moreover, recommender systems also differ in the way they analyze the users’ feedback, developing different ways to compute affinity between users and items and to identify well-matched pairs [38]. In particular, it is possible to distinguish between four main types of recommender systems:

• Collaborative filtering methods: they take user-item historical interactions as input. • Content-based methods: they take the attribute information about the users and items

as input.

• Knowledge-based methods: they provide recommendations based on the user require-ments.

• Hybrid systems: they combine the three other methods.

In this thesis, the journey of the user will be considered as an implicit user feedback. Con-sequently, only collaborative filtering techniques, described in the next subsection, will be explained deeply. Indeed, collaborative filtering techniques are particularly interesting be-cause they are domain free, and so, “they can address aspects of the data that are often elusive and very difficult to profile using content based techniques” [23].

2.3.1 Collaborative Filtering

Collaborative filtering approaches are the most popular methods in recommender systems and have been extensively studied [17]. These techniques make recommendations based on users’ past behaviour, usually expressed with a user rating matrix. This matrix has users as rows, items as columns and is filled with ratings. Unfortunately, collaborative filtering techniques have to deal with the high sparsity of this matrix because users interact only with a small percentage of all the items available. For example, a user on Netflix will watch only a small percentage of all the movies available in the whole catalogue. Figure2.11shows two examples of user rating matrix regarding users of a movie streaming service. The first one

(37)

Figure 2.11: User rating matrices example: explicit rating (left), implicit rating (right) [1]

is filled with explicit feedback in the term of a five-star rating system, while the other one is filled with unary values, showing if the user has watched the film or not.

In particular, it is possible to distinguish between different types of collaborative filtering techniques [1]:

• Memory-based methods (also called neighborhood-based): they predict ratings using a similarity measure. Depending on the type of similarity measure it is possible to distin-guish between:

– User-based collaborative filtering: starting from a particular user, it finds the most similar users based on a similarity of ratings, then, it recommends items that those similar users liked. In particular, these are the general steps that summarize the user-based collaborative filtering algorithms [38]:

1. Assign a weight to all users computing the similarity with the target user. 2. Select the k most similar, also called neighbors.

3. Compute a prediction combining the selected neighbors’ ratings.

– Item-based collaborative filtering: starting from an item, it finds the users who liked that item, then, it looks for other items that those users (or similar users) also liked and it recommends them.

Memory-based methods have advantages and drawbacks: they are easy to implement and they provide recommendations that are easy to explain, but, on the other hand, they have some problems in dealing with sparse matrices.

• Model-based methods: in these methods, “machine learning and data mining methods are used in the context of predictive models” [1], e.g., PCA, SVD and matrix factoriza-tion, in order to deal with big sparse matrices. Consequently, these methods are easily scalable, but, on the other hand, results are difficult to explain because of the latent factors.

2.3.2 Evaluation Metrics

This section describes the main metrics to evaluate a recommender system. In particular, recommender systems can be evaluated using two main approaches: online methods or offline

(38)

methods. “In an online system, the user reactions are measured with respect to the presented recommendations” [1]. For example, in an movie recommendation website, conversion rate of user clicking the links of recommended movies can be used as an evaluation metric. Online evaluation is very useful to compare two algorithms, for example using A/B testing, but does not provide a stand alone technique to measure the quality of a recommendation algorithm with different settings, like offline measures do. In fact, the quality of a recommendation system can be computed offline comparing the predicted recommendations with a test set of already known user ratings. In this sense, the recommendation problem can be considered as a generalization of the classification problem [1] and, consequently, many of the popular classification metrics can be used with some modification. Table2.5shows a confusion matrix in the recommender system environment. For example, considering a top N recommender system, the N recommended items are the positive items, while the unrecommended items are negative. Consequently, True Positive and False Positive items are, respectively, the recommended items that match (TP) or not (FP) what the users preferred. On the other hand, True Negative items are those which are not included in the recommendations list and the users do not have in their preferred items, while False Negative are the ones which are not recommend but do match what the users preferred in their hold-out testing set [1].

Table 2.5: Recommender systems confusion matrix Recommended Not recommended

Used TP FN

Not Used FP TN

Here follows a non comprehensive lists of the most used evaluation metrics: • Mean Absolute Error (MAE)

P

u,i|pu,i− ru,i|

N (2.1)

Where pu,i is the predicted rating for user u on item i, ru,i is the actual rating, and N is the total number of ratings in the test set.

• Root Mean Squared Error (RMSE) s

P

u,i(pu,i− ru,i)2

N (2.2)

• Precision: the proportion of recommended items that are relevant. P recision = T P

T P + F P (2.3)

A possible alternative is the Precision at K (P @K), equal to the proportion of retrieved top-K recommended items that are relevant.

• Recall: the proportion of relevant items in the list of recommended items. Recall = T P

(39)

A possible alternative is the Recall at K (R@K), equal to the proportion of relevant items that are in the top-K recommended items.

• F1 Metric:

F 1 = 2_∗ P recision∗ Recall

P recision + Recall (2.5) • Area Under the ROC Curve (AUC)

A ROC curve is a graphical plot that shows the ability of a binary “classifier to rank the positive instances relative to the negative instances” [16] when its discrimination threshold is varied. The area under the curve (often referred to as simply the AUC) “is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one” [16] (assuming “positive” ranks higher than “negative”). In Figure 2.12, various ROC curves are shown. In particular, in the recommendation system scenario, a ROC curve plots recall (true positive rate) against fallout (false positive rate) for increasing recommendation set size [15].

Figure 2.12: ROC curve example Other measures to take in consideration:

• Coverage: it expresses the percentages of items in the catalogue the recommender sys-tem is able to recommend.

• Novelty: “it evaluates the likelihood of a recommender system to provide recommend-ations that the user have not seen before” [1].

• Serendipity: it measures “the level of surprise in successful recommendations” [1]. The word “serendipity” literally means “lucky discovery” and thus, it measures how unex-pected are the recommendations to the user.

• Diversity: it measures the diversity between the list of the recommended items. For example, while recommending three films, if they all belongs to the same genre, the diversity will be low.

(40)

• Scalability: the recommender system must be efficient while dealing with large amounts of data. Scalability can be expressed in terms of training time, prediction time and memory requirements [1].

2.3.3 Collaborative Filtering for Implicit Feedback Data Sets

In their work [23], Hu et al. identify the unique properties of implicit feedback data sets and address them tailoring a collaborative filtering algorithm specifically for implicit data sets. Here follows a list of the main characteristics of implicit data sets to take in account:

• No negative feedback: it is possible to infer which items the users may like analyzing their behaviour, but it is difficult to say which are the ones the users did not like. For example, a user may have not bought a product only because (s)he was not not aware of that product.

• Presence of noise: noise is a fundamental part of implicit feedback. For example, a user can fall asleep while watching a movie (s)he does not like. From the system side, the user has watched the movie completely, so (s)he should like it.

• Implicit feedback expresses confidence, not preference: explicit feedback systems force the user to express its preference using a numerical value, for example from 0 to 10. In contrast, numerical values in implicit feedback describe the frequency of actions, for example how frequent a user watched a page, or how long (s)he has watched a certain video.

The main idea behind the algorithm developed by Hu et al. [23], is to use matrix fac-torization. In order to figure out how the users and the items are related to each other, matrix factorization reduces the number of features while still keeping the relevant inform-ation. Starting from a big user rating matrix, user and item latent (or hidden) features are computed. Let ru,i represent the implicit rating measure of user u on item i, a confidence variable cu,i is defined in Equation 2.6, where the α term represents a linear scaling of the rating preferences.

cu,i= 1 + αru,i (2.6)

In this way, a minimal confidence for every item i is set, and it increases where more evidence is shown. Preferences are assumed to be the inner products: pu,i = xTuyi, where xu ∈ Rf and yi ∈ Rf are, respectively, the user-factors and the item-factors (hidden representations of each user u and item i with f latent factors). These factors are computed by minimizing the following cost function:

min x∗,y∗ X u,i cui(pui− xTuyi)2+ λ( X u ||xu||2+ X u ||yi||2) (2.7) The λ(P u||xu||2 + P

u||yi||2) is the regularizing term, necessary to not overfit the training data, where λ is determined using cross validation. The cost function contains m_{∗ n terms,} where m is the number of users and n is the number of items. Thus, with a big number of users and items, classic optimization techniques like stochastic gradient descent do not work. To solve this problem, Hu et al. use a different approach that transform the problem

(41)

into an alternating-least-squares optimization process. In fact, “when the user-factors or the item-factors are fixed, the cost function becomes quadratic so its global minimum can be readily computed” [23]. Moreover, with alternating-least-squares, it is possible to find U and V solving one feature vector at a time, making the whole process parallel. Figure2.13shows an overview of the idea.

Figure 2.13: Alternating least square overview

Using differentiation, the authors find an analytic expression for xu that minimizes the cost function.

xu = (YTCuY + λI)−1YTCup(u) (2.8) Y is a n× f matrix that gathers all item-factors. Cu is a diagonal n× n matrix where, for each user u, Cu

ii= cui. The vector p(u)∈ Rncontains all the preferences by u. Then, the recomputation of all item-factors is performed in a similar way.

yi= (XTCiX + λI)−1XTCip(i) (2.9) X is a m_{×f matrix that gathers all user-factors. C}i is a diagonal m×m matrix where, for each item i, Ci

uu= cui. The vector p(i)∈ Rmcontains all the preferences for i. After training the algorithm, “recommendations for user u are the K available items with the largest value of ˆpui= xTuyi, where ˆpuiexpresses the predicted preference of user u for item i” [23].

2.3.4 BPR: Bayesian Personalized Ranking from Implicit Feedback

In [59], Rendle et al. design a new method to recommend items in an implicit feedback scenario, optimizing directly the estimated ranking of items from the user prospective. They point out that, even though all the popular algorithms like SVD [3] and matrix factorization [23] are evaluated on personalized ranking data sets, they are optimized to predict if an item is selected by a user or not. Consequently, the authors try to directly optimize the model parameters for ranking. In particular, they use item pairs as training data and optimize for correctly ranking item pairs. Let U be the set of all users, I the set of all items, and S _{⊆ U ×I}

(42)

the use rating matrix. The goal is to find, for each user u, a total ranking >u⊂ I2 of all items, that meets the properties of a total order (totality, anti-symmetry, transitivity). The authors make an important assumption: if an item has been viewed by a user, then (s)he will favors this item compared to the non-observed ones [59]. On the other hand, if a user has viewed two items, no preference can be inferred. Thus, a training data set DS : U× I × I is defined as follows:

DS :={(u, i, j)|i ∈ Iu+∧ j ∈ I \ Iu+} (2.10) Where I+

u := {i ∈ I : (u, i) ∈ S}. Consequently (u, i, j) ∈ DS means that “user u is assumed to prefer item i over item j”. Then, the Bayesian Personalized Ranking (BPR) optimization is applied to find a ranking for all items. Specifically, it maximizes the following posterior probability, defined in Equation 2.11, “where Θ represents the parameter vector of an arbitrary model class (e.g. matrix factorization)” [59].

p(Θ| >u)∝ p(>u|Θ)p(Θ) (2.11) Assuming that all users act independently of each other, Equation 2.11 can be rewritten as follows: Y u∈U p(>u|Θ) = Y (u,i,j)∈DS p(i >u j|Θ) (2.12) Consequently, Equation2.13 defines a maximum posterior estimator.

BP ROpt= X

(u,i,j)∈DS

lnθ(ˆx(u,i,j)− λΘ||Θ||2 (2.13) Where

• ˆx(u,i,j)(Θ) “is an arbitrary real-valued function of the model parameter vector Θ which captures the special relationship between user u, item i and item j ” [59].

• λΘ are regularization parameters.

It is interesting to notice that there is an analogy between this bayesian formulation and the area under curve described in Section 2.3.2and redefined below.

AU C(u) = 1 |Iu+||I \ Iu+| X i∈Iu+ X j∈I|\Iu+| θ(ˆx_(uij) > 0) (2.14) In fact, re writing the AUC formula using equation2.10:

AU C(u) = X (u,i,j)∈DS zuθ(ˆx(u,i,j) > 0) (2.15) where zu = 1 |U||Iu+||I \ Iu+|

It is clear that, beside zu, the two formula only differ in the loss function. Indeed, the AUC uses the non-differentiable heaviside function, while BP ROpt uses ln(θ), which can be differentiated. Consequently, using the optimization criterion defined in Equation2.13, classic