Automated feature engineering for recommender systems

(1)

Politecnico di Milano

School of Industrial and Information Engineering

Master of Science in Computer Science and Engineering Dipartimento di Elettronica, Informazione e Bioingegneria

Automated Feature Engineering for

Recommender Systems

Supervisor: Prof. Paolo Cremonesi

Co-supervisor: Dott. Maurizio Ferrari Dacrema

Master Thesis by:

Giacomo LOCCI Student ID: 877258

Ilyas INAJJAR Student ID: 863151

(2)

(3)

(4)

(5)

Abstract

In the last decade, the sudden development of the world wide web and of the computational power of computer has granted to big companies like Ama-zon, Netflix or Spotify the ability to offer more extensive catalogs. These companies have based their business model on the capability to offer the user the chance of finding each product easily on their platforms. The cata-log size increment, on the other hand, has highlighted the intense need for a tool that can guide the user in the platform exploration considering his/her tastes and needs. The need for this tool gave birth to the research in the Recommender Systems field. The primary objective of these new systems is to create a personalized experience for each user through the offered service. The idea behind this thesis was born from the ACM RecSys Competition of 2017 offered by XING. A social-network thought to create a job-related

network. (Similar to Linkedin) The competition request to use some of

the most modern recommendation techniques in the context of ”cold-start” items, which are items that have been inserted recently inside the catalog and so there aren’t enough data to build a reliable model of the intercon-nection between item and users. The lack of this data has brought to light the need for creating ad-hoc algorithms to be able to solve the problem satisfactorily.

(6)

(7)

Estratto in lingua italiana

Negli ultimi dieci anni il repentino sviluppo delle connessioni web e della potenza computazionale delle macchine ha permesso alle grandi aziende

come Amazon, Netflix e Spotify di offrire cataloghi di prodotti sempre pi`u

vasti. Queste aziende hanno basato il loro modello di business proprio sulla capacit`a di offrire al cliente la possibilit`a di trovare ogni prodotto

rapi-damente tramite le loro piattaforme. L’incremento della dimensione del

catalogo d’altra parte ha creato la forte necessit`a di uno strumento che

possa guidare l’utente nella scoperta dello stesso considerando i suoi gusti

e desideri. Da questa necessit`a nacque la ricerca nel campo dei Sistemi

di Raccomandazione. L’obbiettivo di questi nuovi sistemi risiede proprio nella creazione di un’esperienza personalizzata per ogni utente all’interno

del servizio offerto. L’idea alla base della nostra tesi `e nata dalla

compe-tizione ACM RecSys 2017 offerta da XING, un social network utilizzato per lo sviluppo di una rete di contatti lavorativi. (Simile a Linkedin) La competizione richiedeva di sfruttare tecniche moderne di raccomandazione nel contesto di prodotti ”cold-start” ovvero coloro che, essendo appena stati inseriti nella piattaforma, non hanno ancora interazioni sufficienti ad elab-orare un modello delle loro relazioni con gli utenti. La mancanza di tali

informazioni ha portato alla luce la necessit`a di creare algoritmi ad-hoc per

(8)

(9)

Ringraziamenti

Ringraziamo il professor Paolo Cremonesi per averci indirizzato in questo

innovativo campo di ricerca dove molto è stato fatto ma ancor di più sarà

da fare. Ringraziamo il dottor Maurizio Ferrari Dacrema per aver sostenuto

la stesura dello scritto con pazienza e professionalit`a. Un ringraziamento

speciale va al nostro gruppo di amici universitari #Owowiwowa che `e, in fin

dei conti, gran parte del nostro percorso al politecnico, Algebro, Il Leader, Comodino, Dani*6(Oh), 3d0, Davidone, Paulo, Gianni e Fra. Ringraziamo

un po’ pi`u seriamente le nostre famiglie che hanno finanziato e sostenuto la

nostra crescita culturale, tutti gli amici che hanno alleggerito questi anni ren-dendoli stupendi. Per ultimi, ma non per importanza, vorremmo ringraziare il politecnico e tutti i suoi professori che con le loro competenze hanno ispi-rato il nostro futuro.

Grazie,

(10)

(11)

2.7.2 Feature Importance . . . 26 2.7.3 Feature Extraction . . . 26 2.7.4 Feature selection . . . 27 3 Datasets 29 3.1 XING Dataset . . . 29 3.1.1 Items . . . 30 3.1.2 Users . . . 31 3.1.3 Interactions . . . 32 3.2 MovieLens . . . 33 3.2.1 Items . . . 34 3.2.2 Tags . . . 35 3.2.3 Users . . . 35 3.2.4 Interactions . . . 35 3.3 Evaluation Metrics . . . 36 3.3.1 Precision . . . 36 3.3.2 Recall . . . 37 3.3.3 F1 Score . . . 37

3.3.4 Mean Average Precision . . . 37

4 Implementation 39 4.1 Content-based filtering implementation . . . 39

4.2 Automated feature engineering . . . 40

4.2.1 Feature Concatenation . . . 40

4.2.2 Bayesian Optimized Feature Selection . . . 41

4.2.3 Bayesian Optimization Weighting . . . 43

5 Results 45 5.1 Metric choices . . . 45 5.2 Data preparation . . . 45 5.3 XING Dataset . . . 46 5.3.1 Concatenation . . . 47 5.3.2 Selection . . . 48 5.3.3 Weighting . . . 52 5.3.4 Comparison . . . 54 5.4 MovieLens Dataset . . . 54 5.4.1 Concatenation . . . 55 5.4.2 Selection . . . 55 5.4.3 Weighting . . . 57 5.4.4 Comparison . . . 58

(13)

6 Conclusions 59

(14)

(15)

List of Figures

1.1 Amazon’s recommandations related to Recommender

Sys-tems: The Textbook by Charu C. Aggarwal . . . 2

1.2 New items suggested on Netflix . . . 4

1.3 ”Because you watched X” items on Netflix . . . 5

1.4 Related to frienship items on Netflix . . . 5

1.5 Lunatic Goats multi-layer ensemble . . . 7

2.1 Recommendation Workflow - The sample contains both of-fline training phase and online querying phase. . . 10

2.2 CBF Model WorkFlow . . . 18

2.3 Features classic rapresentations: Vector and Spatial . . . 26

3.1 Plot of the available interaction over time of the XING dataset 29 3.2 Interaction over time for Movielens Dataset . . . 34

4.1 Feature concatenation . . . 41

4.2 Feature concatenation example . . . 41

4.3 Feature selection . . . 42

4.4 Feature selection example . . . 43

4.5 Feature weighting . . . 44

4.6 Feature weighting example . . . 44

5.1 Plot of subset of XING interactions dataset . . . 46

5.2 Xing Metric Optimization Chart . . . 48

5.3 Xing MAP@10 Optimization Chart for Weighting Phase . . . 52

5.4 Plot of subset of MovieLens interactions dataset . . . 54

5.5 Movielens MAP@10 Optimization Chart . . . 56

(16)

(17)

List of Tables

2.1 Different open source Bayesian optimizers . . . 23

3.1 Xing Dataset statistics . . . 30

3.2 Item’s Features . . . 31

3.3 User’s features . . . 32

3.4 Interaction’s features . . . 33

3.5 MovieLens Dataset composition . . . 34

3.6 Items Features . . . 34

3.7 Tags Features . . . 35

3.8 Merged Items Features . . . 35

3.9 Interaction’s features . . . 36

5.1 Interaction types and values in the XING Dataset . . . 46

5.2 Temporal-cross validation timespans . . . 47

5.3 Xing Selected Item’s Features . . . 47

5.4 Xing Excluded Item’s Features . . . 48

5.5 Xing features selection . . . 51

5.6 Xing features weighting . . . 53

5.7 Xing Results Score comparison . . . 54

5.8 Movielens Items Features . . . 55

5.9 Movielens features selection . . . 56

5.10 Movielens features weighting . . . 58

(18)

Chapter 1

Introduction

“Development of recommender systems is a multi-disciplinary effort which involves experts from various fields such as Artificial intelligence, Human Computer Interaction, Information Technology, Data Mining, Statistics, Adap-tive User Interfaces, Decision Support Systems, Marketing, or Consumer Behavior.” [19]

Recommender Systems Handbook

The exponential growth of available content on the web has created the need for a fast and reliable system to make catalog exploration easy and acces-sible by the user. During the last ten years, the biggest B2C (Business To Client) companies started investing a considerable amount of resources to develop and improve effective Recommender Systems (RSs). Recommender Systems are defined as a subclass of information filtering techniques. Their primary task is to provide contextual and suitable recommendations consid-ering different data sources like the user behavior and characteristic of the items belonging to the catalog. RSs are usually created to provide personal-ization of the user experience (UX). Also, non-personalized RSs exists, they are generally based on popularity criteria.

1.1 Brief History of Recommender System

Recommender system was described for the first time by Jussi Karlgren in 1990 at the Columbia University [11]. Since that day the problem of recom-mending items coherently with taste of users has become one of the most researched topics by each computer science university of the world. Amazon is for sure one of the most significant contributors to the development of the RSs science. Since the end of the 90s Amazon has invested millions of dollars

(19)

in research to be able to increase the revenues by improving its recommen-dations, a sample of this recommendation is reported in Figure 1.1. In 2003, three Amazon employees (Greg Linden, Brent Smith, and Jeremy York) had published one of the most famous paper in the history of RSs: ”Item-to-Item Collaborative Filtering.” [14] The technique presented in this paper is still widely used in many RSs. Currently, the Amazon’s implementation brings around the 30% of the total income of the company. One of the most no-table moments in RSs history is, without any doubt, the Netflix Prize. From 2006 to 2009, the well-known movie streaming company sponsored a com-petition with a grand prize of US$1,000,000 to the team that would be able to improve the original Netflix recommendations by 10% over a 100 million movie ratings. On 21 September 2009, the grand prize of US$1,000,000 was given to the BellKor’s Pragmatic Chaos team using tiebreaking rules [12]. Since 2007 the Association for Computing Machinery (ACM), an interna-tional learned society for computing founded in 1947, holds one of the most influential events on RSs, also known as ACM RecSys. In the list of events held by ACM, there is the RecSys Challenge, which allows big companies to create competitions regarding their specific problems. The Polytechnic of Milan has started joining this series of events in 2016 reaching great results in every edition.

Figure 1.1: Amazon’s recommandations related to Recommender Systems: The Textbook by Charu C. Aggarwal

1.2 Cold-Start problem

Cold start is the problem of providing good recommendations also to users that have started using the platform only recently or to suggest items that have just been inserted in the catalog. Classic literature refers to this prob-lem as two separated cases:

(20)

enough interactions with the system or that have been inserted too recently to ensure the quality of the rappresentation of the item. • Cold-users problem: cold-users are users that have not generated enough

valid information interacting with the system to guarantee a coherent recommendation.

Providing good quality recommendation in a cold-start scenario is one of the main challenges for a RSs. These most popular techniques are suf-fering from this issue due to their natural inclination to rely on interactions between users and items. The most common approach to face the cold-start problem is to implement features-based techniques, which use only item-features relations without the need of interactions.

1.3 Thesis motivation

This thesis aims to improve the effectiveness of content-based approaches enhancing the available feature set of a given dataset in a cold-items sce-nario. The idea behind this work was born during the RecSys challenge offered by XING in 2017 [1], with this technique described in the paper ”Content-Based approaches for Cold-Start Job Recommendations” [3] the team from Polytechnic of Milan, alias Lunatic Goats, was able to reach the first placement in the off-line phase of the competition, achieving the second place in the whole challenge. At the end of the challenge, we continued the research in the feature engineering field discovering that data poten-tial can be significantly improved exploiting the hidden cross-correlations between features. Standard CBFs implement linear models cannot handle some dependencies between features. During the competition, the team has explored the feature combination technique by manually aggregating item features relying only on human foreknowledge and directly testing the result of each experiment on the evaluation platform. Despite the rudimental use of this advanced technique, the results were improved by ∼ 30%. These achievements spurred us to keep researching in this direction, to validate them with globally recognized metrics and datasets and to build an auto-mated and engineered feature enhancing system.

1.3.1 Content-Based utility

In this subsection we will briefly review the reasons why we have decided to improve the Content-Based filtering technique even though is commonly known that the best performing recommendation algorithms are the one

(21)

based on collaboratives methods. We will review the differences between the two approaches deeply in the Chapter 2, which will concern the state of the art of recommender systems. Given these premises there are diffent situations in which a Content-Based system can widely outperform other approaches, we will examplify those particular use cases referring to the video streaming service Netflix.

Push new contents

Services like Netflix and similar, which business models are fully based on a subscription model have the need of continuosly rengaging users presenting new appealing contents every day. Obviously new items do not have enough interactions to be used efficiently in systems based on collaboratives meth-ods, here is the need of a method which is only based on the property of the item itself. In the Figure 1.2 are reported some recommendation of new items on Netflix.

Figure 1.2: New items suggested on Netflix

Directly linked suggestion

In this case the platform which is providing the content is suggesting an item based on an action that the user has done before. In Figure 1.3 Netflix is suggesting some items because the user has previously seen another TV serie. In this case the Content-Based approach is necessary because the system need to compare items directly.

(22)

Figure 1.3: ”Because you watched X” items on Netflix

Catalog discovery and serendipity

Often content providers try to retain users providing interesting and sur-prising contents linked with a particular argument that is predicted to be interesting for a particular user. In Figure 1.4 Netflix is suggesting item cor-related with the concept of friendship, cleary to generate such list of items we must use a Content-Based approach.

Figure 1.4: Related to frienship items on Netflix

1.3.2 ACM RecSys Challenge 2017

This subsection aims to give an overview of the approach adopted by the team Lunatic Goats for the ACM RecSys Challenge 2017. As we antici-pated, the competition on the problem of cold start items: jobs without any interaction on the platform. The goal was to develop a recommendation algorithm able to predict, during the offline phase, the past user activities, and during the online phase, generate a suitable recommendation to send to the users, through an automated pushing system implemented by XING.

The Lunatic Goats strategy was to combine different methods combined in a so-called ”multi-layer ensemble.” In this way, the Team reaches the first place in the offline part and the second step of the podium on the final leaderboard. Due to the cold-start (see 1.2) nature of the challenge scenario, all the implemented algorithms are based on content-based approaches.

(23)

The goal of the competition is to balance a candidate’s disposition to-ward a job posting (item) with the job recruiter’s interest in the candidate (user), avoiding possible adverse reactions. The objective of this competi-tion was dual: the teams had to find the right users, starting from a new job advertisement, which could be interested in receiving notifications about that job. The top 25 teams of the offline challenge joined in the online phase. During the online evaluation, all the participant received each day the newly posted jobs to recommend, and the new interactions occurred in the day be-fore. The final score for each team is made by the two best performing weeks out of five. The calculation of the score is made with a custom evaluation metrics that take into account the duality of the problem:

1. user success: predict the interests for a job position of a given user. 2. item success: predict the interests of a user profile of a given

re-cruiter.

Lunatic Goats team started to realize several State-of-the-Art algorithms and similarities methods:

• Content-Based Filtering - Item-Item (CBF): a resort to a clas-sic approach building two models for predicting positive and negative interactions.

• User-Item Profile Matching: It’s a model that exploit the shared features between users and items, based on the hypothesis that if a user has features in common with an item, we are confident that he/she is going to like it.

• Collaborative Filtering with Content-based micro-clustering: To exploit the collaborative information contained in the dataset, for each item to recommend we associate the interactions of the top 5 similar items according to the content-based similarity. On this aug-mented dataset, the standard collaborative filtering is feasible. • Matrix Factorization with Content-based micro-clustering: is

a matrix factorization algorithm designed to tackle the problem of implicit feedback. Also this technique was feasible on the augmented dataset.

Each ensemble level received the score of the top-n users for each item as input and processed them in three steps: it normalizes the predicted scores,

(24)

weights based on the algorithm relevance and finally, it sums the weighted scores to output the prediction of the level.

Figure 1.5: Lunatic Goats multi-layer ensemble

1.4 Thesis structure description

After this introduction to the Recommender System world, to the compe-tition that gave birth to the idea behind our work and the ”cold-start” problem, we will face a chapter on the current state-of-the-art. In this part of the thesis, we will review the most critical progress and achievements reached in the last decade. The focus will be notably shifted on the tech-niques we applied to build our tool, starting from the modern data collection methods arriving to describe features engineering. The third chapter will be dedicated to the description of the datasets we used to achieve our results. After that, the chapter on the implementation will go into details about our tool and how was built. We will analyze some of the techniques we decided to use and how we implemented. Firstly we will review our content-based filtering implementation, shifting the attention then to features engineering giving particular care to Bayesian Optimization. The last chapter will be about the result we reached and the comparisons with other techniques that do not rely on a feature engineering process during their pipeline.

(25)

(26)

Chapter 2

State of the Art

In this section, we are presenting the current state of the art for recommenda-tions systems. In the overview is described how this subclass of information filtering system is used. One focus of this section is the description of the two main strategies to RSs: content-based filtering and collaborative filter-ing. The first technique tries to recommend items similar to what the user already liked in the past, and the second one exploits the past behavior of the users to create a model able to recommend items to a given user exploit-ing the idea that users who have interacted with the same items have the same tastes. In the next paragraphs, Bayesian optimization [23] and feature engineering techniques [4] are presented, as they are the principal method-ologies used in this thesis trying to improve the quality of a recommendation system in a cold start scenario.

(27)

2.1 Overview on Recommender Systems

Figure 2.1: Recommendation Workflow - The sample contains both offline training phase and online querying phase.

A recommender system 2.1 is a tool made to help a user find the most suitable item according to various external parameters. ”Item” is by defi-nition the object that the user may search on a particular platform, it can be a song, an item to buy or news to read. RSs are directed to users that may not have the capabilities to find autonomously the subject he/she is searching for inside platforms that aggregates multiple alternatives for sim-ilar objects. One of the most famous examples is the recommender system of Amazon.com that suggests to the users which item could be interested in and personalize the whole website to guide the user through the best and most profitable shopping experience. Recommendations are usually

(28)

per-sonalized, so every user receives different suggestions. Besides, also exists non-personalized recommenders [21] which are much more straightforward and are usually featured in magazines and newspapers. Other examples of this kind of RSs is the top popular selection of songs on Spotify or the most seen videos on Youtube, for their fundamental simplicity research does not usually address these systems. Personalized recommendations are usually offered as a list of ranked items, performing this task the RSs try to pre-dict the most suitable item based on the preferences and constraints the user who is currently using the platform. To complete this computation, the system needs to collect some data on the relationships between users and item. Those relationships might be extracted directly from the explicit opinion of the users like ratings or reviews, or they can be derived from interpreting the behavior of the user on the platform. For example, if a user keeps playing the same songs on a music streaming platform, the ac-tion can be probably interpreted as a sign of music taste of the user. RSs development, as for many other informatic artifacts, has started from the observations of reality: people often rely on the suggestions from friends and family. For example, reading movie reviews before going to the cinema or asking a friend before eating in a restaurant near his place. RSs try to emulate this natural behavior by creating an artificial knowledge capable of substituting the real-life interactions with a robust model created from the massive amount of data available in the digital era. The first implemented system was a so-called collaborative filtering approach which wants to rec-ommend items that have interactions with users that are supposed to have a similar taste to the target user. As the digital commerce starts to develop it became clear that there was a problem to address for the difficulties that the users were encountering in finding the item they were searching for at the right time. This overwhelming variety of possible choices often drove the user to poor decisions ruining the perception of the platform. Data an-alysts understood that the user satisfaction is usually inversely dependent to the content search and discovery complexity. RSs have proved in recent years to be one of the most valuable methods of solving the problem of user disorientation. In fact, an RS tries to work as a compass to guide the user to discover fresh items that are relevant to the current user’s context. Once that the recommendations have been computed they are presented to the user on the UI, and he/she might accept them immediately or later. As al-ready anticipated the study of RSs is very young compared to other classic topics in information theory.

One way to define the objective of an RS as a rating prediction problem [22], to guess, for each user the rating for an unobserved item. More formally,

(29)

for each user u ∈ U we want to recommend an item i∗ ∈ I that the user has never seen before maximizing the utility function φ : U × I → R:

i∗ = argmax

i∈I

φ(u, i) (2.1)

where φ is a utility function that measures the rating of a user u of item

i. In RSs the utility is often measured as a real number called rating.

This function is defined only to the subspace of U × I and the task of a recommender system is to extrapolate this function to the whole space. The utility function is generally presented as the User Rating Matrix (URM).

R ∈ R|U |×|I| The rating or preference score given by a user u to item i is

thus represented by rui = φ(u, i). Another kind of problem that RSs may

be asked to solve is the ranking problem. In this kind of task, we are not interested in the absolute value of a specific rating but more on the relative placement between the suggested items. Different optimization criteria can be used to optimize the two different kinds of problem and some are perfectly suited for prediction and some other for ranking.

2.2 Data collection

RSs are information processing systems that gather multiple types of data to build their recommendations. Data is about the items to recommend and users who will receive the recommendations. The information can be exploited or not depending on the recommendation technique. Some tech-niques use primary data, such as user ratings/evaluations for items, other methods are more knowledge dependent (e.g., descriptions of the users or the items, constraints, social relations of the users). In general, we can classify data into three categories: items, users, and interactions (relations between users and items). Items are the objects that are recommendable, RSs can use a range of properties and features of the items. For example in a movie recommender system, the genre, the directors, and the actors can be used to describe a movie and to learn how the utility of an item depends on its features. To personalize recommendations the algorithm can exploit a range of information related to the users such as age, gender, profession, and education. Interactions are data that describes users behavior regarding different items on the platform. For instance, in a movie RSs, an operation may be viewing, recording or starting over. If available, that interactions may also include explicit feedback the user has provided, such as the rating for the selected item. Ratings are the most popular form of transactions data that a Recommender system collects. Ratings can be explicit or implicit.

(30)

In the explicit ratings the user provides his/her opinion about an item on a rating scale (e.g., numerical rating such as the 1-5 stars, binary rating such as like and dislike or ordinal ratings such as ”agree, neutral, disagree”). The implicit ratings on the other hand are ratings that have not been explicitly exposed by the user like usage metrics, clicks and views.

2.3 Content Based Filtering

Content-Based RSs try to recommend items similar to what the user already liked in the past [24].

The size and the variety of items catalog available today in e-commerce and content provider systems has determined a considerable difficulty in finding the right item for a given user. Users need a personalized model to access information efficiently according to their interests and taste. Content-Based RSs uses the items metadata to build the similarity matrix between items, while Collaborative leverage on users interactions as item’s metadata. In this chapter, we will review the classic approach on Content-Based RSs which are the base from where our thesis has been developed.

Content-Based information filtering systems need some specific tech-niques to represent the items and to create implicit user preferences. We can describe the recommendation process in three main steps:

• Content analyzer: Unstructured raw information need some prepro-cessing to extract relevant, structured data. The primary task of this component is to manipulate the data to make it more digestible for the next steps of the system architecture. For example, songs may be represented as vectors of tags. The precise representation of data is then forwarded to the next step.

• Profile learner: This component has the role of interpreting user preferences to generalize over data and create the user profile. Usually, the generalization process is made upon machine learning techniques which can infer the user’s interest model based on the ratings that he has given to other items in the past.

• Filtering component: This module tries to suggest relevant items to the user matching the user profile with the profile of all the eligible items. The result is a judgment on the possible relevance of different items represented with ranking-based techniques or rating-based tech-niques. The ranking-based technique is a technique that do not tries to predict the users ratings but recommends the N-best items. On

(31)

the other hand, rating-based techniques tries to predict the rating of the user and recommends items that are likely to be well rated by the user.

To build the profile of an active user its reactions are collected as feed-backs, typically two kinds of information exist, positives and negatives. To record those feedbacks we can choose between two different techniques, both kind of feedbacks can be used at the same time to improve the personalized recommandation [9]:

• Explicit feedback: when the system requires an explicit evaluation of the item by the user. This kind of feedback is the simplest but the cognitive load on the user is essential, and it isn’t guaranteed to be the best way to represent the user rating. Some examples of explicit feedbacks are ratings, stars, reviews, likes and wishlists.

• Implicit feedback: when the system is capable of understanding the user opinion inferring from their behavior. Since it is directly derived from data analysis, this kind of feedback has the significant advantage of not actively involving the user, but on the other hand, it can be biased by unexpected situations during the usage of the system. (E.g. miss clicks) Some examples for implicit feedbacks are views, clicks, navigation events and downloads.

To build the profile of an active user, it must be defined the training set T Ra for U a which is a set of pairs < Ik, rk > where rk is the rating provided by U a on the item representation Ik. Given the set of user and ratings, the Profile learner usually applies a machine learning algorithm to generate the predictive model. The Filtering component now implements some strategy to rank the list of item comparing the features of the item itself and the user profile model. Information created during the usage of the system is used to keep the system up-to-date, and this allows the RSs to adapt dynamically to the user changing behavior.

Content filtering brings different advantages when compared to collabo-rative methods [19]:

• User independence: Content filtering compute recommendations for a particular user using only the items that the user has interacted with, while the collaborative filtering approach exploit the idea that users that have interacted with similar items have similar taste.

(32)

• Transparency: The system can explain to the user why those recom-mendations have been selected explicitly listing the features that the RSs have used to build the list of recommended items.

• New items: Content-based recommender systems are capable of rec-ommending items that are not yet rated by anyone since they rely only on item features to build the list of suggestions.

On the other hand Content-based RSs has differents shortcomings [19]: • Limited content analysis: It exists a natural limit in the abstraction

of an item using only a defined set of features, and without providing a description of relevant features from the user point of view is hard and expensive. Features can only represent a small portion of the item complexity, for example, BPM (beats per minute) cannot be the only feature to discriminate between songs on a music streaming platform. • Over-specialization: Content-based filtering is not able to help users discover new contents, given that, while the model gets more precise on a user taste, there’s the possibility that the recommended items start to become more similar and ”boring.” This issue is the so-called ”serendipity” problem, and happen when the user can’t find anything engagingly surprising through the RSs.

• New user: RS can’t model users that haven’t given enough ratings, and so recommendations cannot be reliable in the first phase of the system usage.

2.3.1 Item representation

An item eligible to be recommended is represented with a set of features. For example, in music recommendation, features might be the genre, the duration or the artist name. The RS usually represent items using structured data, but in some cases, the features must be derived from unstructured data like raw text. The textual features create a lot of problems when computing the item representation, and in fact, the classic keyword-based approach does not fit the whole spectrum of meaning that a word might assume in particular circumstances. In detail, the keyword-based approach suffers from two main problems:

• Polysemy: when a word has different meanings.

(33)

Polysemy might drive a misinterpretation of the document where a word is considered a feature without taking into account the context. Synonymy can lead information leaking when the system ignores a potential feature just because the word it’s not contained in string keyword set.

2.3.2 Retrieval Model

Content-based recommender system adopts some simple retrieval model, for example, keyword matching in the Vector Space Model (VSM) with TF-IDF weighting. In this model, a vector represents each document in a space where each dimension corresponds to a term of the vocabulary. Every document is, formally, represented as a vector of weights which indicates the association degree between documents and terms. Let D = d1, d2, ..., dN be a set of documents, and T = t1, t2, ..., tn be the dictionary, that is to say the set of words in the corpus. The dictionary is obtained through standard natural language processing operation. Every document dj is represented as a vector in a n-dimensional vector space, so dj = w1j, w2j, ..., dnj, where wkj is the weight for term tk in document dj. Obtaining this representation present two main issues: weighting correctly the terms and measuring the feature vector similarity. The most common weighting scheme is, as anticipated, the TF-IDF (Term Frequency-Inverse Document Frequency) [18], which needs some assumptions:

• Rare terms are not less relevant than common ones. • Multiple occurrences are not less relevant than single ones. • Long documents have the same importance as short ones.

In other words, terms that occur only in a specific document is more relevant for its relevancy. Additionally, normalization of the weight result prevents shorter documents to be demoted by the number of words. The formula exemplifies assumptions:

wic= tf (i, c) · idf (c) (2.2)

A vector of feature represent now the content information, the preference score for a user U regarding an item I can be measured as follows:

ˆ rui= P j∈R+u rujsim(i, j) P j∈R+u sim(i, j) (2.3)

(34)

where sim(i,j) is a function that measures the similarity of two items in the feature vector space. The similarity is computed for each unobserved item I against all the items the user has interacted with. One of the most common similarity measures is the cosine similarity, calculated as follows:

sim(i, j) = wcf T i fj kf_ik₂kf_jk₂ = nF P c=1 wcficfjc s nF P c=1 f_ic2 s nF P c=1 f_jc2

Where fic and fjc are boolean function that indicate the presence of a

feature c in items i and j and wc is the weight attributed to the specific

feature. Another way to formally represent the item-based CBF model is using the matrix annotation:

\

U RM = U RM × S (2.4)

where S in the case of content-based filtering is:

S = ICM ×

sim measureICM

T _(2.5)

Where the URM represents the User Rating Matrix, and the ICM rep-resents the Item Content Matrix, a Vector Space Model of the items in the catalog.

The approach explained to this point refers to the classic item-based CBF. This approach can be extended to support also user-based CBF where the features used to build the model belong to the user characteristics, this may be very useful to address the issue of new users, given that we can access some necessary information like age, city, and instruction level. You can see a CBF workflow example in Figure 2.2.

(35)

Figure 2.2: CBF Model WorkFlow

2.4 Collaborative Filtering

One of the most popular and most effective approach to design a recom-mender system is the so-called collaborative filtering method [20]. Differ-ently from the content-based technique that we presented before this ap-proach does not rely on any attribute or characteristic of items and users. Collaborative filters are based on the collection of vast amounts of data on users behavior and preferences, using them to predict which kind of item the user will like based on similarities with other users behavior. In this kind of approach, the system does not need any information on users and items, only the User Rating Matrix is necessary.

One of the most significant advantages of collaborative filters is the inde-pendence between the algorithm and the item’s characteristics, which gives to the technique the natural capability of abstraction over complex non-linear correlations that cannot be reconstructed by other feature-dependent

(36)

methods. Collaborative filtering was built upon the supposed effectiveness of the ”word-of-mouth” social phenomenon where two users with similar interaction history will probably agree on future item ratings. This concept has been strongly enhanced in the last years when the rise of computational power and the capillary spread of the internet connection made possible the analysis of the opinion of huge communities.

Some of the most famous examples of collaborative item-item (people who buy x also buy y) recommender systems are:

• Amazon.com, when the user purchases an item, suggests purchasing also correlated items.

• Last.fm recommends music based on a comparison of the listening habits of similar users, while Readgeek compares books ratings for recommendations.

• Facebook, Linkedin, MySpace and many other social networks use collaborative filtering to recommend contents, friends, and groups. This kind of algorithms have the following advantages:

• Serendipity: As anticipated the indipendence with content’s features let the algorithm create more various recommendation, the similarity of the items is not based on attributes directly linked to the content itself but on the customer-based behaviour.

• Dinamicity: The model is based on the users preferences which are continuosly changing, this property is inherited by the system which naturally gains the ability of following the current trends on the user-base.

• Item representation: The representation of the item does not de-pend on the item description and so the model is immune to issues like editorial errors, that are usually committed while tagging the items, localization issues or other kinds of human data-entries issues.

Collaborative filtering usually suffers from four kinds of problem: • Cold-start: the system cannot predict correctly until the user has

reached an acceptable amount of historical data.

• Scalability: the effectiveness of the collaborative filter is directly linked to the amount of available data, this means that the more in-formation we have the best the algorithm performs. Having a massive amount of data can, on the other side, raise some scalability issues.

(37)

• Sparsity: the number of items on websites like Amazon or the num-ber of songs on platforms like Spotify is exponentially larger than the number of items seen by a single user, this makes the historical data extraordinarily sparse and creates relevant difficulties in finding simi-larity between different users.

• Popularity Bias: Popular contents are often beating other less pop-ular or old contents whose interactions are too old to be considered into the model, this can lead to a partial usage of the catalog.

2.5 Memory and Model Based Techniques

With a more technical perspective, all the algorithms in the recommender system field can follow two implementation strategies:

• Memory-based • Model-based

2.5.1 Memory-Based techniques

This approach was the first to be implemented in many commercial sys-tems thanks to its simplicity and effectiveness. Some of the most common examples of this techniques are KNN-filtering or item/user-based implemen-tation.

Nearest Neighbour Implementations (KNN)

The aim of this method, which can be applied both to users or items, is to reconstruct the similarity between elements starting from a complete knowl-edge built over user-item interactions. After the computation for mapping similar items or users together the prediction for a user u is calculated as a weighted average. For the item-based KNN, the objective is to suggest to the user u items which are somehow similar to the item he/she has liked in the past. Formally:

ˆ rui=

X

j∈KN N (i)

rujS(i, j) (2.6)

This formula is identical to the formula that belongs to the content-based filtering method with the difference that the similarity S is calculated only over the User Rating Matrix freeing the algorithm from the necessity of using item-related information.

(38)

S(i, j) = r

T i rj

kr_ik₂kr_jk₂ (2.7)

Where, as anticipated, S is the similarity measure and ri, rj are the

rating vectors for the items i and j. KN N (i) is the set of the N most similar items to i.

User-based KNN recommends instead items that have been liked by users that are similar to the target user u. The recommendation scores are computed as a summation of the preferences from the similar users. Formally: ˆ rui= X v∈KN N (u) rviS(u, v) (2.8)

with the new similarity computed as:

S(u, v) = rur

T v

kr_uk₂kr_vk₂ (2.9)

Where KN N (u) differently from the item-based KNN is the set of top N similar users to the selected user.

Both samples are not normalized because we are facing a top-N recom-mendations problem where we can resort a list of item and pop out the first N elements. If we need to predict the ratings of a user we might need to normalize the formulas, this would be the result:

• Item-Based KNN: ˆ rui= P j∈KN N (i) rujsim(i, j) P j∈KN N (i) sim(i, j) (2.10) • User-Based KNN: ˆ rui= P v∈KN N (u) rvisim(u, v) P v∈KN N (u) sim(u, v) (2.11) 2.5.2 Model-based techniques

In this approach, many complex machine learning algorithms are used to predict the ratings for unrated items. Those methods include clustering and latent semantic models as well as Bayesian semantic networks. The most famous of these techniques is for sure the Singular Value Decomposition

(39)

who gives its standing to the Netflix challenge as it was the algorithm that gave the win to the BellKor’s Pragmatic Chaos team. This particular kind of algorithms aims to understand user preferences relying on some latent factors that are extracted by the user’s rating pattern. Those factors are entirely independent of the human intuition. They can be comprehensible as well as altogether abstracted and with no meaning for a reader.

2.6 Optimization Criterion

Recommender systems need some hyperparameters optimization to improve the recommendation effectiveness, to reach this goal different optimization techniques are introduced into the pipeline. This section is focused on the Bayesian Optimization, some example of alternative optimization methods to perform hyperparameters fine-tuning are Grid Search, Random Search (Bergstra Bengio, 2012) [2], and Particle Swarm Optimization (Kennedy Eberhart, 1995) [5].

2.6.1 Bayesian Optimization

Bayesian Optimization is defined as a sequential design strategy to optimize globally black-box functions [15]. The use of this strategy in the machine learning field usually concerns the automation of hyperparameters optimiza-tions. Improving the hyperparameters tuning with Bayesian Optimization drastically reduce the number of iterations required to reach similar results using the same model and dataset. Similar reductions may mean millions of dollars of profit for big companies like Amazon. This is one of the reasons why optimizers are attracting more and more attention. Bayesian Optimiz-ers (BO) offer better performances than most of the other optimizer at the state-of-the-art. They aim to find the minimum of a function f (x) on a bounded set X building a probabilistic model of the function and using it to decide where to move inside f (x) to evaluate the next steps. The two main components of a Bayesian Optimizer are :

• Regression Model: the model that has to individuate relationships among the variables.

• Acquisition Function: which is the function that has to decide which will be the next point to be evaluated.

(40)

Software Regression Model Acq. Function

Spearmint Gaussian Process Exp improv.

Moe Gaussian Process Exp improv.

Hyperopt Tree parzen est. Exp improv.

Smac Random forest Exp improv.

Table 2.1: Different open source Bayesian optimizers

In the Table 2.1 are rapresented some of the most popular optimizers with their respective Regression Model and Acquisition Function which will be explained in the next subsections.

2.6.2 Sequential Model-Based Optimization

Sequential Model-Based Optimization (SMBO) is a formalism of Bayesian Optimization [8]. As very first step a probabilistic regression model is initial-ized. After this initialization, new locations within the domain are sequen-tially selected using the acquisition function S which use the current model as the cheap substitute of the expensive objective function f . Each

evalu-ation of the suggested function produces a result yi = f (xi). This result,

which can be random due to the randomness of f or because of the noise, is added to a historical stack that will be used to update the regression model to generate more accurate suggestion at the next step. The initialization strategy is considered a valuable move in the construction of a valid SMBO, some of those approaches are random, quasi-random, and Latin hypercube sampling of the domain (Hoffman, 2014)[7].

2.6.3 Probabilistic Regression Models

Many different regression models can be used in the context of Bayesian Optimization. All those models have in common the definition of a pre-dictive distribution P (y|x, D) that has to capture the uncertainty of the reconstruction process of the surrogate objective function.

Gaussian Processes

Gaussian Processes (Rasmussen Williams, 2006)[17] are considered the stan-dard regression model for Bayesian Optimization (Snoek et al., 2012; Martinez-Cantin, 2014). In this setting, the function f is considered to be a reconstruc-tion of a GP with mean funcreconstruc-tion µ and covariance kernel K, f ← GP (µ, K). The choice of the kernel function K can have a drastic effect on the quality

(41)

of the optimization result. These kernels define the fundamental values need to compute the predictive distribution.

k(x) = (K(x, x1), ..., K(x, xi))T

Kj,k = K(xj, xk)

(2.12) Predictions will follow a normal distribution.

Random Forest

Another valid choice for the regression model for the Bayesian Optimizer is the ensemble of regression trees [13]. One standard approach in the con-struction of the predictive distribution to assume a Gaussian as the starting function. The parameters µ and σ of the Gaussian may be chosen as the empirical means and variance of the regression values. Defined R(X) as the set of regression value we have:

b µ = 1 B X r∈B (r(x))2 b σ2= 1 |B | − 1 X r∈B (r(x) −µ)_b 2 (2.13)

An interesting property of Random Forest is that it naturally supports conditional variable, which is a variable that appears only under certain conditions defined by other variables of the problem. For example, a variable that controls the choice over some other complex submethods may enable different other variables.

Tree Parzen Estimator

Differently, from the other regression models, Tree Parzen Estimator [16]

does not define a predictive distribution over the objective function. It

identifies two hierarchical processes x(x) and g(x) that behave as generative models for all the domain variables. These processes model the domain variables when the objective function is below or above a specified value y∗.

p(x|y, D ) = (

l(x), if y < y∗,

g(x), if y ≥ y∗. (2.14)

Univariate Parzen estimators are organized in three structures that pre-serves any conditional dependences and resulting in a fit for each process.

(42)

The tree structure makes possible to conserve the specified conditional de-pendencies, but hidden interdependencies between variable may be lost. The application domain for this regression model it is restricted to problems with defined conditional dependencies.

2.6.4 Acquisition Function

Acquisition Function has the role of defining a balance between exploring new areas of the function space or exploiting are that are already known to get favorable value. Intuitively, it defines an expected non-negative im-provement over the previously observed objective function at x.

Exp improv. = EI(x|D) =

Z ∞

fbest

(y − fbest)p(y|x, D)dy (2.15)

When the predictive distribution p(y|x, D) is Gaussian, EI(x) has a convenient closed form [10].

2.7 Feature Engineering

Feature engineering is the process of using domain knowledge to create fea-tures that algorithms can rely on to build a functional model of the data. Feature engineering is one of the most expensive, time consuming and funda-mental steps in the whole machine learning process. It can be done manually or automatically based on the needs of the problem. Features in the data directly influence the predictive model and the result it can be achieved [4].

2.7.1 Feature

In machine learning, a feature is a quantifiable property or characteristic of a phenomenon that we are observing. Choosing independent, representative, and discriminating features is a crucial step for the effectiveness of algo-rithms in all the informatics field, also in recommender systems. Features are usually numeric, but they can also be strings or graphs.

A feature is an attribute that is meaningful to the problem. It is an essential part of observation for learning about the structure of the problem that is being modeled. In computer vision, an image is an observation, but a feature could be a line in the picture. In natural language processing, a doc-ument or a tweet could be an observation, and a phrase or word count could be a feature. In speech recognition, an utterance could be an observation, but a feature might be a single word or phoneme.

(43)

Features are usually represented in like a vector, see Figure 2.3.

Figure 2.3: Features classic rapresentations: Vector and Spatial

2.7.2 Feature Importance

The value that a particular feature adds to the effectiveness of the model can be objectively evaluated. Once you have assigned a score to each feature, they can be easily ranked and removed if the score does not satisfy the given condition. The feature importance may be profoundly influenced by the correlation degree between the analyzed feature and the target variable that must be predicted.

2.7.3 Feature Extraction

In many fields of computer science, such as machine learning, image pro-cessing or pattern recognition, the feature extraction process is defined as the process to building derived values from raw data intended to be in-formative and non-redundant to facilitate the learning and generalization steps, and also to enhance the human interpretation. Feature extraction is strictly correlated to the concept of dimensionality reduction. If the input data of the learning algorithm are too expensive in term of memory or if they are suspected to be redundant, for example, if the same measured are expressed in different measure unit, the previous data can be reduced to a vector of features, usually named vector feature. The process of determin-ing the components of this vector feature is called feature selection, and it is

(44)

usually based both on human intuition and experimental results. The set of selected features are expected to have a meaningful intrinsic representation of the item specifically for the task that our algorithm is built to achieve. For example, if the algorithm has to predict which kind of film the user may like, this decision will not probably be based on the poster image of the movie.

2.7.4 Feature selection

As anticipated in the previous paragraph the feature selection process ad-dresses the problem of determining the usefulness of a particular kind of

feature in the learning process. Feature selection techniques are mainly

used for four reasons:

• Model simplification to make them easier to read by human beings • Shorter training times

• Avoid dimensionality issues

• Improving generalization ability reducing overfitting

The central premise on the use of a feature selection technique is that the data contains some redundant or irrelevant elements that can be removed without incurring in some information loss. Redundant means that the fea-ture is represented multiple times in distinct information instance. Irrelevant are features which are not considered useful for the learning purpose of the algorithm. Feature selection must be distinguished from feature extraction as the former does not create any new feature but defines a subset of the original feature set. A feature selection algorithm is usually composed by a search technique that has to propose new subset and an evaluation metric that score the feature subset. The metric evaluation choice heavily influ-ences the algorithm behavior; we can distinguish three main categories of possible feature selection algorithm:

• Wrapper methods: This method uses a predictive model to score each features subset. Each new subgroup is used to train a different model which is then tested on a hold-out set. Wrapper methods usually provide the best feature subset but, since they have to train a model from scratch for every feature subset that has to be tested, it is the most computationally expensive method.

(45)

• Filter: Filtering methods use a proxy measurement instead of a stan-dard error rate to evaluate the feature subset efficiency. Proxies are chosen to be a low-cost operation. Standard measures include the mu-tual information, the pointwise mumu-tual information, Pearson product-moment correlation coefficient, Relief-based algorithms, and inter/intraclass distance or the scores of significance tests for each class/feature com-binations.

• Embedded: Embedded methods perform the feature selection as part of the model construction. The most common example of this approach is the well-known LASSO method for linear method con-struction, which penalizes regression coefficients with an L1 penalty, shrinking them to zero.

Optimality criteria choice is difficult as there are multiple objectives in the feature selection task that could be optimized. Most common crite-ria incorporate measures of accuracy penalized by the number of features selected.

(46)

Chapter 3

Datasets

In this chapter, we evaluate proposed approach the main characteristics of the datasets we used to execute our experiments. To obtain a more general perspective on the effectiveness of the method we are proposing in this thesis, we decided to use the dataset from the ACM RecSys Challenge 2017 provided by XING, and the MovieLens dataset [6] created by GroupLens research which was collected and made available from the MovieLens website.

Figure 3.1: Plot of the available interaction over time of the XING dataset

3.1 XING Dataset

As already stated in the introduction, XING is a career-oriented social net-working website that offers a platform that links companies with potential employees. This dataset was created to be used in the ACM RecSys

(47)

chal-lenge where teams have high computational power to analyze the data. Since we do not have such resources we decided to split the dataset into portions that kept the properties of the original one. During the competition, XING choose to face the recommendation problem reverting the classic paradigm: each item needed the top 100 users that could be likely interested in it.

In this thesis instead, we decided to use a classical approach to the problem: for each user, we recommended the top N items.

In Figure 3.1 is represented the interaction distribution over time and in the Table 3.1 are described the modification that we did to the dataset cardinalities with a random sampling strategy.

Original Cardinality Reduced Cardinality

U 1497020 63466

I 1306054 70402

interactions 8274901 100000

sparsity 4.2322 ∗ 10−5 2.238 ∗ 10−4

Table 3.1: Xing Dataset statistics

3.1.1 Items

The client company can create a job offer that has some useful details to represent the vacant position efficiently. A detailed description can be found in table 3.2.

(48)

Feature Description

item id Anonimized item ID

career level Career level of the job

discipline id ID of the role

the job is offering

industry id ID of the field of the job

country Country where the job is located

region Region where the job is located

latitude Latitude where the job is located

longitude Longitude where the job is located

employment Type of employement

title Anonimized keywords

related to the title of the job

tags Anonimized keywords

describing the content of the job

created at Timestamp that corresponds to

the insertion of the job offer Table 3.2: Item’s Features

The metadata of the dataset is composed of various kinds of data, list or single values. We decided to represent the feature space adopting a One-Hot encoding strategy. One-hot encoding is a data elaboration procedure that consists of transforming the categorical values into a form that can be more efficiently used by machine learning algorithms.

As we presented in Chapter 4 the core of our thesis consists of an auto-mated feature engineering process that increases the number of features for each item exponentially.

3.1.2 Users

Users on the XING platform are customers who have registered on the plat-form and are possibly interested in a new job. A detailed description can be found in table 3.3.

The method presented in this thesis may also be applied on algorithms that generate models based on user feature. The content-based item-item that we implemented in this work do not consider this information, we didn’t use this part of the dataset.

(49)

Feature Description

user id ID of the user

jobroles IDs of the job roles extracted

from user’s profile

career level Career level of the user

discipline id ID of the role of the user

industry id ID of the field

where the user works

country The country where the user

is currently working

region The region where the user

is currently working

experience n entry class Number of CV entries that the user

has listed as work experience

experience years experience Estimated number of years of user’s

work experience

experience years in current Estimated number of years that the user

has been working in his current job

edu degree Estimated university degree of the user

edu fieldofstudies List of fields that teh user studied

wtcj user’s willingness to change job

Table 3.3: User’s features

3.1.3 Interactions

XING provided an array of implicit feedback as interactions dataset. Fea-tures are described in the table 3.4. Interactions are all transaction between user and item. The dataset contained 6 types of different interaction:

1. Impression: whether the platform has shown a specific recommenda-tion to the user.

2. Click: the user clicked on the item.

3. Bookmark: the user bookmarked the item on XING.

4. Reply: the user clicked on the reply button or application form button that is shown on some job postings.

5. Delete(negative): the user deleted a recommendation from his/her list of recommendation (clicking on ”x”) which means that the

(50)

dation will no longer be shown to the user and that a new recommen-dation item will be loaded and displayed to the user.

6. Shown interest: a recruiter from the items company showed interest in the user. (e.g. clicked on the profile)

Feature Description

user id ID of the user

item id IDs of the job offer

interaction type Type of interaction between user and item

created at Timestamp of interaction event

Table 3.4: Interaction’s features

During the competition, the massive amount of impressions (9GB) and the biased nature of the model created with data coming directly from the already existing recommender system convinced us to discard the impres-sions (interaction type 1). Even though interactions of type 5 were giving a boost to the score because of the XING’s custom metrics, we decided to exclude them from the model.

The dataset distributed by XING during the competition has some miss-ing data when plotted on the time axis. After further investigation, we dis-covered that there was a gap of 30 days approximatively. This issue was demoting the results of the k-fold temporal cross-validation that we imple-mented in the optimization pipeline. We avoided the problem limiting the dataset to the timespan that was fully covered by interactions, we choose to take only events from 01/01/2017 to 07/02/2017.

3.2 MovieLens

This dataset was created by the GroupLens research lab in the computer science department of the Minnesota University. MovieLens is formerly a recommender system and a community website created appositely in 1997 to gather useful information for research on personalized recommendations. The stability and the extensive use made in many papers and articles guided our choice to this dataset. The GroupLens lab offers many different version of this dataset. We choose was the Stable Benchmark 10M.

(51)

Figure 3.2: Interaction over time for Movielens Dataset Cardinality ratings 10M tags 100k I 10k U 72k sparsity 0.0138

Table 3.5: MovieLens Dataset composition

3.2.1 Items

The Movies are characterized by 3 metadata: MovieID, Title and Genres.

Feature Description

MovieID Is the real MovieLens id

Title Identically to those found in IMDB, including year of release.

Genres Pipe-separated list of genres of the movie

Table 3.6: Items Features

The genres are selected from the following list: Action, Adventure, An-imation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War and West-ern. Items features are described in the Table 3.6. Dataset cardinality is described in the Table 3.5.

(52)

3.2.2 Tags

All tags are contained in a separated document. Each line of this document represents one tag applied to one movie by one user, and has the following format:

Feature Description

UserID Is the real user id

tag User generated metadata about movies.

Timestamp Seconds since midnight UTC of January 1, 1970.

Table 3.7: Tags Features

The lines within this file are ordered first by UserID, then, within user, by MovieID.

We decided to use Tags’s features to build the ICM merging the two tables by MovieID:

Feature Description

title Identically to those found in IMDB, including year of release.

Genres Pipe-separated list of genres of the movie

UserIDs Is the real user id

tags User generated metadata about movies.

Table 3.8: Merged Items Features

3.2.3 Users

The Movielens users were selected at random for inclusion. Their ids have been anonymized.

Users were selected separately for inclusion in the ratings and tags data sets, which implies that user ids may appear in one set but not the other.

The anonymized values are consistent between the ratings and tags data files. No other features are present except the UserID.

3.2.4 Interactions

Each line of this document represents one rating of one movie by one user, and has the following format:

(53)

Feature Description

UserID ID of the user

MovieID IDs of the the MovieID

Rating Made on a 5-star scale, with half-star increments.

TimeStamps Seconds since midnight UTC of January 1, 1970.

Table 3.9: Interaction’s features

3.3 Evaluation Metrics

With the accuracy metrics that we are going to present in this chapter, we are trying to measure the quality of the decisions taken by our RSs. The purpose of this metrics is to precisely measure how many items have been classified correctly [19].

There are two kinds of metrics:

• Online metrics: which focus their measurements on the reaction of the users in a real-world environment measuring the goodness of the system with some business KPI like ROI, Retention, CTR, and CR. • Offline metrics: usually applied to the academic world use error

mea-surement like RMSE and MAE, and Recall/Catalog coverage to match the user preferences derived from the ground truth test set.

We are using only offline metrics to measure the quality of our RSs. In the next subsection are used TP,TN,FP,FN, their meanings are: • True Positives (TP): items that have been suggested by the system

and the user has interacted with that item.

• False Positives (FP): items that have been suggested by the system but the user hasn’t interacted with that item.

• True Negative (TN): items that haven’t been suggested by the system and the user would have not interacted with that item.

• False Negative (FN): items that haven’t been suggested by the system but the user would have interacted with that item.

3.3.1 Precision

Precision is defined as the ratio between the relevant recommended items and the total number of items recommended by the system. Given a user

(54)

u, a set of recommended items and a set of relevant items the precision is measured as:

Precision = TP

TP + FP, (3.1)

Usually, this metric is applied to an N number of recommended items, in this case, the metric is called Precision@N.

Precision @ N = n. relevant items predicted @ N

N . (3.2)

3.3.2 Recall

Recall wants to measure the ratio between the relevant recommended items and the totality of the relevant items in the ground truth. Note that when the cardinality of the predicted items list matches the cardinality of the ground truth, precision and recall are equals.

Recall = TP

TP + FN, (3.3)

As per precision, recall is usually measured for an N number of items.

Recall @ N = n. relevant items predicted @ N

n. relevant items . (3.4)

This measured is applied when we have to classify all items for a user, and this characteristic makes it less useful in RSs where we usually want to rank only a K subset of elements instead of all.

3.3.3 F1 Score

This metric is a combination of Precision and Recall. Formally can be

defined as the harmonic mean of the two.

F1 = 2 · Precision · Recall

Precision + Recall (3.5)

3.3.4 Mean Average Precision

Mean Average Precision, also known as MAP, is one of the elaborate mea-sures derived from the basic ones we have presented before. It is computed as the mean of the Average Precision of each user:

(55)

AP = 1 m

N

X

k=1

(P (k) if kthitem was relevant) (3.6)

MAP = P|U |

u=1AP(u)

(56)

Chapter 4

Implementation

In this chapter, we will review the implementation of the various state-of-the-art algorithms and the functionalities that we introduced into the pipeline to magnify features effectiveness. The code that runs our thesis was all written in python.

4.1 Content-based filtering implementation

In this algorithm, we calculate the similarity between items and then we suggest an item i to a user u if the u has enough similar items in the historical. The formula we coded to calculate the score for an item i is:

rui= X j∈U RM (u) ruj · sij (4.1) where sij is: simij = P k∈F fik· fjk· wk pP k∈F(fik· wk)2·pPk∈F(fjk· wk)2+ shrink (4.2)

We used the tf-idf scoring, implemented following this formula:

fik = cik· log10

]items ]itemswith f eature k

(4.3) The shrink addend, also known as shrinking factor, is a constant pa-rameter whose optimal values depend on the dataset. It is correcting the item-to-item similarity measure where little information is available avoiding overfitting.