Bayesian optimization for ensemble hyperparameters in job recommendation

(1)

POLITECNICO DI MILANO

Master of Science Degree in Computer Science and Engineering Department of Electronics, Information and Bioengineering

Bayesian Optimization for Ensemble

Hyperparameters in Job Recommendation

Supervisor: Prof. Paolo Cremonesi

Co-supervisor: Dott. Maurizio Ferrari Dacrema

Authors: Federico Cesaro, 852410 Mattia Dagrada, 852875

(2)

(3)

Abstract

Recommender Systems are a subclass of information filtering systems that try to predict the preferences or the ratings that a user would give to an item. They are applicable in a variety of domains: from movie recommendation to shopping recommendation. Based on the available data, there is a rich set of applicable techniques that can be employed to perform recommendations. These techniques are able to exploit some specific information contained in data, and each category of algorithms is focused on a different type of data. Beyond basic approaches ensemble methods are employed: these techniques merge the recommendations of two or more algorithms in order to obtain new ones. As for the basic algorithms, several different types of ensemble techniques exist.

This thesis is about the application of optimization techniques to an ensemble method called average weighting, in order to find the best possible values for the weights. Au-tomated techniques can be extremely useful with respect to manual parameter tuning, especially for an increasing number of parameters. This work consists in applying three dif-ferent optimization techniques, Genetic Algorithms, Powell’s Conjugate Direction method and Bayesian Optimization with the intent to show that ultimately Bayesian Optimization is able to outperform the other two methods in terms of computational speed and raw performances.

We resorted to data coming from the 2017 RecSys Challenge to support our work. We took part in the challenge along with other students from Politecnico di Milano and we ended 2nd, placing first among Academic teams while being the only team composed exclusively by Master Degree students.

(4)

(5)

Estratto in lingua italiana

I Sistemi di Raccomandazione sono sistemi che predicono le preferenze degli utenti, o quali valutazioni essi darebbero ad uno o più oggetti. L’applicabilità di questi sistemi è molteplice: dai siti che offrono film e serie tv in streaming, a piattaforme per acquisti online e persino a piattaforme che trattano offerte di lavoro. Questi sistemi si basano su un’ampia gamma di tecniche applicabili anche in base al tipo di dati disponibili. Alcune tecniche si concentrano sugli attributi degli oggetti che devono essere raccomandati, altre invece sulle valutazioni o, più in generale, sulle interazioni che gli utenti hanno avuto con alcuni oggetti.

Questa diversità ha portato alla necessità di trovare tecniche che fossero in grado di sfruttare tutti i dati disponibili contemporaneamente. Esse prendono il nome di tecniche di ensemble e ve ne sono di vario tipo. Il loro scopo è quello di fondere le raccomandazioni che vengono proposte da alcuni modelli per crearne di nuove. In questo modo è possibile migliorarne l’accuratezza e/o la diversità.

Il lavoro della tesi si concentra sull’applicare tecniche di ottimizzazione ad una partico-lare tecnica di ensemble che prendere il nome di Average Weighting (Media Pesata): ad ognuno degli algoritmi viene assegnato un valore con il quale si pesano le sue raccoman-dazioni. Viene infine fatta la media per ogni oggetto raccomandato. In casi come questo è fondamentale riuscire a trovare i pesi migliori da assegnare ad ogni algoritmo: quello che proponiamo noi è dunque di utilizzare alcune tecniche di ottimizzazione per automa-tizzare il processo in modo da raggiungere la configurazione ottima di pesi nella maniera più efficiente possibile. A tal proposito utilizziamo tre tecniche di ottimizzazione: Algo-ritmi Genetici, il metodo di Powell delle Direzioni Coniugate e l’Ottimizzazione Bayesiana. L’obiettivo è quello di di mostrare come quest’ultima sia superiore alle altre in termini di prestazioni e velocità.

Il nostro lavoro viene proposto come un proseguo della 2017 RecSys Challenge, una compe-tizione internazionale tenuta annualmente in occasione della conferenza ACM sui Sistemi di Raccomandazione. Abbiamo preso parte alla competizione con la squadra Lunatic Goats, composta da 8 studenti del corso di laurea magistrale di Ingegneria Informatica del Politecnico di Milano. Ci siamo classificati secondi, posizionandoci primi tra tutte le

(6)

squadre formate da studenti e facendo meglio anche di molte compagnie esperte in ambito di Sistemi di Raccomandazioni. La nostra squadra inoltre era l’unica composta da soli studenti di laurea magistrale. I dati sono stati forniti da XING, una piattaforma per il networking professionale con sede tedesca. Si `e trattato quindi di un caso di raccoman-dazione di offerte di lavoro che erano appena state caricate sulla piattaforma, per le quali, quindi, non vi erano interazioni con gli utenti.

Stato dell’Arte

In letteratura troviamo diverse tecniche per affrontare gli ensemble. Le più comuni e più semplici da implementare come la Media Pesata o le tecniche di Votazione offrono dei buoni incrementi in termini di prestazioni rispetto ai singoli algoritmi. Vi sono poi altre tecniche provenienti dal mondo del Machine Learning e del Data Mining che utilizzano approcci più complessi e non sempre implementabili, ma che in molti casi, in particolare nelle competizioni online, garantiscono a chi le usa ottime prestazioni.

In letteratura sono quasi assenti approcci come quello proposto, ovvero quello di utilizzare metodi di ottimizzazione in combinazione con tecniche di ensemble, che sono invece pi`u comunemente usate in altri ambiti.

Struttura della tesi

I contenuti della tesi sono organizzati in 6 capitoli. Il Capitolo 1 verte sulle motivazioni e gli obiettivi della tesi. Il Capitolo 2 si concentra sulle sfide imposte dalla competizione, su come le abbiamo affrontate e su come abbiamo deciso di proseguire il lavoro in questa tesi. Il Capitolo 3 `e focalizzato invece su ci`o che viene proposto in letteratura in termini di tec-niche di ensemble e in termini di tectec-niche di ottimizzazione, con, ove possibile, un esempio di applicazione nell’ambito dei Sistemi di Raccomandazione. Vengono inoltre introdotti brevemente quali sono gli algoritmi base che vengono solitamente usati. Il Capitolo 4 verte sul dominio in cui viene svolto il lavoro di tesi e spiega nel dettaglio quali dati sono stati utilizzati. Nel Capitolo 5 vengono spiegate le metriche di valutazione utilizzate. Troviamo poi tutti i dettagli riferiti all’implementazione degli algoritmi usati nell’ensemble, a quali di queste tecniche vengono usate e infine a come sono state implementate e usate le tec-niche di ottimizzazione. Nel Capitolo 6 vengono infine presentati i risultati relativi agli esperimenti fatti e le conclusioni che si possono trarre da questi.

(7)

3 State of the Art 11 3.1 Recommender Systems . . . 11 3.2 Ensemble techniques . . . 12 3.2.1 Voting . . . 13 3.2.2 Averaging . . . 14 3.2.3 Stacking . . . 15 3.2.4 Bagging . . . 15 3.2.5 Boosting . . . 16 3.2.6 Multi-layer ensemble . . . 17 3.3 Optimization techniques . . . 18 3.3.1 Bayesian Optimization . . . 19 3.3.2 Genetic Algorithms . . . 21

3.3.3 Powell’s Conjugate Direction Method . . . 24

4 Problem domain and dataset description 27 4.1 Domain description . . . 27 4.2 Dataset . . . 28 4.2.1 Challenge Dataset . . . 28 4.2.2 Our dataset . . . 31 5 Implementation 33 5.1 Evaluation metrics . . . 33 5.1.1 Confusion Matrix . . . 33 vii

(8)

5.1.2 Precision . . . 34

5.1.3 Recall . . . 35

5.1.4 F1 score . . . 35

5.1.5 Mean Average Precision (MAP) . . . 35

5.1.6 Normalized Discounted Cumulative Gain . . . 35

5.2 Similarities . . . 36

5.2.1 Cosine . . . 36

5.2.2 Jaccard . . . 37

5.2.3 Tanimoto . . . 37

5.3 Base algorithms . . . 37

5.3.1 Content-Based Filtering algorithms . . . 38

5.3.2 Collaborative Filtering algorithms . . . 40

5.3.3 Negative Interactions . . . 42

5.3.4 Matrix Factorization algorithms . . . 44

5.4 Ensemble description . . . 46 5.4.1 Ensemble type . . . 46 5.4.2 Ensemble structure . . . 47 5.4.3 Score normalization . . . 48 5.5 Ensemble Optimization . . . 48 5.5.1 Bayesian Optimization . . . 48 5.5.2 Genetic Algorithms . . . 49

5.5.3 Powell’s Conjugate Direction Method . . . 50

6 Results and considerations 51 6.1 Metric choice . . . 51

6.2 Genetic Algorithms . . . 51

6.3 Powell’s Conjugate Direction . . . 54

6.4 Bayesian Optimization . . . 57

6.5 Method comparison and conclusions . . . 60

6.6 Conclusions . . . 72

(9)

List of Figures

2.1 Lunatic Goats multi-layer ensemble [1] . . . 8

3.1 STREAM model performances comparison [2] . . . 16

3.2 PumpkinPie multi-layer ensemble. . . 18

5.1 Mono-layer ensemble. . . 47

5.2 Multi-layer ensemble. . . 47

6.1 10 runs of Genetic Algorithms on mono-layer ensemble results on MAP@10. 52 6.2 10 runs of Genetic Algorithms on mono-layer ensemble results on Recall@10. 52 6.3 10 runs of Genetic Algorithms on multi-layer ensemble results on MAP@10. 53 6.4 10 runs of Genetic Algorithms on multi-layer ensemble results on Recall@10. 53 6.5 10 runs of Powell’s Conjugate Direction method on mono-layer ensemble using MAP@10. . . 54

6.6 10 runs of Powell’s Conjugate Direction method on mono-layer ensemble using Recall@10. . . 55

6.7 10 runs of Powell’s Conjugate Direction method on multi-layer ensemble using MAP@10. . . 56

6.8 10 runs of Powell’s Conjugate Direction method on multi-layer ensemble using Recall@10. . . 56

6.9 Confront of three drivers of the spearmint implementation. . . 57

6.10 10 runs of Bayesian Optimization with different seeds on mono-layer en-semble using MAP@10. . . 58

6.11 10 runs of Bayesian Optimization with different seeds on mono-layer en-semble using Recall@10. . . 58

6.12 10 runs of Bayesian Optimization with different seeds on multi-layer ensem-ble using MAP@10. . . 59

6.13 10 runs of Bayesian Optimization with different seeds on multi-layer ensem-ble using Recall@10. . . 59 6.14 Comparison of the three methods using mono-layer structure on MAP@10. 60

(10)

6.15 Comparison of the three methods using mono-layer structure on Recall@10. 61 6.16 Comparison of the three methods using multi-layer structure on MAP@10. 62 6.17 Comparison of the three methods using multi-layer structure on Recall@10. 62 6.18 Weights distribution after 50 runs on mono-layer structure with MAP@10. . 63 6.19 Weights distribution after 50 runs on mono-layer structure with Recall@10. 64 6.20 Weights distribution after 50 runs on multi-layer structure with MAP@10. . 64 6.21 Weights distribution after 50 runs on multi-layer structure with Recall@10. 65 6.22 Weights distribution after 75 runs on mono-layer structure with MAP@10. . 66 6.23 Weights distribution after 75 runs on mono-layer structure with Recall@10. 66 6.24 Weights distribution after 75 runs on multi-layer structure with MAP@10. . 67 6.25 Weights distribution after 75 runs on multi-layer structure with Recall@10. 67 6.26 Weights distribution after 100 runs on mono-layer structure with MAP@10. 68 6.27 Weights distribution after 100 runs on mono-layer structure with Recall@10. 68 6.28 Weights distribution after 100 runs on multi-layer structure with MAP@10. 69 6.29 Weights distribution after 100 runs on multi-layer structure with Recall@10. 69

(11)

List of Tables

3.1 Voting example . . . 13

4.1 Cardinalities of challenge dataset . . . 28

4.2 Item’s Features . . . 29 4.3 User’s features . . . 30 4.4 Interaction’s features . . . 31 4.5 Our dataset . . . 32 5.1 Confusion Matrix. . . 34 5.2 CBFuu parameters. . . 38 5.3 CBFuu performances. . . 39 5.4 CBFii parameters. . . 39 5.5 CBFii performances. . . 39

5.6 Profile matching parameters. . . 40

5.7 Profile Matching performances. . . 40

5.8 CFuu parameters. . . 41 5.9 CFuu performances. . . 41 5.10 CFii parameters. . . 42 5.11 CFii performances. . . 42 5.12 CBFiiNegative parameters. . . 43 5.13 CBFiiNegative performances. . . 43 5.14 CFiiNegative parameters. . . 43 5.15 CFii performances. . . 44 5.16 MF BPR model parameters. . . 45 5.17 MF BPR performances. . . 45

5.18 iALS model parameters. . . 46

5.19 iALS performances. . . 46

6.1 Scores on test set at iteration 50. . . 70

(12)

(13)

Chapter 1

Introduction

Recommender Systems, or RecSys, are systems that, given past interactions of users with items, try to predict their future connections. They combine machine learning, data min-ing and information retrieval techniques, while also considermin-ing psychological and social aspects.

With the massive growth of e-commerce and social networks in the last decade, Recom-mender Systems have also grown to comprise a crucial role for a large number of companies in many different fields. In a world where more and more purchases are performed online, taking away time and interactions between people, recognizing and predicting a customer’s preferences becomes incredibly important, in order to increase the quality of the service and to satisfy, or even to surprise him.

The simplest example one could think of is Amazon, which best represents the growth of online shopping in our lives. However, there are a lot of fields that involve RecSys technologies, such as movie and music streaming platforms as well as social networks. A fact that proves the importance of recommendations for companies in recent years is the spread of RecSys competitions. There are a lot of companies that invest money in prizes to be given to teams that succeed in improving their current Recommendation System. One of the first and most well-known RecSys competitions is Netflix Prize, held in 2006 when Netflix offered a million dollar prize for the team that could improve their RecSys performances by at least 10%. The competition took almost 3 years to complete and was won by a joint-team of several researchers called BellKor’s Pragmatic Chaos.

The core of this thesis work derives from a Recommender Systems competition, the Rec-Sys Challenge 2017, hosted by XING AG for this year’s edition of ACM International Conference on Recommender Systems that took place in Como. We participated in the competition as a team named Lunatic Goats, composed by eight students from Politec-nico di Milano. We classified first in the offline phase and second in the online phase (first between academic teams). We were therefore invited to the aforementioned conference

(14)

2 Chapter 1. Introduction

to present our work and to receive a prize. Our paper ”Content-Based approaches for Cold-Start Job Recommendations” (Mattia Bianchi, Federico Cesaro, Filippo Ciceri, Mat-tia Dagrada, Alberto Gasparin, Daniele Grattarola, Ilyas Inajjar, Alberto Maria Metelli and Leonardo Cella) was accepted and published in the conference proceedings.

Since the competition was held by XING, a Business Social Network oriented towards employment, the challenge was in the job-recommendation domain. This type of Recom-mender System has some peculiar characteristics that are not present in other scenarios like e-commerce or music recommendation. First of all it becomes crucial to predict inter-actions as precisely as possible, since choosing a job is very different from buying something online. The user will have much more interest in finding a good job offer, because it is something that will most likely affect an important part of their life for a significant period of time. For the same reason, companies have interest in finding motivated and compatible candidates for their positions since they probably can’t afford to hire a person that doesn’t match their needs.

In short, it is easily understandable that looking for a new job is far different from buying a new book online or choosing which new movie we want to see. For this reason the need for an efficient Recommender System arises and leads to stronger drive in research and development.

Our thesis work starts from the results obtained by Lunatic Goats and explores the branch of the algorithm dedicated to the ensemble. In Recommender System and in general in Machine Learning, we define as ensemble technique a method that allows taking into con-sideration the result of more than one algorithms and combining them in some way, in order to build a new learner that exploits the best features of each approach and obtains a final outcome that outperforms every algorithm individually.

Ensemble techniques have hyperparameters that have to be tuned in order to optimize the result. In this work we want to discuss a technique called Bayesian Optimization applied to ensemble hyperparameter tuning. Bayesian Optimization is a method aimed at predicting the minimum of a function over a given set, using a Gaussian Process as a prior over that function and an acquisition function to decide how to explore the space. It is a technique that was created to deal with situations that take a long computational time and therefore have to be resolved in a number of iterations as low as possible. We will compare this method with the ones we used in the challenge with Lunatic Goats and show how Bayesian Optimization outperforms the other techniques in terms of both speed and quality of the result.

Since the challenge was oriented toward a cold-start scenario, we used a subsample of the competition dataset in order to emulate a warm scenario, which allows us to use a greater number of algorithms.

(15)

1.1. Thesis structure 3

1.1 Thesis structure

This is the structure of this thesis work:

In Chapter 2 we first describe in detail the RecSys Challenge 2017, our approach to it and the direction we took to expand that work in our thesis.

Chapter 3 contains the State of the Art of the technologies that are used in our work, such as ensemble techniques and optimization methods.

In Chapter 4 we describe the dataset used in the RecSys Challenge and the subsample we used for our work.

Chapter 5 details the implementation of our thesis work, starting from the base algorithms, then presenting the ensemble and finally the optimization methods we tried.

In Chapter 6 we present the final result of our work and our considerations on what we obtained.

(16)

(17)

Chapter 2

Problem Description

In this section is presented the problem we tackle, the competition we attended and a brief description of our approach and of our thesis work. Details about the challenge are described in the first section of this chapter, while in the second section the approach we pursued and the development of our thesis is presented.

2.1 RecSys Challenge 2017

Since year 2010, the RecSys Challenge is a competition hosted by a company interested in this field for each edition of the annual ACM International Conference on Recommender Systems.

This year’s RecSys Challenge [3] was held by XING AG, which is the owner of the business social network XING. The challenge goal revolved around job recommendations, similarly to 2016’s edition, that was held by the same company. This year’s task was dual: starting from a new, cold-start, job advertisement, we had to find users that could be interested in receiving notifications about that job while also searching for a good candidate that would match the requirements of that specific job. This duality, that was not present in the past competition, forced us to focus on both interests of users and requirements of each job. The second main difference from the past year was that the challenge was structured in two phases: offline and online. During each phase of the competition we were given vary-ing sets of users and items to target with recommendations (from now on referred to as target users and target items). In the first phase the set of users and items was fixed (further details about the dataset provided can be found in chapter 4). In this part the focus was set to develop different kind of algorithms aiming to achieve the best possible performance over the custom evaluation metric provided by the organizers, presented in the Algorithm 1. The top 25 teams of the offline challenge were invited to the online part and most of them actively participated to it. In this part, each day, each team would

(18)

6 Chapter 2. Problem Description

Algorithm 1 Evaluation metric pseudo-code (from [3]).

score(item, users) = users.map(u => userSuccess(item,u)).sum + itemSuccess(item, users) userSuccess(item, user) = ( if (clicked) 1 else 0

+ if (bookmarked || replied) 5 else 0 + if (recruiter interest) 20 else 0 - if (delete only) 10 else 0

) * premiumBoost(user)

premiumBoost(user) = if (user.isPremium) 2 else 1 itemSuccess(item, users) =

if (users.filter(u => success(item, u) > 0).size > 0) { if (item.isPaid) 50

else 25 } else 0

receive a new set of cold start target items and new interactions relative to the interactions that occurred the day before. The set of target users assigned to each team instead stayed the same for the entire course of this part. Being an online environment, the main and most relevant difference from the offline part is that the recommendations were effectively pushed to real users of the XING platform. Therefore the interactions we received each day also contained the interactions relative to the push recommendations made by the teams taking part in the competition. The recommendations remained active an entire week, during which the users could interact with them.

The teams were evaluated based on the score achieved over a time window of one week. The online part lasted 5 weeks and the final score obtained by each team was based on the two best performing weeks.

The competition was open to teams coming from companies that have interests in recom-mender systems and from academic institutes that wanted to prove their skills. Over 200 teams applied, of which 103 actively competed. We applied as Lunatic Goats, a team of eight students from Politecnico di Milano.

We placed first in the offline part and second in the online one, being the best academic team by far, surpassing many companies specialized in the field and placing behind a Canadian company with expertise in cold start recommendation problem. We were ac-tually the only team composed by master degree students taking part in the challenge,

(19)

2.2. Problem approach 7

making our achievements more noteworthy.

2.2 Problem approach

Since our team was composed by eight people, it was necessary creating smaller groups, each one focused on a specific task.

We started by implementing different state of the art algorithms (presented briefly in Chapter 3, with some insights on the implementation in Chapter 5) along with the cre-ation of a small local validcre-ation set in order to quickly benchmark our models and later on to tune models’ hyperparameters to avoid overfitting.

Collaborative filtering techniques usually yield better results than Content-Based ones, but the cold-start scenario prevented us from taking advantage of them due to the lack of interactions of the target items. Therefore we focused mainly on Content-Based method, resorting later on to a content-based micro-clustering approach [1] to be able to run col-laborative filtering algorithms. The micro-clustering technique consisted in enriching the user-rating matrix with fictitious interactions of the cold-start items. This also allowed us to implement matrix factorization algorithms that due to their collaborative nature could not make recommendations for cold-start items. We also tested approaches that implement some of the newest algorithms that are focused on cold-start scenario recom-mendations problem.

The dataset contained explicit negative interactions, which consists in users eliminating the suggestion received: we tried to exploit these interactions by building our models using only them. The idea behind this approach resides in the fact that, according to the RecSys challenge evaluation metric presented in Algorithm 1, recommending a user with a negative interaction yields a huge malus when calculating the final score. For this reason we aimed to create algorithms that tried to predict the highest number of negative interaction in order to penalize those items and to reduce the probability to recommend them.

All these algorithms were ensembled with an averaging multi-layer ensemble technique that was intended to take in consideration the characteristics of each model, in order to maximize the performances. The ensemble hyperparameters, namely the weights of the single algortihms, were optimized with multiple optimization techniques that were trained on the test set. Figure 2.1 represents the final structure of our architecture. Refer to our paper [1] for the details on each algorithm involved in the ensemble.

The ensemble combined six different models which led, along with the layers, to nine dif-ferent weights to tune. We could build the ensemble in fairly short time since we saved the predictions of each ensembled model to file. This allowed us not to run again all the algorithms each time we performed an ensemble.

(20)

8 Chapter 2. Problem Description PM CBF+ Baseline CBF- CF iALS Positive Content-Based Content-Based Collaborative Final

Figure 2.1: Lunatic Goats multi-layer ensemble [1]

Manually tuning the parameters is not trivial especially when tuning weights values that are close to the optimum. We decided to resort to an automated parameter optimization method in order to find the optimum: for this task we first employed genetic algorithm and then Powell’s Conjugate Direction Method, a gradient-free optimization algorithm. While genetic algorithms yield good performances in a relative small time window and did not require many runs, the gradient-free optimization showed slightly better results while on the other hand taking a more time and more runs before reaching its optimum. We were able to observe that while the ensemble per se is a very good and efficient strategy to obtain good recommendations, optimizing the weights of the ensembled algorithms yields another relevant improvement to the quality of the recommendations.

This is particularly true due to the fact that a higher amount of ensembled algorithms increases not only the dimensional complexity of the computation but also the time re-quired to compute one ensemble. Being able to efficiently optimize the weights becomes important not just in an offline scenarios like a competition, but also in an online scenario where the speed of a recommendation is as important as its quality.

In this thesis we decided to focus our attention on ensemble weight optimization not only to show the effectiveness of these methods, but also to find an optimization strat-egy that is able to quickly find a good value for the weights. This will make ensemble optimization scalable and applicable in scenarios in which the speed of the computation counts greatly.

(21)

com-2.2. Problem approach 9

petition, which are genetic algorithms and Powell’s Conjugate Direction Method, along with another one, Bayesian optimization, a well known optimization technique used mainly in machine learning.

We also shift our focus from the cold start scenario of the competition to a more standard warm scenario. This allow us to employ collaborative filtering algorithms without the need of the augmented user-rating matrix, therefore bringing back the overall diversity in the recommendations.

We want to employ more algorithms with respect to the ones used during the competition, along with different ensemble methods in order to explore more possibilities. The different scenario may favorite a different ensemble configuration rather than the multi-layer one, therefore the necessity of exploring other options.

(22)

(23)

Chapter 3

State of the Art

This chapter covers the state of the art ensemble techniques in Recommender Systems and the optimization techniques.

We start with an introduction on Recommender Systems in general and which are the basic techniques. The second section covers what an ensemble is and why it is often used to dramatically improve performances with respect to standalone recommendation algorithms, along with the most common ensemble techniques.

In the last section we talk about what optimization techniques are used for, describing in depth the two approaches we use in our work.

3.1 Recommender Systems

In Recommender Systems we aim at estimating the user ratings on new unseen items in order to recommend to that user what he should like the most. Therefore we need two fundamental matrices to perform this: a User Rating Matrix (URM) and a Similarity matrix (S). The estimated User Rating Matrix is generally computed as:

\

U RM = U RM × S. (3.1)

This formula has small variations depending on the approach.

We can roughly divide the possible approaches in Content Based Filtering [4, 5] and Col-laborative Based Filtering [6, 7]. Both present advantages and disadvantages which depend on the type of data available and on the problem scenario.

Content Based Filtering exploits features of items or users to perform recommendations and it is not affected by the cold start problem which happens due to lack of interac-tions. With this approach matrix S in formula 3.1 is computed using Item Content Matrix

(24)

12 Chapter 3. State of the Art

(ICM), or User Content Matrix (UCM), and applying one similarity measure to calculate the similarity between two items or users. We go over different types of similarity measures in the implementation Chapter 5.

Collaborative Filtering on the other hand exploits exactly the interactions between users and items to perform recommendations, can end up performing badly if interactions are scarce. With this approach matrix S in formula 3.1 is computed using the URM or the URM transposed to compute the similarity between two users or two items, again applying one similarity measure.

In recent years Matrix Factorization [8] methods gained a lot of interest. They are Col-laborative Filtering approaches that exploits latent factor models. Latent factor models try to explain the ratings by characterizing both items and users on a varying number of factors inferred from the ratings patterns. A good correspondence between item and user factors leads to a recommendation. In order to learn the factors, the system needs to minimize an error function that can be for example regularize squared error:

min

q∗,p∗

X

(u,i)∈K

(rui− qiTpu)2+ λ(||qi||2+ ||pu||2) (3.2)

or the root mean squared error (RMSE):

RM SE = v u u t 1 N n X i=1 (yt− yn)2. (3.3)

Two possible approaches to minimizing one of these error functions are stochastic gradient descent and alternating least squares(ALS).

All these approaches can’t exploit all the available data at their full extent because for example, as we have seen, Collaborative approaches ignores the features of items and users, therefore hybrid techniques are required. Hybrid techniques combine the results of two or more recommendation algorithms to create new enhanced recommendations.

In general hybrid techniques try to combine different approaches to reduce the error but it is also possible to combine the results of different implementation of algorithms belonging to the same family [9, 10].

3.2 Ensemble techniques

Ensembling is a crucial task of machine learning. It basically consist in creating a com-plex learner using multiple simpler learners. This method usually allows reaching better

(25)

3.2. Ensemble techniques 13

predictive performance than what could be obtained from each single algorithm alone. It combines all the predictions, extracting the best characteristics of each one. In RecSys field the ensemble techniques are widely used in all of their methods and variations. A classical example of how ensemble can be used is combining together the result of a Content-Based algorithm and a Collaborative algorithm. The nature of the recommendations of these two algorithms is quite different and so are their results. By using a method that takes in consideration both of them, the final result will have the best suggestion of the two algorithms and therefore it will be better than each one alone.

3.2.1 Voting

The voting method [11] is one of the oldest and simplest methods for combining different classifiers. It doesn’t require any training and can be employed for example in situations in which classifiers have binary output. There are also scenarios in which we have thousands of weak learners and implementing non trivial ensemble method can be difficult and not efficient: in these scenarios a simple method as voting can be employed with ease and still output good results.

Majority voting [12] works in the following way: for each user-item pair that appears in the recommendations of one of the ensembled classifiers a vote is held. If more than half of the ensembled classifiers have the same user-item pair in their recommendation then the pair will move on to the final recommendation.

Majority voting often improves the accuracy in the final result with respect to the one of the standalone classifiers.

We can show this with an example, presented in Table 3.1, in which we consider the case of recommending 5 items to a user. We represent the recommendation with a binary vector where a 1 means that the item was correctly recommended, 0 otherwise. Other

Recommendation Accuracy

1 0 1 1 0 60%

Input 1 1 1 0 0 60%

0 1 0 1 0 40%

Ensemble 1 1 1 1 0 80%

Table 3.1: Voting example

than simple majority voting we can also implement a weighted version by taking into account the performances of a single algorithm to apply a weight to its recommendations. The ensemble still works in the same way. We can see in [13] an application of voting in cluster ensembles in collaborative filtering recommendation. Once clustering is performed, different ensemble sizes are taken into consideration. For each configuration the idea is to

(26)

permute cluster labels in order to obtain the largest agreement between the labels of two partitions. This means combining the binary outputs of k clusters: the final label is the one that receives the largest amount of votes. The voting approach is employed alongside two cluster based ensemble methods. Even though the voting approach ends up being the worst performing technique, it still obtains results that are close to the performances of the two cluster ensemble methods.

RecSys Application: In [12] we see an application of majority voting and weighted majority voting in a more general scenario rather than a specific recommender system application. The two techniques are benchmarked alongside a Naive Bayes combiner and a Recall combiner. The results are based over a varying number of ensembled classifiers and a varying number of target classes. Majority voting and weighted majority voting are the worst performing techniques as expected, being also the simplest to implement, even though weighted majority voting can perform similarly to the Recall combiner. Overall we can say that the voting technique, based on how simple it is, still yields good performances in particular if we talk about accuracy. Therefore it represents a good baseline to start from.

3.2.2 Averaging

Averaging is one of the most common and intuitive ensembling methods for machine learning [14]. It is widely used for neural networks [14, 15], but also in many other fields. The main idea is to take each algorithm’s suggestion and weight it, then sum up all the results to produce the final ensembled output. This method performs quite well since by ensembling in this way one could consider recommendation from a lot of different algorithm while deciding the importance of each result. This leads to a lowering of the individual errors while prevents overfitting. The reason behind this result lies in some old machine learning basic concepts about the bias-variance dilemma [16, 17]. If we combine multiple learning algorithms, each one with low bias and high variance, the result is likely to be an algorithm with low variance and bias.

In RecSys field, averaging is applied simply taking into account the recommendations of each single algorithm and performing a weighted average on the expected rankings [1]. Mathematically this means that:

rens(u, i) =Pkj=1wj∗ rj(u, i),

where k is the number of algorithms involved, wj is the weight of the jthalgorithm, rj(u, i)

is that algorithm’s estimated ranking for the User u on the Item i and rens(u, i) is the

ranking of the ensembled output for that User-Item couple. Basically this can be seen as a variation of voting, but the main two differences are the presence of weights that give an importance to each algorithm and the fact that the numerical value of the expected

(27)

ranking is considered. Often this method comes with some regularizing technique that remap every algorithm’s rankings to the same scale in order to compare them.

3.2.3 Stacking

Stacking was introduced by Wolpert in 1992 [18]. The key idea behind this method is to use the output of a set of classifiers as input for another classifier with the goal of reducing the prediction error.

We start from a dataset D = {(fn, yn), n = 1 . . . N } where fn are the features of the

nth sample and yn is the class label of the nth sample. We then split the dataset into K

folds in order to perform K-fold cross-validation. We run L learning algorithms on the K-folded dataset to produce models Ml, with l = 1, . . . , L. At the end of the training

we obtain L outputs from these models that will be merged in a new dataset DCV =

{(o_1n, . . . , oLn, yn), n = 1 . . . N } where oln is the output of the lth model for the sample n.

We now need another learning algorithm that we run on the new dataset from which we obtain another model ˆM .

Now for each new point to be classified we first need to obtain the output of the L models and then use those as input for ˆM to obtain the final output. This technique can be applied as it is also in Recommender Systems.

RecSys Application: We can see an application in [2]. The implementation proposed in the paper called STREAM it is not a pure implementation of stacking because stacking alone actually fails to achieve the proposed goal which is a system able to adjust itself based on the input values. For example if we consider two users A and B, user A rates no more than 10 items while B rates 100 items. A collaborative filtering model should work better for B than for A, so the weight given to the collaborative model should be higher for B than for A. In order to overcome this problem new meta-features are added: they indicate the expected quality of the predictions. This approach shows the strength of the method. Three different learning algorithms are employed: linear regression, model tree, bagged model trees. For each algorithm a stacked ensemble model is built.

The stacking ensemble outperforms by a large margin each learning algorithm and also a simple weighted average ensemble approach in which each algorithm has the same weight, as shown in Figure 3.1.

3.2.4 Bagging

Bagging ensemble method [19] consists in training the same predictor multiple times over slightly different instances of a given dataset. The output of each training is then aggre-gated and, using the average, we obtain the final result.

(28)

Figure 3.1: STREAM model performances comparison [2]

which consists in sampling with replacement the samples from the former dataset. In this way we obtain different dataset that may contain a sample repeated multiple times or may not contain a sample that was instead present in the former dataset.

Another prerequisite for bagging is that the predictor is sensitive to the changes in the dataset, resulting in actually different outputs.

If the changes in output of the single predictor are significant, bagging can improve the accuracy of the final prediction. Bagging is often used in machine learning but can find application even in Recommender System.

RecSys Application: We can see in [20] an example of bagging on decision trees in a blending predictions problem. Blending is a regression problem that therefore benefits from smooth output functions but simple decision trees has discretized output which is a huge limit in this scenario. Bagging decision trees improves the overall accuracy of the predictions and in combination with another technique provided very good results when minimizing Root Mean Squared Error(RMSE).

3.2.5 Boosting

Boosting is based on the idea that multiple weak learners can create a single strong learner [21, 22]. The algorithm consists in training several times a set of weak classifiers and then add them to a stronger one. Each time a new algorithm joins the ensemble the data are re-weighted, increasing the importance of the misclassified ones, penalizing on the other hand those that were classified correctly. This brings future learners to focus more on data that still haven’t a good classification and make them ”experts” on that particular

(29)

set of instances [22]. The main difference between each boosting algorithm is the method they weight training data. Some of the most known and used are:

• AdaBoost [23]: short for Adaptive Boosting, was formulated by Yoav Freund and Robert Schapire, it was the first algorithm that could adapt to the weak learners. It uses decision trees as weak classifiers. It is quite sensitive to noisy data and outliers and needs weak learners to perform better that random guessing.

• LPBoost [24]: stands for Linear Programming Boosting. Belonging it to the class of margin-maximizing classification algorithms, it maximizes a margin between training samples of different classes. It uses linear programming to optimize the vector of the weights. Differently from AdaBoost, it has the totally-corrective property, meaning that it adjusts all the weights every iteration.

• XGBoost [25]: meaning eXtreme Gradient Boosting and first introduced by Tianqi Chen, it gained lot of fame especially on Kaggle, a popular site of predictive models and analytics competitions. It uses decision trees and performs Gradient Boosting [26].

• BrownBoost [27]: introduced by Yoav Freund as the counterpart to AdaBoost to learn on noisy datasets. BrownBoost is able to recognize noisy data and to stop classifying them. In contrast with AdaBoost and other boosting algorithms, it uses a non-convex potential loss function and solves a system of two equations and two unknowns.

RecSys Application: Boosting is widely used in RecSys field [28]. XGBoost is almost a standard in competitions, especially in those that are focused on content algorithms like Recommender System Challenge 2017, to the point that it is being used as baseline[1]. Other variant of boosting are used to improve single algorithm’s recommendations in a lot of works such as [29] where is applied to a collaborative filtering algorithm on implicit feedback or [30] where is performed over a matrix factorization recommender.

3.2.6 Multi-layer ensemble

Multi-layer ensemble can be considered a variation of ensembles like Voting or Averaging. It consists, as we can say from its name, in applying ensemble techniques on different levels instead on just one. The idea is to build a sort of tree, where the leaves are the single algorithms and the root is the final output. Each algorithm is ensembled with his brothers to create partial results, that will be ensembled in the next layer and so on. The advantage given by this method is that, by grouping the algorithms to be ensembled, we can fully extract their characteristics. Also we are given the possibility to use different

(30)

Figure 3.2: PumpkinPie multi-layer ensemble.

types of ensemble techniques, so that one could decide to use, for instance, Averaging on the first layer and then Voting on the last one. This essentially introduces a whole new degree of freedom while creating the ensemble, because we have now the possibility to try a lot of different structures and methods to increase the final performance. So, talking about Recommender Systems, one could for instance ensemble some Collaborative Filtering results together and on the other branch some Content Based algorithms; then in the next layer the two partial outputs will be ensembled to create the final result. Obviously the tree can be more complex and have more than two layers. The main problem with this method is that it doesn’t guarantee to be better than regular ensemble. There could be dataset where using Multi-layer ensemble leads to better performances and some other where the results will be poorer. One example where this method turned out to be useful is the RecSys Challenge 2017, where our team Lunatic Goats used it in its better submission and ensembled separately Content Based, Collaborative Filtering and negative Content Based algorithms. Also PumpkinPie, the team from PoliMi of 2016 Challenge, used this type of ensemble structure [31].

3.3 Optimization techniques

Optimization [32] is a mathematical discipline that revolves around finding the maximum or the minimum of a number of a function.

(31)

3.3. Optimization techniques 19

In Recommender Systems, optimization is mostly used in algorithms that apply Machine Learning techniques and it is needed to find the best values for the hyperparameters of the model. It can also be used in standard model to find, for example, the best value of the shrink parameter.

3.3.1 Bayesian Optimization

In Machine Learning algorithms there are often many hyperparameters that require tuning in order to reach good performances. Having an automated approach that it is able to op-timize the performances is crucial to minimize the number of training iteration needed to reach the maximum result. Bayesian Optimization [33, 34] offers very good performances compared to other state of the art Optimization techniques.

Bayesian Optimization tries to find the minimum of a function f (x) on some bounded set X. It constructs a probabilistic model for f (x) and exploits this model in order to decide where to move in X to next evaluate the function.

In Bayesian Optimization we need to make two important choices. The first is the prior over functions for which a common choice is Gaussian Process prior. The second is the acquisition function, which is needed to select the next point to evaluate.

Bayesian Optimization is particularly efficient because it is able to effectively balance ex-ploration and exploitation.

There is a major limitation in the most commonly used form of Gaussian process re-gression is the assumption of stationarity, which means that the covariance between two outputs is invariant to translations in input space [35]. Although this simplifies the re-gression task, it may hurt the ability of GP to efficiently model non-stationary functions. It therefore reflects its effects on Bayesian Optimization. For example, in optimizing the hyperparameters of a machine learning algorithm, the objective function may have a short length scale the closer we are to the optimum, while having a long length scale when we are far away from the optimum. We can indeed expect that bad hyperparameters yield bad performances everywhere while small tweaks of good hyperparameters have notable effects on the generalization performance.

This problem can be effectively overcome by learning a bijective warping of the inputs that removes major non-stationary effects. This is possible using the Beta distribution to project on it each dimension of the input, while marginalizing the shape of the warping. Bayesian Optimization can also be used in optimization problems in which the constraints are not known a priori [36]. Let us consider a generic constrained optimization problem:

min

x c(x) s.t. ρ(x) ≥ 1 − . (3.4)

In Bayesian Optimization the constraint functions are in general unknown because they have not been observed anywhere and the observations may be noisy.

(32)

Due to uncertainty, it is impossible to be certain that ρ(x) ≥ 1 − present in 3.3.1 is satisfied for any x. This resolves to a stochastic programming problem. Usually the goal is to minimize the objective function in expectation, satisfying at the same time the constraints with high probability. This last condition is called probabilistic constraint. Let f (x) represent the objective function and C(x) represent the constraint condition. The probabilistic constraint assumes the form:

Pr(Ck(x)) ≥ 1 − δk, (3.5)

where 1 − δ is the minimum user confidence. The k notation is needed in case there are K constraints and not only one. All constraints must be satisfied.

At this point the general constrained Bayesian Optimization problem can be formalized as:

min

x E[f (x)] s.t. ∀k P r(Ck(x)) ≥ 1 − δk. (3.6)

We now briefly go over Gaussian Processes and Acquisition functions needed by the Bayesian Optimization.

Gaussian process

A Gaussian process is defined as a probability distribution over functions f (x) such that the set of values of f (x) evaluated at an arbitrary set of points x1, . . . , xn jointly have a

Gaussian distribution.

A more detailed overview of Gaussian processes is available at [37].

Acquisition Function

We can denote an acquisition function as a : X → R+_{. It determines what point in X}

should be evaluated next via a proxy approximation xnext= argmaxxa(x).

Under Gaussian process (GP) prior, these functions depend on the model through its pre-dictive mean function µ(x; {xn, yn}, θ) and its predictive variance function σ2(x; {xn, yn}, θ).

We show now three different acquisition functions that exploit different strategies. Probability of Improvement Maximize the probability of improving over the best cur-rent value. Under GP this can be computed as:

aP I(x; {xn, yn}, θ) = Φ(γ(x)) γ(x) =

f (xbest) − µ(x; (x; {xn, yn}, θ)

σ(x; {xn, yn}, θ)

. (3.7)

Expected Improvement Maximize the expected improvement (EI) over the current best. Under GP it has closed form:

(33)

GP Upper Confidence Bound Exploiting lower confidence bounds (upper in case of maximization) in order to minimize regret over the course of their optimization.

aLCB(x; {xn, yn}, θ) = µ(x; {xn, yn}, θ) − κσ(x; {xn, yn}). (3.9)

where κ is tunable to balance exploitation against exploration.

3.3.2 Genetic Algorithms

Genetic algorithms are a technique used for optimization and search problems that is inspired by biological processes such as natural selection, gene mutation and crossover [38, 39, 40, 41, 42]. They were theorized from John Holland in the ’70s and then imple-mented and developed in the following ten years. They obviously refer to Charles Darwin’s concept of evolution and Gregor Mendel’s studies in biology. The goal is to simulate the evolution process of a given number of individuals. Using a concept called survival of the fittest, they analyze a population of individuals that reproduces and evolves, transmitting to their descendants only the best traits while discarding the worst ones.

They were used in a lot of different fields such as medicine [43, 44], cryptography [45], climatology [46], anti-terrorism [47] and economics [48].

In Recommender System field GA were utilized for several tasks like improving Collabora-tive Filtering [49] and Content-Based Filtering [50] where Genetic Algorithms are used to extract weights to average the values calculated with the original algorithm; it is demon-strated that this leads to a significant improvement in the algorithm’s performances. Also, as we are going to see later, they are used in ensemble weights optimization [1], that is a kind of problem that perfectly fits with their characteristics. The main idea behind Ge-netic Algorithms is the concept of chromosome. In biology, the chromosome is composed by genes that contain all the information about the living being they belong to. When an individual reproduces, his children will inherit some traits from him and some from the other parent, mixed randomly and eventually mutated. As the evolution process explains, the new individuals that inherits the best traits will be more likely to survive, giving birth to other children and conserving those best genes.

This is the process that is reproduced in Genetic Algorithms: chromosomes contain a set of values (genes) that represent a possible solution for the given problem. There will be a fitness function, that is the equivalent for Nature’s probability of survival, that will quantify the goodness of an individual; then the parent chromosomes will give birth to another population that will inherit the best trait of the old one. The basic algorithm is the following [39, 41, 42]:

1. Start: Generate random population of n chromosomes 2. Fitness: Evaluate the fitness of each chromosome

(34)

3. Test: If present, test the End Condition on the evaluated population. If satisfied return the current population or the best chromosome and terminate, else go to Step 4

4. New population: Create a new population by repeating the following steps until the population is complete:

• Selection: Select two parents from the previous population according to a selection method

• Crossover: With a crossover probability perform crossover on the two parents to create two new chromosomes

• Mutation: With a mutation probability perform mutation on each new chro-mosome

• Accepting: Place the new chromosomes in the new population

5. Replacing: Replace the old population with the new one

6. Loop: Go to Step 2

The creation and implementation of a GA for optimization basically consists in choos-ing the best option to represent the problem. We’ll have to decide the encodchoos-ing of the chromosome, namely how our parameters will be represented, then we’ll pick a way to choose the parent chromosomes (selection) and the type of crossover and mutation with the respective probabilities.

Encoding

The encoding is the way the parameters are represented in a chromosome. There are three standard type of encoding [51, 42]:

• Binary: most simple encoding, each gene is a 0 or a 1. It’s not natural for many problems

• Permutation: used in ordering problems. Each gene is a number that represents the its position in the solution

• Direct value: each gene is the real value it represents. It’s the most complex but also the most accurate method

(35)

Selection

Selection is the way the two parent chromosomes are chosen to give birth to the next generation. All the methods are probability-based, according to each chromosome’s fitness. The most common selection methods are [51, 42]:

• Roulette: parents are chosen proportionally to their fitness. Higher the fitness, higher the probability

• Ranking: parents are ranked according to their fitness, the probability is propor-tional to the rank

• Tournament: k chromosomes are chosen to be in the tournament. The chromosome with higher fitness has a probability p to be chosen, the second has (1 − p), the third (1 − p)2 and so on until the last one that has probability (1 − p)k−1

Elitism

The elitism concept consist in carrying on the best chromosomes in the next population maintaining them unaltered. This allows to preserve possible outperforming solutions without the risk of having them contaminated by mutation or crossover. Using this tech-nique improves and accelerates the search of most of the Genetic Algorithms [52]

Crossover

When the two parents are chosen they will give birth to their two children. Each couple of chromosomes generates always two new individuals and there is a probability that crossover will be performed during the process. If no crossover takes place, the new chromosomes will be identical to their parents. In the other case instead, the children will inherit characteristics from both the chromosomes of the previous generation according to the crossover method chosen. These are the most used types [51]:

• Single point: a crossover point is randomly chosen on the chromosome. The two children will be the copy of a parent until that point and the copy of the other beyond the point.

• Two point: similar to the single point, but in this case two point are selected. Beyond the second point the children are copied from the first parent again

(36)

Mutation

When a new chromosome is created, there is a given probability that a mutation will take place. A mutation is a random variation on one or two genes of a chromosome. Mutation types are usually related to the chosen encoding of the algorithm [51]:

• Binary: mutation on binary encoding. Two random genes are swapped

• Permutation: similar to binary mutation, two genes are swapped to change the ordering

• Direct value: a gene is slightly increased or decreased randomly

3.3.3 Powell’s Conjugate Direction Method

Powell’s conjugate direction method aims at minimizing values over different dimensions, without losing the result achieved over one dimension before moving to the next one. Before exploring in depth this method, we need to first introduce how multi-dimensional minimization problems are usually solved. In a generic minimization problem we start at a point P in a N-dimensional space, proceeding in a vector direction n, any function of N variable f (P ) can be minimized along n. Different methods differ by how the next direction is chosen. Often we resolve in computing the gradient of the functions in order to proceed with the minimization. But there may be cases for which it is not possible to compute the gradient of the functions, therefore we need to find an alternative solution. It is also important to underline that when selecting the next direction to minimize along, the minimization should not interfere with the already completed minimizations.

The concept of non-interfering directions is also called conjugate directions. We cover the gradient formulation to then move onto the gradient-free solution proposed by J.D.Powell. It is important to note that if we minimize a function along some direction v, then the gradient of the function must be perpendicular to v. If this is not the case, there is still a nonzero directional derivative along v. Let’s now consider a point P as the origin of the coordinate system with coordinates x. Any function f can be approximated by its Taylor series: f (x) = f (P ) +X i ∂f ∂xi xi+ 1 2 X i,j ∂2f ∂xi∂xj xixj+ · · · ≈ c − b · x + 1 2x · A · x (3.10) where: c ≡ f (P) b ≡ −∆f |P [A]ij ≡ ∂2f ∂xi∂xj P . (3.11)

The matrix A whose components are the second partial derivative matrix of the function is called Hessian matrix.

(37)

We can compute the gradient of f present in Equation 3.3.3 as:

∆f = A · x − b. (3.12)

The gradient ∆f changes when we move along a given direction:

δ(∆f ) = A · (δx). (3.13)

If we now consider that we moved along a direction u to a minimum, we should now move along a new direction v. We have to be sure that the gradient stays perpendicular to u and we do not lose the minimization done along u. This resolves to:

0 = u · δ(∆f ) = u · A · v. (3.14)

When this holds for two vectors u and v, they are said to be conjugate.

Powell in [53] discovered a direction method that produces N mutually conjugate direc-tions. It requires to initialize the set of directions ui to the basis vectors ui = ei i =

1, . . . , N . The algorithm consists in repeating the following sequence of steps until the function stops decreasing:

• Save starting position as P0.

• For i = 1, . . . , N , move Pi−1 to the minimum along direction ui and call this point

Pi.

• For i = 1, . . . , N , set ui← ui+1.

• Set uN ← PN− P0.

• Move PN to the minimum along direction uN and call this point P0.

As we can see this method does not use any gradient minimization, which means that it can be applied to all those cases in which the function is non-differentiable. This will indeed come in handy when we apply this method to our optimization problem.

(38)

(39)

Chapter 4

Problem domain and dataset

description

This chapter contains a description of the scenario we worked in. Our work is developed starting from the algorithms we produced for RecSys Challege 2017 with Lunatic Goats team. The competition was organized by the business social network XING, so it was focused on job recommendations. Our thesis deals with the ensemble part of the Lunatic Goats’ recommender system and proposes a way to optimize it.

4.1 Domain description

The domain of our work is job recommendations. This is one of the most common subjects in recommender system field and consists of linking Users, that are the people who are registered to the social network, to the Items, meaning the job offers published by the companies.

In recent years job oriented social networks gained an increasing importance for recruiting and looking for a job. This brought a new development of this field, where social networks tried to improve to provide the best system to their costumers. Recommender systems are one of the technologies implied to accomplish this goal. People are suggested jobs that are considered good for them in order to increase the probability to find the right employment for every person. Obviously, the better is the recommender system, the more the people will like the suggested jobs and the more the social network will increase costumer’s satisfaction. For this reason becomes trivial to understand the importance of the application of RecSys in this field and why several social networks are increasing the efforts on developing a good system.

(40)

28 Chapter 4. Problem domain and dataset description

4.2 Dataset

In this section we are going to explain the two datasets related to our work. The first is the dataset provided by the organizers of the competition. It was obviously designed for a long-term challenge where the competitors had time and resources to improve their solution, so the amount of data was quite huge. Because of this reason and the fact that the competition was focused on cold-star items, we decided to create a subset to work on, reducing the dimension of the dataset and removing the cold-start scenario.

4.2.1 Challenge Dataset

The goal of the competition was to find 100 users that had to be recommended to the target items. This is quite an unusual choice since most of the job recommendation chal-lenge focus on finding new items for the users, not vice versa. The focus of this year’s challenge was dealing with cold-start jobs, meaning that we were given items that had no interactions in the train set. This obviously changes the problem approach, leading most of the effort towards Content-Based algorithms. The dataset provided by XING was a standard dataset for recommendation systems, composed by items, users and Interactions whose details will be explained in the following paragraphs. This were the cardinalities of XING’s dataset:

Cardinality

Items 1306054

Users 1497020

Train set Interactions 8274901

Target Items 46559

Target Users 74840

URM sparsity 0.0000042322

(41)

4.2. Dataset 29

Item profile

The items are the job offers provided from the companies to the social network

Feature Description

item id Anonymized Item ID career level Career level of the job discipline id ID that identifies the role

the job is offering

industry id ID that identifies the field of the job country Country where the job is located

region Region where the job is located latitude Latitude where the job is located longitude Longitude where the job is located employment Type of employment

title Concepts that have been extracted from the title of the job

tags Concepts that have been extracted from the tags, skills or company name created at Timestamp that identifies the moment

the job was created at

(42)

User profile

The users are the people that are registered to XING that are looking for job offers

Feature Description

user id ID of the User

jobroles IDs of the job roles extracted from User’s profile

career level Career level of the User

discipline id ID that identifies the role of the User industry id ID that identifies the field

where the User works country The country where the User

is currently working region The region where the User

is currently working

experience n entry class Number of CV entries that the user has listed as work experience

experience years experience Estimated number of years of User’s work experience

experience years in current Estimated number of years that the User has been working in his current job edu degree Estimated university degree of the User edu fieldofstudies List of fields that the User studied

wtcj User’s willingness to change job

(43)

4.2. Dataset 31

Interactions

The last table shows interaction’s features. It is basically the train set. It describes the links between items and users and the type of each connection. The feature interaction type explains the action the user performed with the item. It can be one of this five types:

1. The user clicked on the item 2. The user bookmarked the item 3. The user replied to the job offer

4. The user deleted a recommendation (negative interaction) 5. A recruiter from the item company showed interest into the user

Feature Description

user id ID of the user item id ID of the item

interaction type A number that identifies the type of the interaction.

created at Timestamp that identifies the moment the interaction was created at

Table 4.4: Interaction’s features

4.2.2 Our dataset

The dataset we used is a subsampling of the RecSys Challenge 2017 original dataset provided by XING. While creating our subset, we had two goals: first of all we needed a warm-start scenario, meaning that each target item had to have at least one interaction in the train set; second, we wanted to reduce the sparsity of the URM. To achieve this results we created our subset by selecting only those items that had at least 15 clicks. Then we took the interactions with those items and the users who clicked them. Subsequently we created the set of target items selecting only from those items that had at least 10 positive clicks and 4 negative clicks, their interactions as test set and the users they interacted with as target users. To complete the split process we created a validation set from the train set to avoid overfitting on the test set. Notice that in both the test sets the items have both positive and negative interactions. This choice is motivated by the intention to create a dataset that was as similar as possible to the competition dataset. Also, as we will see later, the presence of negative interaction is crucial to some of the algorithms we

(44)

implemented. We discarded type 5 interactions, that were almost not influent, to reduce the complexity. The following table shows the cardinalities of our dataset:

Test split Validation split

Items 70402 70402

Users 164765 164765

Train set Interactions 5294113 5256970

Test set Interactions 73365 37143

Target Items 10000 5000

Target Users 32936 19660

URM sparsity 0.000271 0.000268

Table 4.5: Our dataset

Feature aggregation

One of the main achievements we obtained as Lunatic Goats in XING’s competition was the performance of our Content-Based algorithms. Features aggregation played an impor-tant role in the algorithm’s efficiency, so we decided to use it again in our thesis work. We extended item’s and user’s profiles with new features that were simply the logical conjunction of the original ones. This is how we applied it:

Items: We used 20 new features created with feature aggregation. This features were simply the conjunction of 2 o 3 original features. We used almost all of them excluding tags and title.

Users: We created 8 new features combining 2 or 3 original features together. In this case we used only geographical features (such as Country, Region, Latitude and Longitude) and job-related features (Discipline ID and Industry ID ) because we needed some better localization information about people.

(45)

Chapter 5

Implementation

This chapter contains the explanation of all the algorithms, methods and techniques we used in our thesis work.

Since the task we focused on is ensemble’s parameter optimization, first of all we have to build the ensemble itself. This means that we have to implement different kind of algorithms that have to be unified following the ensemble policy.

As already mentioned, a significant part of the work derives from what Lunatic Goats did for RecSys Challenge 2017, especially most of the basic algorithms that were involved in the ensemble, part of the structure of the ensemble itself and some of the hyperparameter optimization techniques.

The first section covers the implementation of the evaluation metrics followed by a section that goes over the different similarity metrics employed in the algorithms. The last section covers the implementation of each base algorithm, the ensemble technique and how the optimizations techniques are applied.

5.1 Evaluation metrics

In this section we discuss the metrics we use to evaluate the performances of our work. The selection of the metrics is biased toward how many items present in the ground truth we are to correctly predict.

5.1.1 Confusion Matrix

In order to fully understand the metrics we present, we need to first introduce the concept of Confusion Matrix 5.1 along with True Positive, True Negative, False Positive and False Negative.

(46)

34 Chapter 5. Implementation

Ground truth

Actual Positive Actual Negative

Prediction Predicted Positive TP FP

Predicted Negative FN TN

Table 5.1: Confusion Matrix.

In information retrieval these indices refers to:

• True Positive: the prediction classifies an item as positive and it is positive inside the ground truth.

• True Negative: the prediction classifies an item as negative and it is negative inside the ground truth.

• False Positive: the prediction classifies an item as positive but it is negative inside the ground truth.

• False Negative: the prediction classifies an item as negative but it is positive inside the ground truth.

In Recommender Systems we lose the concept of Predicted Negative since the prediction of a given algorithm suggests item that the user should like, which means only Predicted Positive items. This results in different formulations of the metrics with respect to the standard formula used in information retrieval.

5.1.2 Precision

In information retrieval precision corresponds to how many retrieved documents are rele-vant with respect to all the retrieved documents.

P recision = T P

T P + F P (5.1)

In Recommender Systems this formula can be seen as how many relevant items are rec-ommended divided by the amount of recrec-ommended items.

P recision = #relevant items

|prediction| (5.2)

There is also the possibility of evaluating precision at a cut-off k. This means that instead of taking into account the entire length of the recommendation we consider only how many relevant items are present in the top-k predicted items and we divide this value by k.

P recision@k =

#relevant items_@k