Politecnico di Milano
School of Industrial and Information Engineering
Master of Science in Computer Science and Engineering Dipartimento di Elettronica, Informazione e BioingegneriaA novel graph-based model for hybrid
recommendations in cold-start scenarios
Supervisor: Prof. Paolo Cremonesi
Co-Supervisor: Dott. Maurizio Ferrari Dacrema
Master Thesis by: Cesare Bernardis, 852509
Abstract
Recommender Systems are tools and techniques aimed to recommend items to users. With the constant growth of Web based applications, in recent years they have become increasingly popular, since they can be applied in a wide variety of domains, from movies to books and much more. In order to make recommendations, they usually try to predict preferences or ratings that users would give to items by exploit-ing available information, that can be of two main types: content or collaborative.
The first refers to the information about the items or the users them-selves and their characteristics, called features, like the genre of a film or the age of a user, and forms a set of recommendation techniques defined as content-based filtering approaches.
The second refers to past interactions of users and forms another set of techniques defined as collaborative filtering approaches. It is well known in recommender system literature that collaborative approaches perform better in most situations.
However, they can not be used when previous interactions are not avail-able, like, for example, when new items are added to the catalogue and they have to be recommended. This type of situation is also called cold-start scenario and in similar cases only content-based approaches can be adopted.
For this reason, some techniques to optimize performance of this type of approaches have been studied in recent past. One of them is called feature weighting, which assigns to every feature a real value, called weight, that estimates its importance.
Some statistical techniques for feature weighting commonly used in In-formation Retrieval, like TF-IDF, have been adapted for Recommender Systems, but they did not provide relevant performance improvements in many situations. Late approaches perform weights estimation by
ex-tracting collaborative information using machine learning techniques, which means that they evaluate the importance of a feature on the base of other users opinions.
This type of models have promising results compared to classic statis-tical analyzes. For this reason, in this thesis we propose a novel graph, feature-based machine learning model to face the cold-start item sce-nario, learning the relevance of features from probabilities of item-based collaborative filtering algorithms.
We will describe the new mathematical model and some implementa-tion details that have been adopted. Then we will test it in different cold-start situations and we will compare its performance with other state-of-the-art approaches.
Sommario
I Sistemi di Raccomandazione sono un insieme di strumenti e tecniche utilizzati per raccomandare oggetti agli utenti. Grazie alla diffusione in costante crescita delle applicazioni Web, nel recente passato sono diventati sempre pi`u popolari. Essi infatti possono essere applicati in una vasta variet`a di situazioni, dalla raccomandazione di film e serie TV, ai libri, a molto altro ancora.
Per proporre delle raccomandazioni, questi sistemi solitamente predi-cono le preferenze o le valutazioni che gli utenti darebbero agli oggetti utilizzando le informazioni che hanno a disposizione. Queste infor-mazioni possono essere principalmente di due tipi: contenutistico e collaborativo.
Il primo si riferisce alle informazioni riguardanti gli oggetti o gli utenti stessi e le loro caratteristiche, come, ad esempio, il genere di un film o l’et`a di una persona, ed `e alla base di un intero insieme di tecniche per la raccomandazione definite approcci basati sul filtraggio del contenuto. Il secondo si riferisce alle passate interazioni degli utenti ed `e alla base di un altro insieme di tecniche definite approcci basati sul filtraggio col-laborativo.
Nella letteratura dei Sistemi di Raccomandazione `e ben noto che gli ap-procci collaborativi ottengono delle performance migliori nella maggior parte delle situazioni. Tuttavia, a causa della loro struttura, esistono dei casi in cui non possono essere utilizzati, come, ad esempio, quando si vogliono raccomandare dei nuovi oggetti che vengono aggiunti al catalogo. In questi casi, chiamati scenari cold-start, possono essere uti-lizzati solamente approcci basati sul contenuto.
Per questo motivo, nel recente passato sono state studiate alcune tec-niche per ottimizzare le performance di questo tipo di approcci. Una di esse viene chiamata ponderazione delle caratteristiche e consiste nell’assegnare ad ogni caratteristica un valore reale, chiamato peso,
che stimi la sua importanza.
Alcune tecniche statistiche comunemente utilizzate nel campo del Re-cupero di Informazioni, come TF-IDF, sono state adattate per i Sistemi di Raccomandazione, ma, in molti casi, non hanno portato a migliora-menti delle prestazioni rilevanti.
I pi`u recenti metodi sfruttano l’informazione collaborativa per stimare questi pesi, estraendola attraverso tecniche di machine learning. In al-tre parole, valutano l’importanza delle caratteristiche sulla base delle opinioni date dagli utenti.
Modelli di questo tipo hanno mostrato dei risultati molto promettenti, soprattutto se confrontati con i classici metodi statistici. Per questa ragione, in questa tesi proponiamo un nuovo modello di machine learn-ing basato su grafi che apprende la rilevanza delle caratteristiche dalle probabilit`a ottenute tramite algoritmi di filtraggio collaborativo basati sugli oggetti.
Descriveremo il nostro nuovo modello matematico e alcuni dettagli riguardanti la sua implementazione, quindi lo testeremo in diverse situ-azioni cold-start e confronteremo le sue performance con altri approcci appartenenti all’attuale stato dell’arte dei Sistemi di Raccomandazione.
Contents
Abstract 3
Sommario 5
1 Introduction 19
2 State of the art 23
2.1 Structure of a Recommender System . . . 23
2.2 Collaborative filtering . . . 25 2.2.1 Memory-based . . . 25 2.2.2 Model-based . . . 26 2.3 Content based . . . 30 2.4 Similarities . . . 31 2.4.1 Cosine similarity . . . 31 2.4.2 Pearson correlation . . . 31 2.4.3 Jaccard coefficient . . . 32 2.4.4 Tanimoto coefficient . . . 32
2.5 Comparing collaborative filtering and content-based . . 32
2.6 Graph-based methods . . . 33
2.6.1 P3α . . . 36
2.6.2 RP3 β . . . 39
2.7 The cold-start problem . . . 39
2.8 Feature weighting . . . 40
2.8.1 Approaches from Information retrieval literature 41 2.8.2 Approaches from Recommender Systems litera-ture without machine learning . . . 43
2.8.3 Approaches from Recommender Systems litera-ture with machine learning . . . 46
3 Model 51
3.1 Hybrid P3 . . . . 51
3.1.1 From collaborative to Hybrid P3 . . . . 51
3.2 The new model . . . 54
3.3 Learning the solution . . . 56
4 Evaluation 61 4.1 Accuracy metrics . . . 61
4.1.1 Classification metrics . . . 62
4.1.2 Rank accuracy metrics . . . 64
4.1.3 A different concept of performance . . . 66
4.2 Datasets . . . 68
4.2.1 Cleaning the datasets . . . 68
4.2.2 Split for training and testing . . . 69
4.2.3 Training and validation . . . 71
4.2.4 Enriched Netflix Dataset . . . 72
4.2.5 Movielens 20M . . . 74
4.2.6 The Movies Dataset . . . 79
5 Results 85 5.1 Implementation details . . . 85
5.1.1 Gradient simplification . . . 85
5.1.2 Building the trainset . . . 87
5.1.3 Changing the target . . . 90
5.1.4 Adapting the re-ranking in RP3 β . . . 91
5.1.5 Using mini-batches . . . 92
5.1.6 Selecting Nearest Neighbours . . . 94
5.1.7 Early stopping . . . 96
5.2 Testing over datasets . . . 97
5.2.1 Algorithms . . . 97
5.2.2 Complete datasets . . . 98
5.2.3 Removing popular items . . . 108
5.2.4 Learning without top popular . . . 112
6 Conclusions 115
A Performance over complete datasets 123 A.1 Enriched Netflix Dataset . . . 124 A.2 Movielens 20M . . . 128 A.3 The Movies Dataset . . . 132
B Performance over datasets without 5% of top popular
items 137
B.1 Enriched Netflix Dataset . . . 138 B.2 Movielens 20M . . . 142 B.3 The Movies Dataset . . . 146
C Performance over datasets without 10% of top popular
items 151
C.1 Enriched Netflix Dataset . . . 152 C.2 Movielens 20M . . . 156 C.3 The Movies Dataset . . . 160
List of Figures
2.1 An example of a graph . . . 34
2.2 An example of an adjacency matrix . . . 35
2.3 P transition probability matrix . . . 38
2.4 P3 α matrix multiplication . . . 38
3.1 Example of 3-step paths on a complete graph . . . 52
3.2 P3 hybrid matrix example . . . . 52
3.3 P3 hybrid matrix multiplication . . . . 53
3.4 Merge processes of content-based and collaborative paths 55 4.1 URM split schema for train and test procedures. Sub-matrix A is used for training, while B is used for testing. 70 4.2 URM split schema for train, validation and test proce-dures. Submatrix A is used for training, while B is used for testing. Dots in submatrix A represent the ratings extracted to form the validation set. . . 71
4.3 Distribution of interactions among items before the ran-domization process of item IDs . . . 78
4.4 Distribution of interactions among items after the ran-domization process of item IDs . . . 79
5.1 Performance on validation set of Movielens 20M with different gradients . . . 86
5.2 Performance on test set of Movielens 20M with different gradients . . . 87
5.3 Performance on validation set of Movielens 20M using different numbers of nearest neighbours in target matrix 89 5.4 Performance on test set of Movielens 20M using different numbers of nearest neighbours in target matrix . . . . 89 5.5 Performance on validation set of Movielens 20M using
5.6 Performance on test set of Movielens 20M using 100 nearest neighbours with and without negative samples 91 5.7 Performance on validation set of Movielens 20M with
different batch sizes . . . 93 5.8 Performance on test set of Movielens 20M with different
batch sizes . . . 94 5.9 Performance on validation set of Movielens 20M using
different numbers of nearest neighbours in recommen-dations . . . 95 5.10 Performance on test set of Movielens 20M using different
numbers of nearest neighbours in recommendations . . 95 5.11 Comparing performance trends on validation and train
sets of Movielens 20M. Best scores are highlighted with a vertical line. . . 96 5.12 Distribution of features among items over different
com-plete datasets. On the x-axis the items are sorted by descending popularity, so that for every point, the y in-dicates the number of features of the x-th most popular item. . . 111
List of Tables
4.1 Confusion matrix for a recommender system[18] . . . . 63 4.2 Number of interactions per rating in Netflix Dataset . . 72 4.3 Number of interactions per rating after filtering in
Net-flix Dataset . . . 72 4.4 Distribution of features among items in Netflix Dataset 73 4.5 Details of matrix A for Netflix Dataset . . . 73 4.6 Details of matrix B for Netflix Dataset . . . 74 4.7 Details of train-validation split for Netflix Dataset . . . 74 4.8 Number of interactions per rating in Movielens 20M . . 75 4.9 Number of interactions per item in Movielens 20M . . . 75 4.10 Number of interactions per rating after filtering in
Movie-lens 20M . . . 76 4.11 Distribution of features among items in Movielens 20M 77 4.12 Details of matrix A for Movielens Dataset . . . 78 4.13 Details of matrix B for Movielens Dataset . . . 79 4.14 Details of train-validation split for Movielens Dataset . 79 4.15 Number of interactions per rating in The Movies Dataset 80 4.16 Number of interactions per item in The Movies Dataset 81 4.17 Number of interactions per rating after filtering in The
Movies Dataset . . . 82 4.18 Distribution of features among items in The Movies Dataset 82 4.19 Details of matrix A for The Movies Dataset . . . 82 4.20 Details of matrix B for The Movies Dataset . . . 83 4.21 Details of train-validation split for The Movies Dataset 83
5.1 Density of trainset URMs and targets in different datasets 88 5.2 Density of trainset ICMs and common features matrix
in different datasets . . . 88 5.3 Performance @5 of different algorithms on test set of
5.4 Performance @10 of different algorithms on test set of Enriched Netflix dataset . . . 101 5.5 Performance @5 of different algorithms on test set of
Movielens 20M dataset . . . 103 5.6 Performance @10 of different algorithms on test set of
Movielens 20M dataset . . . 104 5.7 Performance @5 of different algorithms on test set of
The Movies dataset . . . 106 5.8 Performance @10 of different algorithms on test set of
The Movies dataset . . . 107 5.9 Percentage of most popular items zeroed in target
ma-trix for each dataset . . . 112 5.10 Performance @5 difference of hybrid algorithms learning
from target matrices with top popular rows zeroed out compared to complete ones . . . 113 5.11 Performance @10 difference of hybrid algorithms
learn-ing from target matrices with top popular rows zeroed out compared to complete ones . . . 114
A.1 Performance @5 of different algorithms on validation set of Enriched Netflix dataset . . . 124 A.2 Performance @10 of different algorithms on validation
set of Enriched Netflix dataset . . . 125 A.3 Performance @5 of different algorithms on test set of
Enriched Netflix dataset . . . 126 A.4 Performance @10 of different algorithms on test set of
Enriched Netflix dataset . . . 127 A.5 Performance @5 of different algorithms on validation set
of Movielens 20M dataset . . . 128 A.6 Performance @10 of different algorithms on validation
set of Movielens 20M dataset . . . 129 A.7 Performance @5 of different algorithms on test set of
Movielens 20M dataset . . . 130 A.8 Performance @10 of different algorithms on test set of
Movielens 20M dataset . . . 131 A.9 Performance @5 of different algorithms on validation set
of The Movies dataset . . . 132 A.10 Performance @10 of different algorithms on validation
A.11 Performance @5 of different algorithms on test set of The Movies dataset . . . 134 A.12 Performance @10 of different algorithms on test set of
The Movies dataset . . . 135
B.1 Performance @5 of different algorithms on validation set of Enriched Netflix dataset without 5% of top popular items . . . 138 B.2 Performance @10 of different algorithms on validation
set of Enriched Netflix dataset without 5% of top popu-lar items . . . 139 B.3 Performance @5 of different algorithms on test set of
Enriched Netflix dataset without 5% of top popular items140 B.4 Performance @10 of different algorithms on test set of
Enriched Netflix dataset without 5% of top popular items141 B.5 Performance @5 of different algorithms on validation set
of Movielens 20M dataset without 5% of top popular items142 B.6 Performance @10 of different algorithms on validation
set of Movielens 20M dataset without 5% of top popular items . . . 143 B.7 Performance @5 of different algorithms on test set of
Movielens 20M dataset without 5% of top popular items 144 B.8 Performance @10 of different algorithms on test set of
Movielens 20M dataset without 5% of top popular items 145 B.9 Performance @5 of different algorithms on validation set
of The Movies dataset without 5% of top popular items 146 B.10 Performance @10 of different algorithms on validation
set of The Movies dataset without 5% of top popular items . . . 147 B.11 Performance @5 of different algorithms on test set of
The Movies dataset without 5% of top popular items . 148 B.12 Performance @10 of different algorithms on test set of
The Movies dataset without 5% of top popular items . 149
C.1 Performance @5 of different algorithms on validation set of Enriched Netflix dataset without 10% of top popular items . . . 152
C.2 Performance @10 of different algorithms on validation set of Enriched Netflix dataset without 10% of top pop-ular items . . . 153 C.3 Performance @5 of different algorithms on test set of
Enriched Netflix dataset without 10% of top popular items . . . 154 C.4 Performance @10 of different algorithms on test set of
Enriched Netflix dataset without 10% of top popular items155 C.5 Performance @5 of different algorithms on validation set
of Movielens 20M dataset without 10% of top popular items . . . 156 C.6 Performance @10 of different algorithms on validation
set of Movielens 20M dataset without 10% of top popular items . . . 157 C.7 Performance @5 of different algorithms on test set of
Movielens 20M dataset without 10% of top popular items 158 C.8 Performance @10 of different algorithms on test set of
Movielens 20M dataset without 10% of top popular items 159 C.9 Performance @5 of different algorithms on validation set
of The Movies dataset without 10% of top popular items 160 C.10 Performance @10 of different algorithms on validation
set of The Movies dataset without 10% of top popular items . . . 161 C.11 Performance @5 of different algorithms on test set of
The Movies dataset without 10% of top popular items . 162 C.12 Performance @10 of different algorithms on test set of
Chapter 1
Introduction
Recommender Systems are systems that try to predict, with the avail-able information, preferences or ratings that users would give to items. E-commerce and Social Media are only two examples of fields that have grown consistently in last years and that can get a lot of advantages from their application. For this reason, Recommender Systems have attracted much interest in the recent past and nowadays they play an important role in many major companies. It was October 2006 when Netflix organized one of the first and most popular competitions about Recommender Systems, that took about three years to complete and had a grand prize of 1 million US dollars. Today, many different com-panies profit from recommenders: just think about how important they are for a colossus like Amazon, to reduce the steps between the user and what he is going to buy, or maybe to tempt him showing something he would have never bought otherwise.
To make recommendation, Recommender Systems take advantage of all the information they have, that can be categorized in two main classes: content and collaborative.
Content information is everything we know about how an item or a user is. For each of them a profile containing their characteristics, called fea-tures, is built and is used to understand which items, or users, have similar profiles. For example the profile of the book ”And Then There Were None”, ”Dieci piccoli indiani” in italian, would contain infor-mation like that it has been written by Agatha Christie, that it is a mystery novel, and so on.
Collaborative information, instead, tries to exploit the tastes of users from their past interactions, so two items are similar if they have been
rated by a high number of common users, and users are similar if they rated a high number of common items.
These two categories of information form two main classes of approaches for Recommender Systems: Content-based filtering and Collaborative filtering approaches respectively. Both have advantages and disadvan-tages, but from literature is well known that Collaborative filtering al-gorithms perform better in most situations. However, they are usually strongly influenced by popularity of items and there are cases where they can not be applied.
One of this cases is called the cold-start scenario, that is when a new item is added to the catalogue, so it has no previous interactions, and it has to be recommended. In this situation, a collaborative approach could not recommend the interested item, since it has no previous in-teractions, so a content-based one has to be adopted, with a consequent loss of performance. In fact, the quality of content-based approached is influenced by the availability of significant and structured features, that, instead, are very often user generated and noisy.
Many different techniques have been studied in the past to optimize the performance of content-based algorithms, from simple approaches to complex machine learning models. One of them is called feature weighting: it consists in estimating the importance of every content feature and to formalize it with a real value, called weight. Higher is the weight of a feature, higher is its relevance.
In information retrieval, one of the most famous methods for feature weighting is called TF-IDF, Term Frequency - Inverse Document Fre-quency, and it has been adapted to be used also in content-based recom-menders. However, more recently have been proposed some techniques that extract collaborative information from collaborative approaches and try to embed it in feature weights, estimating feature importance from past interactions of users. Most of them take advantage of ma-chine learning techniques, but, since it is a quite novel field, only a few approaches have been presented, and many of them are hard to tune or require a long computation time. Anyway, they have shown very promising results, so that recently they have attracted a particular in-terest.
In this thesis we propose a novel, graph-based, feature-based machine learning model for recommendation with feature weighting and collab-orative information embedding, extracted from item probability
matri-ces of collaborative filtering approaches.
After defining the mathematical model and describing some techniques adopted for the implementation, we will compare it with other state-of-the-art approaches in cold-start scenarios over three different datasets, including the well known Movielens 20M.
Chapter 2
State of the art
Recommender systems are a subset of information filtering systems that try to predict the ratings or the preferences that a user would give to an item[5]. They have become very popular in recent years and are used in various web applications. We are going to see how a recommender system is structured and how it makes recommendations. Then we will present the feature weighting problem and the reasons why it has become an interesting topic in information retrieval and, particularly, in the recommender field.
2.1
Structure of a Recommender System
Recommender systems can have different structures, depending on the reality they are representing, but the problem that a recommendation system is modeled to solve is always the same: recommend items to users. However, it can sometime happen that the problem is inverted, and is asked to recommend users to items [1], or also to recommend users each other, like, for example, in social networking systems. For simplicity we will focus on the most common problem mentioned above and we will try to predict user ratings or preferences for items.
To make predictions, many different types of information can be used. In literature, the most common algorithms use past user interactions with items or information about the content of the items or the users, also called profiles. For example, imagine that we have to write a rec-ommender system for Amazon. A user has just bought a new laptop and the algorithm should recommend another item to him. One algo-rithm could recommend an item which profile is similar to the laptop he
has just bought (probably another laptop with similar characteristics), another could recommend an item that other users bought after that laptop, like a cover or a backup battery. The first algorithm may seem to provide a bad recommendation, since it is unusual, for a common user, to buy two laptops in a short amount of time. But think about books. That algorithm would probably suggest another book similar to the bought one, which seems to be a more reasonable choice for a recommendation.
However, independently on how the information is manipulated to make predictions, there are some data structures that are widely used in recommendation systems to represent the information we have about the reality to model:
• URM: the User Rating Matrix is a matrix used to represent the ratings or the preferences already expressed by users for items. Consequently, calling U the set of all the users and I the set of all the items, the shape of the URM is |U | × |I|. Depending on the reality we are modeling, the values of each cell are, usually, booleans (0 or 1) if we are interested in representing only the presence or the absence of the interaction between user and item, or real numbers if we want to represent the rating that a user gave to an item. From now on, we will use the notation U RM [u, i] to identify the interaction between user u and item i, that is the value of the URM at row u and column i.
• ICM: the Item Content Matrix is a matrix used to represent the profiles of the items. Calling F the set of all the features, the shape of the ICM is |I| × |F |. A feature, or attribute, represents an information about the content of the item. For instance, the features of a laptop could include the CPU model, the quantity of RAM, or the size of the screen. The values of a feature can be of different types: real valued, string, boolean and so on. This diver-sity in data representation can create difficulties in calculating the similarity between item profiles, so, usually, the different types of feature values are encoded in a binary profile. This means that every cell of the ICM indicates if the item identified by the row index has the feature identified by the column index. From now on, we will use the notation ICM [i, f ] to identify the value of the ICM at row i and column f , which is if item i has feature f .
• UCM: the User Content Matrix is a matrix used to represent the profiles of the users. The shape of the UCM is |U | × |F |. Concep-tually it has the same meaning of the ICM, that is it indicates if the user identified by the row index has the feature identified by the column index.
Basing on informations used to make recommendations, standard mod-els for recommender systems can be split in two macro-categories: col-laborative filtering and content-based [2].
2.2
Collaborative filtering
The basic idea of collaborative filtering methods is that the ratings that a user has not given explicitly to the items can be inferred, because the observed ratings are often highly correlated across various users and items[2, 36, 16]. For example, consider two users who have very sim-ilar tastes. If the ratings, specified by both of them, are very simsim-ilar, then their similarity can be exploited by the underlying algorithm. In such cases, it is very likely that the ratings in which only one of them has specified a value, are also probably going to be similar. But this similarity can also be used to infer the ratings for items that none of the two users has specified yet.
Collaborative filtering is considered to be the most popular and widely implemented technique in RS[36].
There are two types of methods that are commonly used in collabora-tive filtering, which are referred to as memory-based (or neighborhood-based) methods and model-based methods[36, 2]:
2.2.1 Memory-based
Memory-based techniques were among the earliest collaborative fil-tering algorithms, in which the ratings are predicted on the basis of user neighborhoods. Contrary to model-based techniques, they use the stored ratings directly in the prediction[36] and they are usually easier to implement[2]. The neighborhoods can be defined in one of two ways:
User-based
User-based recommendation methods predict the rating of a user u for a new item i using the ratings given to i by users most similar
to u[36]. So we have to define a preference similarity sim(u, v), that should summarize in a singular value how much two users u and v have similar tastes. This similarity value, then, will be used to weight the contribution of the ratings of user v to the prediction of the ratings of user u. The predicted rating U RM0[u, i] for user u and item i can be estimated as U RM0[u, i] = P v∈UU RM [v, i] · sim(u, v) P v∈U|sim(u, v)| Item-based
In item-based approaches, in order to make the rating predictions for a target item by a user, we have to determine a set of items that are most similar to the target item. The idea is that two items are similar if many users have rated them both. Then we use ratings specified by the user in that item set to predict if he will like the target item[2]. Therefore, in this case we have to define a similarity between two items, let’s call them i and j, and the predicted rating U RM0[u, i] for user u and item i becomes
U RM0[u, i] = P
j∈UU RM [u, j] · sim(i, j)
P
j∈I|sim(i, j)|
2.2.2 Model-based
The main difference between memory-based and model-based tech-niques is that the latest do not rely on the whole dataset every time recommendations are necessary, but they extract information from the dataset to build a model and they use it to make recommendations. Model-based methods include machine learning and data mining tech-niques in the context of predictive models[2]. Parameters of the model, if necessary, are learned within the context of an optimization frame-work. Some examples of such methods include decision trees, rule-based models, Bayesian methods and latent factor models.
SLIM
Sparse LInear Method[32, 7] is an algorithm proposed by Ning and Karypis for top-N recommendations. The recommendation score on
an un-rated item i of a user u is calculated as a sparse aggregation of items that have been rated by u:
U RM0[u, i] =X
j∈I
U RM [u, j] · W [j, i]
that in vector form becomes
U RM0[u, i] = U RM [u] WT[j]
Thus, the whole model can be presented as
U RM0 = U RM W
where W is a |I| × |I| sparse matrix of unknown aggregation coeffi-cients. For this reason, matrix W has to be learnt using the original U RM as ground truth and is defined as the minimizer of the following regularized optimization problem:
argmin W 1 2||U RM − U RM W || 2 F + β 2||W || 2 f + λ||W ||1
where every value of W is positive (W ≥ 0), to avoid negative coef-ficients that can cause problems when making recommendations, and the main diagonal of W is null (diag(W ) = 0), in order to avoid the trivial, and meaningless, solution where W is the identity matrix.
Matrix Factorization
Matrix factorization is at the base of some of the most successful im-plementations of latent factor models[24]. Latent factor models try to explain the ratings by characterizing both items and users on an arbitrary number of latent factors inferred from the ratings patterns. A high correspondence between item and user factors leads to a high predicted rating and, consequently, to a recommendation.
Matrix factorization models have gained popularity in the recent past, because they are able to combine a good scalability with an interesting accuracy[36, 24]. In the basic matrix factorization model, calling L the set of the latent factors, the User Ratings Matrix is approximately factorized into a |U | × |L| matrix U and a |I| × |L| matrix V as follows:
Each column of U and V is referred to as a latent vector or latent component, and each row of U and V is referred to as a latent factor[2]. The difference is that U is referred to users, while V is referred to items. It follows that every row of matrix U defines a representation of a user and every row of V defines a new representation of an item.
The major challenge is computing the mapping of each item and user to latent vectors[24]. After the model completes this mapping, the rating a user u will give to any item i can be easily estimated as:
U RM0[u, i] = U [u] V [i]T
Various matrix factorization methods have been proposed in recom-mender system literature. Two of the most famous are Singular Value Decomposition and Alternating Least Square.
SVD Singular Value Decomposition is a well-established technique for identifying latent semantic factors in information retrieval[24]. In the collaborative filtering domain, SVD requires factoring the URM. This process often raises difficulties due to the high sparsity of the URM, since conventional SVD is undefined when knowledge about the matrix is incomplete. Earlier systems relied on imputation to fill in missing ratings and make the rating matrix dense, but it can require a significant amount of time and available memory space. Moreover, inaccurate imputation might distort the data and produce inefficient recommendations[24]. Several versions of SVD have been proposed in literature to overcome these drawbacks[34]. One of the most famous is the one used for the Netflix competition[14] that uses only the observed ratings and avoids overfitting through a regularized model[24]. It is an approximate solution, since the exact decomposition obtained through basic SVD would put a 0 in every cell of the URM that coincides with an item not rated by a user. In a similar situation, the recommender system would not be able to distinguish between good and bad items for recommendations, since all the estimated ratings of a user for un-rated items would be 0. Parameters of U and V are then estimated by minimizing the sum of squared residuals, one factor at a time, using stochastic gradient descent with regularization and early stopping[34]. We can define the update rules for user u, item i and factor k with learning rate η as:
U [u, k] = U [u, k] + η · (eu,iV [i, k] − λU [u, k])
V [u, k] = V [u, k] + η · (eu,iU [u, k] − λV [i, k])
ALS The basic idea behind the Alternating Least Squares optimiza-tion method, is to, starting with an initial set of matrices U and V , iterate the following two steps until convergence[2]:
1. Keeping U fixed, solve for each of the |I| rows of V by treating the problem as a least-squares regression problem, using only the observed ratings. In order to determine the optimal vector V [i], which is the ith row of matrix V , we wish to minimize
X u:U RM [u,i]>0 U RM [u, i] − |L| X s U [u, s] · V [i, s] 2
The terms U [u, 1] . . . U [u, |L|] are treated as constant values, whereas V [i, 1] . . . V [i, |L|] are treated as optimization variables. A total of |I| such least-squares problems need to be executed, and each least-squares problem has |L| variables, but since the least-squares problem for each item is independent, this step can be parallelized easily[2].
2. Keeping V fixed, solve for each of the |U | rows of U by treating the problem as a least squares regression problem, using only the observed ratings. In order to determine the optimal vector U [u], which is the uth row of matrix U , we wish to minimize
X i:U RM [u,i]>0 U RM [u, i] − |L| X s U [u, s] · V [i, s] 2
The terms V [i, 1] . . . V [i, |L|] are treated as constant values, whereas U [u, 1] . . . U [u, |L|] are treated as optimization variables. A total of |U | such least-squares problems need to be executed, and each least-squares problem has |L| variables, but since the least-squares problem for each user is independent, this step can be parallelized easily[2].
2.3
Content based
Content-based systems are designed to exploit scenarios in which items can be represented with descriptive sets of attributes. In such cases, a user’s own ratings on other items are sufficient to discover meaningful new recommendations[2]. Content-based recommender systems try to recommend to users, items that are similar to what they have liked in the past. This similarity is not based on rating correlations across users, but on the attributes of the objects liked by the users. This means that, unlike collaborative systems, which explicitly leverage the ratings of other users in addition to that of a target user, content-based systems focus only on a target user’s own ratings and the attributes of the items liked by that user. In other words, the content-based methodology leverages a different source of data for the recommenda-tion process[2].
Let’s make an example and consider a movie recommendation system. We know that a user has rated highly the movie Terminator and now we want to recommend him a new movie. The item description of Ter-minator contains similar genre keywords as other science fiction movies, like Alien and Predator. Consequently, one these movies can be chosen for the recommendation.
The first phase in all content-based models is to extract discriminative features for representing the items. Discriminative features are those which are highly predictive of user interests or, in other words, those that are more meaningful to describe the interests of a user. Items usu-ally have one or more fields, often represented by unstructured texts, that describe various aspects of an item. If we think about a movie, it could have a field for the list of actors, one for the director, one for the genre, and so on. The most common approach is to extract keywords or bag of keywords from the available data[2]. These keywords, or fea-tures, can be used to build the Item Content Matrix that we presented in section 2.1.
Once the feature extraction is completed, the recommendation phase can be addressed. One of the most diffuse implementations of Content-based recommender systems are the nearest neighbour algorithms[36], as seen in collaborative filtering. They require only to define a similar-ity function sim(i, j), given any couple of items i and j, which has to describe the closeness between the descriptions of the two items. Then, the predicted rating U RM0[u, i] for user u and item i can be estimated
as
U RM0[u, i] = P
j∈UU RM [u, j] · sim(i, j)
P
j∈I|sim(i, j)|
2.4
Similarities
As presented in Sections 2.2 and 2.3, both content-based and collab-orative filtering recommender systems often require the definition of an appropriate similarity, or distance, measure[36]. For this purpose, usually users and items are represented by vectors or sets, so that math-ematical or statistical approaches studied for these type of structures can be adopted.
2.4.1 Cosine similarity
One of the most common similarity measure is the cosine similarity, that is the cosine of the angle that two vectors form. Given two vectors x and y of an n-dimensional space, the cosine similarity is defined as
cos(x, y) = x y ||x||||y||
where indicates the common dot product and ||x|| indicates the norm of the vector x. Note that, considering the possibility to have negative values in vectors x and y, the cosine similarity ranges from −1, not similar at all, to +1, identical.
2.4.2 Pearson correlation
Pearson correlation is the most commonly used correlation coefficient, which measures the linear relationship between objects[36]. The orig-inal formula for the Pearson correlation ρ of two random variables X and Y is written as
ρX,Y =
cov(X, Y ) σXσY
where cov is the covariance and σX and σY are the standard deviations
of X and Y respectively. This formula can be applied to samples and becomes r = Pn i=1(xi − x)(yi− y) pPn i=1(xi− x)2 pPn i=1(yi− y)2
where n is the size of the sample and x and y are the sample means of x and y respectively. Note that the Pearson coefficient ranges from −1, total negative linear correlation, to 1, total positive linear correlation, while 0 means no linear correlation.
2.4.3 Jaccard coefficient
Jaccard similarity coefficient is a statistic used to measure the similarity of sample sets. Considering two sets X and Y , the Jaccard coefficient is defined as
J (X, Y ) = |X ∩ Y | |X ∪ Y | =
|X ∩ Y | |X| + |Y | − |X ∩ Y |
Note that 0 ≤ J (X, Y ) ≤ 1 since the cardinality of a set is always non-negative and |X| + |Y | ≥ 2|X ∩ Y |. If both X and Y are empty, usually, the Jaccard coefficient is assumed to be 1.
2.4.4 Tanimoto coefficient
Tanimoto similarity coefficient[47, 37], also known as Extended Jac-card, has its origin in the comparison of sample sets, like Jaccard coef-ficient, but has been adapted for real valued vectors[23]. Considering two vectors x and y, the Tanimoto coefficient is defined as
T (x, x) = x y
||x||2+ ||y||2− x y
Note that if x and y are binary vectors with values ∈ {0, 1}, the Tani-moto coefficient reduces to Jaccard[46].
2.5
Comparing collaborative filtering and
content-based
Collaborative approaches overcome some of the limitations of content-based ones[36]. Sometimes the content for some items is not available, difficult to obtain or noisy, but through the feedbacks of other users they can still be recommended. Sometimes if two items have a similar content it does not mean that they have also a similar quality, but collaborative recommendations are based on the quality of items as evaluated by peers. Sometimes users show interest in items with very
different content, collaborative filtering can recommend such items, as long as other users have already shown interest for these different items. However, also collaborative filtering approaches have some drawbacks. An example: recommendation lists generated by collaborative filter-ing algorithms contain frequently some well known blockbusters. This happens because the ranking of items is influenced by items popularity that leads to a low coverage, a metric described in detail in Chapter 4. Moreover, both collaborative filtering and content-based algorithms are prone to generate, in different forms, a phenomenon commonly called filter bubble. It is a concept introduced by Eli Pariser[33], who argued that the ability to adapt and adopt new information is at the base of hu-man intelligence, while recommender systems trap users into a static environment, the filter bubble, which reduces creativity and learning ability. Collaborative filtering systems tend to create a bubble around very popular items, since, as stated before, they are strongly influenced by them. Content-based ones, instead, usually create a bubble around items similar to those users interacted with in the past: if they clicked on a laptop, the recommender system will recommend other laptops, if they clicked on a book written by Agatha Christie, it will recom-mend other books of Agatha Christie, and so on. An interesting study about how and in which measure recommender systems contribute to the creation of filter bubbles has been conduced by Nguyen et al.[31]
2.6
Graph-based methods
In graph-based approaches, data is represented in the form of a graph, where users and items are nodes, and edges encode the interactions among users and items.[36]. In this document, we will represent these models with directed, weighted graphs. Directed means that every edge has a specified direction, while weighted means that every edge has a value that represents its importance. Three different types of nodes can be defined:
• User node: each node of this type represents a single user • Item node: each node of this type represents a single item • Feature node: each node of this type represents a single feature
of an item
• Edges from a user node u to an item node i (and another from i to u) if u has rated i
• Edges from an item node i to a feature node f (and another from f to i) if i has f
The result is a particular form of a tripartite graph, as shown in figure 2.1. The peculiarity is that there are not edges among user nodes and feature nodes, that in the described graph can not exist since features refer to items, not to users, while a generic tripartite graph would allow their existence.
Note that every edge has its ”inverse” with tail and head exchanged. This means that, to define the graph, both undirected edges and cou-ples of directed edges can be used. The difference is the possibility to give weights to edges. Let’s take, for example, two nodes A and B and the edge, or edges, between them. A single undirected edge would have a singular weight, so there is no difference in going from A to B and from B to A. A couple of directed edges, instead, would allow to give different weights to the two arcs, so the edge that goes from A to B could have a different weight respect the one that goes from B to A.
Graph-based approaches do not allow directly connected nodes to
in-Figure 2.1: An example of a graph
fluence each other by propagating information along the edges of the graph[36]. The weights of the edges influence the quantity of informa-tion that is allowed to pass through them: greater the weight, more information is ”transmitted”.
Before presenting some known algorithms based on graphs, we intro-duce some common concepts that help to represent mathematically the model we are studying. The first is the adjacency matrix, that from now on we will call A, that is, conceptually, a matricial representation of the graph. A has a row and a column for every node, independently from the type, and is defined as:
A[i, j] = (
1 if exists the edge from i to j 0 otherwise
Can be noted that, if the graph is undirected, then A is symmetric. If we consider a directed graph, instead, A can not be assumed sym-metric. Moreover, in case the edges in the model are weighted, such information has to be included in matrix A. For this purpose, we can simply modify A by replacing the binary representation of the edges with their respective weights:
A0[i, j] = (
wi,j if exists the edge from i to j
0 otherwise
where wi,j is the weight of the edge (i, j).
As we can see in figure 2.2, A0 can be built using the matrices we have
Figure 2.2: An example of an adjacency matrix
the URM can be used to represent the edges between user and item nodes and the ICM can be used to represent edges between items and features. Another concept we have to introduce, which is the basis of many graph-based algorithms, is the random walk. A random walk process on a graph can be seen as a discrete Markov chain, where an hypothetical walker starts on a node and at each time step chooses randomly one of its neighbours and moves to it[35]. A Markov chain is a stochastic process that satisfies the Markov property, also called ”memorylessness”. In other words it means that the next action, that in our case is the edge we are going to follow, is stochastically chosen among a set of options which probabilities do not depend on the past actions and states, nodes in our case, but only on the state we are in. A Markov chain is characterized by a particular squared matrix, called Markov matrix or transition matrix, which is used to describe the transitions of the process. Every entry is a non-negative real value that represents the probability of reaching the node represented by the column index from the node represented by the row index in a single step. The transition matrix, that we will call P , can be obtained by normalizing row by row the adjacency matrix A0:
P [i, j] = A
0[i, j]
P
k∈V A0[i, k]
where V is the set of all the nodes and i, j ∈ V . There are many algorithms in recommender system literature based on random walk models [13, 25, 8, 17], we will present some implementations.
2.6.1 P3α
P3
α is an algorithm based on the definition of P . Due to the
collab-orative nature of this algorithm, we will not consider feature nodes, so adjacency and transition matrices will cover a set V of nodes that contains only nodes of type user and item. From theory we know that elevating P at the power of n, we get the probability to reach the node indicated by the column index starting from the node indicated by the row index after a random walk of n steps. This is also the origin of the name of the algorithm: P3 stands for the third power of the transition
matrix P of a Markov chain[8]. What P3does, is ranking the items, for
every user u, according to the distribution of the third vertex on the random walk starting at u. Other values of n have been tested, but it
has been shown that 3 usually outperforms higher parameter values[8] (and values lower than 3 are meaningless). To improve performances, a real valued parameter α has been introduced into the original P3
model. We define Pα as derived from the original transition
probabil-ity matrix P where every positive entry is elevated at the power of α, which value has to be found in validation phase.
Pα[i, j] = A0[i, j] P k∈V A0[i, k] α
It follows that, calling V the set of all the nodes, u, i ∈ V are respec-tively a generic user u and a generic item i
Pα3[u, i] = X
v∈V
X
j∈V
Pα[u, j] · Pα[j, v] · Pα[v, i] (2.1)
The equation (2.1) is easy to implement, but has a main drowback. P has a number of rows and columns equal to the number of users plus the number of items, that corresponds, in real cases, to hundreds of thousands of rows and columns, that corresponds to a very high amount of memory required for computation. Moreover, even though P derives directly from the U RM and inherits its sparsity property, the multiplication cancels that property leading to a very dense ma-trix. However, with a brief analysis of matrix P , we can simplify the calculation of matrix P3
α. We can note from figure 2.3 that, due to the
absence of edges among nodes of the same type, we have two blocks in Pα that contain only zeros and two more blocks that can be derived
from the U RM :
• Pui: the top-right block of non-zero elements, with shape |U |×|I|,
that represents the edges from user nodes to item nodes. It can be obtained by normalizing row by row the U RM matrix and elevating it at the power of α:
Pui[u, i] =
U RM [u, i] P
j∈IU RM [u, j]
!α
• Piu: the bottom-left block of non-zero elements, with shape |I| ×
|U |, that represents the edges from item nodes to user nodes. It can be obtained by normalizing row by row the U RM matrix transposed and elevating it at the power of α:
Piu[i, u] = U RMT[i, u] P v∈UU RMT[i, v] α
Figure 2.3: P transition probability matrix
This property can help in reducing the calculation of Pα3 to a matri-cial multiplication among non-zero blocks, as shown in figure 2.4. The
Figure 2.4: P3α matrix multiplication
same figure shows also that the final matrix Pα3 has the same block structure of P . As stated before, what we are interested in to make recommendations is the probability distribution among the third ver-tices, starting from user nodes. These probabilities are contained in the top-right block of the final matrix, which means that what we need to make recommendations can be obtained by the following block mul-tiplication:
2.6.2 RP3 β
Paudel et al.[35] noted that the ranking of items, according to the transition probability matrix P used in P3
α, is strongly influenced by
item popularity. The result is that recommendation lists are dominated by the presence of blockbusters, or, in other words, item nodes with a high degree. To improve recommendations with less common items and to compensate for the influence of popularity, they introduced a simple re-ranking procedure dependent on item-popularity. Given the predicted score U RM0[u, i] for user u and item i, which is the result of P3
α algorithm for user u and item i:
U RM00[u, i] = U RM
0[u, i]
D[i]β
where D[i] is the indegree, the number of incoming edges of a node in a graph, of item i and β is a real valued, strictly positive pa-rameter. Take, as an example, two items i and j and suppose that U RM0[u, i] = U RM0[u, j], that means that we have the same proba-bility of reaching nodes i and j after 3 steps, starting from node u. If i has a lower indegree than j (D[i] < D[j]) the effect of re-ranking will be to promote the recommendation of the less popular item j, ranking it in a higher position, since U RM00[u, i] > U RM00[u, j].
The idea is that parameter β should regulate the influence of item in-degrees in re-ranking procedure: a low value means a low popularity penalty, a high value means a high popularity penalty. Note that, if β = 0, the results of RP3
β will coincide with Pα3 ones, since D[i]0 = 1.
This approach has been proved to increase, compared to P3
α, not only
accuracy performance, but also to improve diversity in recommendations[35]. Accuracy and diversity are described in detail in Chapter 4.
2.7
The cold-start problem
When a new item is added to a ratings matrix, it has no ratings from the various users. Consequently, none of the memory-based and model-based collaborative filtering methods, that are known as the best per-forming approaches in most cases, would recommend such an item, be-cause ratings are not available for recommendation purposes[2]. This is known as the cold-start problem for a new item.
recommender system literature[27, 41]. Usually, content information is used to bridge the gap from rated items to new items[41]. However, as stated before, it is known that content-based approaches perform worse in most situations compared to collaborative ones. The challenge, then, is to optimize content-based algorithms, possibly by exploiting collab-orative information that can be reused in cold-start scenarios. One of the most common techniques to perform this task is feature weighting.
2.8
Feature weighting
In content-based recommendation every item is represented by a fea-ture vector or an attribute profile[10]. The feafea-tures, depending on the aspect of the item they represent (for example title, actors, ...), can hold numeric or nominal values.
To compute the similarity of two items, various distance measures be-tween the feature vectors have been proposed in recommender system literature. These similarity values are then used in content-based al-gorithms to obtain a ranked list of recommended items. The most simple similarities we can consider, like cosine or Pearson, implicitly assert equal importance of all features. However, when a user esti-mates the similarity between two items often gives different weights to different attributes. For example, while choosing a movie, the genre may be more important than the name of the music director or the producer. It may be stated that users base their judgements on some latent criteria which is a weighted linear combination of the differences in individual attribute. So we can define a similarity S between two items I1 and I2 as
S(I1, I2) = w f (I1, I2) =
X
c∈F
wc· fc(I1, I2)
where w is a vector of feature weights, f (I1, I2) is the vector of
sim-ilarities of values between I1 and I2, that depend on the type of the
attributes (numeric, nominal, boolean, ...), and F is the set of features. Usually, the values of f are normalized, for simplicity, to have values in range [0, 1]. The vector w of feature weights, instead, is unknown. The target of feature weighting is to try to predict these weights to improve recommendation performance.
2.8.1 Approaches from Information retrieval literature
In information retrieval, we have documents composed by terms instead of items described by features, but the two concepts are closely related. The term weighting problem has been widely discussed in Informa-tion Retrieval literature. Many of the proposed soluInforma-tions are based on the study of the statistical distribution of terms among the docu-ments. These methods usually increase the performance of standard, non-weighted algorithms, but their potential is limited and dataset de-pendent.
TF-IDF
TF-IDF stands for term frequency-inverse document frequency, and it is a weight often used in information retrieval and text mining. This weight is a statistical measure that represents how important a term, or word, is to a document in a collection. The importance increases proportionally to the number of times a word appears in the document and to the inverse of the frequency of the word in the collection. It can be simply applied also to recommendation tasks by interpreting the word in a document as a feature in an item and the collection as our whole ICM. Typically, the TF-IDF weight is composed by two terms multiplied by each other. It’s value for item i and feature f can be obtained as:
T F -IDF = T Ff,i· IDFi
• TF: Term Frequency. To better understand the idea behind this term, let’s make a simple example. Suppose we have a set of doc-uments and wish to determine which one is most relevant to the query ”the red sock”. The first thing we can do is to eliminate documents that do not contain all three words. However, this filter still leaves too many documents. The idea that allows to further distinguish them, is that we might count the number of times each term occurs in each document. This number is called its term frequency. However, since the case where the length of documents varies greatly is quite common, TF is usually normal-ized dividing it by the document length. The first form of term weighting is due to Hans Peter Luhn[28] and is based on his as-sumption: The weight of a term that occurs in a document is simply proportional to the term frequency.
As done before, we can draw a parallel with recommendation sys-tems. If we interpret a word as a feature and the document as an item, we can apply the concept expressed before: TF gives more importance to a feature depending on its value (if it has one) and the total value of the features that belong to the same item. TF for feature f and item i can be calculated as
T Ff,i=
nf,i
P
c∈F nc,i
where nf,iis the frequency of feature f in item i (usually a feature
is present only once in an item or it is not present at all, so nf,iis
a boolean value). We have presented the simplest normalization term, while in literature have been proposed different types of normalization that can adapt better to a situation or other.
• IDF: Inverse Document Frequency. Think about the example presented for TF. The term ”the” is very common in English doc-uments, so term frequency will tend to give more importance to documents which happen to use the word ”the” more frequently, while the more meaningful terms ”red” and sock” will have a limited relevance. Still, ”the” is not a good term to distinguish between relevant and non-relevant documents and terms, on the contrary, less common words like ”red” and ”sock” may help more in our task. Therefore an inverse document frequency factor is in-corporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. Karen Sp¨arck Jones[44] gave a statistical inter-pretation to this idea, which became a milestone of term weight-ing, calling it Inverse Document Frequency (IDF): the specificity of a term is the number of documents to which it pertains. We should treat matches on non-frequent terms as more valuable than ones on frequent terms.
Once again, using the same parallelism proposed for TF, we can state that a feature that belongs to many items tends to have a high TF value for many items, while it could be less representative than other features. IDF for feature f can be calculated as
IDFf = log N P i∈Inf,i where N =P i∈I P
our whole ICM, and nf,i is the frequency of feature f in item i.
We have proposed the simplest argument for the logarithm, while in literature have been presented various arguments that can fit better in different situations.
TFC
In contrast with the probabilistic models of Information Retrieval, the TF-IDF model is not able to exhibit the relevance properties of the documents and it does not derive a theoretically valid term relevance weight[40]. However, this weight is not immediately computable with-out knowledge of the presence of terms in relevant and non relevant documents of the collection, and a number of methods that have been proposed for the estimation of the weight, have shown that under well defined conditions it can be reduced to an inverse document frequency factor.
Another weighting factor that standard TF-IDF does not exploit ex-plicitly is the length of the document. A short document will most likely be represented by a small number of terms, while a long docu-ment will most likely be represented by a large number of terms. In the second case, the chance to find matches between the query and the document is high, so large documents have a better chance of being retrieved. A normalization factor can help to solve this problem by equalizing the lengths of the documents.
Salton and Buckley[40] proposed an extension of common TF-IDF, called TFC, that tries to take into account all these issues by combin-ing different term weights. The final formula becomes
T F C = tf · log N n q P i∈d(tfi· log N ni) 2
where tf is the common term frequency, N is the total number of terms in the document collection, n is the number of documents to which the term is assigned and d is the document.
2.8.2 Approaches from Recommender Systems literature with-out machine learning
Many implementations of information retrieval literature have been adapted for recommendation systems, and some new approaches to
feature weighting have been studied appositely. Also, in the recent past, some machine learning techniques have been applied to find an alternative solution. The results are promising, but the learning algo-rithms are very delicate and need a fine tuning of parameters to get good performances.
FWUM
Feature-weighted User Model is a feature weighting approach proposed by Symeonidis, Nanopoulos, and Manolopoulos[45]. The algorithm is splitted in four different steps and tries to construct a feature weighted user model, which discloses the duality between users and features, based on Information Retrieval, including Term Frequency and Inverse Document Frequency weighting schemes in CF. They propose a new top-N generation list algorithm based on features frequency.
Content-based user profile The first step requires to construct a feature profile for a user from both user ratings and item features, exploiting the favourite ones of each user. For a user u, his profile is constructed as
P (u, f ) = X
∀i∈R(u)
ICM [i, f ]
where R(u) is the set of items rated positively by u and ICM is boolean. Therefore, P (u, f ) denotes the correlation between user u and feature f .
Feature weighting This step weights the features, in order to find those features which better describe user u and those which better dis-tinguish him from the others. The first set of features provides quantifi-cation of intra-user similarity, while the second set of features provides quantification of inter-user dissimilarity. This is done by adapting the concepts expressed by TF-IDF.
Intra-user similarity, referred as Feature Frequency, is quantified by measuring the frequency of each feature f for a user u, that, in this model, is simply represented by the user profile P (u, f ).
Inter-user dissimilarity, called Inverse User Frequency, is quantified by measuring the inverse of the frequency of a feature f among all users.
It can be calculated similarly to IDF as
IU F (F ) = log |U | U F (f )
where |U | is the cardinality of the users set (simply the number of users) and U F (f ) is the number of users that have a positive value for feature f in their user profile. The weighted value for feature f and user u is calculated as
W (u, f ) = F F (u, f ) · IU F (f )
User’s neighbourhood formation This model uses a User-User sim-ilarity approach based on user profiles. Cosine simsim-ilarity between cou-ples of users has been selected for this task, to take into account only the set of features correlated with both users. The similarity value between users u and v is calculated as
sim(u, v) = P ∀f ∈F W (u, f ) · W (v, f ) q P ∀f ∈FW (u, f )2 q P ∀f ∈F W (v, f )2
Top-N list generation Usually, the technique used for the generation of the top-N list counts the frequency of each positively rated item inside the found neighborhood, and recommends the N with the highest values. In FWUM the approach is different, because tries to take into account the fact that each user has his own reasons for rating an item, so the relevance of each item is calculated summing the frequency in the top N user neighborhood of each feature it is composed by. To better understand how the algorithm works, let’s make an example and assume that we have to recommend a top-1 list for user u2:
1. Get the nearest neighbours of u2: {u1, u4}
2. Get the set of items rated positively by the neighbourhood: {i1, i3, i5}
3. Get the features of each item in the set: i1 : {f2}, i3 : {f2, f3}, i5 :
{f1, f2, f3}
4. Find the frequency of each feature in the neighbourhood: f r(f1) =
1, f r(f2) = 3, f r(f3) = 2
5. For each item, find the weight in the neighbourhood by summing its features frequency: w(i1) = 3, w(i3) = 5, w(i5) = 6
Automated selection of informative content-based features
Ronen et al. proposed a solution to perform feature selection[38]. The idea is to compute a relevance score for each feature and then select the highest-scoring features, which are also the most informative. They distinguish between two types of features: attributes and labels. The attributes include meta-data features associated to items, like price or brand. The labels, or tags, are n-grams or descriptions that consumers, experts or text mining algorithms associate to items. They propose two different approaches to compute relevance scores for attributes and la-bels, but they are both based on the ratio between two variables b1
and b2. Considering a feature f , a similarity function between two
attributes or labels and defining Rk as the set of k relevant items
pro-duced by an implicit or explicit feedback recommender, we can state that:
• b1 is proportional to the similarity of f with respect to relevant
items according to Rk.
• b2 is a normalizer which is proportional to the similarity of f with
respect to random items.
The ratio b1b2 measures the normalized relevance of the feature with regard to recommended items.
2.8.3 Approaches from Recommender Systems literature with machine learning
LFW
Least-squares Feature Weighing is a weighting schema proposed by Leonardo Cella and Stefano Cereda[6]. It is a solution based on least squares (LSQ) optimization to automatic feature weighing. Automatic feature weighting aims at inferring the weights in w from a collaborative model so that, given a item-to-item similarity matrix S(CF ) ∈ R|I|×|I| computed with collaborative filtering, the optimal weight vector w∗ can be determined by solving a simple LSQ problem
argmin
w∗ ||S
(CF )− S(w)||2 F
where S(w) ∈ R|I|×|I|is the similarity matrix calculated using a weighted
weight vector w∗ in order to reconstruct the CF similarity matrix using weighted CBF similarity, and this is exactly the goal of LFW algorithm. Using a linear weighing scheme, we have a linear regression problem as
argmin w∗ X ∀i∈I X ∀j∈I/{i}
(s(CF )ij − hw, ICM [i] ICM [j]i)2
which can be efficiently solved analytically. The array of optimal weights can also be used to compute similarity of new items that have never been rated. The algorithm is able to weight features according to their collaborative relevance and generally outperforms fully content-based weighting schemes in both warm and cold item scenarios. The collaborative similarity matrix S(CF ) can be obtained with any CF algorithm. In the paper, SLIM has been chosen, since it has been ex-tensively proven to achieve state-of-the-art performance in many col-laborative tasks[6].
FBSM
Feature-based factorized Bilinear Similarity Model is a method pro-posed by Sharma, Zhou, Hu and Karypis[42]. It is an improvement of UFSM (User-specific Feature-based item-Similarity Models)[12], a sparse, high dimensional factor model that includes user preferences in a latent representation. The main difference is that FBSM tries to exploit the correlation among item features, task that UFSM was not able to accomplish. The rating for an item i for user u is given by
ru,i =
X
j∈R+u
s(i, j)
where R is the rating matrix, R+u is the set of items that user u liked and s(i, j) is the similarity function between items i and j calculated as
s(i, j) = fiTW fi
where W is the weight matrix that expresses the correlation among item features and fi and fj are the feature column vectors of items i
and j respectively. Every cell of W represents the importance of the correlation between the feature identified by row index and the fea-ture identified by column index. Note that if W is the identity matrix, then sim(i, j) is simply the cosine similarity. Note that the dimen-sion of W matrix grows quadratically with the number of features. It
follows that its estimation can become rapidly computationally infea-sible. Moreover, in high-dimensional datasets, sufficient training data is often not present for reliable estimation, leading to poor generaliza-tion by overfitting the training data. To overcome this problem, FBSM proposes to represent W as
W = D + VTV
where D is a diagonal matrix and V ∈ Rh×|F | is a latent factors matrix where each column represent the latent space of a feature and F is the set of features. Replacing this notation in the estimation of similarity we get s(i, j) = fiTW fi = s(i, j) = fiT(D + V TV )f i = d(fi fj)T + |F | X k=1 |F | X p=1 fikfjp(vTkvp) (2.2)
where d is the main diagonal of matrix D and vk is the latent
repre-sentation of feature k in matrix V . This model has two advantages with respect to the original one: the number of parameters is linear in the number of features, instead of quadratic, and allows to regular-ize diagonal feature weights and feature latent factors separately. To estimate the parameters, Stochastic Gradient Descent with Bayesian Personalized Ranking loss function has been chosen, since it is designed to better fit ranking problems. Note that, differently from LFW, this model uses ratings instead of item similarities for the learning proce-dure.
2.9
Introducing our work
We have seen different researches of recommender system and informa-tion retrieval literature, that try to deal with the item cold-start prob-lem by optimizing performances of content-based approaches, using feature weighting or its derived and more complex techniques. However any approach has its own points of strength and points of weakness. Well known algorithms, probabilistic or not, coming from information retrieval experience, like TF-IDF or TFC, are very simple to implement and compute, and can easily be adapted to fit recommender systems,