A user study on novelty and diversity in recommender systems

(1)

POLITECNICO DI MILANO

DIPARTIMENTO DI ELETTRONICA, INFORMATICA EBIOINGEGNERIA MASTER OFSCIENCE INCOMPUTER SCIENCE AND ENGINEERING

A

USER STUDY ON NOVELTY AND DIVERSITY IN

R

ECOMMENDER

S

YSTEMS

Master Thesis of:

Filippo Pellizzari

Mat.852700

Supervisor:

Prof. Paolo Cremonesi

Co-supervisor:

Dott. Maurizio Ferrari Dacrema

(2)

(3)

Alla mia famiglia e a tutti coloro che hanno creduto in me.

(4)

(5)

Abstract

R

ECOMMENDERSystems are software applications that support users in finding items of interest within larger collections of objects. Evaluation of Recom-mender Systems has traditionally focused on the performance of algorithms and the accuracy of predictions. However, many researchers have recently started in-vestigating other evaluation criteria from users’ perspective: such criteria, called beyond-accuracy metrics, may have a significant impact on the overall quality of recommendations. A major problem in beyond-accuracy research is the gap between the computed-offline metrics and the real user’s perception. A powerful approach to properly solve this problem is conducting user studies, online experiments used to collect user feedback on perceived qualities of recommendations.

In this work, we survey the state of the art of beyond-accuracy metrics and user studies. We design and implement a new webapp to set up user studies of interest. We conduct online experiments on perceived novelty and diversity of recommendations, by testing simple content-based algorithms in the movies domain. We analyse the impact of algorithms on user’s perception. Finally, we compare the results of perceived novelty and diversity with some computed-offline metrics, in order to evaluate which one of them best correlates with the real user’s perception.

(6)

(7)

Sommario

I

Sistemi di Raccomandazione sono applicazioni software che supportano gli utentinella ricerca di articoli di interesse all’interno di raccolte più grandi. La val-utazione dei Sistemi di Raccomandazione si è tradizionalmente focalizzata sulle prestazioni degli algoritmi e sull’accuratezza delle previsioni. Tuttavia, molti ricerca-tori hanno recentemente iniziato ad indagare altri criteri di valutazione dal punto di vista degli utenti: tali criteri, chiamati anche "metriche oltre l’accuratezza", potreb-bero avere un impatto significativo sulla qualità complessiva delle raccomandazioni. Uno dei principali problemi nella ricerca delle "metriche oltre l’accuratezza" è il divario tra le metriche calcolate offline e la percezione reale dell’utente. Un potente approccio per risolvere correttamente questo problema è la realizzazione di studi sull’utente, esperimenti online utilizzati per raccogliere il feedback degli utenti sulle percepite qualità delle raccomandazioni.

In questo lavoro, esaminiamo lo stato dell’arte delle "metriche oltre l’accuratezza" e degli studi sull’utente. Progettiamo e implementiamo una nuova applicazione web per impostare studi di interesse sull’utente. Conduciamo esperimenti online per misurare novità e diversità percepite delle raccomandazioni, testando semplici algoritmi basati sul contenuto, nel dominio dei film. Analizziamo l’impatto degli algoritmi sulla percezione dell’utente. Infine, confrontiamo i risultati di novità e diversità percepite con alcune metriche calcolate offline, per valutare quale di queste è meglio correlata con la percezione reale dell’utente.

(8)

(9)

Ringraziamenti

V

ORREIinnanzitutto ringraziare il Prof. Paolo Cremonesi per aver creduto nelle mie capacità e avermi dato la possibilità di mettermi alla prova con questo lavoro di tesi. Grazie ai suoi preziosi insegnamenti e alla sua esperienza, sono riuscito a raggiungere con soddisfazione questo traguardo. Ringrazio inoltre il Dott. Maurizio Ferrari Dacrema per l’aiuto e il sostegno nella stesura di questa tesi, nonché per la sua energia e motivazione, che mi hanno permesso di superare ogni difficoltà.

Un ringraziamento speciale va alla mia famiglia, in particolare ai miei genitori: senza il loro sostegno e incoraggiamento tutto ciò non sarebbe stato possibile. Ringrazio i miei amici, che hanno sempre creduto in me. Una dedica speciale va ad Andrea, Giorgio e Nicola, che hanno condiviso con me questo lungo percorso universitario, con le gioie e i sacrifici di ogni giorno. La loro compagnia e il sostegno che mi hanno dimostrato rendono questo traguardo ancora più prezioso.

Filippo Milano, 20 Dicembre 2018

(10)

(11)

List of Figures

2.1 Screenshot of the music recommendation survey . . . 15

2.2 Screenshot of Ekstrand’s experiment interface . . . 17

3.1 Novelty Surveyauthentication . . . 20

3.2 Novelty Surveydemographics . . . 20

3.3 Novelty Surveycatalogue . . . 21

3.4 Novelty Surveysearch . . . 22

3.5 TMDb official page . . . 22

3.6 Novelty Surveysurvey . . . 23

3.7 Novelty Surveyfinal page . . . 26

3.8 Novelty Surveyadmin page . . . 28

3.9 Novelty Surveysystem architecture . . . 28

4.1 Age of the subjects - Novelty Study . . . 32

4.2 Countries of the subjects - Novelty Study . . . 32

4.3 Perceived FON by age . . . 35

4.4 Perceived SON by gender . . . 35

4.5 Top-popular - OFF-NOV-1 vs OFF-NOV-2 . . . 37

4.6 Top-popular - OFF-NOV-1 vs OFF-NOV-3 . . . 37

4.7 Top-popular - OFF-NOV-1 . . . 38 4.8 Top-popular - OFF-NOV-2 . . . 39 4.9 Top-popular - OFF-NOV-3 . . . 39 4.10 Random - OFF-NOV-1 . . . 39 4.11 Random - OFF-NOV-2 . . . 40 4.12 Random - OFF-NOV-3 . . . 40

4.13 Age of the subjects - Diversity Study . . . 41

4.14 Countries of the subjects - Diversity Study . . . 42

4.15 Perceived ILD by age . . . 44

4.16 Perceived ILD by gender . . . 44

4.17 Top-popular - DIV-1 . . . 46 4.18 Top-popular - DIV-2 . . . 46 4.19 Top-popular - DIV-3 . . . 46 4.20 Random - DIV-1 . . . 47 4.21 Random - DIV-2 . . . 47 4.22 Random - DIV-3 . . . 47 xi

(14)

(15)

Chapter 1 Introduction

T

HEfocus of recommender systems research has traditionally been the accuracy of predictions, in particular how closely the users’ ratings match predicted ratings. However, it has been recognized that other evaluation metrics (such as diversity, novelty and serendipity) may have a significant impact on the overall quality of recommender system [21]: in literature, they are called beyond-accuracy metrics.

A major problem in beyond-accuracy research area is the gap between the offline evaluation metrics and the user’s perception of that metrics. In order to properly evaluate recommender systems in terms of beyond-accuracy qualities, real user interactions with the system must be collected: in this cases, user studies are essential. A user study is conducted by recruiting a certain number of participants and asking them to perform some tasks requiring an interaction with the recommender system. In many cases, participants can also provide qualitative feedback (e.g., by completing questionnaires). Online experiments are very powerful tools. However, few research works in recommender systems include experiments on user feedback, because there are many challenges, such as the recruitment of a sufficiently large user base, the choice of the application domain and the formulation of survey questions.

1.1 Mission and Contributions

The mission of this work thesis is to enlarge the dataset of user studies on beyond-accuracy metrics, in particular the novelty and diversity of recommendations. In this thesis, we design and implement a new webapp to set up user studies, make online recommendations and collect user feedback. After analysing the impact of different recommendation algorithms on perceived qualities, we will focus on novelty and diversity of recommendations. The final goal is to compare our results with some metrics of novelty and diversity that can be computed offline and observe which one of them best correlates with the real perception of users.

(16)

1.2 Contents Outline

This thesis is structured as follows:

• Chapter 2 provides an overview of accuracy and beyond-accuracy metrics and the state-of-the-art user studies in recommender systems research. • Chapter 3 aims to present the webapp Novelty Survey, which we implemented

in order to set up our user studies.

• Chapter 4 contains the description of our experiments and the obtained results. • Chapter 5 draws the conclusions of this work.

(17)

Chapter 2 State of the Art

R

ECOMMENDER systems (RS) are software applications that support users in finding items of interest within larger collections of objects, often in a per-sonalized way. Today, such systems are used in a variety of application domains, including for example movies, music, books, jobs and e-commerce. Receiving automated recommendations of different forms has become a part of our daily online user experience.

Evaluation of RS performance is inherently difficult for several reasons [15]. First, different algorithms may be better or worse on different datasets. Some algorithms have been designed specifically for datasets where there are many more users than items (e.g., the MovieLens dataset has 65000 users and 5000 movies). Another reason is that the goals for which an algorithm has been developed may vary. The most common task for RS is to find good items. This means that a good algorithm should be able to predict the rating that a user would give to an item, in order to recommend only the n best items for each user. For this reason, most of the evaluations focus on RS accuracy. An accuracy metric empirically measures how close a recommender predicted ranking of items for a user differs from the user’s true ranking of preference.

There is an emerging understanding that good recommendation accuracy, alone, does not give users of RS an effective and satisfying experience [15]. RS must provide not just accuracy, but also usefulness. For instance, a recommender might achieve high accuracy by only computing predictions for easy-to-predict items, but those are the very items for which users are least likely to need predictions (e.g., top popular items). For this reason, beyond-accuracy metrics are also needed. Since user’s perception of that metrics is important, user studies (i.e., online experiments involving users) are essential to collect this type of data.

In this chapter we focus on some well known accuracy metrics. Then, we present the definition of the most important beyond-accuracy metrics in RS research. Then, we briefly describe the general aspects of user studies. Finally, we provide an overview of the most significant user studies that investigates beyond-accuracy metrics in literature.

(18)

2.1 Accuracy metrics

Prediction accuracy is the most discussed and most widely used metric in the RS literature [27, 15]. The idea comes from the assumption that a user, examining all items available, could place them in a preference order. An accuracy metric empirically measures the difference between the RS predicted ranking of items and the users true order of preference. A high precision means a low difference, a low precision means a high difference. Using various concepts of difference, can be defined three distinct classes [15]:

• Predictive Accuracy Metrics: they measure how close the RS predicted ratings are to the true user ratings.

• Classification Accuracy Metrics: they measure the frequency with which a recommender makes correct or incorrect decisions about whether an item will be rated or will have an interaction with a user.

• Rank Accuracy Metrics: they measure the ability of a recommendation algorithm to produce an ordered list of items to recommend that matches the order of preference that a user would have given to the same items.

2.1.1 Predictive Accuracy Metrics

Predictive accuracy metrics measure how close the RS predicted ratings are to the true user ratings. For example, the MovieLens movie recommender predicts the number of stars that a user will give each movie and displays that prediction to the user.

A popular metric used in evaluating accuracy of predicted ratings is Mean Absolute Error(MAE). The system generates predicted ratings puifor a test set T

of user-item pairs (u, i) for which the true ratings ruiare known. MAE measures

the average absolute deviation between a predicted rating and the user’s true rating: M AE =

P

(u,i)∈R|pui− rui|

|T | (2.1)

A popular alternative is Root Mean Squared Error, which penalizes large errors:

RM SE = s

P

(u,i)∈R(pui− rui)2

|T | (2.2)

Predictive accuracy metrics may be less appropriate when a ranked result is returned to the user, who then only views items at the top of the ranking. For these tasks, users may only care about errors in items that are ranked high, or that should be ranked high. Furthermore, these type of metrics are not good when the granularity of the rating scale is small, since errors will only affect the task if they result in erroneously classifying a good item as a bad one or viceversa; for example, if 3.5 stars is the cut-off between good and bad, then a one-star error that predicts a 4 as 5 (or a 3 as 2) makes no difference to the user. Predictive accuracy metrics are

(19)

2.1. ACCURACY METRICS 5

particularly important if the predicted rating has to be shown to the user, which is not a very common case in recommendation systems. In practice, they are no longer used to measure accuracy of recommendations.

2.1.2 Classification Accuracy Metrics

The role of a classification metric is to express how well a system is able to classify items correctly. The performance of an algorithm is usually visualized in a 2 x 2 table called confusion matrix [27].

Table 2.1: Confusion matrix

Recommended Not Recommended Actually Used True-Positive False-Negative Actually Not Used False-Positive True-Negative

As shown in Table 2.1, we have four possible outcomes for the recommended items.

• True-Positive (tp): the system predicted correctly, judging the item as "rele-vant".

• False-Positive (f p): the system did not predict correctly, judging the item as "relevant".

• True-Negative (tn): the system predicted correctly, judging the item as "not relevant".

• False-Negative (f n): the system did not predict correctly, judging the item as "not relevant".

Combining the values contained in the confusion matrix we can summarize the performance of our algorithm in single values. Two of the most popular classification accuracy metrics are Precision and Recall.

Precision is the ratio between the number of items correctly classified positively by the system and the sum of all the items wrongly or correctly classified positively:

P recision = tp

tp + f p (2.3)

It is the probability that a recommended item is really relevant. Its value will always be in the range [0, 1]. Recall is the ratio between the number of items correctly classified positively by the system and the sum of all the items that are relevant:

Recall = tp

tp + f n (2.4)

It represents the probability that a relevant item is recommended. Its value will always be in the range [0, 1].

Precision and Recall can be combined in a single quantity, called F-measure: F = 2 · P recision · Recall

(20)

2.1.3 Rank Accuracy Metrics

Rank accuracy metrics measure the ability of a recommendation algorithm to pro-duce an ordered list of items that matches how the user would have ordered the same items [15]. This kind of metrics is more appropriate to evaluate algorithms in domains where the system has to provide a ranked list of recommendations. The idea is to measure the ability of the algorithm to distinguish between the "best" alter-natives and just "good" alteralter-natives, in domains where such a distinction is important. Mean Average Precision

Mean Average Precision (MAP) is considered one of the most important metrics in recommender system literature. It is the mean of the Average Precision scores for every single recommended list length, until a specified length n [22]. Average Precision is the average of the precision value obtained for the set of top n items existing after each relevant item is retrieved. This means that the formula to calculate the Average Precision at n becomes:

AP @n = Pn

k=1P recision@k · Rel(k)

M (2.6)

where Rel(k) is 1 if the item at rank k is relevant, 0 otherwise. M is the number of relevant items in the list.

Finally, MAP is defined as follows: M AP @n =

P

u∈UAPu@n

|U | (2.7)

where U denotes the users in the test set. Mean Reciprocal Rank

Mean Reciprocal Rank (MRR) is a well known metric in information retrieval [34]. This metric evaluates the results of a recommender based on the order of probability of correctness. It is defined as following:

M RR = P u∈U 1 ranku |U | (2.8)

where ranku is the position of the first relevant retrieved answer for user u. If no

relevant answers are retrieved, the value of _rank1

u will be 0.

2.2 Beyond-accuracy metrics

In this section, we provide the definition of the main beyond-accuracy metrics and the approaches proposed in RS literature to measure them in offline experiments.

(21)

2.2. BEYOND-ACCURACY METRICS 7

2.2.1 Diversity

The notion of diversity originates from ideas in information retrieval (IR) research. In IR literature, it has been discovered that the value of a list of retrieved documents is influenced not only by the documents similarity to the query (relevance), but also by the redundancy of such documents [4]. Ensuring that the list of retrieved documents covers a broad area of information increases user’s satisfaction [6].

In RS research, diversity is generally related to how different are items in a recommendation list for the user. One of the most common way to measure diversity was suggested by Smyth and McClave [30]. Given a recommendation list R, diversity is computed as the average pairwise distance between items in the list:

diversity(R) = P i∈R P j∈R\idist(i, j) |R| (|R| − 1) (2.9)

where dist(i, j) is some distance function defined between items i and j. For instance, the distance has been measured using Hamming distance [17], the com-plement of Pearson correlation [33], or the comcom-plement of cosine similarity [25]. Similarly, Ziegler et al. [37] defined the intra-list similarity metric as the aggregate pairwise similarity of items in the list, with higher scores denoting lower diversity of the list: ILS(R) = P i∈R P j∈R\isim(i, j) 2 (2.10)

Diversity metrics based on similarity were criticized by Vargas et al. [32]. They argued that user’s perception of diversity is strongly influenced by item genres. Moreover, they claimed that the optimal distribution of genres (in terms of diversity) is achieved when sampling items randomly. Therefore, they proposed a "binomial diversity" metric that captures how closely the genre distribution in the item list matches the distribution that would be obtained by randomly sampling items from the dataset. Such metric should also take into account coverage, redundancy and size of recommendations. In other words, the diversity metric should reflect how well a list of items covers the genres a user is interested in and how well genre redundancies are avoided. Moreover, it should be sensitive to the size of the recommendation list, since coverage and redundancy need to be treated differently for lists of different length.

Other diversity metrics can be found in RS literature. For instance, Lathia et al. [20] defined the temporal diversity as the normalized set theoretic difference between lists of size N received by a user at two different time points (t1 < t2):

diversity(R1, R2, N ) =

|R2\ R1|

N (2.11)

where R1 is the recommendation list at time t1and R2is the recommendation list

at time t2. They found that users with large profile size (i.e. the number of ratings

per user) suffer from lower diversity, while those who rate a lot of content in one session are likely to see very diverse results the next time.

(22)

Adomavicius and Kwon [2] used other three sophisticated metrics for diversity: Shannon entropy [28], Gini coefficient1_{[13] and Herfindahl index}2_{[14]. In}

particu-lar, these metrics provide different ways of measuring distributional dispersion of recommended items across all users, by showing the degree to which recommenda-tions are concentrated on a few popular items (i.e., low diversity) or are more equally spread out across all candidate items (i.e., high diversity). They are calculated as follows: Entropy-diversity = − N X i=1 rec(i) total · ln rec(i) total (2.12) Gini-diversity = 2 N X i=1 n + 1 − i n + 1 · rec(i) total (2.13) Herf indahl-diversity = 1 − N X i=1 rec(i) total 2 (2.14) where rec(i) is the number of users who got recommended item i, N is the total number of available items and total is the total number of recommendations made across all users.

Finally, diversity has also been defined as the average pairwise distance between recommendation lists generated for different users (inter-user diversity) [36]. Given two users u and v, the inter-list distance between their recommendation lists of size N is measured as follows:

d(u, v) = 1 − |Ru∩ Rv|

N (2.15)

Averaging over all pairs of users (M ), we obtain the inter-user diversity metric: IU D =

P

(u,v)∈U d(u, v)

M (2.16)

Inter-user diversity (IUD) is often compared with intra-list similarity (ILS, see 2.10). High IUD in fact does not guarantee low ILS, and viceversa. For instance, ILS of top-popular recommendation list is typically low (i.e., high intra-list diversity), but IUD can be very low.

2.2.2 Novelty

Novelty can be seen as the ability of a recommender to introduce users to items that they have never experienced before. Definitions of novelty in the RS literature typically focus on two aspects:

• Unknown: the item is unknown to the user.

1

Commonly used measure of wealth distribution inequality

(23)

2.2. BEYOND-ACCURACY METRICS 9

• Dissimilarity: the item is dissimilar to items in user’s profile (i.e., previously seen items).

The quality of an item being unknown is not trivial to define formally. The fact that an item is unrated does not necessarily imply that it is unknown, because a user rarely provides ratings for all known items of the catalogue. Therefore, without user’s feedback, it is impossible to know if an unrated item is really novel. A common solution is to approximate item’s novelty with the global unpopularity among users: in this way, novelty can be measured offline, without conducting costly online experiments. Formally, item’s novelty is defined through self-information formulation [36], in order to give more importance to very rare items:

novelty(i) = − log₂pop(i) (2.17)

where pop(i) is the popularity of item i measured as percentage of users who rated i. Then, novelty of a recommendation list R is typically computed as follows:

novelty(R) = P

i∈R− log2pop(i)

|R| (2.18)

Given the previous definition, novel items are identified with the long tail items: the part of the catalogue seen by a small part of the user base [3].

Other works focused also on the dissimilarity aspect of novelty. For instance, Vargas and Castells [33] considered a similarity-based model where item novelty is defined by a distance function between the item and a context θ of experience. They found two useful instantiations of θ:

• The set of items the user has interacted with (i.e., the user profile). • The set R of recommended items.

Given a set of items θ, novelty of item i is formulated as the expected distance between the item and the set:

novelty(i|θ) = X

j∈θ

p(j|choose, θ, i)dist(i, j) (2.19)

where p(j|choose, θ, i) is the probability that the user chooses item j in the context θ, when has already chosen i. The distance function dist can be defined as the complement of some similarity measure (Pearson correlation, cosine, etc.).

2.2.3 Serendipity

Serendipity is a RS quality that is closely related to novelty. In the RS literature, Herlocker et al. [15] informally defined a serendipitous recommendation as one that "helps the user find a surprisingly interesting item he might not have otherwise discovered". Based on this definition, the two important aspects related to serendipity are:

(24)

• Usefulness: the item is interesting, relevant and useful to the user.

Experimental studies of serendipity are very rare, because this objective is difficult to explain and to measure. A common practice in offline experiments is the one first proposed by Murakami et al. [23]. This approach consists in comparing the list of items generated by the actual recommender with recommendations produced by a primitive prediction model (i.e., any model that is not optimized for serendipity, for example popularity-based model), in order to compute unexpected items. Given a set of recommendations R, the set of unexpected recommendations is obtained by subtracting from R items that are recommended by the primitive prediction model P M :

Runexp = R \ P M (2.20)

Following this idea, Ge et al. [11] proposed a formulation that combines unexpect-ednessand usefulness aspects of serendipity:

serendipity(R) = |Runexp∩ Rusef ul|

|R| (2.21)

where R is the set of recommendations generated by the actual recommender, Runexp

is obtained by 2.20 and Rusef ul is the subset of items in R that are useful for the

user. The usefulness of recommendations may be judged by the user in online experiments or approximated by the user’s ratings for the items in an offline setting: for example, Adamopoulos et al. consider an item to be useful if its average rating is greater than 3.0 [1]. A limitation of this comparative approach could be the choice of the primitive prediction model.

2.2.4 Coverage

The coverage of a recommender is a measure of the domain of items over which the system can make recommendations [15]. In the RS literature, the coverage was mainly associated with two concepts:

• Prediction coverage: the percentage of the items for which the system is able to generate a recommendation:

coverage = |Ip|

|I| (2.22)

where I is the set of all available items and Ip is the set of items for which a

prediction can be made.

• Catalogue coverage: the percentage of the available items which effectively are ever recommended to a user:

coverage = SN i=1I i R |I| (2.23)

where I is the set of all available items and Ii

Ris the set of items contained in

(25)

2.3. USER STUDIES 11

Coverage is related to the previously seen metrics:

• Diversity: Adomavicius and Kwon [2] discussed the difference between di-versity (which they call individual didi-versity, i.e., the didi-versity of a recommen-dation list presented to a single user) and coverage (which they call aggregate diversity, i.e., the range of items recommended across all system users). They argued that high diversity does not imply high coverage. For instance, if different users are recommended the same diverse set of items, the average diversity of the system will be high, but the coverage will remain low.

• Novelty: intuitively, a high coverage of the item catalog requires recommend-ing the long-tail items to users, which corresponds to a high average novelty (see 2.18).

• Serendipity: Ge et al. [11] briefly discussed the relation between coverage and serendipity. They argued that high serendipity implies high coverage, but an increase in coverage will not necessarily improve serendipity. The authors, however, offered neither experiments nor mathematical proofs to support this hypothesis.

2.2.5 Measuring user’s perception

Any evaluation of RS not involving user feedback is limited in terms of reliability. For instance, without asking the end-user of the system, it is not evident that an item that was shown to be serendipitous by some metric will be perceived as such by the user.

Of the beyond-accuracy metrics seen before, diversity and novelty are the ones that are most frequently investigated in user studies. Serendipity has been reported to be difficult to explain to users [26] or has been left out of the studies as being too similar to novelty [24]. Coverage is not measured in user studies since it is defined at the level of the system and is not directly related to individual user experiences. Besides the perceived accuracy, diversity, novelty, user studies often also measure user satisfaction with the system. Satisfaction is a concept easy to understand and can be considered as a higher-level quality influenced by many other perceived qualities, such as relevance, diversity and novelty.

2.3 User Studies

In this section, we present the general aspects to consider while conducting online experiments involving users.

2.3.1 Survey Design

There are essentially two ways in which participants can be assigned to experimental conditions3_{[18, 27]:}

3

In RS field, an experimental condition is one of the possible conditions of users, who have to test a particular setting of the recommender system

(26)

• In a between-subjects experiment, participants are randomly assigned to one of the experimental conditions. Such experiments are more realistic, because users of real systems usually see a single version of the system. On the other hand, they are more expensive, because need more people to test all the experimental conditions.

• In a within-subjects experiment, participants experience all conditions at the same time. The advantage of this method is that it can detect differences between conditions more precisely. However, in these types of studies users are more conscious of the experiment and are far from a realistic usage scenario, in particular when there are more than two conditions.

2.3.2 Participants

Finding participants to take part in the experiment is the most time-consuming aspect. Participants recruitment involves a tradeoff between gathering a large enough sample for statistical evaluation and gathering a sample that accurately reflects the characteristics of the target population. Increasing the number of participants increases the statistical power (i.e., the likelihood of detecting an effect of certain size) of the experiment. To determine the required sample size, researchers should perform a power analysis [7, 10] using an estimate of the expected effect size of the hypothesized effects and an adequate power level (usually 85%). In recommender systems research manipulations typically have small effects (causing differences of about 0.2 - 0.3 standard deviations in the dependent variables) and occasionally medium-sized effects (differences of around 0.5 standard deviations) [18]. To detect a small effect with a power of 85% in a between-subjects experiment, 201 participants are needed per experimental condition. To detect a medium-sized effect, 73 participants are needed per condition. Within-subjects experiments need far fewer participants: 102 to detect small effects, and 38 to test medium-sized effects [18].

The quality of the user base is also important: the sample should be as unbiased as possible in order to simulate the real-world situation. For example, it is better to choose participants who have no particular connection to the researcher or the experiment. Moreover, it is generally better to give a generic description of the study to avoid bias: the description should focus on the task rather than the purpose of the study [18].

Finally, exploiting crowdsourcing platforms could be very convenient to motivate a larger group of people to take part of the experiment, although sometimes it is dif-ficult to obtain a geographically balanced sample (e.g., 90% of Amazon Mechanical Turk workers are from USA and India4_{). Moreover, there is no guarantee that the}

results of crowdsourced work will be of sufficient quality. Sometimes it occurs that crowdsourced users try to cheat: quality checks and feedback can help to prevent this kind of problem.

(27)

2.3. USER STUDIES 13

2.3.3 Questionnaire

A fundamental part of the user study is the questionnaire. A set of questions can provide valuable information about user experience and system properties that are difficult to measure. Rating scale of answers (e.g., 5-point scale from "strongly disagree" to "strongly agree") and order of questions are important variables to take in account. General tips for designing good questions are provided by [18, 27]:

• Ask neutral questions, that do not suggest a "correct" answer. • Use simple words and short sentences. Avoid technical terms.

• Avoid double-barreled questions. Each question should measure only one thing at a time: for example, if a participant found the system fun but not very useful, she would find it hard to answer the question "The system was useful and fun".

2.3.4 Statistical Evaluation

Another important step of user studies is to formulate some questions of interest (e.g.,"Is Algorithm A better than Algorithm B?") and, starting from them, formulate some hypotheses (e.g., "Algorithm A is better than Algorithm B"). Once collected data, it is important to statistically test the formulated hypotheses and verify that results are statistically significant (they are not due to chance). A standard tool for significance testing is p-value, which is the probability of an observed result assum-ing that the null hypothesis5(e.g., "Algorithm A is not better than Algorithm B") is true. A smaller p-value signifies more evidence against the null hypothesis: when p-value is lower than a predefined threshold (the significance level, commonly set to 0.05), the null hypothesis can be rejected and the obtained results are statistically significant.

Most researchers perform piecewise tests of their hypotheses, which means that they perform a separate test of each dependent variable [18]. Some of the most used solutions are the following:

• The difference between two experimental conditions can be tested with a t-test6. For between-subjects experiments, an independent (2-sample) t-test should be used; for within-subjects experiments, a paired (1-sample) t-test should be used. The main outcome of a t-test is the t-statistic and its p-value; a smaller p-value signifies more evidence against the null-hypothesis.

• The difference between more than two conditions can be tested with an ANOVA (Analysis of Variance) F-test7. The ANOVA test produces an F-statistic; its p-value signifies evidence against the null hypothesis that the dependent variable has the same value in all conditions.

5

In statistics, null hypothesis is the default hypothesis, that researchers try to reject

6

The test statistic follows a Student’s t-distribution under the null hypothesis

(28)

• Non-parametric Cochran’s Q test is used for testing 2 or more experimental conditions, where a binary response (e.g., 0 or 1) is recorded from each condition within each subject (i.e., participant). Cochran’s Q tests the null hypothesis that the proportion of "successes" is the same in all conditions versus the alternative hypothesis that the proportion is different in at least one of the conditions. When Cochran’s Q test is computed with only 2 conditions, the results are equivalent to those obtained from the McNemar test [29]. Another widely used statistical test is Structural Equation Modelling (SEM) [24, 19]. SEM is an integrative statistical procedure, because it tests all hypotheses (known as the structural model, or path model) at the same time.

2.4 Related Work

In this section, we provide an overview of the most significant user studies in RS literature. Research works on user studies can be split into two categories:

• Targeted user studies: used to analyse the impact of the algorithm or user interface modifications on a specific recommendation quality (e.g. novelty, diversity) perceived by users.

• Multicriteria user studies: used to measure relationships between different qualities.

2.4.1 Targeted user studies

Ziegler et al. [37] evaluated their diversification algorithm on Book Crossing dataset8_.

Given a list of ten book recommendations, each user was asked to complete a survey related to the perceived diversity and satisfaction. The users were randomly assigned to a user-based or item-based collaborative filtering (CF) recommender. The results revealed that small changes in the recommendation list positively influenced user satisfaction with the item-based CF recommender. In the case of user-based CF recommender, the results showed no particular correlations.

Celma [5] evaluated the users’ perception of accuracy and novelty in a music recom-mender (288 Last.fm users, Figure 2.1). The study was based on three algorithms: content-based, item-based CF and a hybrid combination of the two. The results of the experiment showed that the perceived accuracy was higher using item-based CF algorithm, while the content-based approach produced the most novel recommenda-tions from the users’ perception.

Hu and Pu [16] conducted a within-subjects user study with 20 participants to com-pare a standard list interface with a organization-based interface (i.e., an interface in which recommendations are grouped into categories). The users provided feedback regarding the perceived categorical diversity (i.e., items being of different kinds)

(29)

2.4. RELATED WORK 15

Figure 2.1: Screenshot of the music recommendation survey

and item-to-item diversity (i.e., items being dissimilar to each other) as well as the perceived ease of use and usefulness of the system. The results showed no signif-icant difference in perceived item-to-item diversity, while the organization-based interface had a positive influence on the perceived categorical diversity, ease of use and usefulness of the system.

Ge et al. [12] analysed the relationship between the placement of the items within a recommendation list and the perceived diversity. The authors conducted a user study with 52 participants. The users were asked to evaluate the diversity of precomputed movie lists; the same lists were displayed to all the users. Each list contained movies from one genre, with a small number of diverse items (movies with different genre) in a particular placement. The study showed that inserting all the diverse items at the end of the list led to higher perceived diversity than placing them in the middle of the list.

Zhang et al. [35] performed a small-scale user study (21 participants) to evaluate a serendipity-enhancing recommender for Last.fm music. Participants were asked to evaluate two recommendation lists generated by a baseline recommender and the serendipity-enhancing version of the system. The perceived accuracy and serendipity of the recommendations were measured on a 5-point scale. The study results showed that the serendipity-enhancing system was perceived as more serendipitous, but less accurate. However, users’ satisfaction was lower with the baseline recommender, therefore users were happy to sacrifice accuracy for serendipity.

(30)

2.4.2 Multicriteria user studies

Pu et al. [24] conducted a user study to determine the criteria that influence the perceived recommender system’s usefulness. They launched a large-scale survey among 239 participants. Questions were related to the quality of online recommenda-tion services such as Amazon, Youtube and IMDb. Of the beyond-accuracy metrics, serendipity was discarded, as it was considered too similar to novelty. Analysing the correlations between answers, the authors validated a model consisting of 32 criteria (e.g., The recommender system helped me discover new products) grouped into 15 categories (e.g., Novelty, Diversity, etc.). The result of the user study showed that the perceived usefulness of a recommender system was strongly influenced by the perceived accuracy and novelty, less by the perceived diversity.

Knijnenburg et al. [19] proposed a framework for evaluating users’ experience of recommender systems. The model was composed of a set of concepts including objective system aspects (e.g. algorithms, user interface features) and personal characteristics (e.g., demographics, expertise) that influence the user experience (e.g., the perceived diversity). In order to investigate the relationships between these framework components, the authors conducted a series of experiments. One of them, the diversification experiment, was conducted online, using Amazon’s Mechanical Turk as a crowdsourcing platform, involving 137 participants, mainly from the US and India. Participants were asked to rate at least ten items from the MovieLens 1M dataset, after which they would receive a list of ten recommendations. The composition of the list was computed independently varying the algorithm (most popular, kNN or Matrix Factorization) and the level of diversification between sub-jects. Diversification was based on movie genre and implemented using the greedy reranking approach [37]. Participants were asked to choose one movie from the list of recommendations and to complete the user experience survey. The result showed that diversification was perceived differently for different algorithms; for example, non-diversified kNN and MF recommendations were perceived as more accurate than the non-diversified most popular recommendations. In general, the authors observed that the perceived diversity was positively correlated with the perceived accuracy and the overall satisfaction with the system.

Ekstrand et al. [9] adapted the questions used by Knijnenburg for a comparative user study, involving 1052 users of MovieLens recommender system (Figure 2.2). Participants were asked to compare pairs of movie recommendation lists and answer questions regarding perceived qualities (accuracy, diversity, novelty and so on). The authors were interested in detecting differences between algorithms, therefore they chose the within-subjects design for the survey. Three collaborative filtering algorithms were used: item-based CF, user-based CF and SVD. For each user, two out of three algorithms were randomly selected. To address the problem of unfamiliar items, the authors limited the set of recommendable movies to the most 2500 popular ones. The result of the study showed that the users were less satisfied with the user-based CF algorithm. Furthermore, the overall satisfaction was positively influenced by the perceived diversity, negatively by the perceived novelty.

(31)

2.4. RELATED WORK 17

Figure 2.2: Screenshot of Ekstrand’s experiment interface

2.4.3 Some observations

Table 2.2 contains a summary of the results of user studies described above. The negative influence of novelty on users’ satisfaction observed by Ekstrand et al. [9] seems to contradict the findings of Pu et al. [24], who found novelty to positively influence the perceived system usefulness (and consequently users’ satisfaction). A possible explanation of this contradiction could be related to the application domains: Ekstrand et al. focused on movie recommendations, while Pu et al. conducted the experiment using different recommender services, including Amazon, Youtube and IMDb. Another reason could be the different formulations of the novelty-related questions: Ekstrand et al. used a negative tone (e.g., "Which list has more movies you do not expect?"), while Pu et al. used positively phrased questions (e.g., "The recommender system helped me discover new products").

From the previous example, it follows that the impact of novelty is still an open question and a larger number of user studies are needed.

(32)

Table 2.2: User studies

Author # Users Metrics Domains Results

Ziegler et al. [37]

2125 diversity books Using item-based CF, diversity of items positively influences user sat-isfaction.

Celma [5] 288 accuracy, novelty

music Content-based approach produces the most novel recommendations. Hu and Pu

[16]

20 diversity perfumes Organization-based interface in-creases perceived categorical diver-sity.

Ge et al. [12] 52 diversity movies Diverse-items inserted at the end of the list increases perceived diversity. Zhang et al.

[35]

21 accuracy,

serendipity

music Serendipity positively influences user satisfaction, at the cost of low accuracy. Pu et al. [24] 239 accuracy, diversity, novelty e-commerce, video, movies

Novelty positively influences user satisfaction.

Knijnenburg et al. [19]

137 accuracy, diversity

movies Diversity positively influences per-ceived accuracy and user satisfac-tion. Ekstrand et al. [9] 1052 accuracy, diversity, novelty

movies Novelty negatively influences user satisfaction, while diversity posi-tively influences user satisfaction.

(33)

Chapter 3 Web application

T

HISchapter aims to present Novelty Survey web application, an online tool that helps researchers to set up user studies on movie recommendations and collect user feedback.

3.1 Functionalities

In this section, we describe the main functionalities of Novelty Survey web applica-tion. Two actors interact with the system:

• Participants: they are the final users that interact with the recommender system and provide feedback by answering a set of questions.

• Researchers: they create and monitor the user studies. 3.1.1 Authentication

In order to access the user study, participants need to authenticate. They can quickly register and login with social credentials (Facebook or Google), otherwise they need to use email and password (Figure 3.1). In the latter case, account confirmation is requested and there is the possibility to reset forgotten passwords. In both types of registration, demographic data (gender and age) are provided by the users (Figure 3.2), while the country of the participants is automatically identified by the ip address of the client (in fact, these data are not always available in social accounts).

3.1.2 Catalogue

Once logged in, participants can select a certain number of items from a catalogue of movies(Figure 3.3). We focus on movies domain, because it is well known from the user perspective and it can be easily used to explore various recommender qualities. In particular, we chose TMDb1_{as movie catalogue for the following reasons:}

• Size: the catalogue is very large, contains more than 450000 items.

1_{https://www.themoviedb.org/}

(34)

Figure 3.1: Novelty Survey authentication

(35)

3.1. FUNCTIONALITIES 21

Figure 3.3: Novelty Survey catalogue

• Up-to-date: the catalogue is always up-to-date with the current movies (also those coming soon).

• Reliability: the database is crowd-sourced, a large community contributes to correct data and insert missing values.

Participants can search movies by title or part of the title (Figure 3.4). Results of the query are displayed as posters in pages (eight items per page). When the search bar is blank, the up-to-date list of most popular movies is provided by default. For detailed information (cast, crew, trailer, etc.), each item is linked to TMDb official page (Figure 3.5): so the users are sure to have selected the right movie. All the selected items are displayed in the right side of the page as a reminder and compose the user profile, which is used as input of the recommender system.

3.1.3 Recommendations

The recommender system constitutes the core of the web application. We decided to implement the following recommendation algorithms:

• Top-popular: given a recommendation list of length N , provide the N most popular TMDb movies.

• Random: given a recommendation list of length N , provide N movies ran-domly selected from TMDb catalogue.

• Top-rated: given a recommendation list of length N , provide the N most popular TMDb movies.

(36)

Figure 3.4: Novelty Survey search

(37)

Figure 3.6: Novelty Survey survey

Top-popularand Random algorithms can be pure (with no filters) or content-based (with filters). Filters are the genres, the crew and the cast of the input movies (user profile). Filters are used in different combinations, to produce different algorithms (see Algorithm 1 and Algorithm 2). Top-rated algorithm is only pure, without filters (see Algorithm 3).

We decided no to implement collaborative filtering algorithms, in order to avoid the cold-startproblem: there are not enough movie ratings in TMDb.

For each user study, one or two algorithms are chosen (it depends on the survey design). Each algorithm produces a list of recommended movies to be displayed on the page.

3.1.4 Survey

Participants can complete a questionnaire related to the perceived qualities of rec-ommendations (Figure 3.6). For each movie of the list, participants can explore related information (link to TMDb official page). They can answer only a question per page. Once clicked the "Next" button, they can not go back and change the response. We decided to apply this "no back" policy in order to detect people who respond randomly: they can not check their previous answers and it is probable that they provide different answers to the same questions. Participants can logout and resume the survey at any point. Once completed the survey (Figure 3.7), they can not repeat it with the same account (each account is associated to only one survey). 3.1.5 Administration

The admin page of the Novelty Survey web application is entirely dedicated to the administration of user studies (Figure 3.8). Researchers can access the page with

(38)

Algorithm 1 Top-popular algorithm

1: function GETGENRES(userP rof ile)

2: for each movie ∈ userP rof ile do

3: for each g ∈ movie.genres do

4: genres ← g

5: return genres

6: function GETCREW(userP rof ile)

8: for each c ∈ movie.crew do

9: crew ← c

10: return crew

11: function GETCAST(userP rof ile)

13: for each c ∈ movie.cast do

14: cast ← c

15: return cast

16:

17: function GETCONTENT(userP rof ile, f lag, (N genres, N crew, N cast))

18: if f lag.genre = true then

19: genres ← GETGENRES(userP rof ile)

20: content.genres ← mostCommon(genres, N genres)

21: if f lag.crew = true then

22: crew ← GETCREW(userP rof ile)

23: content.crew ← crew[0, N crew]

24: if f lag.cast = true then

25: cast ← GETCAST(userP rof ile)

26: content.cast ← cast[0, N cast]

27: return content 28:

29: procedure TOPPOPULAR(userP rof ile, f lag, (N genres, N crew, N cast), recListLength) 30: content ← GETCONTENT(userP rof ile, f lag, (N genres, N crew, N cast))

31: movies ← getT M DbT opP op(content) . get the most popular movies from TMDb catalogue, using filters (API is available and provided by TMDb).

32: movies ← excludeSeen(movies, userP rof ile) . exclude movies that are

already present in userProfile.

33: recs ← movies[0, recListLength]

(39)

Algorithm 2 Random algorithm

1: function GETGENRES(userP rof ile)

2: for each movie ∈ userP rof ile do 3: for each g ∈ movie.genres do

4: genres ← g

5: return genres

6: function GETCREW(userP rof ile)

8: for each c ∈ movie.crew do

9: crew ← c

10: return crew

11: function GETCAST(userP rof ile) 12: for each movie ∈ userP rof ile do 13: for each c ∈ movie.cast do

14: cast ← c

15: return cast

16:

17: function GETCONTENT(userP rof ile, f lag, (N genres, N crew, N cast))

18: if f lag.genre = true then

19: genres ← GETGENRES(userP rof ile)

20: content.genres ← mostCommon(genres, N genres) 21: if f lag.crew = true then

22: crew ← GETCREW(userP rof ile) 23: content.crew ← crew[0, N crew] 24: if f lag.cast = true then

25: cast ← GETCAST(userP rof ile) 26: content.cast ← cast[0, N cast] 27: return content

28:

29: procedure RANDOM(userP rof ile, f lag, (N genres, N crew, N cast), recListLength, (p1, p2))

30: content ← GETCONTENT(userP rof ile, f lag, (N genres, N crew, N cast))

31: randomP age ← getRandomBetweenV alues(p1, p2) . pick a random value in

the range [p1,p2]

32: movies ← getT M DbCatalogueP age(content, randomP age) . get the movies of page number="randomPage" from TMDb catalogue, using filters (API is available and provided by TMDb).

33: movies ← excludeSeen(movies, userP rof ile) . exclude movies that are already present in userProfile.

(40)

Algorithm 3 Top-rated algorithm

1: procedure TOPRATED(userP rof ile, recListLength)

2: movies ← getT M DbT opRated() . get the top-rated movies from TMDb

catalogue (API is available and provided by TMDb).

3: movies ← excludeSeen(movies, userP rof ile) . exclude movies that are already present in userProfile.

5: return recs

(41)

3.2. SYSTEM ARCHITECTURE AND TECHNOLOGIES 27

the admin account. Admin page is used to manage the following objects:

• Users: Researchers can visualize all the details about the users registered in the system (both survey participants and administrators). Details include email address, demographic data (gender, age, country) and date of registration. Researchers can delete all the selected accounts, except the administrators: it means that deleted accounts can register again and repeat the survey. Re-searchers can also export the data relative to selected users in different formats (csv, xls, json and many others).

• Surveys: Researchers can create and edit surveys. Each survey is identified by a unique survey id and correspond to a different user study. Researchers can decide the name of the survey and the type of survey design (between-subjectsor within-subjects). For each survey, they can set the questions (those available in the database) and the order of questions. Finally, they can select the algorithms to use in the study and the length of recommendation lists. • Questions: Researchers can create and edit questions to be used in the different

surveys. For each question, they can set the response options (those available in the database) and the order of options.

• Options: Researchers can create and edit response options.

• Responses: Researchers can visualize the database of all the responses sub-mitted by the participants. Each line of the database contains a question and the respective answer, including the email of the participant, the survey id, the completion date, the user profile, the recommendations lists and the algorithms that have been used. The validity of each survey is also recorded: the survey is not valid if the participant provides different answers to the same questions. Researchers can also export the data relative to selected responses in different formats (csv, xls, json and many others).

3.2 System Architecture and Technologies

The system is based on a common three-tier client-server architecture (Figure 3.9). The frontend layer is implemented using HTML5 and Javascript. In particular, we used ReactJS2 _{and Semantic UI React}3 _{libraries to build the user interface. The}

application layer is implemented in Python, using the Django Web Framework4_{. The}

Django application is hosted in the Heroku cloud platform5_{. We used PostgreSQL}6

as database system. 2_{https://reactjs.org/} 3_{https://react.semantic-ui.com/} 4 https://www.djangoproject.com/ 5 https://dashboard.heroku.com/ 6_{https://www.postgresql.org/}

(42)

Figure 3.8: Novelty Survey admin page

(43)

3.2. SYSTEM ARCHITECTURE AND TECHNOLOGIES 29

3.2.1 Third-party services

Third-party services of Novelty Survey web application include: • TMDb APIs7 _{to query TMDb catalogue movies.}

• Third-party authentication providers: Facebook8_{and Google}9_.

• Email backend Sendgrid10_{to send emails.}

7_{https://developers.themoviedb.org/3/getting-started/introduction} 8 https://developers.facebook.com/ 9 https://developers.google.com/ 10_{https://sendgrid.com/}

(44)

(45)

Chapter 4 Experiments and Results

T

HISchapter aims to present our experiments and the obtained results. We used Novelty Survey webapp in order to set up user studies on beyond-accuracy metrics, in particular novelty and diversity.

4.1 Novelty Study

The objective of this study was to compare user’s perceived novelty with some novelty metrics that can be computed offline, in order to choose the one which best correlates with the user perception. As an approach, we measured the perceived qualities of two recommendation algorithms, using a within-subjects survey design, then we focused on results of perceived novelty.

4.1.1 Experiment

Users and Context

We conducted our experiment on users of Figure Eight1, a crowdsourcing platform. In 9 hours, 169 users attempted the survey, 102 completed it. Among complete surveys, 89 were valid: 13% was cut out for inconsistency (they provided different answers to the same questions). 75% of the subjects were male, 25% were female; most of the subjects were on the age range between 18 and 30 (Figure 4.1); they were from 21 different countries (Figure 4.2), mainly from Venezuela (40%), Ukraine (9%) and Egypt (8%). For each subject, it took on overage about 6 minutes to finish the whole experiment.

User Profile

After registration, each user was initially requested to select 8 movies of interest (implicit rating) from TMDb catalogue. This input data composed the user profile.

1_{https://www.figure-eight.com/}

(46)

Figure 4.1: Age of the subjects - Novelty Study

(47)

4.1. NOVELTY STUDY 33 Recommendation Algorithms

For this experiment, we tested two recommendation algorithms: Top-popular content-basedand Random content-based (see Algorithm 1 and Algorithm 2). We setup the configuration of the algorithms in the admin page of Novelty Survey web application as follows:

• Top-popular content-based: based on genre ( N genres = 1), crew (N crew = 5), cast (N cast = 5).

• Random content-based: based on crew (N crew = 8) and cast (N cast = 7), random pages between 5 and 8 (p1 = 5, p2 = 8).

For each algorithm, we computed a recommendation list containing 5 movies. We presented these lists as "List A" and "List B" (to avoid possible biases, the ordering of the algorithms was randomized for each user).

Survey

For this experiment, we investigated 4 perceived qualities of recommendations: • Accuracy: measures how much the recommendations match the users’

inter-est.

• First Order Novelty (FON) [8]: measures the extent to which users receive new recommended movies (never watched).

• Second Order Novelty (SON) [8]: measures the extent to which users receive new recommended movies (never heard about).

• Satisfaction: measures the global user satisfaction with the recommender system.

The questions used to evaluate user feedback were the following: • Accuracy: Which list contains more movies that you like?

• First Order Novelty (FON): Which list contains more movies that you have already watched?

• Second Order Novelty (SON): Which list contains more movies that you already knew about?

• Satisfaction: Which list is more satisfactory for you?

We also posed other equivalent questions expressed in a different way, in order to verify for consistency of answers. To detect people who respond randomly, we have inserted duplicated questions. Possible answers to each question were: "List A", "List B", "Both".

(48)

4.1.2 Results

The final results of the online evaluation can be found in Table 4.5. In order to test the significancy of the results, we performed multiple pair-wise McNemar tests on the responses from the users, as it well fits to the characteristics of the gathered data (binary responses of two experimental conditions, see 2.3.4). For the sake of the statistical tests, we considered only exclusive answers ("List A" or "List B"). All tests were run considering significance level α = 0.05. Results are all significantly different (p-value < 0.05). The results indicate that Random content-basedalgorithm provides more novel recommendations w.r.t. Top-popular content-basedalgorithm. Furthermore, perceived accuracy and overall satisfaction of Top-popular content-based recommendations are higher.

Table 4.1: Results of the user study Research Variables Top-popular Random

Accuracy 73% 18%

First Order Novelty 16% 67%

Second Order Novelty 11% 74%

Satisfaction 68% 18%

For both the algorithms, First Order Novelty (FON) and Second Order Novelty (SON) are negatively correlated with accuracy (ACC) and satisfaction (SAT). In particular, the negative correlation between FON and SAT for Random recommended lists is the most significative (see Table 4.2). We observe that users are generally more satisfied when they perceive SON (never heard of) w.r.t FON (never watched).

Table 4.2: Pearson Correlation of survey variables (most significative correlation in bold) FON - Top-popular SON - Top-popular

ACC -Top-popular -0.43 -0.27

SAT - Top-popular -0.44 -0.29

FON - Random SON - Random

ACC - Random -0.48 -0.40

SAT - Random -0.49 -0.39

We observe that, in percentage, users in the age range between 40 and 50 perceive top-popular lists of movies as more novel (never watched) w.r.t. the younger participants. We can see an inverse trend for random lists of movies (Figure 4.3).

We also observe that, in percentage, female participants perceive top-popular lists of movies as more novel (never heard of) w.r.t. male participants. We can see an inverse trend for random lists of movies (Figure 4.4).

(49)

4.1. NOVELTY STUDY 35

Figure 4.3: Perceived FON by age

(50)

Novelty metrics analysis

We compared our survey results of novelty with 3 novelty metrics that can be computed offline on recommendation lists (R) of both the algorithms (Top-popular and Random). In particular, we chose the following 3 metrics:

• OFF-NOV-1: the well known novelty self-information formulation, based on items popularity (see Equation 2.18).

novelty(R) = P

i∈R− log2pop(i)

|R| (4.1)

• OFF-NOV-2: a variation of Equation 2.18 with the inverse of popularity.

novelty(R) = P

i∈R1/pop(i)

|R| (4.2)

• OFF-NOV-3: a variation of Equation 2.18 with the inverse of vote count (i.e., the number of users who rated the items).

novelty(R) = P

i∈R1/voteCount(i)

|R| (4.3)

Values of OFF-NOV-1 novelty and OFF-NOV-2 novelty computed on Top-popular recommended lists are highly correlated (Pearson’s r = 0.9, see Figure 4.5), whereas OFF-NOV-1and OFF-NOV-3 are less correlated (Pearson’s r = 0.45, see Figure 4.6). Therefore, in Top-popular case, we expected significantly different results between OFF-NOV-1and OFF-NOV-3 metrics.

We compared the offline metrics with FON (never watched) and SON (never heard of). As offline metrics are continuous variable, whereas FON and SON are dichoto-mous variables (the responses are binary), we used point-biserial correlation: this is a special case of Pearson’s correlation used to measure the association that exists between one continuous variable and one dichotomous variable [31]. Results of measurements are in Table 4.3 (Top-popular) and in Table 4.4 (Random).

Table 4.3: Top-popular - Point-Biserial Correlation of Novelty metrics (best positive correla-tion in bold)

Top-popular - Novelty FON SON

OFF-NOV-1 0.024 0.094

OFF-NOV-2 0.163 0.150

OFF-NOV-3 0.324 0.157

Correlation coefficients are all positive, but most of them are close to zero (no relationship). In the case of Top-popular algorithm, the best positive correlation is between OFF-NOV-3 and FON (see Figure 4.9): on average, values of OFF-NOV-3

(51)

4.1. NOVELTY STUDY 37

Figure 4.5: Top-popular - OFF-NOV-1 vs OFF-NOV-2

(52)

Table 4.4: Random - Point-Biserial Correlation of Novelty metrics (best positive correlation in bold)

Random - Novelty FON SON

OFF-NOV-1 0.047 0.120

OFF-NOV-2 0.035 0.105

OFF-NOV-3 0.002 0.024

Figure 4.7: Top-popular - OFF-NOV-1

novelty are effectively higher when a user perceives top-popular list of movies as novel (F ON = 1, i.e., never watched). No significative results can be found in the case of Random algorithm. For completeness, we report all relationships between offline and perceived novelty metrics (boxplots in Figure 4.7, 4.8, 4.9, 4.10, 4.11, 4.12).

We can conclude that OFF-NOV-3 (i.e., a variation of self-information novelty formulation with the inverse of vote count) is a good metric and reflect the real perceived novelty of Top-popular recommender system.

4.2 Diversity Study

The objective of this study was to compare user’s perceived diversity with some diversity metrics that can be computed offline, in order to choose the one which best correlates with the user perception. As an approach, we measured the perceived qualities of two recommendation algorithms, using a within-subjects survey design, then we focused on results of perceived diversity.

(53)

4.2. DIVERSITY STUDY 39

(54)

Figure 4.11: Random - OFF-NOV-2

(55)

4.2. DIVERSITY STUDY 41

Figure 4.13: Age of the subjects - Diversity Study

4.2.1 Experiment

Users and Context

We conducted our experiment on users of Figure Eight. In 11 hours, 212 users attempted the survey, 124 completed it. Among complete surveys, 99 were valid: 20% was cut out for inconsistency (they provided different answers to the same questions). 74% of the subjects were male, 26% were female; most of the subjects were on the age range between 18 and 30 (Figure 4.13); they were from 24 different countries (Figure 4.14), mainly from Venezuela (47%), Turkey (8%), Serbia (6%) and Ukraine (6%). For each subject, it took on overage about 5 minutes to finish the whole experiment.

User Profile

After registration, each user was initially requested to select 8 movies of interest (implicit rating) from TMDb catalogue. This input data composed the user profile.

Recommendation Algorithms

For this experiment, we tested two recommendation algorithms: Top-popular content-basedand Random content-based (see Algorithm 1 and Algorithm 2). We setup the configuration of the algorithms in the admin page of Novelty Survey web application as follows:

• Top-popular content-based: based on genre ( N genres = 1), crew (N crew = 8), cast (N cast = 8).

(56)

Figure 4.14: Countries of the subjects - Diversity Study

• Random content-based: based on crew (N crew = 8) and cast (N cast = 8), random pages between 5 and 8 (p1 = 5, p2 = 8).

For each algorithm, we computed a recommendation list containing 5 movies. We presented these lists as "List A" and "List B" (to avoid possible biases, the ordering of the algorithms was randomized for each user).

Survey

For this experiment, we investigated 3 perceived qualities of recommendations: • Accuracy: measures how much the recommendations match the users’

inter-est.

• Intra-list Diversity: measures how much users perceive intra-list recom-mended movies as different from each other.

• Satisfaction: measures the global user satisfaction with the recommender system.

The questions used to evaluate user feedback were the following: • Accuracy: Which list contains more movies that you like?

• Intra-list Diversity: Which list contains more movies that are similar to each other?