• Non ci sono risultati.

LCBM: Statistics-basedParallel CollaborativeFiltering

N/A
N/A
Protected

Academic year: 2021

Condividi "LCBM: Statistics-basedParallel CollaborativeFiltering"

Copied!
21
0
0

Testo completo

(1)

LCBM: Statistics-based Parallel

Collaborative Filtering

Fabio Petroni1, Leonardo Querzoni 1, Roberto Beraldi 1, Mario Paolucci2

Cyber Intelligence and information Security

CIS Sapienza

1 Department of Computer Control and Management Engineering Antonio Ruberti, Sapienza University of Rome -

petroni|querzoni|[email protected]

2 Institute of Cognitive Sciences and Technologies, CNR - [email protected]

Larnaca, Cyprus 22-23 May, 2014

(2)

Personalization

I Most of todays internet businesses deeply root their success in the ability to provide users with strongly personalized experiences.

I Recommender Systems: are a subclass of information filtering system that seek to predict the rating or preference that user would give to an item.

(3)

Collaborative Filtering

I Collaborative Filtering (CF) represents today’s a widely adopted strategy to build recommendation engines.

I CF analyzes the known preferences of a group of users to make predictions of the unknown preferences for other users.

Filtering

filter the user data stream satisfying user personal interests

Discovery

recommends new content to the user that matches his interests

3 of 21

(4)

Challenges

I Key factors for recommendation effectiveness:

amount of data available

larger data-sets improve quality more than better algorithms

timeliness

timely deliver output to users constitutes a business advantage

7 Most CF algorithms typically need to go through a very lengthy training stage, incurring in prohibitive computational costs and large time-to-prediction intervals when applied on large data sets.

7 This lack of efficiency is going to quickly limit the applicability of these solutions at the current rates of data production growth.

(5)

Taxonomy Of CF Algorithms

MEMORY- BASED

MODEL- BASED

Collaborative Filtering

neighborhood models user-oriented item-oriented

Pearson Correlation

Similarity

Cosine Similarity

cluster

models Bayesian

networks linear

regression

LATENT FACTORS

MODELS

pLSA neural

networks

Latent Dirichlet Allocation

MATRIX FACTORIZATION

ALS SGD SVD

5 of 21

(6)

Memory-Based CF Algorithms

I Operate on the entire set of ratings to generate predictions:

B compute a similarity score, which reflects the distance between two users or two items.

- Pearson Correlation Similarity - Cosine Similarity

B Predictions are computed by looking at the ratings of the k most similar users or items (k-nearest neighbor).

3 Simple design and implementation;

3 used in a lot of real-world systems.

7 Impose several scalability limitations;

7 their use is impractical when dealing with large amounts of data.

(7)

Model-Based CF Algorithms

I Use the set of available ratings to estimate or learn a model and then apply this model to make rating predictions.

I The most successful are based on matrix factorization models.

7 of 21

(8)

Matrix Factorization 1/2

LOSS (P, Q) =X

(Rij − PiQj)2

I The most popular techniques to minimize LOSS are Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD).

ALS

ALS alternates between keeping P and Q fixed, solving from time

to time a least-squares problem.

SGD

SGD works by taking steps proportional to the negative of the

gradient of the LOSS function.

I Both algorithms need several passes through the set of ratings.

(9)

Matrix Factorization 2/2

3 Provide high quality results;

3 SGD was chosen by the top three solutions of KDD-Cup 2011;

3 there exist parallel and distributed implementations.

7 High computational costs;

7 large time-to-prediction intervals;

7 especially when applied on large data sets.

SGD ALS LCBM

training time O(X · K · E ) Ω(K3(N + M) + K2X ) O(X )

prediction time O(K ) O(K ) O(1)

memory usage O(K (N + M)) O(M2+ X ) O(N + M)

In the table X is the number of collected ratings, K the number of hidden features, E the iterations, N and M the number of users and items respectively.

9 of 21

(10)

Contributions

In this paper we present:

LCBM: Linear Classifier of Beta distributions Means

a novel collaborative filtering algorithm for binary ratings that:

3 is inherently parallelizable;

3 provides results whose quality is on-par with state-of-the-art solutions;

3 in shorter time;

3 using less computational resources.

(11)

Overview

LCBM

Working phase

ratings

Training phase

Item rating PDF inference ratings per

item

User profiler ratings per

user

standard error mean

> predictions

threshold

I We consider:

B items as elements whose tendency to be rated positively/negatively can be statistically characterized using a probability density function;

B users as entities with different tastes that rate the same items using different criteria and that must thus be profiled;

I The algorithm needs two passes over the set of available ratings to train the model (build users and items profiles).

11 of 21

(12)

Item Rating PDF Inference

I Beta distribution:

B X = the relative frequency of positive votes that the item will receive.

B α = OK + 1, β = KO + 1

Item profile

MEAN = α+βα = OK +KO+2OK +1 SE = OK +KO+21

q (OK +1)(KO+1) (OK +KO)(OK +KO+3)

I Chebyshev’s inequality:

Pr (MEAN−λSE ≤ X ≤ MEAN+λSE ) ≥ 1−1 λ2

0 0.5 1 1.5 2 2.5 3

0 0.2 0.4 0.6 0.8 1

PDF

X

I Example:

B OK = 8, KO = 3

B MEAN ≈ 0.7

B SE ≈ 0.04

Pr (0.62 ≤ X ≤ 0.78) ≥ 0.75

(13)

User Profiler

0.1 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.65 0.7 0.8 0.85 discriminant function

OK errors KO errors

I Every rating is represented by a point with two attributes:

B a boolean value containing the rating (OK or KO).

B a key ∈ [0, 1] that gives the point’s rank in the order;

I if value = OK → key = MEAN + λSE

I if value = KO → key = MEAN − λSE

User profile

Quality threshold (QT) = the discriminant function of a 1D linear classifier that separates the points with a minimal number of errors.

13 of 21

(14)

Working Phase

LCBM

Working phase

ratings

Training phase

Item rating PDF inference ratings per

item

User profiler ratings per

user

standard error mean

> predictions

threshold

I In order to produce a prediction the algorithm simply compares QT values from the user profiler with MEAN values from the item’s rating PDF inference block:

B if MEAN > QT → prediction = OK

B if MEAN ≤ QT → prediction = KO

(15)

Experiments: Setting and The Data Sets

Dataset MovieLens 100k MovieLens 10M Netflix

users 943 69878 480189

items 1682 10677 17770

ratings 100000 10000054 100480507

I 5-fold cross-validation approach:

B This technique divides the dataset in 5 folds and then uses in turn one of the folds as test set and the remaining ones as training set.

I We compared LCBM against CF implementations in Apache Mahout:

B ParallelSGDFactorizer: lock-free and parallel implementation of SGD

B ALSWRFactorizer: ALS with Weighted-λ-Regularization

15 of 21

(16)

Experiments: Performance Metrics

I Confusion matrix: true positives (TP) false negatives (FN) false positives (FP) true negatives (TN)

I Matthews correlation coefficient (MCC) ∈ [−1, 1]

MCC = TP × TN − FP × FN

p(TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)

B 1 → perfect prediction;

B 0 → no better than random prediction;

B −1 → total disagreement between prediction and observation.

I time needed to run the test;

I the peak memory load during the test.

(17)

Results: Accuracy

0 0.1 0.2 0.3 0.4 0.5 0.6

MovieLens (105) MovieLens (107) Netflix (108)

MCC

number of ratings LCBM

SGD K=256 SGD K=32 SGD K=8 ALS

0 0.1 0.2 0.3 0.4 0.5 0.6

MovieLens (105) MovieLens (107) Netflix (108)

MCC

number of ratings LCBM

SGD K=256 SGD K=32 SGD K=8 ALS

17 of 21

(18)

Results: Computational Resources

0 2 4 6 8 10 12 14 16 18 20

MovieLens (105) MovieLens (107) Netflix (108)

memory (GB)

number of ratings LCBM

SGD K=256 SGD K=32 SGD K=8 ALS

0 2 4 6 8 10 12 14 16 18 20

MovieLens (105) MovieLens (107) Netflix (108)

memory (GB)

number of ratings LCBM

SGD K=256 SGD K=32 SGD K=8 ALS

0 0.2

Zoom

0 0.2

Zoom

(19)

Results: Timeliness

0 500 1000 1500 2000 2500 3000 3500 4000 4500

MovieLens (105) MovieLens (107) Netflix (108)

time (s)

number of ratings LCBM

SGD K=256 SGD K=32 SGD K=8 ALS

0 500 1000 1500 2000 2500 3000 3500 4000 4500

MovieLens (105) MovieLens (107) Netflix (108)

time (s)

number of ratings LCBM

SGD K=256 SGD K=32 SGD K=8 ALS

0 10

Zoom

0 10

Zoom

19 of 21

(20)

Conclusions

I LCBM is a novel CF algorithm with binary ratings.

I It works by analyzing collected ratings to:

B infer a probability density function of the relative frequency of positive votes that the item will receive;

B compute a personalized quality threshold for each user.

I These two information are used to predict missing ratings.

I LCBM is inherently parallelizable.

Future works:

I Handle multidimensional ratings.

I Provide recommendations: a list of items the user will like the most.

(21)

Thank you!

Questions?

Fabio Petroni

Rome, Italy

Current position:

PhD Student in Computer Engineering, Sapienza University of Rome

Research Areas:

Recommendation Systems, Collaborative Filtering, Distributed Systems [email protected]

21 of 21

Riferimenti

Documenti correlati

24 of the Consolidated Law on Immigration states that the employer, whether Italian or foreign, but nevertheless legally residing in Italy, or the employer’s associations on behalf

The simulations of Mapelli ( 2016 ) do not include PPISNe and PISNe and adopt mass-loss prescriptions for VMSs that are extrapolated from formulas derived for ‘ordinary’ massive stars

the percentage ratio of the standard deviation to the arithmetic mean, the coefficient of variation (ref. 6), and has used it, for example, in comparing the relative variations

Despite the benefits offered by keyword-based search systems, most recom- mendation approaches assume that the type of item the user is interested in has been precisely identified

re sinteticamente – non senza qualche lacuna – la sua vicenda bio- grafica e bibliografica; nello specifico, accenna alle felici circostan- ze che dalle Marche lo condussero verso

Pour finir, après avoir rappelé aux sujets les bienfaits reçus du roi, il les interpellait : «Donc, soyons ses intercesseurs en matière spirituelle, priant Dieu afin qu'il lui

Work execution level, Factor C, is measured as for windows. Aspects to be considered as for windows. Outdoor environment, Factor E, Aspects to be considered are the same as of