Reportscientificosullemetodologieadottateper l’arricchimento dei profili degli utenti

(1)

Progetto finanziato con fondi POR FESR 2014/2020 - ASSE PRIORITARIO I

“RICERCA SCIENTIFICA, SVILUPPO TECNOLOGICO E INNOVAZIONE”.

R.2.4. - ALLEGATO RELAZIONE TECNICO SCIENTIFICA FINALE (IN LINGUA INGLESE)

Report scientifico sulle metodologie adottate per l’arricchimento dei profili degli utenti

Progetto cluster “top-down” DoUtDes - Resp. Scient.: Prof. Salvatore M. Carta

In this report, we experiment and validate some user profile enrichment techniques that can be applied to the companies of the RAS cluster project “DoUtDes”. To do so, we focus on a toy case study, simulating a platform for runners managed by a sport company of the cluster.

The scenario requires to provide an online coach with a ranked list of users, according to the support they need. In order to do so, we first model their performance and running behavior, and then present an enrichment and ranking methodology, in order to recommend users to coaches, according to their performance in the last running session. We validate this approach in order to define a valid methodology than can be extended to be used in DoUtDes.

Introduction

In this report we propose a user profile enrichment, ranking and recommending system that suggests to a coach the sportspeople she follows after they performed a workout. Our approach first models users according to their workout performance, then ranks them in ascending order of workout quality, thus suggesting first those with the worst performances.

The choice of introducing a user profile enrichment system between the end of a workout and the support offered by the coach is not only motivated by the large number of users that a coach follows, but also by the complexity of workout results (a workout is usually composed by different activities, such as running, walking, and resting, and each activity returns several results, like the speed and covered distance), which need to be contextualized with the characteristics of the users (e.g., gender, age, and workout objective). With our proposal, we are offering coaches a tool to have an initial filtering of the workout results, to facilitate their work.

In this work,we model the problem as an enrichment and ranking problem, since our goal is not to predict the quality of a workout with a score (rating), but to provide the coach with an effective ranking of the users to support.

Our contributions can be summarized as follows:

(2)

- We simulate an approach to model the performances of the users;

- We introduce a novel algorithm to enrich and rank the users according to the support they need and recommend them to the coach;

- We validate our proposal on a real-world dataset collected from an existing platform on standard metrics to assess ranking quality, and compare it against state-of-the-art ranking algorithms;

- We make the results of this report available, in order to allow the community to advance the research on this topic, and to be used for defining the effective enrichment approach of DoUtDes.

The case study platform

The case study platform we are validating is a tool for simulating online personal training that supports and guides people towards an active lifestyle by motivating them to exercise.

The platform is made up of a web application and a mobile client that communicate with the same business logic, so it is similar, from a structural point of view, to the DoUtDes architecture.

The mobile application uses the devices’ sensors to record training statistics, while the web application provides users with an area where they can manage their workout settings and find workout session statistics; it also serves as a dashboard for the coaches, so that they can find all the tools needed to handle requests of tailored workout plans. Figure 1 depicts the typical interaction between a user and a coach.

After the user chooses a coach and specifies his objectives and current physical skills, the coach receives the user’s data and creates a tailored workout plan and sends it to the sportsman’s app. (See points 1 and 2 in the figure)

When the user receives the workout plan, a virtual personal trainer guides her to correctly complete the workout and the mobile app records training data. (See points 3 and 4 in the figure).

At the end of the workout, the coach receives training statistics and remotely monitors the user’s performance, modifies the workout (if needed), and motivates her by means of the internal messaging system. (See point 5 in the figure)

In this flow, our solution is meant to enrich step 4, by filtering the workout results, so that they are sent to the coach in ranked order.

Rating Count

1 434

2 1356

(3)

Table 1: Samples count for each rating

Dataset used for the simulation

This section describes the data and the features used in this study.

Our research is based on a real-world dataset containing 47555 activities that compose 8669 workouts, performed by 412 users (this means that each workout is composed by several activities).

The coaches in the platform evaluated these workouts by assigning a rating ranging between 1 (poorly performed) and 5 (well performed). Ratings were distributed as described in Table 1, where “count” indicates the number of samples having the corresponding rating.

As we are dealing with real-world data, we encountered the problem of class imbalance. We will deal with this phenomena before the classification process, as described in Section 5.2.

For each user, the platform collects the following features: (u1) user ID, (u2) user birth date, (u3) user gender, (u4) user height, (u5) user weight. Trivially, all these features (except u1 and u2) can be updated over time.

The features collected by the platform at workout level are: (w1) workout ID and (w2) burnt calories. Each activity that forms a workout, instead, is characterized by the following features: (a1) activity ID, (a2) distance objective (in eters, indicating the distance goal given by the coach to the sportsperson for that activity), (a3) covered distance (in meters), (a4) speed objective (in km/h, indicating the speed goal given by the coach to the sportsperson for that activity), (a5) average speed (in km/h), (a6) time objective (in seconds, indicating the time goal given by the coach to the sportsperson for that activity), (a7) time elapsed (in km/h), (a8) pace objective (in min/km, indicating the pace goal given by the coach to the sportsperson for that activity), (a9) average pace (in min/km), (a10) activity type (either walking, running, or resting), (a11) activity label (either, pace, distance, time, or unknown, indicating the type of objective the activity has; the unknown label is taken by those activities that do not have an objective). It should be trivial to notice that, according to the activity label (feature a11), features a2, a4, a6, and a8 can be equal to 0.

From all the workouts in the dataset, we removed all those that respected at least one of the following constraints, since they do not represent reliable ones: (i) covered distance >

420000 meters, (ii) workout duration > 10800 seconds, (iii) rest time > 3600 seconds, (iv) average speed > 16 km/h, (v)maximum speed > 60 km/h, (vi) minimum length < 0, (vii) burnt calories > 3000. The final dataset contains 8181 workouts.

3 2145

4 2681

5 1565

(4)

The model

Given the raw features available in our dataset and presented in the previous section, the next goal is to model each workout, by doing some feature engineering. In order to model workouts, we regrouped all the activities that belong to each workout and excluded the activities that have resting as activity type (feature a10) since, according to coaches, they are not considered when evaluating workout quality; for this reason, they should not be part of our user modeling and user profile enrichment techniques.

In the following, we describe the features we created, and how they are derived from the original ones. We start with the workout ID (feature f 1, directly derived from w1), and continuing with the following categories:

Distance-based features: (f 2) distance objective, built as the sum of the distance objectives of the activities of the considered workout (feature a2); (f 3) covered distance, built as the sum of all the covered distances of the activities of the considered workout (feature a3); (f 4) distance gap, built by first creating the difference between the distance objective (feature a2) and covered distance (feature a3) for each activity in the workout, and then averaging the obtained values (this feature indicates how well the user respected her distance objective).

Speed-based feature: since speed and pace are highly correlated measures (pace is the inverse of speed), in the study we focus on the pace, which is more focused on the user performance. For this reason, we introduce only one feature related to speed: (f 5) average speed, built as the average of all the speed for the activities of the considered workout (feature a5).

Temporal features: (f 6) time objective, built as the sum of the time objectives of the activities of the considered workout (feature a6); (f 7) time elapsed, built as the sum of all the time the user has taken to complete the activities of the considered workout (feature a7); (f 8) temporal gap, built by first creating the difference between the time objective (feature a6) and elapsed time (feature a7) for each activity in the workout, and then averaging the obtained values (this feature indicates how well the user respected her time objective).

Pace-based features: (f 9) pace objective, built as the average of the pace objectives of the activities of the considered workout (feature a8); (f 10) average pace, built as the average of the paces of the activities of the considered workout (feature a9); (f 11) pace gap, built by first creating the difference between the pace objective (feature a8) and average pace (feature a9) for each activity in the workout, and then averaging the obtained values (this feature indicates how well the user respected her pace objective); (f 12) pace gap variance, measured as the variance of the pace gaps in each activity considered to compute feature f 11 (this feature indicates how much the individual values are far from the average), (f 13) pace gap standard deviation, measured as the standard deviation of the pace gaps in each activity considered to compute feature f 11 (this feature also indicates how much the individual values are far from the average, but it is expressed in the same units as the data).

Workout characteristics: (f 14) walking activities’ percentage, built by considering the percentage of activities in a workout where feature a10 is equal to walking; (f 15) running

(5)

activities’ percentage, built by considering the percentage of activities in a workout where feature a10 is equal to running; (f 16) percentage of activities with an objective, measured as the percentage of activities in a workout where feature a11 is not equal to unknown; (f 17) percentage of well-performed activities, measured as the percentage of activities in a workout that have any type of gap equal to 0; (f 18) burnt calories, which is directly derived from feature w2.

User characteristics: (f 19) user age, created using feature u2, in order to contextualize the workout performance with the age of the user; (f 20) user gender, directly computed from feature u3; (f 21) user height, directly computed from feature u4; (f 22) user weight, directly computed from feature u5; (f 23) user BMI, computed using features f 21 and f 22; (f 24) user weight condition, which takes the following values: 8, to indicate high-risk obesity (if f 23

≥ 40); 7, to indicate moderate-risk obesity (if 35 < f 23 ≤ 39.99); 6, to indicate low-risk obesity (if 30 < f 23 ≤ 34.99); 5, to indicate overweight (if 25 < f 23 ≤ 29.99); 4, to indicate regular weight (if 18.50 < f 23 ≤ 24.99); 3, to indicate mild thinness (if 17 < f 23 ≤ 18.49); 2, to indicate moderate thinness (if 16 < f 23 ≤ 16.99); 1, to indicate severe thinness (if f 23 < 16).

Normalization

The aim of normalization is to transform the values of numeric values in a dataset to a standard scale, without distorting differences in the ranges of values.

Table 2: Samples count for each rating in the training set after sampling

Most of the classifiers perform better when the features are normalized. To get a normal distribution of the numerical data we used Yeo-Johnson Power Transformations. The Yeo- Johnson transformation is given by:

Rating Count

1 2123

2 2089

3 2068

4 2046

5 2071

(6)

An advantage of this method is that it works with positive and negative values. So, we applied Yeo-Johnson transformation to all non-categorical features (i.e., to all features, except f 1, f 22, and f 26).

Sampling

The learning phase and consequently the prediction of the majority of Machine Learning classifiers may be biased towards the occurrences that are frequently present in the dataset.

Researchers have suggested two main approaches to deal with data imbalance: the first approach consists of tuning the data by performing a sampling, and the other is to tweak the learning algorithm. Due to its effectiveness in our data, we employed the first approach.

More specifically, we have considered the oversampling approach, since it proved to be more effective for small dimension datasets. We opted for Synthetic Minority Over-sampling Technique Tomek (SMOTETomek) that is a combination of over- and under-sampling using SMOTE and Tomek links since it creates completely new samples and eliminates only examples belonging to the majority class instead of replicating the already existing ones, which offers more examples to the classifier to learn from.

This means that the minority class examples are over-sampled by introducing synthetic examples of each minority class considering all the k minority class nearest neighbors and majority class examples are under-sampled.

After applying SMOTETomek we got a more balanced ratings distribution in the training set as described in Table 2, where “count” indicates the number of samples having the corresponding rating. As the table reports, rating distribution in the training set is now well balanced.

Simulation framework

This section describes the experiments performed to validate our methodology.

Experimental Setup and Strategy

(7)

The experimental framework exploits the Python scikit-learn 0.19.1 library. The experiments were executed on a computer equipped with a 3.1 GHz Intel Core i7 processor and 16 GB of RAM. To balance the datawe applied SMOTETomek, using imbalanced-learn, which is a package that provides with a bunch of sampling approaches used in datasets showing high class imbalance. To normalise the features in our dataset we applied the Yeo-Johnson transformation provided by PowerTransformer package present in the scikit-learn library.

The input w with the coach-in-the-loop insights is structured as follows (note that these insights start from f 2, since the workout ID, f 1, is only an index): w = [5, 5, 10, 5, 5, 5, 10, 5, 5, 10, 10, 10, 5, 5, 9, 6, 8.5, 8.5, 8.5, 8.5, 8.5, 8.5, 8.5]. In order to implement the fit function in line 2 of our algorithm, we use the function svm.LinearSVC (for the sake of simplicity, in the results we will denote this algorithm as RankSVM, specifying if it is the version with or without coachin- the-loop insights). We validate this choice by considering in our algorithm different fit functions that implement other classifiers, and compare the results. More specifically, we considered the following:

● linear_model.LogisticRegressionCV (named RankLR);

● linear_model.RidgeClassifierCV (named RankRC);

● ensemble.GradientBoostingClassifier (named RankGBC);

● ensemble.ExtraTreesClassifier (named RankETC);

● ensemble.RandomForestClassifier (named RankRFC).

All the classifiers were run with the default parameters.

As mentioned in Section 6, in order to validate our choice of modeling the problem as a binary classification, we also evaluate Algorithm 1 in its multiclass version, thus removing line 1 and training the fit function with the original X matrix and y vector.

In addition to this, we also consider an alternative of Algorithm 1 that is not fed with the coach-in-the-loop insights (hence, line 4 is not part of the algorithm, and the ranking is based on v, instead of c).

To summarize, our experimental strategy consists in the following four sets of experiments:

1. Algorithm with or without coach in the loop. We evaluate Algorithm 1 in its original version and without the coach-in-the-loop insights, in order to evaluate their effectiveness in the classification process.

2. Algorithm against other binary classification based algorithms. We compared the versions of the algorithm using alternative binary classifiers, in order to understand which one performs better in terms of nDCG.

(8)

3. Algorithm against its multiclass version. We compared the versions of the algorithm using binary classification with the one using multiclass classification, in order to know how the type of classification affects the ranking quality.

4. Feature importance evaluation. After choosing the most effective algorithm, we took away the least important features one by one, and evaluated the ranking nDCG, to check how the less relevant features affected the effectiveness of the ranking.

Table 3: RankSVM with and without expert in the loop (nDCG@5)

Metric

The effectiveness of a ranking model is evaluated by comparison between the ranking lists output by the model and the ranking lists given as the ground truth, using measures such as Mean Average Precision (MAP), Discounted Cumulative Gain (DCG) and normalized Discounted Cumulative Gain (nDCG). We measured using an exponential gain and logarithmic decay based on the graded relevance judgments. In our case, nDCG at position k is defined as:

where N is the maximum possible DCG given the known relevant users, uj is the uth-ranked user returned by R, and rel (uj) is the binarized relevance assessment of this user.

nDCG values range between 0 and 1.

Table 3 shows the nDCG@5 and nDCG@10 values for our proposed algorithm and its version without coach-in-the-loop insights. Results show that the version that consider the insights coming from the coach always outperforms the version that only uses the automatically-detected feature relevance.

Algorithm against other binary classification based algorithms.

Given the results of the previous experiment, we have chosen to take into account the algorithms including the coach-in-the-loop insights.

Algorithm With coach Without coach

nDCG@5 0.87 0.86

nDCG@10 0.69 0.67

(9)

Table 4: Binary classifiers comparison.

Table 4 shows that RankSVM and RankLR are the algorithms giving the most relevant ranking with an nDCG@5 of 87% and an nDCG@10 of 69%3. This means that the coach would be able to properly support the sportspeople she follows since we are correctly ranking the users that need timely support in 87% or more of the cases if we consider enriching her a list of five users, and in 69% or more of the cases if we consider enriching her a list of ten users.

Algorithm against its multiclass version.

Since RankSVM is the algorithm that performs better, we compared the performance of RankSVM using binary classification and using multiclass classification.

Table 5: Binary SVM vs Multiclass SVM

As we can notice from Table 5, the version that is based on a binary classification is outperforming the one based on the multiclass classification in terms of both nDCG@5 and nDCG@10.

Feature importance evaluation.

At each iteration of this experiment, we removed the least important feature (as reported in the w vector), to see how it was affecting the effectiveness of our algorithm. Table 6 reports the nDCG@5 and nDCG@10 after removing the feature indicated in the third column.

Results show that none of the feature is affecting the effectiveness of the algorithm, since the values of nDCG have not gotten any better by removing features.

Algorithm nDCG@5 nDCG@10

RankSVM 0.87 0.69

RankRC 0.77 0.65

RankLR 0.87 0.69

RankGBC 0.72 0.60

RankETC 0.76 0.64

RankRFC 0.78 0.65

Algorithm nDCG@5 nDCG@10

Binary 0.87 0.69

Multiclass 0.15 0.31

nDCG@5 nDCG@10 Removed feature

(10)

Table 6: Results returned by removing the least important features

0.84 0.66 f ²

0.84 0.67 f ¹⁴

0.80 0.66 f ¹⁵

0.86 0.69 f ⁷

0.84 0.66 f ⁵

0.82 0.67 f ³

0.83 0.65 f ¹⁰

0.83 0.66 f ⁹

0.84 0.68 f ⁶

0.84 0.67 f ¹⁷

0.84 0.66 f ²⁴

0.84 0.65 f ¹⁹

0.84 0.65 f ²¹

0.85 0.66 f ²²

0.83 0.65 f ²⁰

0.87 0.66 f ²³

0.62 0.57 f ¹⁸

0.78 0.65 f ¹⁶

0.79 0.65 f ¹¹

0.79 0.66 f ¹²

0.79 0.65 f ¹³

0.78 0.70 f ⁸

0.36 0.43 f ⁴