Topic Based News Recommendation

(1)

Computer Science Department Master Degree in Computer Science

ICT Solution Architect Curricula

Topic Based News Recommendation

SUPERVISOR:

Prof. Paolo Ferragina

CANDIDATE: Lorenzo Bellomo

Autumn Session Academic Year 2019/2020

(2)

(3)

A B S T R A C T

Content-based recommendation algorithms for news articles fail to capture human knowledge into account. This is due to two facts: topic annotators are still a somewhat recent research field, and more research effort went into development of methodologies for user-based approaches. The work of this thesis is focused on topic-user-based recommendation of news, where topics are real world things or con-cepts, as extracted from a public knowledge base (Wikipedia). The proposed methodology applies salient topic detection, knowledge graphs, graph embedding, and ranking methodologies to recom-mend items according to the human knowledge that is carried by the input news. We evaluate this approach on real world data coming from European Broadcasting Union (EBU), and on a well-known dataset for this task, and acknowledge that it improves results over state-of-the-art solutions by 10% for F1 metrics.

(4)

C O N T E N T S

1 i n t r o d u c t i o n 1

1.1 Outline of this Thesis 4

2 s tat e o f t h e a r t 6

2.1 Bag of Words 6

2.1.1 Text Pre-processing 7

2.1.2 Slight TF-IDF variations 8

2.1.3 Other Metrics 9

2.1.4 Keyword Extraction 9

2.1.5 Named Entity Recognition 10

2.1.6 PoS Tagging 10

2.1.7 Summarization 11

2.1.8 Latent Semantic Analysis 12

2.2 Embeddings 13

2.2.1 Word2Vec 13

2.2.2 Universal Sentence Encoder and Doc2Vec 13

2.2.3 Wikipedia2Vec 14

2.2.4 DeepWalk 14

2.3 Knowledge Graphs 14

2.4 Graph-based Recommendation 15

2.4.1 Graph Edges 15

2.4.2 Document Relatedness Measures 16

2.4.3 SED recommendation 16

2.5 Topic extraction and manipulation 17

2.5.1 TagMe 17

2.5.2 SWAT 19

2.6 Page Rank 21

2.6.1 Personalized PageRank 22

2.7 Hierarchical Density-Based Clustering 22

2.7.1 DBSCAN 22 2.7.2 HDBSCAN 23 3 i n f r a s t r u c t u r e 24 3.1 The EBU 24 3.2 Eurovox Project 24 3.3 News Project 25 3.4 Peach Team 26 3.4.1 Main Products 26

3.4.2 Data Science Platform 28

3.4.3 Codex and Spectrum 29

3.4.4 Sprints 30

4 r e c o m m e n d at i o n p r o c e s s 32

4.1 Problem Statement and Assumptions 32

4.2 Preprocessing Phase 32

(5)

c o n t e n t s v

4.2.1 Topic Extraction 33

4.2.2 Topic Filtering and Outlier Removal 33

4.2.3 Build the Entity-Article Graph 34

4.3 Query Time 36

4.3.1 Candidate Selection 37

4.3.2 Subgraph Creation 37

4.3.3 Article Scoring 38

4.4 Graph Design and Implementation 39

4.4.1 Graph Design 39

4.4.2 Graph Representation Choice 40

4.4.3 Implementation Details 42

5 e va l uat i o n 45

5.1 Manual Evaluation 45

5.1.1 News Project Dataset 45

5.1.2 Swedish Radio Dataset 46

5.2 CNRec Dataset 49

5.2.1 Evaluation Process 50

5.2.2 Results analysis 52

5.3 Time and Space Performances 53

6 c o n c l u s i o n 56

6.1 Future Works 57

(6)

L I S T O F F I G U R E S

Figure 1 Screenshot of the Spectrum UI (with German BR data) 27

Figure 2 The Data Science Platform Components 28

Figure 3 Task and Endpoint Workflow representation 29

Figure 4 Topic Based Recommendation: Example 1 48

Figure 5 Topic Based Recommendation: Example 2 48

L I S T O F TA B L E S

Table 1 OtherTFmetrics. 9

Table 2 OtherIDFmetrics. 9

Table 3 Number of Good Recommendation according to the algorithms 52

Table 4 Number of Good Recommendation in the ground truth 52

Table 5 F1 measure over the CNRec dataset 52

Table 6 Precision measure over the CNRec dataset 53

Table 7 Recall measure over the CNRec dataset 53

Table 8 Dataset information and data 54

Table 9 Time and Memory occupancy for the different implementations 54

L I S T I N G S

Listing 1 Entity Similarity Matrix 42

Listing 2 Indexing the matrix 43

Listing 3 Normalization by row 43

(7)

1

I N T R O D U C T I O N

Content to content recommendation has always been one of the most studied problems in the context of Information Retrieval. The prob-lem of designing good recommendation algorithms has been key as many widespread applications rely increasingly more on recom-mended content, with respect to user queries.

Youtube replaced the welcome page from "your subscriptions" to the recommended content page in year 2013. Spotify produces daily, weekly, and monthly discovery playlist to improve the experience for the users. Facebook changed its timeline from being sorted chronolog-ically to being sorted by relevance, and shortly after Twitter followed the same philosophy. Instagram promotes an explore tag which allows to consume recommended content. Netflix home page greets you with a "you might like" category, and actively recommends content based on users’ watch-list.

But this does not mean that the problem is a new one, as content recommendation predates even computers. Newspapers always had to deal with this problem, by providing several pages of content, but having to fit "the best" information in the front page in order to cap-ture the user taste. Of course, newspapers have the problem of deal-ing with a static front page, with no ability to "personalize" it. The best they were allowed to do, was to create a unique profile for their aver-age reader, and personalize the front paver-age to attract their attention as much as possible. For example, a newspaper with a Democratic average audience, would have pushed for different topics than a Re-publican one.

Online solutions solve the "low personalization" problem by using dynamic pages, which allow content providers to guarantee that ev-ery individual user receives the best recommendation possible for their taste.

Historically, three approaches have been proposed for the problem of content recommendation in Computer Science:

1. Based on User Data: Content is recommended using user data, user rating (explicit or implicit) and trending algorithms to de-cide which items to prioritize in the recommendation. This ap-proach yields better results (by means of click-through rate met-rics) generically, but suffers from the cold start problem, as the recommendation quality depends on the amount of data avail-able (which are few and sparse at the start).

2. Fully Content-Based: In this solution, no user data is used to se-lect the recommendation. Content is recommended either

(8)

i n t r o d u c t i o n 2

ing from a single piece of content the user is consuming or from a profile of given user. This solution does not suffer from the cold start problem.

3. Hybrid Approaches: These solutions take some concepts from pre-viously described approaches.

Classic solutions for content-based recommendation fail to integrate human knowledge into the equation. Pure content-based recommenda-tion has historically been tackled with a bag-of-words approach. In this solution, text is modeled as a set of textual tokens, and text similarity is computed using various metrics which exploit this concept. This effectively means that text similarity is syntactic because the more to-kens are shared between two texts, the higher is the likelihood that they are similar. In the last 10 years, Natural Language Processing (NLP) tasks regarding topic recognition and detection have come a long way, and many industrial applications integrate public Knowl-edge Bases (such as Wikipedia or WikiData) and NLP tools in their text mining algorithms.

Recent knowledge extraction algorithms exploit available Knowl-edge Bases in order to capture the real world topics present in text. This kind of task is hard for a machine because of the problem of disambiguation, which is made difficult by the intrinsic ambiguity of human language.

Humans use various words to express the same concept, and rely on the other listeners’ common sense to apply this disambiguation process correctly. The word "Leonardo", for example, refers to the famous scientist "Leonardo da Vinci" in the phrase "Leonardo painted Mona Lisa", but refers to "Leonardo di Caprio" in the phrase "Leonardo’s performance was Oscar worthy". The solution for disambiguation has to rely on the textual context of the input query text, and maximise the probabilities that each one of the entities located in the text actually refers to that specific concept.

Advancement in the field of topic recognition both on structured and unstructured texts have allowed to expand the classic approach for content-based recommendation to handle real world entities, hav-ing them drawn from Knowledge Bases.

The work of this thesis proceeds in this research flow and it is fo-cused on topic-based recommendation of news. The work fits in a collaboration with EBU (European Broadcasting Union) 1

, which is an alliance of over 70 European broadcasters (including RAI from Italy and BBC from the United Kingdom). The specific team we col-laborated with for the purpose of this thesis is the Peach Team, which handles a personalization and recommendation ecosystem.

The specific EBU project that inspired the idea of the thesis is the News Project. The aim of this project consists in building a unified

(9)

i n t r o d u c t i o n 3

news portal for content coming from various European broadcast-ers, with specific focus on content accessibility (in various European languages), GDPR regulation compliance, and content-based algorith-mic paradigms (in opposition to other solution coming from big tech companies, usually based on Collaborative Filtering).

For the purpose of a news portal, the most important algorithms to develop are clustering, recommendation, and search. Clustering ap-proaches based on topics have already been proposed in the literature [1], so the aim of the thesis is to propose a recommendation algorithm

inspired to the previously proposed topic-based approaches like [1]

and [2], but specifically built to be fully content-based.

The algorithm we propose performs topic annotation on the news dataset (both salient [3] and not [4]), filters out topic outliers using HDBSCAN[5], and then builds a graph with two types of nodes: entities

and news articles. Articles are linked to entities that are found in them during the topic annotation phase, and entities are linked together and weighted via their similarity measure, estimating it with entity embeddings [6, 7] and standard relatedness measures [8]. Whenever

the system has to provide recommendation for a news (or a query), we perform a candidate selection process based both on textual sim-ilarity between articles and relevant entities in the queried news. We then extract a vertex-induced subgraph from the one we built before, and rank news by means of a special instantiation of the Personalized PageRank algorithm [9], which is a Random Walk-based algorithm

used to assign weights to nodes in a graph.

We deploy this algorithm on various datasets and implement the overall algorithmic pipeline for various IT infrastructures.

We first deploy this algorithm on the EBU News Project dataset (the one related to the news portal), which has over 17 000 (and growing) news articles coming from various European broadcasters, and trans-lated in English with another tool developed in EBU: the Eurovox Toolbox. This dataset is the testing ground for the one that will be on the news portal. We tested the algorithm and tweaked the parameters on this specific dataset, and the results we obtained on it are the ones we showed to the broadcasters during the thesis period.

We also deployed this algorithm on a standard dataset for the con-tent recommendation problem, being it the CNRec dataset [10]

(Con-tent News Recommendation dataset). This dataset consists of 300 news articles from a fixed time-range, specifically built to benchmark content based news recommendation algorithms. We benchmark our algorithm on this dataset and measure the metrics of Precision, Re-call and F1 score with respect to the "good recommendation" rating of six human annotators. We scored an improvement of up to 10% in F1 measure over standard bag-of-words approaches.

The last step was a deployment of our proposed algorithm on the EBU Peach infrastructure, specifically tailored for the Swedish Radio

(10)

1.1 outline of this thesis 4

dataset. This is by far the biggest dataset we worked on to date (more than 100 000 news). The dataset has been tagged usingTagMein two ways:

• Directly in Swedish, using a private local installation ofTagMe2

specifically modified to work in Swedish.

• Having translated the whole dataset in English (using the pre-viously mentioned Eurovox Translation Toolbox), and then ap-plying the topic annotation phase on the English version of the Swedish news.

Having ported this algorithm on this specific dataset allowed us to use the Peach Team recommendation evaluation tool (Spectrum) for testing the algorithm and its results. This is the platform that broad-casters use to evaluate and test manually the quality of the results of their algorithm, allowing them to directly (visually) compare the output of different methods.

1.1 o u t l i n e o f t h i s t h e s i s

The thesis is organized in the following parts:

• Chapter2(s tat e o f t h e a r t) shows the state-of-the-art for the

problem of content-based recommendation, along with some al-gorithms and tools that will be used for the purpose of this thesis. Section2.1shows a set of techniques for standard

bag-of-word modeling of text. Section2.2shows a set of powerful

Ma-chine Learning based embeddings vectors, that are widely used in the context of Natural Language Processing. Sections 2.3 and 2.4 introduce the concept of knowledge graphs, along with a

proposed algorithm for content-based recommendation based on it. Section2.6introduces the PageRank algorithm as a tool to apply

weights to nodes in a network. Section2.7explains theHDBSCAN

algorithm, a novel clustering algorithm that applies ideas from

DBSCANand some from hierarchical clustering.

• Chapter 3 (i n f r a s t r u c t u r e) is focused on explaining what

the EBU is, as well as its most relevan projects with respect to this thesis. Section 3.1 shows what the EBU is and their goal.

Section3.2 explains the Eurovox project and its toolbox, which

we used extensively for translation purposes. Section 3.3

ex-plains the News Project, that aims at building a unified Euro-pean news portal, and which also happens to be the one that sparked the idea of this project. Section 3.4 shows the Peach

Team, the one we mostly collaborated with, and their infrastruc-ture.

(11)

1.1 outline of this thesis 5

• Chapter 4 (r e c o m m e n d at i o n p r o c e s s) explains in details

our algorithm for topic-based recommendation. Section 4.1

ex-plains the main problem setting, and the assumptions we are al-lowed to make on our datasets. Section4.2 explains the

prepro-cessing phase, which is the actions we make at system startup. Section4.3instead explains what happens at query time. Finally,

section4.4 explains the main implementation choices.

• Chapter 5 (e va l uat i o n) explains the evaluation process we

performed on our algorithm. We performed three kind evalu-ations. The first one, where we manually check the results, is shown in section 5.1. Section 5.2 shows instead the evaluation

we did on a standard dataset for the task of content-based rec-ommendation. Section 5.3 shows the performance evaluation,

mainly focusing on space occupation and time measurement. • Chapter 6(c o n c l u s i o n) recaps the work done in this thesis,

and proposes some potential research directions as future works in section6.1.

(12)

2

S TAT E O F T H E A R T

In this chapter we will present the state of the art for content-based recommendation. We will also, at the end of the chapter, introduce a set of algorithms and tools which have been extensively used through-out the design of the proposed solutions to the problem.

Before diving into the specifics of the state of the art for news rec-ommendation, we must state that the approach proposed by this the-sis is purely content-based, so this section only covers the literature regarding content-based approaches. Any approach which relies on user data (click through rate, explicit/implicit feedback or any other user related metrics) will not be considered. Approaches based on trending articles or collaborative filtering are state-of-the-art for recom-mendation but will not be considered for the purpose of this thesis. Nevertheless, they constitute orthogonal approaches and thus could be combined with the ones described in the thesis.

We are going to show approaches that handle text as bag of words, and the ones that handle semantic relations based on Named Entity Recognition (NER) and NLP tools hinging on Neural Networks. 2.1 b a g o f w o r d s

The first approach we are going to dive into is the standard NLP ap-proach that handles text as a bag of words. Any text mining operation performed on content is done by exploiting purely syntactic features. The most famous metric for bag of word mining is the TF-IDF [11].

This metric levereges two terms. One, the term frequency (TF) which takes into account the presence of a term in a text, and the other, the inverse document frequency (IDF) which down-scales the importance of common words. The TF-IDF metric of a term t in a document D defined as:

TF IDF(t, D) = TF(t, D) · log n nt

Where TF(t, D) is the number of times in which term t appears in document D, n is the total number of documents in the collection, and nt is the number of documents that contain the term t.

This metric was proposed because of the concept that, if a very uncommon word is shared between two documents, than the likeness of those two documents being similar is much higher then if they shared a very common one. For example, two documents sharing the word Netflix does not necessarily imply that they are similar, while them sharing House of Cards is a much stronger signal.

(13)

2.1 bag of words 7

This metric depends on the term, so it is perfect for word-document similarity. In case of document-document similarity, the solution is to model them as vectors. The vector length equals the total size of the collection word dictionary. Entry i of the vector is the TF-IDF score of term i in given document. Given that, and by modeling each docu-ment as a vector in the same n dimensional space (with n as the size of the dictionary), we can easily compute their relatedness using well known metrics such as cosine similarity. The cosine similarity between two vector is, given B as the document collection:

cos(~q, ~p) = ~q· ~p |~q||~p| = |B| P i=1 qipi |B| P i=1 q2_i |B| P i=1 p2_i

Standard content-based solutions inspired by theTF-IDFapproach have been evolving since their first proposal.

Several improvements come from the following developments. 2.1.1 Text Pre-processing

The first technique proposed for improving the bag of words ap-proach is to apply text pre-processing. The key observation is that words of similar groups, represent the same concept. Words like "car" and "cars" could very easily be handled as one sole concept. Standard

TF-IDFwould use two different vector entries for the 2 words, which does not leverage the fact that functionally, they express the same thing. The following techniques can be applied to improve the text mining quality, by abstracting the concept of words into tokens. t o k e n i z at i o n. The goal of this task is to output a set of valuable tokens (sequences of characters) from an input text. Tokens have to be valuable enough on their own. For example, if the input text is "The Woman lived in New York", then the output tokens are "Woman" and "New York". The main algorithmic problems to solve with tokenization are multi-word tokens and number handling.

s t o p w o r d s r e m ova l. The goal of this process is removal of stop-words, which are, in Information Retrieval, a set of very common words which do not add meaning to its context. Their TF-IDF score is usually very low because of their extremely low IDF factor (given by their pervasive presence in the dictionary), but their net semantic value is completely negligible, as no one would rely on words like "the", "a", "an" to label two texts as similar.

n o r m a l i z at i o n. The main idea of this step is to capture words like U.S.A. and USA under the same category.

(14)

2.1 bag of words 8

c a s e f o l d i n g. Another very simple but effective step is the one of case folding. Systems handle upper case and lower case letters differently. Reducing all the text to lowercase allows to capture equal words that otherwise would get two different entries in the TF-IDF

vector. Exceptions can be applied to specific cases (uppercase mid-sentence), to avoid mixing names and objects (Bush and bush). t h e s au r i. Thesauri are also applied to text mining tasks in order to capture words synonyms. For example, with TF-IDF, we might imagine to use one single entry for each equivalence class between words. For example, cars and automobiles would fall on the same entry in the vector.

s t e m m i n g. The process of stemming consists in the process that reduces words to their root before indexing. Words like "fish", "fish-ing" and "fisher" are all reduced to a common stem (or root) which is fish. The most well known algorithm for stemming is the Porter’s algorithm [12]. The goal of this step is to group even more functionally

similar words into their roots, in order to not lose similarity hits due to slight changes in the word.

l e m m at i z at i o n. The process of lemmatization consists in trans-forming words into their respective lemma. For example, in English, a verb like talk can be conjugated in many ways, like talking, talked, or talks. Verbs conjugations, plural nouns, English possessive and many other grammatical constructs are taken into account during this pro-cess.

This kind of task is much harder than the ones we presented before. The main problem comes from the fact that different words behave in very different ways. For instance, the word "better" has "good" as its lemma, other words lemmas coincide with their stem (such as "walking", for which the lemma is "walk"), while other words have different lemmas depending on their context (I.E "meeting" can either refer to the verb or the subject). Lemmatization is a process which is language dependent, and it requires resources (such as, for example, a dictionary) that might be not easy to find for all languages. It also requires algorithms to be context aware.

2.1.2 Slight TF-IDF variations

The TF-IDF metric we proposed before is not the only one that was developed. Many other small variations were proposed to apply the same concept . We are going to show in tables1and2some examples

of TF-IDFmetric variations, by showing different options for its two factors:TFandIDF.

(15)

2.1 bag of words 9 Name TF weight Binary 0, 1 Count f_t,d TF ft,d/P t 0∈d f_{t 0,d} LN log(1 + ft,d) DN k + (1 − k) ft,d maxt 0∈dft 0,d

Table 1: OtherTFmetrics.

Name IDF weight

unary 1

IDF −lognt N

IDF Sm. log(_1+nN

t) + 1

IDF Max log(maxt 0∈dnt 1+nt 0 )

Prob. IDF logN−nt nt

Table 2: OtherIDFmetrics.

2.1.3 Other Metrics

Another solution is to deploy metrics other thanTF-IDF. They usually tackle the Bag of Words problem from another point of view with respect toTF-IDF.

An example of this, which will be used for the purpose of this thesis, isBM25[13] (where BM stands for Best Matching). The metric is

the following: BM25(D, Q) = n X i=1 IDF(qi)· f(qi, D) · k1+ 1 f(q_i, D) + k1· (1 − b + b ·_avgdl|D| ) Where:

• f(qi, D) is the term frequency of term qiin D

• IDF(qi)is one of theIDFversions proposed above in table2.

• k1 and b are usually free parameters, which are usually picked,

when not optimized, as k1 ∈ [1.2, 2.0] and b = 0.75

• avgdl is the average document length in the indexed collection This metric is considered state of the art for bag of words information retrieval. It is based on the probabilistic relevance model [14], which was

developed in order to rank texts by estimating the probability that a document diis relevant with respect to a query q.

2.1.4 Keyword Extraction

Keyword Extraction is the task of extracting the set of terms (intended as one or more words) that best describe the input text. The most well known algorithm for this task (and also an astonishingly simple one) is the RAKE algorithm [15]. It is based on the following observation:

keywords frequently contain multiple words but rarely contain punctuation or stop words. This algorithm is very fast in the task and unsupervised, which makes it an easy choice when such a task is needed.

(16)

2.1 bag of words 10

1. Candidate Extraction: Candidate keywords are extracted by split-ting the document by word delimiters (like whitespaces, tabs...). Words are grouped together when close and surrounded by the same stopwords.

2. Scoring: Scores for candidate keywords are computed using two metrics: freq(w) is the frequency of the word w and deg(w) is the sum over the co-occurrency table row. This row, for word w, is effectively a vector of the size of the dictionary and entry i is equal to the number of co-occurrences of word w with word i in the set of candidate keywords. The total score, for word w, is simply deg(w)/freq(w).

3. Adjoining keywords: The process of candidate selection is good, but may split some keywords which are usually split by stop-words (i.e. "hand of God"). Whenever a pair of keystop-words adjoin one another at least twice in the same order, then they are added. Score for the resulting keyword is the sum of the scores.

4. Result: The top1/₃keywords is returned to the user.

This algorithm is excellent for easily identifying the key tokens in a text, and it is easy to imagine how similarity between documents can take into account also the number of shared (or similar) keywords. 2.1.5 Named Entity Recognition

Named Entity Recognition (NER) is a NLP technique which tries to locate named entities in an input text. Those named entities are labeled into categories like person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

This task has been widely explored in the context of Information Retrieval, and is usually tackled by splitting the problem in two sub-tasks: recognition and classification. The first one is usually solved via segmentation, by attempting to recognize candidate tokens and grouping together tokens which belong to one functional named en-tity. The second problem is usually solved with an ontology or any kind of knowledge base.

One of the most widespread and used implementation of NER is present in the Python open source librarySpaCy1

, and in this case the implementation is heavily reliant on Convolutional Neural Network. 2.1.6 PoS Tagging

Part of Speech tagging (or PoS tagging for short) is a NLP tech-nique which consists in marking words with their respective Part of

(17)

2.1 bag of words 11

Speech. Words in a phrase are labeled as nouns, verbs, adjectives, and many more grammatical abstractions. Probabilistic models are the most widespread in this kind of application, and the Hidden Markov Model approach is one of the most successful solutions for given prob-lem. Deterministic rule based approaches do not work because of the intrinsic ambiguity in every human language rules. Inner phrase word distribution is estimated for each class co-occurrency. Proba-bility are estimated by building parse trees for the input text, and by taking into account stopwords. Classic linguistic studies come in handy when estimating probabilities.

2.1.7 Summarization

Summarization is a well known task in Linguistic, but is hard when applied automatically by machines. The problem is the one of short-ening the input text to the bare minimum in order to keep all the relevant information within the original content.

Two approaches were developed during the years for text summa-rization: extractive and abstractive. The former uses phrases and words from the input text and filters out the unnecessary ones to form a summary, while the latter uses advanced NLP techniques to generate a brand new summary from scratch.

TextRank, one of the most widespread algorithms for summariza-tion, is part of the extractive methods, and is based on the PageRank algorithm (explained in section2.6). The approach taken by this

algo-rithm is to build a graph of words from the input text as follows: • If more input sources are provided, concatenate all the text

con-tained in the articles.

• Split the text into individual sentences.

• Extract sentence embeddings for each sentence.

• Compute similarities between sentence vectors and store them in a matrix.

• Convert the similarity matrix into a graph, with sentences as vertices and similarity scores as edges (this is important to run the Page Rank algorithm).

• Retrieve a certain number of top-ranked sentences to build the final summary.

The thesis will make heavy use of extractive techniques, mainly for the field of knowledge extraction. Those algorithms are presented in section 2.5. The approach used by TextRank strongly resembles the

work we will propose in chapter 4, and it is for that purpose that we

(18)

2.1 bag of words 12

2.1.8 Latent Semantic Analysis

Latent Semantic Analysis (LSA) is a particular technique of NLP pro-cessing which leverages matrices and Singular Value Decomposition (SVD) to pack information to fewer dimensions. Its application to NLP is mainly inspired by the first applications of the TF-IDF ap-proaches, which resulted in huge matrices indexing information re-lated to each document and each word in the dictionary.

The goal is to shrink the size of this matrix by mapping the dictio-nary space to a lower dimensional one, in an attempt to group related concepts together.

This problem is approached from a linear algebraic perspective, by applying SVD to the input matrix. SVD is a factorization of a real number matrix M into the product of three matrices M = USVT. In this decomposition, given M ∈ IRm∗n, we have U ∈ IRm∗m, V ∈ IRn∗n, and S ∈ IRm∗n. This decomposition maintains the original input dimensions, but our premise was the one of dimensionality shrinking. In this case a variation of the classic SVD factorization, called compact SVD, comes in handy. Given all the previous fact and r _{6 min(m, n), with r being the rank of the input matrix M, we} have that we are able to create a decomposition M = UΣVT_{, where}

U ∈ IRm∗r, Σ ∈ IRr∗r, and V ∈ IRn∗r. In this decomposition, Σ is the r × r diagonal matrix where the elements of the diagonal are the eigenvalues of matrix MMT _{in decreasing order.}

LSA uses SVD, combined with another dimensionality reduction technique, to shrink the size of the matrices. The application of LSA is done on the mainTF-IDFmatrix, which is A ∈ IRm∗n, the terms × documents matrix. From this matrix, we extract two other matrices:

• T = AAT_{: m × m terms correlation matrix (entry (i, j) is the}

correlation between term i and j. We also define U ∈ IRm∗r matrix of r eigenvectors of matrix T .

• D = ATA: n × n document correlation matrix (entry (i, j) is the correlation between document i and j. We also define V ∈ IRn∗r matrix of r eigenvectors of matrix D.

Given those matrices, we have A = UΣVT_{, where Σ is the diagonal}

matrix with the eigenvalues of T in decreasing order. At this point, LSA picks k r, and picks the k biggest eigenvalues in the diagonal matrix Σ, and zeroes out all the other entries in the diagonal. This allows to reduce the decomposition to Ak = UkΣkVkT, where Uk ∈

IRm∗k, Vk ∈ IRn∗k and Σk is as defined above. This allows to shrink

dimensions from r to k (which is usually picked order of magnitudes less) with approximation guarantees on Ak.

(19)

2.2 embeddings 13

2.2 e m b e d d i n g s

Another solution for content-based text mining is completely differ-ent, and based on Machine Learning.

The baseline idea is to map textual entities (be it words, concepts or full phrases) to n dimensional floating point vectors. Embeddings are not probabilistic, but are fixed, which means that ambiguity between textual entities is not captured by this model. The main strength of this approach with respect to a standard bag of words one comes from the fact that embeddings tend to carry latent knowledge. For example, if two words are usually found in the same context of fre-quently close in phrases, then the probability that their embeddings vectors will be close in the n-dimensional space is significantly higher. Several embedding models have been developed during the years, and we are going to present the most valuable ones for the purpose of our thesis.

2.2.1 Word2Vec

Word2Vec [16] is the most widespread and known NLP application

using embeddings. Here, words are the modeled textual entities. Vec-tors are trained using a two layer Neural Network with the task of identifying the context of words in the English language.

The Neural Network’s architecture is based on either of the follow-ing models:

• The Continuous Bag of Words model (CBOW) predicts the cur-rent word based on a set of context words. The fact that context words are modeled as a set means that their ordering is not rel-evant. This model training is faster but performs worse for rare words.

• The Continuous Skip-Gram model uses the current word to pre-dict the neighbor ones. Weights are increased for closer words with respect to further ones. This model has a higher training time but performs better for rare words.

2.2.2 Universal Sentence Encoder and Doc2Vec

These are other two applications of embeddings, where text units are greater-than-one-word texts, namely Universal Sentence Encoders [17] and Doc2Vec [18]. The models are optimized on small phrases

to paragraphs, and additionaly doc2vec can handle long documents (while Universal Sentence Encoder can only handle short text).

Their main use is to compute semantic similarity between para-graphs. Input text is encoded into embeddings of 512 dimensions,

(20)

2.3 knowledge graphs 14

and the greater distance on the euclidean space is, the more likely it is that those phrases are uncorrelated.

2.2.3 Wikipedia2Vec

Wikipedia2Vec[6,7] is a library that adapts the embeddings approach

to Wikipedia pages. The library provides pre-trained models for var-ious languages and some useful functionalities, like the ability to trieved the closest (meaning most similar) Wikipedia Pages with re-spect to the one provided as input.

The learning process is dependent on the structure of the Graph, and with such it retains latent information regarding its structure. The actual learning is performed by exploiting neighbor words as contexts (skip-gram model), but additionally words that are close to Wikipedia anchors are used to keep track of the context when "jump-ing" to another Wikipedia page. Another thing that is taken into ac-count is neighbor nodes to a given Wikipedia page, so that neighbor-ing entities are taken into account on the final embeddneighbor-ings vector. 2.2.4 DeepWalk

DeepWalk2

[19] is a library that exploits short random walks to learn

the representation of vertices in a graph. The main idea and use for this library is to learn social relations in social graphs, to then per-form anomaly analysis, social classification and similar operations. The learning process forDeepWalkis the following.

First, given an input graph G = (V, E), a set of k short truncated random walks are generated for each node u ∈ V. Second, the k ·|V| generated random walks are used to feed the Word2Vec algorithms for learning the embedding representation of the nodes, here inter-preted as words in sentences formed by the nodes traversed by those random walks. The general intuition is that similar nodes should ac-tually generate similar paths and thus embed to similar vectors. 2.3 k n o w l e d g e g r a p h s

In this section we are going to present the concept of Knowledge Graph, their use in Information Retrieval and in content-based rec-ommendation and their many variations.

A Knowledge Graph (KG) is a data structure which uses a graph-like data structure to integrate data and knowledge. The easiest exam-ple is the Google one, where real life concepts and entities are stored as nodes in a Graph. This allows concepts to be related one another with labeled edges that represent the kind of correlation they share.

(21)

2.4 graph-based recommendation 15

For example, a specific person node like the one of Donald Trump can be linked to the node related to Queens, NYC with a label of type "Born in", and to another location node like White House, WDC with another label of "Lives in".

This approach allows Google to strongly empower their main appli-cation (Google Search) with functional and relation-based queries. The informative boxes (known as knowledge panels), were only added to Google Search after the functional introduction of the Google KG in 2012.

Google’s example of a KG can easily explain the power of a method like that for storing real life knowledge. Examples of open access knowledge bases exists, and usually, in the research environment, such resources are used. Open knowledge repositories like Wikipedia (with DBPedia) and FreeBase are wide in themes, provide content in many languages and are usually manually kept up to date to the most recent events.

Many applications based on Knowledge Graphs have emerged dur-ing the years in the field of research, spreaddur-ing from topic annotation, content to content recommendation, search engines, Q&A services, expert finding, and many more.

2.4 g r a p h-based recommendation

Knowledge graphs have been exploited also in the context of content-based news recommendation. However, many of those applications heavily rely on user data or user profile learning techniques which are beyond the scope of this thesis. We are going to present some ideas from some hybrid recommender systems which internally use (Knowledge) graphs.

In order to approach the problem from a graph perspective, two things have to be taken into account: graph edges and relatedness measures.

2.4.1 Graph Edges

Graph edges when dealing with textual documents can be tackled in many different ways:

• Entity co-occurrence: The first solution comes from approaching the problem in a similar manner with respect to TF-IDF [20],

where rarer entity links are valued more then common ones (rarer with respect to their labels). Entity similarity is estimated on a Weighted Unlabeled Knowledge Graph by performing a fixed number of random walks. This approach is valuable but it re-quires a labeled graph and a lot of running time.

(22)

2.4 graph-based recommendation 16

• KB labels: If the graph built is effectively a subgraph of the origi-nal graph (as it is the case in many applications), origiorigi-nal labels can be kept and used. Labels are strings which transport real world meanings.

• Entity Embeddings: As shown in section 2.2.3, edges may be

weighted by using e.g. cosine similarity between the embedding vectors of the linked entities.

• DeepWalk: it can be deployed to learn the latent relationship strength between entities both in the unfiltered KB graph or in the article specific subgraph.

• Semantic enhancement: Another proposal is the one of merging the output of entity annotators and WordNet to obtain a network of entity semantic similarity [21]. WordNet is a lexical database

which returns semantic relations between words. 2.4.2 Document Relatedness Measures

When considering the problem of document relatedness, it is impor-tant to consider that relatedness measures may change based on the specific graph implementation of choice. In [22], a relatedness

mea-sure based on nodes connectivity is proposed, where each document is semantically annotated (more details in section 2.5) and the

relat-edness measure is based on the number of paths between the two documents annotations.

Other approaches rely on handling documents as subgraphs of the KB graph. This allows to use PathSim [23], which computes

related-ness between two subgraphs (two documents A and B) by means of their connectivity.

2.4.3 SED recommendation

An approach that was proposed for content-based recommendation hinging on graphs is SED [10], which is based on shortest distances

over entity subgraphs extracted from FreeBase. The algorithmic idea is the following one.

1. Entity discovery and linking: Entity linking is performed with USTC_NELSLIP [24], a NN-based entity linker. The output is a

set of named entities, together with their respective categories, taken from a list (person, location, Organization, Geopolitical Entity, and Facility).

2. Subgraph generation: A subgraph of FreeBase is built over all the named entities found in the dataset. Nodes are entities, edges are labeled with the same labels of FreeBase. For each entity found in the dataset, it is expanded with their neighbor entities.

(23)

2.5 topic extraction and manipulation 17

3. Shortest distance similarity between articles: Articles, in this newly built graph, are recognizable as vertex-induced subgraphs, and nodes are those found in the given article. The similarity be-tween two nodes is identified as the shortest path bebe-tween their respective subgraphs.

Additionally, this solution proposes some optimizations: like priori-tizing named entities found in the title, performing outlier removal, and adding to the graph the top words byTF-IDF, in order to capture as much information as possible.

2.5 t o p i c e x t r a c t i o n a n d m a n i p u l at i o n

Topics (equivalently called entities in this document) are at the core of this thesis. The task of topic annotators is to locate and link un-structured text to a relative entry of a Knowledge Base. The manipu-lation of those entities is key to a successful algorithm. Section2.5.1

explores TagMe[4], a topic annotator. Section 2.5.2explores _SWAT [3],

a novel salient entity annotator specifically designed for structured text.

It is worth noting that many additional entity annotators have been proposed during the years [25, 26], but we are going to present only

the ones that have been used for the purpose of this thesis. 2.5.1 TagMe

TagMeis a topic annotator specialized on fast annotation of short text. Topic annotation is applied on the Wikipedia Knowledge Base. The first step applied byTagMe is the candidate selection, which is done by querying the dictionary of anchors in the Knowledge Base and by then keeping the longest non-overlapping mentions. The second step is the one of entity disambiguation, and it is performed with a collective agreement between candidates. The vote is expressed via the Milne&Witten [8] relatedness measure, and at the end of the process

only the highest ranked entities are kept. More details on the voting scheme will be provided in the next paragraph.

TagMealso allows to allows to use a long text mode, which restricts the voting process only to n close entities, reducing both the time cost of the tagging process and the impact of topic drift.

The output of the tagging process is a list of annotations, where each element has:

• Wikipedia page identifier

• Wikipedia Title: Title of the Wikipedia page, the one displayed on top

(24)

• ρ Score: float number ρ ∈ [0, 1] which expresses the semantic coherency of given entity with respect to the query text. The higher it is, the most closest the entity is to the text. This is the main value to check to disambiguate good annotations and noise.

• Start and End: two integer values that locate the position of the anchor that is marked as a topic. This part of the text is called spot

• link probability: This value is another probability, so lp ∈ [0, 1], and it is computed as the number of times that the anchor text a (so the spot in the text which was highlighted as an entity) is used as an anchor divided by the occurrences of a in Wikipedia (as an anchor or not). The higher this value is the higher are the chances that this anchor actually expresses some real world en-tity. Low link probability values are usually found in adjectives, articles and verbs, while subjects are usually on the higher end of the spectrum.

ta g g i n g p r o c e s s. The selection process for topics withTagMeis done with a voting scheme, where voters are candidate entities. To present the full pipeline, we define anchor text those tokens that at least once appear as clickable links in Wikipedia (whenever the text appears in blue and redirects to another Wikipedia page).

The pipeline is the following:

• Anchor selection: The first step is spent byTagMe parsing the in-put text, and identifying, out of all the tokens present, which ones are also anchors, extracting the set AT.

• Anchor disambiguation: Out of all the elements in a ∈ AT, first TagMefinds the set of pages that could be pointed by a, and then applies a collective agreement vote (more details later). This step returns a set of annotations, associated with their link probability (defined above) and their ρ score.

• Anchor pruning: Low ρ score annotations are discarded.

The collective agreement voting process starts from a set of anchors AT.

For each anchor a ∈ AT, we select paas the set of pages for which a

is an anchor. For each one of those pages, a vote is cast by all the other anchors in the text (or by a close subset in caseTagMeis applied in fast mode). For each anchor b 6= a, we use as vote for pa for anchor b the

average relatedness over all pages pb linked to anchor b with page

pa. Relatedness between pages is computed using Jaccard similarity

(25)

2.5.2 SWAT

SWAT(Salient Wikipedia Annotator of text) is another topic annotator based on the Wikipedia Knowledge Base which was used during the project. It offers a different approach with respect to theTagMe anno-tation process.SWATis more efficient on long and well formatted text, and outputs a set of topics, together with their salience score. The higher the score is for an entity, the higher is the importance of that topic in the queried text (TagMe instead outputs the semantic coher-ence). Candidate entities are chosen using another tool:WAT[28].WAT

generates annotations by assuming that input text is less noisy and more refined with respect to TagMe, and extracts more information based on the syntactic structure of the phrases.

BothWATandSWAToutperformTagMewhen dealing with structured text, but they are only available on English. We then decided to use bothSWATandTagMe, expecting overall better results with the former, but more availability with the latter.

The output fields of a SWAT query are very similar with respect to

TagMeones and is a list of annotations as described in the following: • wiki_id: Wikipedia unique identifier.

• wiki_title: Wikipedia page title.

• spans: List of indexes (start and end) which reports where in the text given entity appears.

• salience_score: Score assigned by SWAT to this annotation. It lies in the range [0, 1].

• salience_class: Boolean value that is 1 when the entity is consid-ered salient bySWAT, and 0 otherwise.

s a l i e n t e n t i t y d e t e c t i o n p r o c e s s. We are going to present some rough notions on howSWATdetects and ranks salient entities in the text.

SWATapplies a pipeline to approach such a task, which is comprised of three main steps:

1. Document Enrichment: in this step the text is expanded with syn-tactic, semantic and latent features.

2. Feature Generation: in this step previously retrieved features are all mapped to vectors in order to have easily comparable enti-ties.

3. Entity Salience Classification: last step that maps entities to salient or not salient.

(26)

The Document Enrichment step deploys various state-of-the-art tech-niques and modules to capture as many information as possible, namely:

• CoreNLPis deployed for PoS tagging (see section 2.1.6) and

cre-ating the co-reference chain between words (which word, be it noun, pronoun, or anything else, refers to which other).

• TextRank is deployed to perform summarization (see section

2.1.7). This also has the side effect of ranking words in text with

a random walk.

• WAT is topic annotator aimed towards structured text, which maps mentions to entities (in our case Wikipedia pages). The disambiguation phase is performed using two metrics: common-ness, which represents the probability that a specific mention is disambiguated with entity e, and coherence, which represents the semantic coherence between the extraction and the textual context (similarly toTagMe’s ρ score). The weighting process is done by building a graph and scoring entities with the Jaccard relatedness measure. WAT effectively deprecates TagMe on well structured text.

• Word2Vecis deployed to enrich documents with latent informa-tion. More details on it are in section2.2.1.Word2Vecis applied

to all the entities extracted byWAT on the input text. Effectively, two kind of latent embeddings representation are used in this step:Entity2Vec[29] andDeepWalk[19].

The second step is the one of Feature Generation. Features are built from the output of the four modules described above. The amount of feature generated is big, but we are going to present their main theme in the following:

• Position-based features: This set of features capture the distribu-tion within the input text of the entities occurrences in order to predict their salience score.

• Summarization-based features: This set of features keeps track of those entities that appear also in the summarized text, which is a strong indicator of their importance.

• Linguistic-based features: This set of features exploits dependency trees of the sentences where the entities occur to keep track of

CoreNLPdependency relations between tokens.

• Annotation-based features: This set of features exploits the output from WAT (mainly commonness and coherence) in order to in-crease robustness to the noise produced in the topic annotation process.

(27)

2.6 page rank 21

• Word2Vec-based features: This set of features aims at keeping la-tent information about words, entities and phrases that would otherwise be lost with standard syntactic features.

• Relatedness-based features: This set of features is introduced to capture how much entities are related to all the other ones in the input text. Relatedness is both computed with standard mea-sures like Milne&Witten and Jaccard similarity, or with cosine similarity between entity embedding vectors.

The last step is Entity Salience Classification, where entities are classi-fied as salient or not. This is done by deploying gradient tree boosting on the features defined in the second step.

2.6 pa g e r a n k

The problem of estimating node importance in a graph G(V, E) is his-torically one of the most interesting problems regarding given data structure. It is safe to assume that, out of the many proposed solu-tions, the most widespread one is the PageRank [9], which is allegedly

used by Google for web page ranking.

PageRank is an algorithm based on random walks, and it outputs a probability distribution, which represents the importance of each node in the graph having reached the random walk steady state.

The key idea underlining the Page Rank approach is to slightly alter the Web Graph in order to guarantee the convergence of the Random Walk to a stationary distribution. This is enforced with the use of a teleport step which, when picked with a fixed probability at each step, forces the random walker to travel to a random node in the graph with uniform probability distribution.

The Page Rank formulation is the following: r = [αΠT + (1 − α)eeT]× r Where: • Π(i, j) =    1_/_#out(i) _{if (i, j) ∈ E} 0 otherwise

• r is the principal eigenvector (|r| = |V|) and is the rank result vector whenever the random walk reaches the steady state. This convergence is guaranteed by the properties enforced by the presence of a teleportation step with pr = 1 − α.

• e = [1_/_|V|_,1_/_|V|_{. . .}1_/_|V|_]_and|E| = |V|. This is the uniform

telepor-tation vector.

• α is the dumping factor (α = 0.85 in the original paper) and ex-presses the probability that the random walker keeps travelling

(28)

2.7 hierarchical density-based clustering 22

in the graph. 1 − α is instead the probability of teleporting with uniform distribution to any node in the graph.

• B(i) is the set of pages linking to node i, which means that a (j, i) link is indeed present in the graph.

• #out(x) is the number of outgoing edges from node x. 2.6.1 Personalized PageRank

The original PageRank algorithm has, as main focus, the one of es-timating the node importance in the whole graph. Other times, the relevance of a node with respect to another one, or to a set of nodes, may be way more useful for some kinds of applications. In this case, Personalized PageRank proves to be more useful (still provided in the original PageRank paper [9]). The only thing varying with respect to

standard PageRank is the teleportation vector, which changes from being a uniform distribution to a specific probability distribution cen-tered into the nodes with respect to one wishes to evaluate the rele-vance.

2.7 h i e r a r c h i c a l d e n s i t y-based clustering

Clustering has always been an important task in many Data Min-ing and Information Retrieval activities. It allows to create groups of similar items according to given metrics. The metric used between el-ements in the dataset is usually an instance of a distance function or a similarity function, with those functions being defined appropriately taking into account the kind of data being handled.

Several approaches have been proposed for tackling this problem in the literature, with the most prominent ones being KMeans [30], DBSCAN [31] and Hierarchical Clustering [32]. Those families of

clustering are not mutually exclusive, and many algorithms were pro-posed extending the original ideas to obtain different kind of results. We are going to focus on an algorithm, HDBSCAN [5], which

com-bines standard density-based approaches (DBSCAN) andHierarchical Clustering, to ensure quality results, which have been shown to be state of the art for clustering. We are going to first explainDBSCAN, to then show the extension proposed byHDBSCAN.

2.7.1 DBSCAN

DBSCAN is the most well known clustering algorithm in the family of density-based one. The approach taken is to label points as belonging to a cluster if they belong to an area of points of similar density, and by marking as outliers those that lie in lower density zones.

(29)

2.7 hierarchical density-based clustering 23

Given e set of n points {X1, X2, ..., XN} he algorithm requires two

parameters to run:

• : This parameter expresses the radius (distance) taken into ac-count when considering neighbor nodes.

• k: The minimum number of points required to form a cluster. Given those parameters, Xiis labeled as code point if B(Xi, ), the ball

of radius and centered in Xicontains at least k points. We say that

two points, Xi, Xj are density connected if they are both core points

and they lie within the other point’s ball (Xi ∈ B(Xj, ) and Xj ∈

B(X_i, )).

Clusters are then picked as non-empty maximal subset of points which are density connected.

2.7.2 HDBSCAN

HDBSCANexpands on the concept presented on standard density-based clustering methods, and includes elements of hierarchical clustering. This is achieved by growing a hierarchy of variousDBSCANinstances, all with different values of . Splits are handled by checking for one of the following cases:

1. cluster C_icontains less than m points. In this case, C_iis consid-ered spurious and all its data points are marked as outliers. 2. C_iis the only cluster with more than m points. C_iis considered

as continuation of its parent and no more split are performed. 3. More than a single child cluster contains m points. This case

is considered a “true” split since it is shrinking the size of the parent cluster and thus increasing the density λ within the new generated clusters, where λ =1_/

After the execution of this algorithm,HDBSCANeventually outputs the flat clustering that maximizes the sum of the stability function σ ex-pressed as:

σ(C_i) = X

Xj∈Ci

(λ_max,C_i(X_j) − λ_min,C_i(X_j))

where λmax,Ci(Xj) is the λ value for which Xj falls out of the cluster

(30)

3

I N F R A S T R U C T U R E

This thesis was developed as part of a collaborative project involving the University of Pisa and the European Broadcasting Union (EBU). In this section, we are going to describe what the EBU is, their goals and tasks. We are going to present the Peach Team, which is the team tasked with the recommendation process (and more), in direct contact with many European broadcasters.

We are going to present the process and the collaboration had with this team for the purpose of this thesis, and their IT infrastructure where much of the work was done. We are also going to present other EBU teams and projects which provided assistance in this work. 3.1 t h e e b u

The European Broadcasting Union (EBU) is an alliance of public service media organisations, mostly focused on European broadcasters, with headquarters in Geneva (Switzerland). The organisation is made up of 115 member organisations in 56 countries, and 34 associate mem-bers from a further 21 countries. It is best known to the general public for producing the Eurovision Song Contest.

Their primary goal is to cooperate tightly with broadcasters to pro-vide safe platforms, services and content. For that purpose, the EBU is organized in many small teams, each one dealing with different services. Services range from news exchange, to translation, events, initiatives and many more.

The specific teams that were involved in the work of this thesis are: • The Eurovox Project, section3.2.

• The News Project Team, section3.3.

• The Peach Team, section3.4.

3.2 e u r ov o x p r o j e c t

This project aims at building "an open toolkit for transcription and translation, allowing citizens easy access to content in their native language". The goal is to provide a toolbox for Speech-To-Text (STT), Translation and Text-To-Speech (TTS), with open and standardized APIs 1

. The aim of this service, initially, is mainly towards

journal-1 https://tech.ebu.ch/eurovox

(31)

3.3 news project 25

ists and broadcasters, in order to provide easy access to foreign local news.

Additionally, Eurovox provides one service (El Traductor), which was specifically designed for producing transvoiced videos.

For instance, say an Italian journalist realizes a video documentary on local elections in a small Italian town, which for some exceptional reason, has regional or national relevance. This kind of news is usu-ally hard to obtain and understand abroad, but it might be interesting content, by maybe being an indicator for a strong rising political trend. With a tool such as the one provided by Eurovox (El Traductor), it is possible to see the original documentary dubbed over in any other language, by applying STT on the original audio, then translation, and then TTS.

This project is indeed valuable to broadcasters, and received much interest in the last times. The use of this toolbox did not directly im-pact the work of this thesis but it is important to highlight it in order to fully understand the dataset and the next project, presented in sec-tion3.3.

3.3 n e w s p r o j e c t

This project is a joint effort of many EBU teams, with the main con-tributing teams being Peach (more in section3.4) and Eurovox.

The main goal is to provide content from different European part-ner broadcasters under one joint portal (the project is still in a proof of concept state, and the access to it is restricted). The project goal is to provide this content in various languages (English, Italian, Span-ish, German, etc...) by exploiting the toolbox provided by the Eurovox team. News articles are translated to a set of languages, but the me-dia type does not have to be necessarily textual. Let’s say, for example, that RAI (Italian Broadcaster), publishes a video documentary regard-ing World War II. Other broadcasters may see the video of given doc-umentary with dubbed audio in their native language. They are also allowed to read the translated transcript of this documentary and its attached description, if any is provided.

The main aim of the project is to both provide high language avail-ability (for access ease), and to display all the content under a unique portal. For this purpose, most of the effort was spent obtaining the ap-proval of as many broadcasters as possible, and organizing the news feed, as each broadcaster has their own data format, pushing/pulling method and metadata. The full news feed is then translated to English using the Eurovox toolbox, and then provided under one portal with (for now) minimal filtering and searching capabilities (as of the day of writing this document, the project is still in its earliest stages).

On top of this comprehensive portal, the Peach team provides arti-cle recommendation, so that broadcasters can more easily get access

(32)

3.4 peach team 26

to foreign news with ease. The dataset used to experiment with the recommendation algorithm developed in this thesis has been the one of the News Project. Some information regarding this dataset:

• Number of documents is close to 17 000 as of the day of writing this document, and it is continuously growing.

• Documents come from various European broadcasters.

• Most of those items are not originally written in English, but are translated through the Eurovox toolbox.

3.4 p e a c h t e a m

This thesis was mainly done in collaboration with the Peach team (PErsonalization for eACH). The purpose of Peach, one of the services provided by EBU, is to offer to broadcasters a single sign-on and a recommendation system both for news articles and other kind of media types (episodes, audio, video). The Peach Team is small (7 people involved in the days of writing this thesis), and their process is the standard SCRUM one [33]. The work day begins with a 15

minutes daily meeting, where the time is spent telling what was done the way before, what is the intent for the day and what where the problems. We were part of this scrum process for the whole duration of the thesis (starting from march). Work cycles last three weeks, and at the end of this time frame a review is performed with the clients, where the work done in given scrum sprint is displayed to them in order to show the progress and receive valuable inputs.

Clients for Peach are broadcasters that seek a service for content rec-ommendation, user sign-in, dashboards, recommendation evaluation tools and more. The Peach Team started as a collaboration between Bayerischer Rundfunk (BR), Radio-Télévision Suisse (RTS) and the EBU. The two main (most committed) clients, namely Sverige Radio from Sweden and BR Mediathek from Germany, also deploy one developer full time (or almost full time) specifically for the Peach project. 3.4.1 Main Products

The main products and services provided by the Peach Team are: • Single Sign-On: This service provides a way to uniquely

iden-tify a user across multiple platforms, while being respectful of GDPR regulations. This feature is mostly appreciated for the provided ability of better profiling user, which is mandatory for better recommendation quality. Customization and branding is provided in order to make the design of the single sign-on fit on the specific broadcaster platform.

(33)

3.4 peach team 27

• Data Scientist Platforms: More on this topic is in section3.4.2.

• Spectrum: Spectrum is the main tool used for recommendation evaluation. It provides a clean user interface with easy access to a specific organization dataset, and the possibility to com-pare different recommendation algorithms for specific content or for specific users. Spectrum also allows easy access to a link to the original article (in the broadcaster website) and, when ex-tracted, an easy visualization for the topics present in an article. A screenshot of Spectrum is shown in figure1. Here we are able

to see the configuration panel on the left (currently displayed organization, available recommendation endpoints, search fea-tures and settings including the option to display topics), the search results on the left column and the selected recommenda-tion endpoints results on the right columns.

• Recommendation Service: This is by far the main selling point for the Peach project.

Figure 1: Screenshot of the Spectrum UI (with German BR data)

The recommendation service is the core service of the project. It con-sists of a full ecosystem built for experimenting, comparing, deploy-ing and testdeploy-ing recommendation algorithms. The 4 steps towards effi-cient recommendation are:

1. Data Acquisition: Peach provides code libraries and template to ease the user data collection from the broadcaster specific plat-forms.

2. Recommendation Algorithms: For more details on algorithms, see section3.4.2.

3. Processing & Distribution: This module handles model updates and content delivery. The Peach Infrastructure provides ways

(34)

3.4 peach team 28

to frequently update the model to have up-to-date data and im-prove the recommendation process. Recommendation are pro-vided throughREST endpoints.

4. Evaluation & Monitoring: Peach provides dashboards and means to perform A/B testing directly in production in order to mon-itor the performance of various algorithms. In addition to that, the previously presented tool Spectrum can be helpful to have a first glimpse at the recommendation results.

3.4.2 Data Science Platform

The Data Science platform is designed to allow data scientists to ex-plore data sets, quickly prototype recommendation algorithms, de-ploy them and then measure the impact. A snapshot showing the main "ingredients" of the platform is shown in figure2

Figure 2: The Data Science Platform Components

Peach offers a data science platform in the form of a Jupyter Lab interface. In this lab, it is possible to write recurring code routines to update models and test your own recommendation algorithm against other known standard methods like the content-based and the collab-orative filtering ones. Recurring code routines are called Tasks in the PEACH terminology, while recommendations are provided to broad-casters via REST Endpoints. The lab also gives access to various use-ful services when dealing with recommendation like dashboards and testing environments.

Code is handled by integrating the Peach Lab with GitLab. The goal is to write Python or Jupyter Notebook code (with some additional

(35)

3.4 peach team 29

configuration files), put it in a GitLab repository, and clone it in the Peach Lab environment.

Tasks are the most important part of the Peach lab. They are used to update models. They provide the option to select the executor, to allow the programmer to choose between Spark tasks for paralleliz-able code or standard Python for standard, less CPU consuming code. Tasks usually fetch.

A visual representation of tasks and endpoints is shown in figure

3.

(a) Task (b) Endpoint

Figure 3: Task and Endpoint Workflow representation

External libraries can be used, and they can be provided as depen-dencies in the configuration file (in YAML format). Those libraries will be installed using standard pip format, so they can effectively be versioned. When having effectively set up a task on the Peach en-vironment, the Peach lab provides a way (a notebook) to check the status of the tasks which are currently in the workflow loop. You are then able to filter tasks by organization, check the full list and their status (running, pending, finished), check if their last execution was successful or failed, and the duration of the last n executions of given task and their textual output.

Recommendation are instead retrieved using endpoints. They share a similar configuration process as the one used for tasks, and have to be coded in Jupyter notebooks.

3.4.3 Codex and Spectrum

Codex is the metadata repository. It effectively serves as a MongoDB private database, with documented REST APIs to interact with this storage. Broadcasters use it to store their content metadata (be it video, audio, articles, or episodes) and access content with APIs.

The second product worth presenting is Spectrum, which is a visual browser interfaced used to test different algorithms for recommenda-tion. The goal of this tool is to provide an intuitive interface for data scientists to easily compare side by side recommendation for various query. Of course, different algorithms require different inputs. For ex-ample, it is possible to check outputs for textual queries, user queries and query by content id.

(36)

3.4 peach team 30

Spectrum displays, for each content element: • the unique ID

• the description, which is the text content of an item (synopsis for an episode, text for an article...)

• the link to the content on the broadcaster’s website content ele-ment

• a thumbnail or preview, in the form of an image • the publication date of the item

• the type of the item (article, program, episode, list, clip) • the media type of the item (audio, video or text)

• the transcript (optionally) of the item’s media • the duration (optionally) of the media

Spectrum additionally allows to show topics extracted on medias. Topics are stored as a list of elements, where each single element is formed as follows:

• ID: a unique id for the topic • title: a name for the topic

• link (optional): external link related to the topic

• source (optional): from which part of the item this topic was ex-tracted (for example: title, text, transcript)

• score (optional): score of the single annotation with respect to the item.

3.4.4 Sprints

In this section we present the work done during the "sprints" events with the Peach team, and what was the value of those projects for the broadcasters.

• Sprint April 25 - May 15: The focus was on extracting topics us-ingTagMe(section2.5.1) on the SR (Swedish Radio) dataset. This

is a dataset of articles in Swedish Language. The topic extraction was performed on the English Wikipedia Dump, which means that we used the Eurovox Toolbox (section3.2) to translate the