Movie textual overviews embedding

5.4 Feature Engineering

5.4.5 Movie textual overviews embedding

Overviews of the movies were converted into vectors for all of the models that used item metadata. They were, in other words, stored as dense vector representations. In contrast to sparse vectors, the word "dense" alludes to the fact that every member inside the vector contains a value. As a result, movie textual plots were embedded in a vector space with comparable embeddings close together. These embeddings may then be compared to locate phrases with comparable meanings, for example using cosine-similarity. BERT (Bidi-rectional Encoder Representations from Transformers) is the state-of-the-art language model for making dense vectors with context. Each of its encoder layers produces a set of dense vectors, the size of which is determined by the BERT model utilized, which is commonly 384 or 768. We utilized them as contextual word embeddings since they are numerical representations of a single token. Then, using clustering approaches, word embeddings from the same phrase were handled together in order to obtain a semantic represen-tation of the text and identify important topics. Two main methods were explored for sentence embedding:

• Using BERT pre-trained transformer models from HuggingFace [32], by extracting the last hidden layers and making mean pooling op-erations. bert-base-uncased pre-trained model was used for this purpose, it is a model pretrained on BookCorpus, a dataset consisting of 11.038 unpublished books and English Wikipedia. Before passing sentences to the model, the encode method from BertTokenizer class is exploited for getting the list of input IDs with the appropriate special tokens of the sentence. Those IDs are token indices, numerical representations of to-kens building the sequences that will be used as input by the model.

Since there is one vector representing each token (output by each en-coder), we are actually looking at a vector of size K by the number of tokens, where K is usually equal to 384 or 768. We can transform those vectors to create semantic representations of the input sequence. The

simplest and most commonly extracted tensor is the last_hidden_state tensor which is conveniently given in output by the BERT model and it represents the sequence of hidden-states at the output of the last layer of the model. We need to convert the last_hidden_states tensor to a vector of K dimensions using a mean pooling operation. Each of the to-kens has respective K values. The pooling operation will take the mean of all token embeddings and compress them into a single 384/768 vector space creating a sentence vector.

• Using SentenceTransformer [21] library models, in particular all-MiniLM-L6-v2 model, trained on a large and diverse dataset of over 1 billion training pairs. It maps sentences and paragraphs to a 384 dimen-sional dense vector space and can be used for tasks like clustering or semantic search. SentenceTransformer models take care of tokenization and encoding too. A list of sentences (our movie textual overviews) long as the number of movies of the dataset is passed to the model encoder, which then outputs an embedding matrix of dimensions (M, 384), where M is the number of movies rated in the dataset. Under the scenes, what this model does is to first pass the input through the transformer model, then to apply the right mean pooling-operation on-top of the contextu-alized word embeddings. We can generate a fixed-size representation for input phrases of various lengths using the pooling layer. The authors of SBert paper tested with several pooling algorithms, including MEAN and MAX pooling, as well as using the CLS token that BERT creates by default. Max length of sentences was increased from the default value of 256 characters to the maximum length overview in the dataset.

The second option, those with pretrained SentenceTransformers models, was preferred since it ended up to be the most easy to implement and ef-ficient, given our available computational resources. In the scatter plot be-low, we show the average overview embedding per genre after PCA. Each 2-dimensional record representing a genre is obtained by averaging all the BERT embeddings of movies belonging to that genre. For each genre we obtained a vector of 384 entries. Principal Component Analyisis (PCA) was applied here with the purpose of reducing dimensions of the date for illus-tration purposes. This technique linearly transforms data by simultaneously mapping them into a new space whose dimensionality is smaller and tries to keep most of the information of the original data. What PCA does is finding the ”directions” of the data that explain most of the information present in them.

Figure 5.4. Movie overview BERT average embedding per genre after ap-plying PCA with n_components=2

The plot clearly shows how BERT encoding is capable of representing semantic meaning and context from sentences. Although we are making an average of several embeddings and we are restricting the entire space from 384 to 2 dimensions, we can still see how overviews belonging to similar genres end up to be closer in the embedding space. For instance, History is reasonably close to War, Crime is close to Thriller and Mistery, Science Fiction is close to Adventure. Once movie overview embedding are obtained using BERT, we can compare them and find the most similar ones, that will be the closest ones in that embedding space. For instance, in the table below we show most similar overview to that of "Life Is Beautiful", an Italian movie about the story of a Jewish family during World War II.

Movie Title Cosine

Similarity Overview

Life Is Beautiful 1.0000

A touching story of an Italian book seller of Jewish ancestry who lives in his own little fairy tale. His creative and happy life would come to an abrupt halt when his entire family is deported to a

concentration camp during World War II.

While locked up he tries to convince his son that the whole thing is just a game.

Europa Europa 0.6015

A Jewish boy separated from his family in the early days of WWII poses as a German orphan and is taken into the heart of the Nazi world as a ’war hero’

and eventually becomes a Hitler Youth.

Although improbabilities and

happenstance are cornerstones of the film, it is based upon a true story.

August Rush 0.5523

A drama with fairy tale elements, where an orphaned musical prodigy uses his gift as a clue to finding his birth parents.

The Book Thief 0.5457

While subjected to the horrors of WWII Germany, young Liesel finds solace by stealing books and sharing them with others. Under the stairs in her home, a Jewish refuge is being sheltered by her adoptive parents.

Schindler’s List 0.5379

The true story of how businessman Oskar Schindler saved over a thousand Jewish lives from the Nazis while they worked as slaves in his factory during World War II.

Movie Title Cosine

Similarity Overview

Captain Corelli’s

Mandolin 0.5343

When a Greek fisherman leaves to fight with the Greek army during WWII, his fiancee falls in love with the local Italian commander. The film is based on a novel about an Italian soldier’s experiences during the Italian occupation of the Greek island of Cephalonia (Kefalonia).

Table 5.2: Most similar movies to "Life Is Beautiful"

in terms of cosine similarity between movie overview BERT embeddings.

We can see how the most similar overviews in terms of cosine similarity, using sklearn, are all movies in which the themes World War II, Jewish deportation, father-son relationship.

5.4.6 Overview clusters using HDBScan on BERT

Nel documento Hybrid Movie Recommender System using NLP techniques for items' features generation (pagine 71-75)