Word2Vec - Task description - Deep Learning for Session Aware Conversational Agents

1.4 Task description

2.1.1 Word2Vec

Word2Vecis a set of machine learning models whose aim to produce a vector representation for words. It was proposed by a group of researchers lead by Thomas Mikolov[23] at Google in 2013. The technique is based on Neural Probabilistic Language Model, a task in which a machine learning model is trained to predict the next word wtgiven the history h = {w1, w2, ..., wt−1}. Such a model learns the conditional probability of the next word, given all the previous ones.

P(wt|h) = softmax(score(wt, h)) = exp(score(wt, h)) q

w^Íin vocabexp(score(w^Í, h)) (2.1) The score(wt, h) computes the compatibility of target word and the context h, dot product is commonly used. The model is trained to maximize the log-likelihood of such conditional probability.

J = log P (wt|h) = score(wt, h) = score(wt, h)−log ^Ø

w^Íin vocab

exp(score(w^Í, h)) (2.2) Since the model is trained on an entire corpus, the final goal is to find the right parameters θ that maximizes the average of (2.2).

arg max

1 T

t=1

jÔc,j /=0

log P(wt+j|w_t; θ) (2.3)

2 – Deep Learning for NLP

The equation (2.3) expresses in a compact form the necessity to maximize the probability of seeing a particular word within a window having size c of the current word wt.

The algorithm came up in 2 different flavours:

• Skip-Gram

• CBOW (Continuous bag of words)

While both variants work with a size c window, the objective function is different. The CBOW is trained to predict the center word of the win-dow, given the surrounding context (the so called "bag of words"). The Skip-Gram is instead trained to predict the context, given the center word.

CBOW relies on the "Bag of words" assumption, where the order of con-text words does not influence the prediction of the center word. Skip-Gram instead has a mechanism to weight nearby words more than distant ones.

While CBOW is pretty much faster, the Skip-Gram handles better infre-quent words.

In this section only the Skip-Gram algorithm is taken into consideration.

Skip-Gram

The Skip-Gram algorithm defines the conditional probability through the softmax function. In equation (2.4) wi represents a context word (outside word) and wt the target one (also referred as central word), both one-hot encoded. N and k are respectively the number of words in the vocabulary and the embedding dimension. θ will represent the trained lookup matrix after the learning phase.

P(wi|w_t, θ) = exp(θwi) q

texp(θwt) , θÔIR^{N xK} , w_iÔIR^N (2.4) The architecture for the Skip-Gram is a shallow 3-layer neural network. It has one input layer, one hidden layer and one output layer. The hidden layer presents a lookup matrix for the center word, from there the embedding for the input word can be retrieved by means of a simple matrix multiplication.

Because wtis one-hot encoded, the result of this operation will be the i−th column of the embedding matrix, where i is the index of the entry of wt

containing the one. The output layer contains the lookup matrix for all the context words. The embeddings for the context and center word are dot product, the results are then followed by a softmax and an argmax, in or-der to predict each context words. The backpropagation is then performed in order to correct the network weights, which are the lookup matrices for center and context words. Optimizing such structure means to tune the

2 – Deep Learning for NLP

embeddings for words that often co-occur, trying to make them as similar as possible, in this way the dot product will have a higher result.

After the training the embedding for each word can be obtained by sim-ply removing the output layer and feeding a valid index for the vocabulary word. The reader could have notice that during training, Skip-Gram, tunes

Figure 2.1: Skip-Gram architecture in an original diagram from Mikolov et al., 2013

two different lookup matrices, even if at the end only one will be used. The reason why the proposed model does not have a unique matrix is merely mathematical. If the model had one single lookup matrix, then each time dot product is performed, the highest score will be assigned to the center word, considering that it is equal to itself. This is a typical problem of lan-guage models that must be avoided, a model that predicts the input word as the next word of itself has no predictive power.

During the development of Word2Vec there were two main problems to solve. The first problem was related to the presence of a very large corpus, using batch Gradient Descent the weights would be updated every epoch, and so too seldom. To solve this problem, Word2Vec uses Stochastic Gradi-ent DescGradi-entthat, differently from batch Gradient Descent, updates weights at each sample. The error is typically noisier but the model converges faster and is robust against local minimums. The second problem was in-stead more tricky to solve, it is related to the computation of denominator

2 – Deep Learning for NLP

in the conditional probability. Since words of vocabulary are a huge num-ber, iterate over them each time is not feasible. Additionally it would be useless to compute the gradient for all the words in the vocabulary if we only have fixed size windows, the vector of gradients will be full of zeros.

To overcome this problem, Word2Vec uses negative sampling. Instead of iterating over the entire vocabulary each time, just going to train a binary logistic regression for a true pair (center word and a word in its window) against several noise pairs (center word paired with random word), which is the so called negative sample.

Word2Vec embeddings are still used today because the built multi dimen-sional space, in which those vectors live, owns very interesting properties.

Such space allows the use of mathematical and logical operators on word vectors. A now multi-cited example is the one involving the following oper-ation:

e_king− e_man+ ewoman ≈ e_queen Logical operations can also be performed:

eRome: eItaly = ex: eF rance, x ≈ P aris

Such dreamy results are possible because of the rich geometry on which that space was built. Related words are close to each other and this brings a lot of benefits to the final representation. Even if Word2Vec can capture

Figure 2.2: Some example tho show the rich geometry of Word2Vec space after a PCA, for dimensionality reduction.

very complex patterns in data, it not scales with corpus size and has a very slow training time. For this reason Glove was proposed.

2 – Deep Learning for NLP

Nel documento Deep Learning for Session Aware Conversational Agents (pagine 31-35)