Neural networks for resolving semantic ambiguity in natural language processing

(1)

Dipartimento di Ingegneria dell’Informazione Corso di Laurea in Computer Engineering

Neural networks for resolving semantic

ambiguity in natural language processing

Relatori:

Prof. Mario G.C.A. Cimino

Prof.ssa Gigliola Vaglini

Prof. Roman Yangarber

Laureando:

Giacomo Furlan

(2)

Abstract

In this thesis, we address the task of morphological disambiguation in morphologically rich languages. In particular, we implement neural models using unlabeled and labeled data, where the labeled data is available in very limited amounts. This is also a central problem in many resource-poor and endangered languages.

We apply deep learning techniques to resolve morphological ambiguity, relying on existing morphological analyzers. We consider the problem of disambiguating the part-of-speech (POS) and the lemma of ambiguous words, given the context of the words.

The idea is to train recurrent neural networks to understand the context and to discriminate between the analyzer’s options for POS and lemma. We evaluate single-task and multi-task models and we achieve state-of-the-art accuracy on Italian, Russian and Finnish morpholog-ical ambiguity.

(3)

Acknowledgements

I would first like to thank my thesis supervisor Professor Roman Yangarber of University of Helsinki, whose guidance and feedback helped me constantly during the development of this project.

I would like to thank my professor, Professor Mario Cimino, for his constant support and help whenever needed.

I wish to express my deepest gratitude to Professor John Daer for helping me to improve my English writing.

I would also like to acknowledge all the Language Learning group at Helsinki University who have helped me constantly.

I would like to thank all my friends with whom I shared this long journey.

Finally, I must express my very profound gratitude to my parents and my family for pro-viding me with unfailing support and continuous encouragement throughout my years of study. This accomplishment would not have been possible without them. Thank you.

(4)

List of Figures

2.1 Rule based Model . . . 16

2.2 Hidden Markov Model . . . 19

2.3 Decision Tree Tagging . . . 20

3.1 Perceptron . . . 24

3.2 Multilayer Perceptron . . . 25

3.3 Relu v/s Leaky Relu . . . 25

3.4 Recurrent Neural Network . . . 26

3.5 Vanishing Gradient . . . 28 3.6 LSTM Architecture . . . 28 3.7 Bidirectional LSTM Architecture . . . 30 4.1 Model Idea . . . 37 4.2 Model Scheme . . . 38 4.3 Attention Mechanism . . . 40 4.4 Attention Scheme . . . 41 4.5 Multitask Model . . . 43

4.6 Losses Comparison Multitask Model . . . 45

4.7 Weighted Losses Multitask Model . . . 46

4.8 Transfer Learning Concept . . . 48

4.9 System Architecture . . . 50

5.1 Average Cross-Validation Accuracy of Russian models splitting the bins by lemma . . . 61

5.2 Accuracy of the model using a limited amount of data for the second-phase training . . . 64

(8)

5.4 Accuracy of the model using a limited amount of data for the second-phase training . . . 67 5.5 Comparison of Accuracy after second-phase training for Italian models . . . . 67 5.6 Comparison of Accuracy after second-phase training for Finnish models . . . 68 5.7 Comparison of Accuracy after second-phase training for Russian models . . . 68 A.1 Average Cross-Validation Accuracy of Italian models splitting the bins by lemma 75 A.2 Average Cross-Validation Accuracy of Finnish models splitting the bins by lemma 75

(9)

List of Tables

1.1 Analysis of the Italian word “gelato” . . . 11

1.2 6= POS 6= Lemma Examples . . . 11

1.3 = POS 6= Lemma Examples . . . 12

1.4 6= POS = Lemma Examples . . . 12

1.5 = POS = Lemma Examples . . . 13

1.6 Percentage Ambiguity in the Dataset, divided by type . . . 13

3.1 Analyzer Coverage . . . 23

4.1 Datasets size (Number of instances in the dataset) . . . 47

5.1 Lemma Target Size . . . 53

5.2 Confusion Matrix . . . 54

5.3 Percentage of Ambiguity for each Dataset . . . 55

5.4 Network Setting . . . 56

5.5 Hyperparameter Network . . . 56

5.6 Hyperparameter Adam Optimizer . . . 57

5.7 Reference Percentage Accuracy Single-Task Model . . . 58

5.8 Percentage Accuracy Single-Task Model with pre-training . . . 59

5.9 Percentage Measures of top 5 Russian ambiguous words . . . 60

5.10 Percentage Measures of top 5 Russian ambiguous words with pre-training . . 60

5.11 Russian Model Guided Accuracy (Lemma) . . . 62

5.12 Russian Model Guided Accuracy (POS) . . . 62

5.13 Russian Model Guided Accuracy (POS) with pre-training . . . 62

5.14 Russian Model Guided Accuracy (Lemma) with pre-training . . . 63

5.15 Percentage Accuracy Single Task Attention Model . . . 63

(10)

5.17 Multi-Task Model Results . . . 65

5.18 Multi-Task Model Results Pre-Trained . . . 65

5.19 Multi-Task Model guided accuracy of predicting both lemma and POS correctly. 66 A.1 Percentage measures of top 5 Italian ambiguous words . . . 71

A.2 Percentage measures of top 5 Italian ambiguous words after second phase training . . . 71

A.3 Percentage measures of top 5 Finnish ambiguous words . . . 72

A.4 Percentage measures of top 5 Finnish ambiguous words after second phase training . . . 72

A.5 Italian Model Guided Accuracy (POS) before 2nd phase training . . . 73

A.6 Italian Model Guided Accuracy (POS) after 2nd phase trainin . . . 73

A.7 Finnish Model Guided Accuracy (POS) . . . 74

(11)

Chapter 1 Introduction

1.1 Problem Description

All natural languages are inherently ambiguous, with different degrees of ambiguity. Ambiguity is the quality of being open to more than one interpretation.

There are different forms of ambiguity that are relevant in natural language processing[1]:

Semantic Ambiguity: (also called lexical ambiguity) is the presence of two or more possible meanings within a single word . This type of ambiguity is typically related to the interpreta-tion of sentence.

Consider the following sentence: “The coaches train the athletes”. The word train is con-sidered ambiguous because could be a noun (the train) and a verb (to train) but the context makes clear which one is it.

Syntactic Ambiguity: (also called structural ambiguity or grammatical ambiguity) this type of ambiguity represents sentences or sequence of words that can be parsed in multiple syn-tactical forms.

Take the following sentence: “John and Mary are married.”. The meaning of the words is com-pletely clear but the sense of the sentence is not. Are they married to each other or separately?

In artificial intelligence (AI) theory, the group of techniques used to handle ambiguity is known as disambiguation. From a conceptual point of view, disambiguation is the process of determining the most probable meaning of a specific word considering the context.

(12)

Languages have a wide variety of morphological processes available to create words and word forms, generating different meanings. We define a word or a sentence “ambiguous” if it can be interpreted in more than one way. This doesn’t mean that the word is vague or unclear but it means that there are two or more distinct meanings available.

The morphological ambiguity is the only ambiguity that concerns us because in it we find all the information that we need to solve the ambiguity in the analysis of the context.

Context information plays a fundamental role in this task. For example, the same piece of information may be ambiguous in one context and unambiguous in another and easily solv-able. A person is able to distinguish automatically the meaning of the word because he’s able to contextualize the situation.

Resolving the ambiguity of words is a central problem for Natural Language Processing (NLP) and NLP applications such as machine translation, information retrieval, parsing, spelling cor-rection, reference resolution, automatic text summarization, etc..

Another important application is the learning-application context. Disambiguating words could be really useful in platform for language learning such as Revita[2][3] where intelligent tutoring systems (ITS) and computer-assisted language learning (CALL) are used to build a language learning environment.

In our context we work with morphologically rich languages and the morphology of words is different depending on the categories of languages (isolating languages, agglutinative lan-guages or inflecting lanlan-guages).

In morphologically rich languages, words take different forms depending on their exact mean-ing or how they are used in the sentence. The more morphologically rich the language is, the more inevitable the ambiguity of some words becomes. Sometimes, different combinations of morphemes could have the same written form.

Morphology is the study of form-meaning relationships between words; morphemes being the minimal unit of meaning (a word or a part of it).

(13)

A word is the smallest unit of language which has meaning by itself1 _{and may consist of}

a single morpheme or of a combination of morphemes.

A word is described by some morphological features that explain the structural characteristic of the word. A word is define by:

• Surface Form : The surface form of a word is the form of a word as it appears in the text.

For example, the surface form of the verb “go” in the third person singular is “goes”, and the surface for of the noun “macchina” (car) in plural is “macchine” (cars).

• Lemma : The lemma is the reference form under which the word is entered in a dictio-nary. Normally it corresponds to the nominative singular for a noun and to the infinitive form for a Latin based language.

For example, the lemma of the word “horses” is “horse” and the lemma of the word “parla” (he/she speaks) is “parlare” (to speak)

• Part-Of-Speech: A part of speech (POS) is the class of words defined by the function of the words in sentences, such as nouns, verbs, prepositions etc. .

Morphological analysis is a process that, after a tokenization phase (the process of splitting a word in its base elements), extracts POS, Lemma and grammatical features using a mor-phological analyser. The analyser returns a single analysis if the word is non-ambiguous. On the other hand, if the word is ambiguous, the analyzer returns more than one analysis for the word. The object of this project is the identification of the correct one.

(14)

As a representative figure, we have reported the analysis for the ambiguous Italian word “gelato”. This word could be a Noun (ice-cream), a Verb (to ice) and the Adj form of the Verb (to be iced). In the “tag” field the analyser inserts the morphological features of the word.

{’ analyse s ’ : { ’ I t a l i a n ’ : [ [ {

’ base ’ : ’ g e l a r e ’ , ’ pos ’ : ’ Verb ’ ,

’ tags ’ : { ’ TENSE ’ : ’ pp ’ , ’GENDER ’ : ’m’ , ’NUMBER’ : ’ sg ’ } } ] , [{

’ base ’ : ’ g e l a t o ’ , ’ pos ’ : ’ Adj ’ ,

’ tags ’ : { ’GENDER ’ : ’m’ , ’NUMBER’ : ’ sg ’ } } ] , [{

’ base ’ : ’ g e l a t o ’ , ’ pos ’ : ’Noun ’ ,

’ tags ’ : { ’GENDER ’ : ’m’ , ’NUMBER’ : ’ sg ’ } } ] ] } ,

’ s u r f a c e ’ : ’ g e l a t o ’ }

Table 1.1: Analysis of the Italian word “gelato”

Morphological ambiguity occurs when a morphological analyser gives more than one possible analysis for a surface form.

In our study case, we can separate the morphological ambiguity into four different classes, depending of the result of the analysis:

Different POS, Different Lemma : This is the ideal type of ambiguity we want to solve, since the information obtained solving one ambiguity, POS ambiguity or Lemma ambiguity, helps us to automatically disambiguate the other one. In tab. 1.2 we can see an example of this condition for each language in our study.

Language Surface Lemma POS Translation Italian parto partire_parto _NounVerb _childbirthto leave Russian начал начать Verb to start

начало Noun beginning Finnish tuli tulia_tuli _NounVerb he came_fire

Table 1.2: 6= POS 6= Lemma Examples

(15)

analysis, we cannot disambiguate this case considering the POS. We need to disambiguate the Lemma to understand the meaning of the sentence correctly. This is a difficult task because each possible solution could be grammatically correct within the sentence but the context meaning could change drastically, when choosing the wrong one.

Language Surface Lemma POS Translation Italian accoppi accoppiare Verb_{accoppare Verb} to pair_{to kill}

Russian велико большой Adj big

великий Adj large Finnish palaa _palatapalaa Verb_Verb to return_{he burns}

Table 1.3: = POS 6= Lemma Examples

Different Pos, same Lemma: If the Lemma is the same for both the possible solutions, the lemma disambiguation is infeasible and these ambiguities could be solved only disam-biguating the POS.

Language Surface Lemma POS Translation Italian ora ora_ora _NounAdv _hournow

Russian видим видимый Adj visual

видимый Verb to see

Finnish kuusi kuusi_kuusi Numeral_Noun _{he spruce}six

Table 1.4: 6= POS = Lemma Examples

Same Pos, Same Lemma : If the Lemma and POS are the same, we cannot disambiguate the word. This type of ambiguity could be solved studying the morphological features of the word. The knowledge of both Pos and Lemma is not enough. Our approach does not work with this type of ambiguity. In fig. 1.5 we reported one example for each language of interest.

In some more complex cases, we can find a surface form whose ambiguity belongs to multiple classes. For example, in the previously seen tab. 1.1 , we have three different possible anal-ysis for the word ’gelato’. For this word we have an ambiguity between different POS and different Lemmasand also an ambiguity between different POS and the same Lemmas. These cases are frequent and they increase the complexity of the problem.

(16)

Language Surface Lemma POS Translation Italian gioca giocare Verb he/she plays_{giocare Verb} _{play! (you)} Russian вопрос вопрос Noun of the issue

вопрос Noun issue Finnish nostaa nostaa_nostaa Verb_Verb _{he raises}to raise

Table 1.5: = POS = Lemma Examples

The introduction of a multitask model, means we do not to have to choose between a model that disambiguates the POS and a model that disambiguate the Lemma. The model will be able to solve all types of ambiguity automatically.

In table 1.6 we report the category distribution of ambiguities for each language. Type Italian Russian Finnish

6=POS 6= Lemma 13.23 16.51 68.16 =POS 6= Lemma 1.67 7.63 8.93 6=POS = Lemma 71.88 28.90 16.87 =POS = Lemma 13.23 46.96 6.04

(17)

1.2 Objectives

Morphological disambiguation is one of the main tasks of automatic natural language pro-cessing. A morphological disambiguator is used to select the correct morphological analysis of a word.

The aim of this project is to design, implement and evaluate models for morphological dis-ambiguation of words in morphologically rich languages.

The complexity of morphological disambiguation varies for each category of languages. For example, the morphological disambiguation is based on rather simple methods for English due to its poor morphology and rigid word ordering in sentences.

On the other hand, Russian has an inherent morphological ambiguity and a free word order that add complexity to the task.

In addition, in agglutinative languages such as Finnish, Turkish and Hungarian, the complex morphology of words greatly complicates morphological disambiguation but it is a funda-mental step before further processing can be carried out.

We need our approach to be language-independent, since we want to have a model that is valid for most of the different languages. In particular, the focus of the project is on those languages that lack annotated data for morphological disambiguation.

We tested in three very different languages to verify the effectiveness across different types of languages: Russian, Finnish and Italian.

The work hinges on the following hypotheses:

• a morphological analyzer is available for the language,

• there is a limited source of annotated data available for that language We don’t make any hypothesis on the distribution of our dataset.

Our model aims to:

(18)

• if the target word is unknown, learn how to disambiguate it from the word in a similar context

• for the multitask model, the model should be able to improve accuracy using the implicit relationship between lemma and part-of-speech.

This project follows up the work of Hoya Quecedo et al. (2020)[4] and it aims to verify their discovery and to improve on their results using additional strategies.

(19)

Chapter 2 Related Work

2.1 Rule-based Model

A rule-based model uses a large number of hand-crafted rules to select the correct morpho-logical parse or POS tag of a given word in a specific context[5].

Rule-based taggers use a lexicon (or dictionary) to assign each word possible tags. If the word has more than one possible tag, the correct tag is disambiguated using hand-written rules to identify the correct tag. This method requires a deep knowledge of that specific language.

Figure 2.1: Rule based Model

In Fig. 2.1 we reported an example of the disambiguation rule for an English rule-based tag-ging model.

(20)

The advantages of the rule-based approach are:

• the rules are human-readable and easy to understand • could be arbitrarily detailed for every condition • no need of a training phase of our model Instead, the disadvantages are the following:

• the rules are hand-written, this means that the creation process is very laborious • the creation of the rules needs a deep knowledge of the specific language

• every exception has to have it’s own rule

This method normally is combined with statistical methods for reducing the number of hand-written rules and to improve accuracy, or with some automatic processes for generating the rules[6].

(21)

2.2 Stocastic Models

2.2.1 Hidden Markov Model

The HMM is a sequence tagging model. The goal of the model is to assign a label or class to each unit in a sequence, thus mapping a sequence of observations to a sequence of la-bels. A HMM is a probabilistic sequence model: given a sequence of words it computes the distribution over possible sequences of labels and chooses the best label sequence. Given:

x = x1, ..., xT the sequence of words as input

y = y1, ..., yT the sequence of the tags as labels

(2.1)

The models will find the most probable sequence of tags given the sentence:

y = argmaxyp(y|x) = argmaxyp(x, y) (2.2)

The conditional probability can be expressed as:

p(x, y) = p(y|x)p(y) ≈

T

Y

t=1

p(xt|yt)p(yt|yt−1) (2.3)

This is possible due to the following assumptions: The Markov assumption:

The future of a process is conditionally independent of the past, once the present is known.

p(y) ≈

T

Y

t=1

p(yt|yt−1) (2.4)

Conditional Independence assumption:

The effect of the values of a sequence of labels y on a given sentence x is independent of the values of other classes sequence of labels.

p(x|y) ≈

T

Y

t=1

p(xt|yt) (2.5)

The probabilities p(yt|yt−1)are called “transition probabilities” and the probabilities p(xt|yt)

(22)

Figure 2.2: Hidden Markov Model

The goal of the model is to compute y that maximizes the argmax given x. To solve the product of the equation 2.5, the probability of each y has to be computed to find the best y.

2.2.2 Viterbi Algorithm

This approach is not feasible because the complexity of the problem increases exponentially with the possible states. For example, if the input sentence of the model has length T and the possible states are S, the possible sequences that have to be evaluated are TS.

Dynamic programming is used to solve this problem efficiently. The Viterbi algorithm find the most likely sequence of hidden states.

Let Qt,s be the most probable sequence of a hidden state of length t that finishes with the

state s and generates o1, .., ot. Let qt,sbe the probability of this sequence, then qt,sis computed

dynamically as follows:

qt,s = max s0 qt−1,s

0p(s|s0)p(o_t|s) (2.6)

The idea is to create the possible paths choosing the most probable state at each step. Several solutions have been proposed for sequence tagging, using solutions derived by the Hidden Markov Model, achieving great accuracy.

However, the main drawback of the use this type of solution is that HMM could not express the dependencies between hidden states and is not able to understand the correlation of the words in a sentence

(23)

2.2.3 Decision Tree Model

The TreeTagger (Schmid, 1994)[7], is a special type of Markov Model that uses a decision tree to understand the context and improve the tagging performance of the Markov model. The algorithm builds a decision tree where the nodes are the possible tags of the previous words of the sentence. The tag is predicted using the information of the previous tag. In Fig. 2.3 we can see a sample structure of the tree proposed in the paper.

Figure 2.3: Decision Tree Tagging

This method was originally created for German but it has also been applied to several lan-guage such as Greek, Italian and Russian.

This model can achieve a very good accuracy (more than 95%) but is very dependent on the quality of the training data for the construction of the tree.

(24)

2.3 Models Based on Neural Network

The use of neural networks for POS tagging was introduced by Schmid (1994)[8].

Currently, the sequence tagging/labelling problem is one of the main topic in Natural Lan-guage Processing.

The introduction of some deep learning techniques in natural language processing has led to an improvement in accuracy over the classic methods for task like POS tagging, Name Entity Recognition and Sequence Labelling.

Currently, the trending research topics are: • Long Short-Term Memory architecture • Convolutional architecture

• Attention-based architecture

The structure of the Long Short-Term Memory (LSTM) architecture was described in detail in section 3.4 and it is the base structure of our model. This strategy allows the model to learn the context of the input, remembering long term dependencies.

An example of a Convolutional Network for sequence tagging was proposed by Collobert et al., 2011[9], where the model consists of convolutional layers (and subsampling) without any fully connected layer. The use of convolutional networks reduces the number of param-eters by sharing weights and makes the learned features invariant to the location. The null invariant property could be an unwanted trait if the input ordering is critical.

Recently Zhang et al., 2018[10], proposed a document-level and corpus-level attention strat-egy for name tagging, using a bi-directional long short-term memory for encoding. The usage of an attention architecture focuses on the information that the model has to learn.

The neural network models outperform the classic model which is based on statistics or man-ually crafted rules. In addition, they have some of desirable properties: they do not require language-based knowledge and they learn automatically from the data, without any manual work.

(25)

However, they have some disadvantages: they require powerful machines and a lot of data for training and they work as a black box, thus what is learned between the input and the output, is unknown to us.

This project focuses mainly on developing a model that learns in situations in which the annotated data are limited.

(26)

Chapter 3 Background

3.1 Morphological Analyzer

Morphological analysis may be defined as the process of obtaining grammatical information from tokens, given their suffix information. A morphological analyser is a model that seg-ments the text in surface forms and delivers, for each surface form, one or more lexical forms consisting of lemma, lexical category and morphological inflection information.

For our project, we have used the following analyzers: • For Russian, we used the Crosslator analyser.[11] • For Italian, we used the analyser from Apertium.[12]

• For Finnish, we used the analyser from the Giellatekno platform.

The coverage of the analyzer is a relevant metric to understand how many tokens the analyzer is not able to solve.

Italian Russian Finnish Coverage 94,22% 97.79% 95.14%

Table 3.1: Analyzer Coverage

(27)

3.2 Multilayer Perceptron

A multilayer perceptron is a feed-forward neural network with multiple layers of perceptron. A Neural Network has a very strong representational power. Due to the Universal Approxima-tion theorem [13] you can represent any bounded continuous funcApproxima-tion using only a 1 hidden layer model. In Fig. 3.1 we can see the structure of the perceptron. We have the input values, the weight and bias, a weighted sum function and an activation function.

The output of a perceptron o is defined as:

o = φ(X

i

wixi+ b) (3.1)

where φ is our activation function, wi the weight associated at each value and b the bias.

If we use a non-linear activation function, the multilinear perceptron is considered a universal approximator, which means it can model any function (Hornik, 1991)[14].

Figure 3.1: Perceptron

In a multi-layer perceptron (MLP), the structure is the same as a single layer perceptron hav-ing multiple hidden layers. Each layer has its own weight matrix, bias vector and activation function, and the output of the MLP is the composition of all its parameters.

The backpropagation algorithm involves two phases: forward phase and backward phase. In the forward phase the values are propagated from the input to the output layer and the out-put is comout-puted, and the backward phase, where the error between the observed values and the label values is propagated backwards in order to correct the weights and bias values.

(28)

Figure 3.2: Multilayer Perceptron

In our model we use a modified version of the Relu activation function called Leaky Relu (Maas et al., 2013)[15]. LeakyRelu(x) =      x if x > 0 0.01 · x otherwise (3.2)

The difference between the ReLu function and the Leaky Relu function is that a leaky rectified linear hidden unit is used to alleviate potential problems caused by the hard 0 activation of ReLu. One problem of the hard 0 is the vanishing gradient problem previously discussed.

(29)

3.3 Recurrent neural networks

Recurrent neural networks, or RNNs (Rumelhart et al., 1986a)[16], are a family of neural net-works for processing sequential data. These types of neural netnet-works are specialized for pro-cessing a sequence of values x1, ..., xn. The strength of this model derives from the sharing of

parameters across different parts of a model. Sharing parameters allows us to use the model with input of different lengths and generalize across them. Such sharing is particularly impor-tant when a specific piece of information can occur at multiple positions within the sequence. In a traditional fully connected feedforward network, all the inputs are considered indepen-dent of each other. A traditional model would have to learn all the rules of the language separately at each position in the sentence and it would not be able to use sequential infor-mation. On the other hand, the RNN model output depends on its hidden state generated by the sequential information of a time series.

Figure 3.4: Recurrent Neural Network

The RNN processes a sequence of vectors x by applying a recurrence formula at every time step:

ht = fW(ht−1, xt) (3.3)

Where:

• htis the new state

(30)

• ht−1is the old state

• xtis the input vector at some time step

The major advantages of the use of recurrence are:

• regardless the sequence length, the input size is always the same, because the sequence length is specified in terms of state of transitions rather than variable-length history of states

• the same transition function f with the same weight matrix W can be use at each time step

3.3.1 Vanishing and Exploding Gradient Problem

RNNs work upon the fact that the result of a piece of information is dependent on its previous state or previous n time steps. Regular RNNs might have a difficulty in learning long range dependencies.

RNNs compute h(t) = WT_{h(t − 1)}_{multiple times. These long term dependencies introduce}

the problem of the vanishing or exploding gradients.

The vanishing gradient problem is far more threatening, compared to the exploding gradient problem, where the gradients become very, very large due to single or multiple gradient val-ues becoming very high. The reason why the vanishing gradient problem is more concerning is that an exploding gradient problem can easily be solved by clipping the gradients at a pre-defined threshold value. In figure 3.5 we can understand how the vanishing gradient behaves. Suppose we want to backpropagate the error of the model δE. To calculate the error at time step seven we need to compute a long dependency:

δE δW = δE δy7 · δE δy6 · ... · δE δy2 · δE δy1 (3.4)

Here we apply the chain rule and if any one of the gradients approached 0, all the gradients would rush to zero exponentially fast due to the multiplication. Such states would no longer help the network to learn anything. This is known as the vanishing gradient problem. We can solve the vanishing gradient problem using LSTM (Long Short-Term Memory) cells.

(31)

Figure 3.5: Vanishing Gradient

3.4 Long Short-Term Memory

Long Short Term Memory networks (LSTMs) are a special kind of RNN, created specifically to learn long-term dependencies. They were proposed by Hochreiter & Schmidhuber (1997)[17], and they have been widely used and refined for a large variety of problems.

In Fig. 3.6 the architecture of a LSTM cell is presented. The goal of each cell is to decide which information to memorize and which information to forget.

Figure 3.6: LSTM Architecture

A LSTM has three different inputs: the precedent cell state ct−1, the output of the precedent

cell ht−1 and the encoded word xt. In the forget gate layer we decide which information we

want to forget:

(32)

where [x,y] is the concatenation between x and y and σ is a standard logistic activation func-tion defined as:

σ(x) = 1

1 + e−x (3.6)

The forget gate layer output a value [0:1] for each number of the precedent cell state ct−1

where 0 means to forget the value and 1 means to keep it.

Then we decide which information we want to store in the cell. In the input gate layer we use a sigmoid activation function to decide which values update and we use a tanh activation function to create a new vector of candidate values c0

t. it= σ(wi· [ht−1, xt] + bi) (3.7) c0_t= tanh(w0· [ht−1, xt]) + bc) (3.8) where tanh = e x_{− e}−x ex_{+ e}−x (3.9)

These values are combined to compute the current state of the cell, we sum the vector with the old information with the vector with the new information:

ct = ft· ct−1+ it· c0t (3.10)

In the output gate layer instead, we determine the output of our memory cell. This value is obtained combining the previous cell output and a filter version of the cell state.

ot = σ(w0· [ht−1, xt] + b0) (3.11)

The function tanh pushes the values of ctbetween -1 and 1.

(33)

3.5 Bidirectional RNN

Bidirectional RNNs (BRNN) were introduced by Schuster and Paliwal, 1997[18] to overcome the limitation of using only input information from the past. In the Bidirectional RNN we train two Rnns simultaneously in a positive and negative time direction, learning both the previous and the consecutive data. Each layer has a completely independent forward pass and a backward pass and the output of the BRNN is the concatenation of the forward and backward hidden states −→ht and

←−

ht. In our model we use a Bidirectional LSTM, a special

Bidirectional RNN that is better suited to learn the context of the sentence. The following figure (Fig. 3.7 ) shows the structure of a bidirectional layer.

Figure 3.7: Bidirectional LSTM Architecture

Bidirectional LSTMs have been widely used for different natural language tasks like sequence tagging and Name Entity Recognition. In particular, Huang et al.[19] compared different deep learning methods for sequence tagging and NER. Their results showed that a bidirectional LSTM model can produce state-of-the-art (or close to) accuracy and that the bi-directional LSTM model obtains better tagging accuracy than a single LSTM layer model with identical feature sets.

(34)

3.6 Word Embedding

Word embedding is a word representation that allows words with similar meaning to have a similar representation. Word embedding is in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space.

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks.

One of the most used libraries for embedding is the Facebook Fast Text Embedding library.[20] The main goal of Fast Text Embedding is to take into account the internal structure (morphol-ogy) of the word while learning the word representation. It uses a skip-gram model, where each word is represented as a bag of character n-grams.

A vector representation is associated to each character n-gram and words are represented as the sum of these representation. This representation is used to determine the likelihood for each word, given the context of the word.

In the paper “Bag of Tricks for Efficient Text Classification” Joulin et al.[21] compared the Fast Text Model with recently proposed methods inspired by deep learning and they proved that Fast Text is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation.

In our model, the input is made up of surface forms, represented as strings of characters. In order to process our information, we need to embed the word to transform the string into a meaningful representation for the neural network. This representation will be a real-valued vector of size 300.

The advantages of Fastext are:

• We can get embedding for surface forms for morphologically rich languages like Finnish • It allows us to obtain an embedding for out-of-vocabulary words, using its set of

(35)

Chapter 4 Methods

4.1 Data Collection

One of the many issues of most machine learning tasks is the lack of usable data for build-ing an accurate model. To collect this data we used the Universal Dependencies Treebank repository[22]. UD is an open community that provides reliable annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human lan-guages.

The data are collected using a standard data format called CoNLL-U[22] derived from the CoNLL-X data format[23]. The data consist in automatically annotated sentence that have been corrected manually.

For each word we have the following information: • ID: Word index

• FORM: Word form or punctuation symbol • LEMMA: Lemma or stem of the word form

• UPOSTAG: Universal part-of-speech tag (Google universal POS tags) • XPOSTAG: Language-specific part-of-speech tag

• FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension

(36)

• DEPREL: Universal Stanford dependency relation to the HEAD • DEPS: List of secondary dependencies

• MISC: Any other annotation

However, we used only the FORM, the LEMMA and the UPOSTAG of the morphological annotation. For Finnish, as the number of annotated instances is quite small, Hoya Quecedo et al. 2020[4] used a collection of 1M proprietary news articles to increase the dataset. We considered the data of Universal Dependencies Treebank our ground-truth and we use these files to build our instances.

4.2 Data Pre-processing

4.2.1 Tokenization

Tokenization is one of the most important steps of the pre-processing of natural language text preparation. Tokenization is the process of tokenizing or splitting a string or a text into a list of tokens.

In linguistic analysis it is important to define which elements constitute a word or a sentence. In some languages this process is solved thanks to the identification of word borders and punctuation marks separated by spaces and line breaks. In contrast, in other languages the process is trivial because, for example, the explicit word borders or even the sentence bound-aries could be undefined.[24]

For example, we can see how the following sentence “Il ragazzo mangia la mela ” the boy eats the apple) is tokenized:

Il ragazzo mangia la mela. Il ragazzo mangia la mela .

Even in space-delimited languages there are several language dependent challenges:

• Managing Punctuation: question marks and apostrophes are the major source of to-kenization ambiguity.

For example, in English we can have “we’ve” for “we have” or in Italian we can have “un’ape” for “una ape”.

(37)

Even the full stop could be ambiguous. It could be used as end of the sentence, as an abbreviation or as a decimal point.

• Compound Words: some categories of words are not separated by whitespace. This is very common in Languages like Finnish, German or Dutch, but it is also present in Italian.

For example the word “tuinfeest” (Dutch for: garden party ) is split into “tuin” (garden) and “feest” (party), or the word “portagli” (Itatlian for: take it to him) is split into “porta” (bring/take) and “gli” (to him/her).

For the not space-delimited languages (called Scriptio Continuo Languages) like Mandarin or Thai the problem of separating the words or even the sentences is more challenging.

For our Finnish training corpus, the Turku OpenNLP model[25] for tokenization has been used.[4] It is a statistical model, trained in order to deal with all the challenging cases. All the tokens in the Universal Dependencies Directory are already provided in tokenized form.

4.2.2 Building Instances

We use the tokens obtained from the tokenization phase to build the instances. The instances have different features depending on the model and on the task. For the single task model (the prediction of lemma or POS) the structure is the following:

• Training Instances: A context window (representing the sentence), the word and the label for the task (Lemma or POS)

• Testing Instances: A context window (representing the sentence), the word, the label for the task (Lemma or POS) and all the possible options for the label

For the multi task model (the prediction of both lemma and POS simultaneously) there is more information needed for each instance:

• Training Instances: A context window (representing the sentence), the word and the labels both for Lemma and POS

• Testing Instances: A context window (representing the sentence), the word, the labels both for Lemma and POS and all the possible options for both Lemma and POS

(38)

The labels are the collected from the ConNLL-U file presented in section 4.1 and they represent our ground truth. On the other hand, the options are extracted from the analysis for the word.

There may be some discrepancy between the possible options returned by the analyzer and the manually assigned labels. In addition, we must consider that analysers may be incomplete and have errors. For this reason, if there is a conflict between options and labels we discard the instance. Context window: the context window is an ordered list of surface form con-taining our target word and the previous and the consecutive words. A context window is defined by the word target and by a radius, the number of precedent and consecutive ele-ments to consider when we want to extract the information about the context. Each element of the window is a token or a special tag “<pad>” if the information is not available.

Suppose, for example, that we have the target word “stato” with a radius of dimension 4, the context window is the following:

<pad> <pad> <pad> Lo stato garantisce la libert`a . 1

This window is created using all the text document and not only a sentence. This could be really useful when, for example, the information to disambiguate the word is in the next sen-tence and not in the current one, especially if the sensen-tence is very short.

Hoya Quecedo et al. (2020)[4] improved the tokenization using information from the analysis. For compound words, very frequent in Finnish, they split the word into maximal compound analyzing the compound identified by the Analyzer.

For example, the Finnish compound word eläinlääkäriasema (“veterinary clinic”) is made up of three elementary stems: eläin (“animal”) + lääkäri (“doctor”) + asema (“station”).

However, since the analyser has eläinlääkäri (“veterinarian”) in its lexicon the compound is split into eläinlääkäri + asema. This reduce the vocabulary size since there are many possible compounds in Finnish.

For Russian and Italian this is not a problem as there are very few compound words or the cases are well defined and already identified by the analyzer that splits this type of compound into tokens.

Label: The label could be a POS or a Lemma, depending on the target of the model. The set

(39)

of possible labels are determined from the analysis of the analyzer.

For example, from the target word “stato” could be a Verb (“to be”) or a Noun (“state”). The POS label for the context presented in the previous example is Noun, and the possible options are [Noun,Verb].

We ignored categories of words such as personal names that are usually disambiguated using a techniques called Named Entity Recognition (NER).

(40)

4.3 Model Architecture

4.3.1 Model Idea

Figure 4.1: Model Idea

The model aims to disambiguate the word’s lemma and the word’s POS given a working morphological analyzer for that language. The idea of the model is presented in Fig. 4.1. We extract a context vector of the sentence using bidirectional Rnns and we use that information and the embedding of the word as input of a classifier, more precisely a multilayer perceptron classifier (MLP). The classifier will return the probability associated with each class, probabil-ity represented in the Fig. 4.1 by a color scale. Due to the fact that we know the analysis for that word, we can use that information to compute the maximum probability of the possible classes only, reducing the number of possible predictions. We use an “options” vector to drop the neurons that cannot be predicted, keeping only the ambiguous classes for that word.

4.3.2 Base Model

In this paragraph we will describe the basic architecture of our model. The input of our model is a sequence of word x = [x0, .., xw]and our output is a the predicted class y.

(41)

Figure 4.2: Model Scheme

The sequence x is made up of the surface form of the word and the first step is to input our sentence in an encoder layer. The encoder we use is Fastext, as previously described in sec-tion 3.6. The embedding funcsec-tion returns an embedded vector of size n. In our case we set n = 300. After that, the embedding vectors are feed to the BiLSTM to get the context vector.2

− → ht = LST MtL(x0:t) ←− ht = LST MtL(xw:t) (4.2)

where xt is the target word and

− → ht and

←−

ht are the contexts of the previous words and the

contexts of the next words respectively.

The concatenation of the previous and the next context with the embedding of the target word xt is used as input for the Multilayer Perceptron (MLP).

ˆ

x =−→ht+

←−

ht + xt (4.3)

where “means concatenation.

z = M LP (ˆx) (4.4)

(42)

To obtain the probability distribution for the prediction we compute ˆy applying a softmax function. The softmax function is defined as:

sof tmax(x) = e

x

P

ke(xk)

(4.5)

The function normalizes the value and returns a probability distribution R|V | _{→ [0 : 1]}|V |_.

The predicted class is defined as ˆy:

ˆ

y = sof tmax(z) (4.6)

4.3.3 Training the Model

To train the model we apply the Gradient Descendent Algorithm to a function we want to minimize, called the objective function or loss function. Given a loss function L, the gradient descent is performed as:

∆L(−→w ) = [δL δw0 , ..., δL δwn ] (4.7) − →_{w ←− −}→_{w − α∆L(−}→_{w )} _(4.8)

Minimizing the loss means finding the parameters in the model that are responsible for the error and we update/correct them. The learning rate α is defined by the optimization function we are using.

We use a Loss Function called Cross-entropy. Cross entropy is a measure of divergence be-tween two probability distributions, the empirical distribution from the data and the predicted distribution from the model.

Cross-entropy is calculated using the probabilities distribution as follow:

H(ˆy, y) = 1 n n X i yilog(ˆyi) (4.9)

where H() is the cross-entropy function, y is the target distribution and ˆy is the predicted distribution.

(43)

4.4 Attention Model

The attention mechanism has been widely used to improve neural machine translation by se-lectively focusing on parts of the source sentence during translation. Attention, as the name suggests, provides a mechanism that creates a focus on certain input time step for an input sequence of arbitrary length.

The attention mechanism has been widely used in Natural Language Processing, especially for sentence-to-sentence translation.[26]

Yang et al. (2016)[27] propose a hierarchical attention network for document classification, proving that the attention mechanism could be used for classification problems and not just for sequence generation. In their paper they proposed a “Word Attention” mechanism. Not all words contribute equally to the representation of the sentence meaning. The attention mech-anism focuses on words that are important to the meaning of the sentence. The mechmech-anism creates a representation of those informative words to form a sentence vector.

Figure 4.3: Attention Mechanism

we calculate s, which is the attention similarity score for a word. ui is the context vector and we can see it as a non linearity on RNN word output. We compute the dot product between the context vector uand ui. Then we compute the score sinormalizing the value of vi

In our model we use a simple version of the Word Attention model proposed in the paper of Yang et al. .[27]

The idea behind the paper is that not all words contribute equally to the representation of the sentence meaning. Using the word attention mechanism we extract words that are important to the meaning of the sentence.

(44)

We indicate ui as the hidden representation of xi, the output of the bidirectional RNN. αt= exp(uT tcw) PT t exp(uTitcw) (4.11)

where cwis the word context vector. This vector is randomly initialized and learned during the

training process. The value α represents a normalized weight that generates an importance score for the words in the sentence.

We use this score to weight the output of the bidirectional RNN and then we pass the weighted features to the second layer of bidirectional RNN.

ct= αtht (4.12)

The mechanism creates a fixed-length embedding c of the input sequence by computing an adaptive weighted average of the state sequence h.

(45)

4.5 Multitask Model

4.5.1 Motivation

In Machine Learning, we typically create a single model or an ensemble of models to perform a specific task and we try to optimize a particular metric. We try to find the perfect model for that specific task and we use fine-tuning to train the models until their performance no longer increases.

Using this strategy, we focus on our single task and we are able to achieve really good results, but we ignore context information that might be useful to achieve even better results.

The idea is that we might gain information that comes from the training of related tasks. By sharing representations between related tasks, we can enable our model to generalize better on our original task. This approach, called Multi-Task Learning, has been used suc-cessfully across all applications of machine learning, from natural language processing and speech recognition to computer vision.

In Natural Language processing, Colbert et al. (2008)[28] demonstrated that learning tasks simultaneously could improve generalization performance in specific contexts.

They implemented a single convolutional neural network architecture that, given a sentence, outputs a number of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense (grammatically and semantically) using a language model.

The entire network is trained jointly on all these tasks using multitask learning. They showed that the accuracy of each task has improved using a multi-task model.

4.5.2 Multitask Model for POS and Lemma prediction

Multi-task learning aims to improve learning efficiency and prediction accuracy by learning multiple objectives from a shared representation.

The architecture we used is called “Hard Parameter Sharing”. In hard parameter sharing, a common hidden layer is used for all tasks but several task-specific layers are kept separated towards the end of the model. This technique is very useful as by learning a representation for various tasks performed by a common hidden layer, we reduce the risk of overfitting.

(46)

In Fig. 4.5 we can see the architecture of our model.

Figure 4.5: Multitask Model

In our model we used a shared layer consisting of a bidirectional RNNS and a MLP0, a

multi-layer perceptron, in which the model learn the generalization representation of our data. The output of the shared MLP is feed to two layers MLP1and MLP2specific for a task (LEMMA

or POS prediction). For each specific task a prediction is computed and a loss is calculated using the cross entropy (H).

4.5.3 Introduction of a hyper-parameter to balance the losses

The main problem of the multitask model is the optimization of the gradient descendent algorithm having two losses instead of a single one.

The losses could have different scales and the gradient descendent algorithm would reduce the bigger loss without taking into account the other losses. The naive solution is to use a

(47)

weighted sum of losses, where the loss weights are uniform or manually tuned.

Ltot =

X

i

wiLi (4.13)

However, this solution needs an appropriate choice of weighting between each task’s loss and searching for an optimal weighting is computationally expensive and difficult to resolve with manual tuning.

In figure 4.6 we can see that the losses of the two tasks have different scales and this prevents the model from learning correctly.

In particular, the model has not been able to learn from the POS because the loss for LEMMA was way bigger. There have been multiple proposed solution to this problem but we apply the idea proposed in the paper “Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics” [29].

We introduced a new hyperparameter to balance the losses. This parameter is learned au-tomatically during the training of our model. This solution will reduce a specific type of uncertainty that does not change with input data and is task-specific.

Let fW_(x)_{be the output of a neural network with weights W on input x. For our}

classifica-tion we squash the model output through a softmax funcclassifica-tion, and sample from the resulting probability vector:

p(y|fW(x)) = Sof tmax(fW(x)) (4.14)

In maximum likelihood inference, we maximise the log likelihood of the model. In our model we define the cross entropy loss:

Li(W ) = −

X

i

yi· log Sof tmaxi(fW(x)) = − log Sof tmaxi(fW(x)) (4.15)

where only yj is the correct label to be predicted, and all other yi are zero. Kendal et al. [29]

proved how the joint loss can be approximated as:

L(W, σ0, σ1) ≈ 1 σ02 L0(W ) + log σ0+ 1 σ12 L1(W ) + log σ1 (4.16)

(48)

as learning the relative weights of the losses for each output. For large scale values σi, the

contribution of Li(W )will decrease, whereas for a small scale σiits contribution will increase.

The objective function is penalized when setting σitoo large. This function will try to minimize

the weights but the addition of log(σ) will avoid them converging to zero. The graph in the bottom of Fig. 4.6 shows us how the function balances the weights of the losses . In particular, we can notice how the scale of the two losses became the same.

Figure 4.6: Losses Comparison Multitask Model

The upper graph and the lower graph show the loss before and after the balancing, respectively. We indicate with the letter “W” the Weighted Losses.

(49)

Observing the Fig. 4.7 we can see how the losses are not balanced. The backward path will fix the weight corresponding to the higher loss. Now the losses are balanced and they are both able to be corrected.

Figure 4.7: Weighted Losses Multitask Model

The graphs show the loss for each task for the first 400 iterations. We indicate with “W Loss Pos” the Weighted Losses for POS and with “W Loss Lemma” the Weighted losses for Lemma.

(50)

4.6 Transfer Learning

The classic supervised machine learning paradigm is based on learning in isolation, a single predictive model for a task using a single dataset. This approach requires a large number of training samples for learning the parameters.

Transfer learning refers to a set of methods that extend this approach by using a model de-veloped for a task for a second task leveraging its generalization properties.[30]

Transfer learning has been used in a large variety of fields but it is very effective for situa-tions in which obtaining data samples and manual data labeling are very time-consuming and expensive. Milcevic et al. [31] showed how to apply transfer learning for supervised image classification using CNN in situations where the dataset is limited.

In our case, the dataset with ambiguous words was too small for training a single model and, in the previous work, the idea was to use only the non-ambiguous words for training the model.[4]

Italian Russian Finnish

Lemma POS Lemma POS Lemma POS

Non-ambiguous 209140 206873 993980 1021643 192373 189254 Ambiguous 60561 62828 112310 84647 14120 17239 Rate Not A/ A 3.45 3.29 9.25 12.07 14.02 11.38

Table 4.1: Datasets size (Number of instances in the dataset)

The idea proposed by Hoya Quecedo et al. (2020)[4] is to train a model for tagging non-ambiguous words and use that information for another task, to disambiguate the non-ambiguous words. They proved that we could achieve a good accuracy ( more than 75%) using only the information of the non-ambiguous instances.

They proposed this solution because the labeled data were insufficient to build a deep neural model trained on ambiguous words. As we can see in tab. 4.1 the amount of data available for ambiguous words is really small compared to the non-ambiguous ones. This is not a problem for languages such as English for which we have a huge source of labeled data available, but it is really restrictive when we work with less popular languages. These languages are the target of the project.

(51)

4.6.1 Two-Phase Learning

One drawback of the Hoya Quecedo et al. (2020)[4] idea is that we are losing all the labeled data of the non-ambiguous words in our dataset. This part of the dataset is smaller, but the information value is huge because these instances could teach the network to solve our task specifically. We can take the idea another step forward and use all our data to train the model. The idea is to use Transfer Learning to apply knowledge learned from the non-ambiguous data to solve new problems faster or with better solutions; in this case, the morphological disambiguation of a word.

Figure 4.8: Transfer Learning Concept

In machine learning transfer learning is usually expressed through the use of a pre-trained model. A pre-trained model is a model that was trained on a dataset of sufficient size to solve a problem similar to the one that we want to solve. Then we train the model with our limited size dataset to adjust the parameters for the specific task.

(52)

Out training stage is made up of two phases:

• PRETRAINING PHASE: We use the non-ambiguous data to train the model till we get a satisfactory result for the basic task (word tagging)

• TRAINING PHASE: We use the ambiguous data to learn how to disambiguate an am-biguous word

This two-phase approach has been widely used in situations where the cost of labeling the data is high, especially in Computer Visions tasks.[32]

(53)

4.7 Development framework

The project has been developed using Python as the programming language of choice. Python is the most popular programming language for machine learning and AI-based projects due to its simplicity and consistency, the access to great libraries and frameworks for AI and ma-chine learning (ML), flexibility, platform independence, and a wide community of users. For the implementation of the neural networks we have used Pytorch.[33] Pytorch is a Python-based library that is used to perform scientific computing operations and acceleration libraries of graphical processing units (GPUs). It enables neural network modeling, training, and test-ing, with a focus on deep learning and high performance.

Figure 4.9: System Architecture

The development of the project can be divided into the following phases:

• DATA COLLECTION: we extract our data from a Universal Dependency Treebank as previously mentioned in section 4.1. From the file, we extract the sentences and the ground truth for each word. The next step is to interact with the analyzer. The input of the analyzer is a single word, the output a JSON file, that is converted in a python dictionary, with the analysis of the word. If the message contains more than one analysis the word is ambiguous.

(54)

• DATA PRE-PROCESSING: the pre-processing phase is composed of a first phase, the data-cleaning, and a second phase, the data embedding. For the embedding phase we used the FastText embedding as explained in section 3.6

• MODEL SELECTION: we implemented different possible model configurations. The model can be single-task or multitask and it can have or not an attention layer. We load the configuration file for the model that is language independent. The model is implemented using Pytorch.

• MODEL TRAINING: we train the model with the configuration parameter described in sec 5.4 using the servers for HighPerformance Computing of the Helsinki Univer-sity. In particular, we used a GPU Server that allows PyTorch to parallelize most of the computations and speed up the training phase.

• MODEL EVALUATION: to evaluate the model we used the python sklearn.metrics library . The sklearn.metrics module has been created for Machine Learning and it implements functions assessing prediction error for specific purposes.

(55)

Chapter 5 Evaluation

5.1 Experiment Description

The data we used for the training and the evaluation are CoNLL-U 2006/2007 data format from the Universal Dependencies Treebank, as previously mentioned in section 4.1.

The target of the model could be: • Part-of-Speech

• Lemma

• Both Part-of-Speech and Lemma

We define 5 classes for the POS whereas the number of classes for the lemma is variable and language-dependent.

The POS classes are the following [4]:

• Noun: Nouns and noun-like POS, such as numerals or acronyms.

• Adjective: Adjectives and adjective-like POS, such as participles (in Russian). • Verb: All types of verbs, including auxiliary verbs.

• Adverb: Adverbs and adverb-like POS, such as adpositions.

(56)

The size of the lemma classes are reported in tab 5.1.

Language Italian Russian Finnish Size 11 651 36 928 18 708

Table 5.1: Lemma Target Size

For each model we define different types of accuracy:

• Blind Accuracy : blind accuracy is computed keeping the prediction with the highest probability among all the possible classes. This prediction does not consider only the possible options but it treats the model as a simple tagger and not as a disambiguation model.

• Guided Accuracy : guided accuracy is computed considering only the classes given by the analyzer. The highest probability is chosen only between the possible classes and the task of the model is to disambiguate between these classes.

5.2 Evaluation Metrics

The evaluation metrics we have used to evaluate our models are: accuracy, precision, recall and the F1 measures (calculated from accuracy and recall).

For evaluating our model we need to compute four different basic metrics:

• True positives (TP): Number of positive tuples that were correctly classified by the model

• True negatives (TN): Number of tuples that were correctly classified by the model

• False positives (FP): Number of tuples that were incorrectly classified as positive

• False negatives (FN): Number of tuples that were incorrectly classified as negative These terms are summarized in a table called a “confusion matrix”. From the confusion matrix we can easily compute all the metrics that we need to evaluate a model.

(57)

Predicted Class Positive Negative Actual Class Positive_Negative _{F P}T P F N_{T N} _NP

P0 N0 T otal

Table 5.2: Confusion Matrix

The model accuracy is the percentage of the test set tuples that are correctly classified:

Accuracy = T P + T N

T otal (5.1)

Precision is the percentage of tuples that the model classified as positive which are actually positive:

P recision = T P

T P + F P (5.2)

Recall is the percentage of positive tuples that the model classified as positive:

Recall = T P

T P + F N (5.3)

F-score (F1) is the combined precision and recall in a single measure and represents the

har-monic mean of precision and recall:

F1 =

2 · precision · recall

precision + recall (5.4)

F1 is usually more useful than accuracy, especially if you have an unbalanced dataset.

5.3 Current State-of-the-art Models

The state-of-the-art models we discuss in this section are not only models for morphological disambiguation, but they are also taggers. This means that the accuracy of these models is computed over all the possible tokens. On the other hand, our model is built only for a disam-biguation task because it relies on the analysis of an analyzer. To compare it with the results, we computed a token-level accuracy considering the prediction of non-ambiguous words of our model having 100% accuracy.[4]

(58)

accu-racy: The token-level accuracy T is defined as:

T = 100 − A + AG

100 (5.5)

T is the weighted average of the accuracy over all the tokens, ambiguous and non-ambiguous. Target Italian Russian Finnish

Ambiguity _{Lemma 22.45}Pos 23.30 _10.157.65 8.35_6.84

Table 5.3: Percentage of Ambiguity for each Dataset

5.3.1 Italian

The state-of-the-art models for Italian are the following:

• POS Tagging: (98.00% accuracy) TINT[34] — a model derived from Stanford CoreNLP that provides the part-of-speech annotation through the Maximum Entropy implemen-tation (Toutanova et al., 2003) [35]. This model is built using the same dataset we used for our model (the Universal Dependencies dataset for Italian)

• Lemmatization: (96.06% accuracy) [36] — achieved using a deep neural network archi-tecture. The proposed model is context sensitive with a two-stage bidirectional gated recurrent neural network

5.3.2 Russian

The state-of-the-art models for Russian are the following:

• POS Tagging: (96.94%) TreeTagger—a probabilistic tagger[7] tested on Russian by Dereza et al. (2016)[37]

• Lemmatization: (98.00%) AnIta— based on a large hand-written lexicon and twolevel rule-based

(59)

5.3.3 Finnish

The state-of-the-art models for Finnish are the following:

• POS Tagging: (97.70%) TurkuNLP[38]—a neural machine translation system for sequence-to-sequence translation based on a neural network [39].

• Lemmatization: (95.54 %) TurkuNLP for morphological tagging using a sequence-to-sequence neural network architecture with morphosyntactic context representation. [38][40]

This model is built using the same dataset we used for our model (Universal Dependencies).

5.4 Model Settings

The network settings we have used are the following:

Window size 21

Embedding size 300 LSTM hidden units 512

LSTM layers 1

MLP hidden neurons 10241

Table 5.4: Network Setting

For training, we used the hyperparameter reported in tab. 5.5. Dropout[41] is used during the training as a form of regularization to avoid overfitting.

Batch size 20 Dropout rate 0.1

Epochs 20

Table 5.5: Hyperparameter Network

As Optimizer we used “Adam Optimizer” with the setting reported in fig 5.6.

Adam (Adaptive Moment) Optimization algorithms combines Momentum and RMSProp and

(60)

it is obtain using the following equations: ∆wt= −η vt √ st+ gt wt+1 = wt+ ∆wt (5.6)

where: vt is the exponential average of gradients along wi, st is the exponential average of

squares of gradients along wi, is a term added to the denominator to improve numerical

stability and η is the learning rate. The values of vtand stare defined as:

vt= β1vt−1− (1 − β1)gt

st = β2st−1− (1 − βs)gt2

(5.7)

where gtis the gradient at time t along wi and β1 and β2are two hyperparameters.

β1 0.9

β2 0.999

0.0001 η 0.001

Table 5.6: Hyperparameter Adam Optimizer

5.5 Evaluation Settings

For the evaluations we applied two different strategies:

• Holdout Method with Random Split: the data are randomly partitioned into two independent sets: the training set and the test set. The training set is usually between 70% and 90% of the whole dataset and it is used to derive the model. We performed the holdout method n times using different seeds for the random split and then we took the average of the results.

• 10-Fold Cross-Validation: the data are randomly partitioned into 10 mutually exclu-sive subsets. We performed training and testing 10 times using 9/10 subsets for training and only one for testing. We used each subset the same number of times and we com-puted the average of the 10 results.

Neural networks for resolving semantic ambiguity in natural language processing