DeepSumm : a deep learning approach to text summarization

(1)

p o l i t e c n i c o d i m i l a n o Facoltà di Ingegneria

Scuola di Ingegneria Industriale e dell’Informazione Dipartimento di Elettronica, Informazione e Bioingegneria

Master of Science in

Computer Science and Engineering

DeepSumm: A Deep Learning

Approach to Text Summarization

Advisor:

p r o f. licia sbattella

Co-advisor:

i n g. roberto tedesco

Master Graduation Thesis by: r i c c a r d o c a m p o

Student Id n. 875724

(2)

(3)

A C K N O W L E D G M E N T S

I would like to thank my advisors Prof. Licia Sbattella and Ing. Roberto Tedesco for the help and insights they provided throughout this work. They gave me the opportunity to learn and discover new topics I am now passionate about.

I would also like to thank my family and my friends who gave me their unconditional support during my academic career.

(4)

(5)

C O N T E N T S Abstract xiii Sommario xiv 1 i n t r o d u c t i o n 1 1.1 Motivations . . . 1 1.2 Goals . . . 2 1.3 Document Outline . . . 2 2 b a c k g r o u n d a n d r e l at e d w o r k s 5 2.1 Natural Language Processing . . . 5

2.2 Automatic Text Summarization . . . 5

2.3 Extractive vs. Abstractive . . . 7

2.3.1 Extractive Summarization problem formulation . . . 7

2.3.2 Abstractive Summarization formulation . . . 8

2.4 Extractive techniques . . . 9

2.4.1 Common features . . . 9

2.4.2 Graph methods . . . 10

2.4.3 Clustering methods . . . 10

2.4.4 Fuzzy Logic methods . . . 11

2.4.5 Neural Networks methods . . . 11

2.4.6 Ontology based methods . . . 12

2.4.7 Recent advancements and state of the art . . . 12

2.5 Abstractive techniques . . . 13

2.5.1 Word graph methods . . . 13

2.5.2 Seq-to-Seq methods . . . 14

2.6 Abstractive PAS methods . . . 15

3 t e c h n i q u e s a n d t o o l s 17 3.1 Words and Syntax . . . 17

3.1.1 Tokenizer . . . 17

3.1.2 Part of Speech (_POS) tagger . . . 18

3.1.3 Stemmer . . . 19

3.2 Semantics . . . 20

3.2.1 Predicate Argument Structure (_PAS) . . . 20

3.2.2 Anaphora resolution . . . 20

(6)

.3.1 Convolutional Neural Network (CNN) . . . .

3.3.2 Recurrent Neural Network (_RNN) . . . 22

3.3.3 Long short-term memory (_LSTM) . . . 23

3.4 Embedding vectors . . . 25

3.4.1 Word Embeddings . . . 25

3.4.2 Sentence Embeddings . . . 25

4 p r o p o s e d m e t h o d 27 4.1 Abstractive algorithm overview . . . 27

4.2 Preprocessing and SRL . . . 29

4.3 Feature extraction . . . 29

4.3.1 Sentence position score . . . 29

4.3.2 Sentence length score . . . 30

4.3.3 _TF-_IDF score . . . 30

4.3.4 Numerical data score . . . 30

4.3.5 Centrality score . . . 30

4.3.6 Title similarity score . . . 31

4.4 Deep Model Architecture . . . 31

4.5 Summary Composition . . . 33

4.6 The extractive algorithm . . . 33

4.7 Training the model . . . 35

4.7.1 Loss function . . . 35 4.7.2 Accuracy problems . . . 35 4.8 Score Extraction . . . 36 4.8.1 2-clusters . . . 36 4.8.2 N-clusters . . . 36 4.8.3 Best-N . . . 38 5 d ata s e t s 41 5.1 DUC 2002 dataset . . . 41

5.2 The New York Times annotated corpus . . . 42

5.3 Dataset differences . . . 43

6 s y s t e m e va l uat i o n 45 6.1 Summary evaluation metrics . . . 45

6.1.1 _ROUGE . . . 45

6.1.2 Pyramid . . . 46

6.2 Experiment settings . . . 47

6.2.1 Choosing the score extraction method . . . 47

6.2.2 RNN tuning . . . 49

6.2.3 Improving summary quality . . . 52

6.3 Results . . . 53

6.3.1 Abstractive . . . 53

6.3.2 Extractive . . . 56

6.3.3 Comparing results . . . 56

6.4 Analysis of the generated summaries . . . 60

6.4.1 External programs problems . . . 60

6.4.2 Readability and meaning . . . 61

(7)

7 c o n c l u s i o n s a n d f u t u r e w o r k s 65 7.1 Conclusions . . . 65 7.2 Future works . . . 66 b i b l i o g r a p h y 67 a g e n e r at e d s u m m a r i e s 73 a.1 Abstractive Summaries . . . 73 a.2 Extractive Summaries . . . 78 vii

(8)

Figure 2.1 Generic single document summarizer architecture . . . 7

Figure 3.1 POStagging sliding window . . . 18

Figure 3.2 Stemming . . . 19

Figure 3.3 Anaphora resolution of an A. Turing’s quote . . . 21

Figure 3.4 A convolutional layer . . . 22

Figure 3.5 RNNunfolding . . . 23

Figure 3.6 AnLSTMcell . . . 24

Figure 3.7 Seq-to-Seq model applied to machine translation . . . . 24

Figure 3.8 Sentence Embeddings . . . 26

Figure 4.1 System overview diagram . . . 28

Figure 4.2 Document matrix . . . 31

Figure 4.3 DNNArchitecture . . . 32

Figure 4.4 Extractive system overview diagram . . . 34

Figure 4.5 2-clusters . . . 37

Figure 4.6 N-clusters . . . 38

Figure 4.7 bestN . . . 39

Figure 6.1 Example of Pyramid . . . 47

Figure 6.2 Epochs . . . 50

Figure 6.3 Activation functions . . . 51

Figure 6.4 Dense layers . . . 52

(9)

L I S T O F TA B L E S

Table 3.1 Most commonPOStags from the Penn TreebankPOSlist 19

Table 5.1 NYT annotated corpus statistics . . . 42

Table 5.2 Direct speech in dataset documents . . . 43

Table 5.3 SRLtag percentages . . . 43

Table 5.4 Sample summaries . . . 44

Table 6.1 2-cluster score extraction method evaluation . . . 48

Table 6.2 MaximumROUGE1recall scores . . . 48

Table 6.3 DUCabstractive modelsROUGE1scores . . . 53

Table 6.4 NYT abstractive modelsROUGE1scores . . . 54

Table 6.5 NYT abstractive tested onDUC ROUGE1scores . . . 54

Table 6.6 DUCabstractive tested on NYTROUGE1scores . . . 55

Table 6.7 DUCextractive modelsROUGE1scores . . . 56

Table 6.8 NYT extractive modelsROUGE1scores . . . 57

Table 6.9 Comparing abstractive methods trained onDUC . . . . 57

Table 6.10 Comparing abstractive methods trained on NYT . . . . 58

Table 6.11 Comparing extractive methods trained onDUC . . . 59

Table 6.12 Comparing extractive methods trained on NYT . . . . 59

Table 6.13 SENNAannotations errors . . . 60

Table 6.14 CoreNLP co-reference annotations errors . . . 61

Table 6.15 Wrong sub-sentence meaning . . . 61

Table 6.16 Selecting subordinate clauses . . . 62

Table 6.17 Anaphora resolution does not affectROUGErecall . . . 63

(10)

Example~3.1 Period ambiguity . . . 18

Example~3.2 POStags of an A. Turing’s quote . . . 18

Example~3.3 SRLof an A. Turing’s quote . . . 20

Example~4.1 Equivalent scorings . . . 35

A C R O N Y M S

PAS Predicate Argument Structure NLP Natural Language Processing ATS Automatic Text Summarization

AATS Abstractive Automatic Text Summarization EATS Extractive Automatic Text Summarization

(11)

DL Deep Learning

ROUGE Recall-Oriented Understudy for Gisting Evaluation

SENNA Semantic/syntactic Extraction using a Neural Network Architecture POS Part of Speech

SRL Semantic Role Labeling

BMCP Budgeted Maximum Coverage Problem NLG Natural Language Generation

CNN Convolutional Neural Network RNN Recurrent Neural Network DNN Deep Neural Network NN Neural Network

LSTM Long short-term memory

BLSTM Bidirectional Long short-term memory TF Term Frequency

IDF Inverse Document Frequency MLP Multilayer Perceptron

SVM Support Vector Machine

HAC Agglomerative Hierarchical Clustering MSE Mean Squared Error

SCU Summary Content Unit

DUC Document Understanding Conference

(12)

(13)

A B S T R A C T

Automatic Text Summarization (ATS) is one of the most difficult Natural Language Processing (NLP) tasks. It is commonly divided into two categories: extractive summarization, which selects the most significant sentences, and abstractive summarization, that generates new text. Research onATS, espe-cially in the abstractive case, still has not found a good solution: a system capable of producing coherent and readable summaries. In this prospective, we propose in this work a method to perform abstractive and extractive summarization with the aim of improving readability and performances in

ATS.

The abstractive method we propose is based on the Predicate Argument Structure (PAS) of the sentence. We will show that it is possible to extract small units of meaning from a sentence leveraging the relations between each predicate and its arguments. We employ Deep Learning (DL) techniques to train a model to predict a ranking over our sentences (or sub-sentences). To perform this task, we embed the content of each sentence using six statistical and semantic features plus the sentence embeddings, derived using a pre-trained model. The presented model can be adapted to different domains by changing the features, to reflect specific characteristic of interest.

We trained our model using news articles from DUC and NYT datasets; we obtained good results in terms of ROUGE recall metric compared with state-of-the-artATSalgorithms.

(14)

(15)

S O M M A R I O

La generazione automatica del riassunto è uno dei problemi più comp-lessi nell’Elaborazione del Linguaggio Naturale. È comunemente divisa in due categorie: generazione di riassunti estrattivi, che seleziona le frasi più significative, e generazione di riassunti astrattivi, che produce testo nuovo. La ricerca in questo campo, specialmente per la realizzazione dei riassunti astrattivi, non ha ancora trovato una buona soluzione, cioè un sistema che produca riassunti coerenti e leggibili. In questa prospettiva, il presente lavoro vuole proporre un metodo per eseguire riassunti astrattivi ed estrattivi, con lo scopo di ottenere migliori performance e leggibilità rispetto ai sistemi attuali.

Il metodo astrattivo proposto è basato sulla Struttura Predicato Argomento della frase. Si mostra la possibilità di estrarre brevi unità semantiche da una frase, facendo uso delle relazioni che intercorrono tra ogni predicato e i suoi argomenti. Sono state impiegate tecniche di Deep Learning per addestrare un modello a predire una classifica delle frasi (o sotto frasi). Per fare ciò viene incorporato il contenuto di ogni frase in sei caratteristiche numeriche, sintattiche e semantiche, insieme a sentence embedding derivati da un mod-ello pre-addestrato. Il modmod-ello che viene presentato può essere adattato a diversi domini, cambiandone le caratteristiche per riflettere aspetti del testo particolarmente significativi.

Il nostro modello è stato addestrato usando articoli di giornale dai corpora

DUCe NYT. Abbiamo ottenuto buoni risultati riguardo alla metrica ROUGE

recall, in confronto allo stato dell’arte degli algoritmi per la generazione automatica del riassunto.

(16)

(17)

1

I N T R O D U C T I O N

Non exiguum temporis habemus, sed multum perdidimus.1 — Seneca De Brevitate Vitae 1.1 m o t i vat i o n s

The abundance of data in the form of text we face these days, has led Natural Language Processing (NLP) researchers to design algorithms to extract the most useful information out of long and redundant text.

While state-of-the-art Question Answering systems have reached notable results (Tayyar Madabushi, Lee, and Barnden,2018), research in Automatic

Text Summarization (ATS) still hasn’t achieved good enough results in term of readability and human comprehension (especially when summarizing long text, as stated in Paulus, Xiong, and Socher,2018; See, Liu, and Manning, 2017). Recent research on _ATS is focusing more on Abstractive Automatic

Text Summarization (AATS), that is the process of creating a new text that summarizes the given input text, as opposed to Extractive Automatic Text Summarization (EATS) which just selects the best sentences out of the input text. However, unlike EATS, most abstractive methods derive just a short headline from the input text.

One of the main reasons of summarization difficulty is the inability to develop an evaluation metric for summaries. IndeedROUGE(Lin,2004), the

most adopted metric inATSevaluation, is merely based on words overlap, so it does not take into account the semantic and syntax of the summary (Cohan and Goharian,2016). Thus, one of the motivations that drives this work is to

research a new summarization algorithm that tries to improve readability in automatic generated summaries.

Recent research inAATS is employing Deep Learning (DL) techniques to produce short abstracts maximizingROUGE recall score (Rush, Chopra, and Weston,2015; Paulus, Xiong, and Socher,2018; See, Liu, and Manning,2017).

1 It is not that we have a short space of time, but that we waste much of it

(18)

These methods, although reaching highROUGEscores, fail to produce long summaries with good readability. This approach differs fromEATSmethods based on statistical and semantic features (Radev et al., 2004; Yadav and

Meena,2016). We believe that these features are relevant and should be taken

into consideration, as they proved to work inATS.

Therefore, another motivation is to take into account statistical and semantic facts, while using novel techniques based on DL. We can shape the set of features to satisfy one specific application need or type of text. Indeed, the summarization task is highly dependent on the domain the text belongs to. So features that are relevant to a particular domain may not be for another. 1.2 g oa l s

The main goal of this thesis is to design an algorithm to produce a summary starting from a given text. The second goal is to use a DLapproach based on statistical and semantic features as well as sentence embeddings, to represent relations between sentences (Cer et al.,2018). We aim at controlling, through

the use of features, the information that labels a sentence as important. At the same time we will explore sentence embeddings being a new method to represent and compare sentences.

A third goal is to find a method to automatically extract a ranking over sentences, given the reference summary. Indeed to train our Deep Neural Network (DNN) we will need to transform the text-summary pairs into a continuous representation.

To fulfill these goals we will employ both extractive and abstractive summa-rization. To perform the latter, we make use of Semantic Role Labeling (SRL) to extract sub-phrases of from the original text sentences (Sentence Compression). This approach is often considered an hybrid between extractive and abstrac-tive summarization; indeed the output text is not made of entire sentences from the text but it is, however, made of subparts of the original sentences.

Our system has been used in the domain of newspaper articles, like most of research in ATSdoes, as it provides a wide selection of corpora.

1.3 d o c u m e n t o u t l i n e

We start this work, in Chapter2, by introducing the reader to_ATS, delving

deep into the difference between abstractive and extractive summarization. We will also explore the most relevant and recent literature on ATS both extractive and abstractive.

We continue, in Chapter3, by presenting the tools and technologies we

adopted in our work. These include text processing tools, DLand Sentence Embeddings. In Chapter 4we will extensively present our approach to _ATS.

Starting by showing the system overview and then going through the algo-rithm details: from the adopted features to theDLtechnique we used and its training.

(19)

1.3 document outline 3

We dedicated Chapter5to the datasets we employed in our work, we will

discuss their characteristics and differences. The evaluation of our system will be discuss in Chapter 6. We will explain the experiment settings and

paths that led us to the final results; then we will compare these results to those of related articles.

Finally in Chapter 7 we present the conclusions of our work and some

ideas to improve our system, in terms of both readability and information extraction.

(20)

(21)

2

B A C K G R O U N D A N D R E L AT E D W O R K S

In this chapter we provide an overview of the approaches toATSand we will review the literature on this topic.

We will first introduce the concept ofNLPand its main purposes. Then we continue showingATSas a particular NLPtask, underlining the differences between extractive and abstractive summarization. Finally we will review the literature onATS, showing the most relevant works related to our thesis. 2.1 nat u r a l l a n g ua g e p r o c e s s i n g

NLP (also called Human Language Technology) is an interdisciplinary field that aims at automating tasks involving human language; it is a subfield of Artificial Intelligence.NLPtasks may handle human-machine communication, improve human-human communication or perform useful processing of text or speech (Jurafsky and Martin,2009, Chapter 1).

Although we’ve talked about NLPsince 1950 (when the Turing test was proposed), in the last few years we’ve seen a spread of its usage in a variety of fields. From everyday use applications like Google Translate and chatbots like Siri, Cortana and Alexa, to more challenging tasks like YouTube automatic content ID checker (Hosseini, Xiao, and Poovendran,2017).

It is clear thatNLPis gaining more and more attention from non-IT industry too, as, for example, many company websites now feature a chatbot for customer care.

2.2 au t o m at i c t e x t s u m m a r i z at i o n

Text Summarization is "the process of distilling the most important informa-tion from a text to produce an abridged version for a particular task and user" (Mani,1999).

We can enlist the main characteristics of a summary:

• Compactness. It contains the salient information of the analyzed text.

(22)

• Produced by one or more documents, but also speech or multimedia input.

• Length. Always shorter than the original source, and often not longer than a couple hundreds of words.

• Shape. May be of different typologies: from simple text to a list of elements, to a text graph.

The main goal of the summary is to display useful information extracted and rearranged from the source text. In other words, the summary allows the user to save time in reading the whole document by reading only the salient parts.

In the literature there are various classifications of summaries based on dif-ferent linguistics concepts. In terms of generality we can distinguish between: • Informative summary. Its aim is to present the concepts from the original

text in a shorter form, but preserving the salient information.

• Descriptive summary. Gives a relative idea of the source text without going into specific details of its content.

We can also differentiate summaries based on the topic they deal with: • Query-focused. The summary is produced to answer a specific user

query.

• Topic-based. The summary focuses its content into a small set of topics of interest.

• Generic. Provides the source information without consider any particu-lar user need.

Finally we can distinguish between abstract and extract; i.e. if the docu-ment is composed of the original text sentences, or it is a composed of new sentences.

Talking about the summarization process, we can divide it into two cat-egories concerning the source to summarize: single document and multiple document summarization. The latter utilizes a set of documents from which common important information is extracted. In this case, redundancy is a significant indicator of importance.

In this work we will focus on single document summarization (Figure2.1).

In both cases howeverATSperforms these fundamental operations:

• Content selection. It is the main problem of summarization, choosing from the original text the most important sentences or pieces.

• Information ordering. Ordering and structuring the extracted units. • Sentence realization. Cleaning up the units to produce a fluent text. (Jurafsky and Martin,2009, Chapter 23, 24–26)

(23)

2.3 extractive vs. abstractive 7 Content Selection sentence segmentation sentence extraction document summary Information Ordering Sentence Realization

Figure 2.1: Generic single document summarizer architecture 2.3 e x t r a c t i v e v s. abstractive

We will spend a few more words on the difference between extractive and abstractive summarization. Although their aim is the same, producing a summary, they are quite different in term of problem formulation.

2.3.1 Extractive Summarization problem formulation

As explained in Takamura and Okumura,2010, single document extractive

summarization can be formulated as a Budgeted Maximum Coverage Prob-lem (BMCP). The document content is represented as conceptual units that can correspond to words. The model selects the subset of sentences so that the sum of the benefits of the conceptual units covered by those sentences is maximized: max X j bjzj s.t.X i cixi6 K, X i aijxi> zj; ∀j xi∈{0, 1}; ∀i, zj∈{0, 1} ; ∀j (2.1) where:

zj is 1 if conceptual unit ej is covered, 0 otherwise; x_iis 1 if sentence siis selected, 0 otherwise;

b_j is the non-negative number indicating the benefit of conceptual unit ej; aijis 1 if sentence sicontains conceptual unit ej;

c_iis the length of sentence si;

Kis the maximum legitimate summary length.

Basically we want to select the sentences that give us the best information while respecting the length constraint. Therefore we need a method to rate the importance of each sentence, so that we can select the best sentences until the maximum length is reached. Most unsupervised methods employ the notion of term importance, either to directly rank sentences or deduce relations between sentences that will be used to rank them.

(24)

The usual method to determine term importance is to compute the Term Frequency (TF)-Inverse Document Frequency (IDF) score:

T Fi,j= n_i,j |dj| (2.2) IDF_i= log |D| | {d : i ∈ d} | (2.3) where:

n_i,jis the number of occurrences of term i in document j; |dj| is the number of terms in document j;

Dis a collection of|D| documents.

2.3.2 Abstractive Summarization formulation

We can generalize the extractive summarization formulation to the Abstractive case: max X j b_jz_j s.t. X s∈Abs c (s)_{6 K} Abs⊂ S; zj ∈{0, 1} ; ∀j (2.4) where:

z_j is 1 if conceptual unit ejis covered, 0 otherwise;

b_j is the non-negative number indicating the benefit of conceptual unit ej; Sis the set of all possible sentences;

c : S7→N returns the length of each sentence; Kis the maximum legitimate summary length.

We can notice that in this case we are not constrained in selecting sentences belonging to the original document. We also have relaxed the constraint that indicate that all the concepts are contained at least once in the sentences, indeed this is obvious considering the set S of all possible sentences. Since the set S has infinite cardinality we cannot call this problem aBMCPand we cannot bound its complexity.

Ideally we would want to store the main concepts from the source and generate a new text from them. However this task is particularly difficult, so the main solutions involve reshaping the original text. Some possible techniques could be:

• Selecting the best sentences and use rules to compress them to the minimum meaningful sentences.

• Train a Natural Language Generation (NLG) model to produce a sum-mary starting from the source text.

(25)

2.4 extractive techniques 9

• Selecting the words from a word graph constructed from the original text.

We will explore these techniques in Section2.5.

2.4 e x t r a c t i v e t e c h n i q u e s

In this section we will briefly show the most recent and relevant works on

EATS.

2.4.1 Common features

We first enlist some of the most common features used in manyEATSmethods to rank sentences or words.

• Keyword feature. We assign to each word a score based onTF-IDF. Words with high score are considered keywords and regarded as important. We then consider the cumulative score of the sentence to determine the sentence importance.

• Sentence location. Higher score is assigned to sentences closer to the beginning or the end of the document, as it is where usually most information resides.

• Sentence length. The sentence length plays an important role in identi-fying key sentences. Shorter texts do not hold many concepts and very long sentences take up too much space.

• Title similarity feature. The title often conveys the general topic of the text, thus if a sentence is similar to the title it might be useful for the summary.

• Sentence centrality. Based on sentence similarity of some sort (TFIDF

cosine, sentence embeddings, WordNet1

), we assign to a sentence higher score if it is closer to a large number of sentences. Thus a sentence is considered more important if many sentences are similar to it, meaning that the concept expressed by the sentence is relevant.

• Numerical data. In some application it is important to include numerical data in the summary. This feature assign higher score to sentences containing numbers.

Other less common features has been used, including:

• Proper Nouns (Naik and Gaonkar,2017). If the sentence contains named

entities, like name of a person or an institutions, then it might be important.

(26)

• Sentence Reference Index (Gupta and Siddiqui, 2012). Giving more

importance to sentences that precede a pronominal reference, as they main contain the information that is subsequently referred to.

• Paragraph frequency (John and Wilscy, 2013). This feature tries to

recognize paragraph relevance using the frequency of the words that appear in it.

2.4.2 Graph methods

In this class of unsupervised methods, sentences are arranged in a fully connected graph where each sentence is a node and edges are similarity scores between the connected sentences. In the multi-document case we can useTF-IDFcosine similarity, which is the similarity between the word vectors obtained using theTF-IDFscore. We can represent a word using a vector with an element for every document in the collection; each element will be the

TF-IDFscore of the word in that document.

Once the graph is built, the best sentences are selected using PageRank algorithm or scoring them with a different method. An approach similar to the one described is adopted in Mohamed and Oussalah, 2016and Salton

et al.,1997.

In Han et al.,2016 the document is arranged as an undirected weighted

graph. Graph nodes are sentences represented as bag of frames 2

while arcs represent semantic similarity between couple of sentences. Using semantic frames allows the system to take into account both sentence-level semantic information and word order. Semantic similarity between sentences is com-puted as the maximum similarity score between couples of frames that fit the sentences. Frame similarity is the sum of the similarities between set of terms in all the roles in the frames.

2.4.3 Clustering methods

Conceptually similar to graph methods, this class of methods tries to reduce redundancy identifying sentences clusters, i.e. a group of similar sentences by means of some similarity measure.

Zhang and Li, 2009 computes clusters using K-means algorithm, using

sentence similarity measure based on the weighted sum of three features: • Word form (with weight 0.2), number of co-occurring words in the

sentences.

• Word order (with weight 0.1), to describe the sequence similarity be-tween two sentence.

• Semantic similarity (with weight 0.7), based on HowNet (chinese ver-sion of WordNet).

(27)

2.4 extractive techniques 11

In Gupta and Siddiqui,2012multi-document summarization is performed

by clustering the sentences resulting from the summaries of each document. First, for each document, summaries are computed ranking sentences through the use of statistical features. Then all the sentences of the resulting summaries are clustered using syntactic and semantic similarity measures. The former relies on the position of similar words between sentences, while semantic similarity is computed using WordNet to derive relations between words in the sentences.

2.4.4 Fuzzy Logic methods

We use Fuzzy Logic to define a set of rules that help in marking the sentences as important or not important.

First of all we extract a number features for each sentence, then a Fuzzifier takes these feature as input and assigns them fuzzy importance values (e.g.: Low, Median, High). Using a set of rules, we deem sentences important or not important. Example (from Suanmali, Salim, and Binwahlan,2009):

IF (NoWordInTitle is VH) and (SentenceLength is H) and (TermFreq is VH) and (SentencePosition is H) and (SentenceSimilarity is VH) and (No-ProperNoun is H) and (NoThematicWord is VH) and (NumbericalData is H) THEN (Sentence is important)

Finally a Defuzzifier is used to assign to each sentence a numerical score that will be used to produce the summary.

In Yadav and Meena, 2016 the summary is generated taking common

sentences from 3 different summaries

• A summary created using fuzzy rules to select best sentences, where sentences are represented as feature vectors (title similarity, sentence length, sentence location, numerical data, sentence centrality).

• Using Bushy path method (Salton et al.,1997) selecting the most

con-nected sentences in a similarity graph.

• Ranking sentences based on the sum of the scores of the terms that compose them, where the term score is computed using WordNet. 2.4.5 Neural Networks methods

SeveralEATS algorithms using Neural Network (NN) have been developed, we’ll present four of them. Although they all useNNs, they are very unlike to each other.

In Kaikhah, 2004 a three layered feed-forward _NN is trained to learn the

types of sentences that should be included in the summary. This is done by using a dataset where each sentence in the documents is marked by a human reader as summary sentence or not.

(28)

Fattah and Ren,2008instead adopts a Probabilistic_NN. A set of sentence

features (among which: sentence position, sentence centrality, title similarity) is extracted from the documents and used to train the model. TheNNlearns the probability distribution for a feature vector to be a summary sentence. Therefore the model learns to label each sentence as part of the summary or not, based on its features.

In Tsai et al.,2016two_CNN_sare employed. Words within a document are

represented as distributed vectors using word embedding techniques. Vectors are joined together to form a matrix which will be the input of the firstCNN. The sentence matrix of constituent sentences is likewise obtained and fed into the secondCNN.The output of theCNNswill be the input of a Multilayer Perceptron (MLP) that will induce a ranking over the sentences.

In Sinha, Yadav, and Gahlot,2018the document is fed to a_NNwith one

input layer, one hidden layer and one output layer. Each sentence in the document is previously converted in a vector with Fasttext library (Joulin et al.,2017) using word embeddings (obtained with word2vec, Mikolov et al., 2013). The output layer will provide a ranking over the sentences.

2.4.6 Ontology based methods

In this class of methods we make use of ontological knowledge to determine sentence importance.

In Hennig, Umbrath, and Wetzker, 2008 sentences are represented as

sub-trees of the ontology. Each node of the ontology is populated by a bag-of-words obtained from web searches. Once the sentences have been mapped to subtrees of the ontology, a set of features is extracted from these sub-trees. The features will be then used with a trained Support Vector Machine (SVM) to classify sentences for the extractive summary.

In Ramezani and Feizi-Derakhshi,2015, instead, sentences are ranked using

centrality scores deduced from the ontology. 2.4.7 Recent advancements and state of the art

Wan, 2010 presents a state-of-the-art algorithm for both single and

multi-documentEATS. Their algorithm is based on the interdependent concepts of

local and global saliency. That is, how a sentence is important in its document and in the entire document set.

Sentence salience is computed using cosine similarity between the sentence itself and the sentences in the whole document set. This score is then ad-justed with a term that takes into account the position of the sentence in its document.

A modern approach using Reinforcement Learning is proposed in Narayan, Cohen, and Lapata,2018. A_CNNsentence encoder is used to encode sentences

(29)

2.5 abstractive techniques 13

document encoder that transforms documents into continuous representation. AnotherLSTMis adopted as a sentence extractor that labels each sentence repre-sentation as summary or non-summary sentence. The network is trained in a reinforcement learning framework that directly optimize the final evaluation metric.

Finally we present the work of Durrett, Berg-Kirkpatrick, and Klein,2016,

a discriminative model for single-document summarization combining com-pression and anaphoricity constraints. Basically, they select sentences based on a set of weighted features and then delete content, while respecting rules imposed by the anaphora. Thus, they try to preserve coherence, while short-ening the text. To make sure that the selected sentences do not have any pronoun without reference, the system constraints the selection sentences based on co-reference relations between them.

They developed both a sentence-extractive and a compressive model. The first performs extractive summarization, retaining the best sentences. The sec-ond utilizes a combination of discourse and syntactic compression to shorten the original sentences. This process categorizes this method as abstractive (it is conceptually similar to the one we adopted).

2.5 a b s t r a c t i v e t e c h n i q u e s

We now exploreAATS methods, leavingPASmethods to the next dedicated section. Most of the methods presented in this section are meant to produce a short abstract one or two sentences long.

2.5.1 Word graph methods

The methods in this class arrange the text into a graph where each node is a word. Each node is connected to the others by a similarity relation of some sort. The resulting summary is obtained as a path in the graph.

Ganesan, Zhai, and Han,2010builds the Opnosis graph, a graph that

rep-resents word units linked by directed edges carrying information about the position of the word. The abstractive summarization problem is cast as one of finding appropriate paths in the graph.

The graph is built starting from a set of sentences relevant to a specific topic. Each sentence is divided into word units, i.e. the word and the POS

annotation. Each unique word unit will form a node, therefore it is necessary to keep track of all sentences which that word belongs to.

The nodes are than linked with directed edges representing the order of the words in each sentence. This method naturally captures highly redundant discussions in subgraphs.

The abstract is generated by repeatedly searching the Opinosis graph for appropriate subgraphs that both encode a valid sentence and have high redundancy scores. A path redundancy score is the number of overlapping

(30)

sentences covered by this path. The sentences encoded by the selected sub-graphs would then form an abstractive summary. A similar approach is adopted in Le and Le,2013.

In Moawad and Aref,2012 the Rich Semantic Graph is extracted from the

source document. The semantic graph is then pruned to from the reduced graph, from which the abstract is composed.

Syntactic and morphological tags are applied to each word. For each sentence, the domain ontology is accessed to instantiate, interconnect and validate the sentence concepts to build rich semantic sub-graphs. Finally, rich semantic sub-graphs are merged together to represent the whole document semantically, creating the Rich Semantic Graph.

A set of heuristic rules are applied on the generated graph to reduce it by merging, deleting, or consolidating the graph nodes. These rules exploit WordNet semantic relations: hypernym, holonym, and entailment. The system then inquires the domain ontology to generate sentences which are affine to the concepts in the Rich Semantic Graph. Besides, the WordNet ontology is accessed to generate multiple texts according to the word synonyms. The generated multiple texts are evaluated and ranked according two criteria: the most frequently used words and the discourse sentence relations.

2.5.2 Seq-to-Seq methods

Many of the methods we will show employ theLSTMEncoder-Decoder archi-tecture, that we will further discussDLin Section3.3.

Summarization is modeled as a Sequence-to-Sequence problem, where the input text is regarded as a sequence that will be transformed into a shorter sequence: the summary.

Rush, Chopra, and Weston,2015employs a probabilistic formulation for

the abstractive summarization problem. Define the setY ⊂ ({0, 1}V,...,{0, 1}V) as all possible sentences of length N, where{0, 1}_V represent a word in the vocabulary of size V (one-hot representation).X indicates the set of possible input words. Defining the scoring function s : Y × X 7→ R, an abstractive system will try to find the optimal sequence from the setY:

argmax y∈_Y

s(x, y) (2.5)

The system will try to model the log probability of a summary log p(y|x; θ). The distribution is a conditional language model based on the input sentence x. ANNLanguage Model is adopted.

Unlike a standard language model however, this system employs an en-coder that outputs a vector representing the input and the current context. Finally a Beam Search decoder will produce the summary from the intermedi-ate representation.

Nallapati, Xiang, and Zhou,2016extended the model above by introducing

(31)

2.6 abstractive pas methods 15

machine translation by Bahdanau, Cho, and Bengio,2014). Zeng et al., 2016

further improved the system, proposing a mechanism that dynamically copy the words from the input sequence while decoding, to tackle the problem of small fixed dimension of the decoder vocabulary. In Suzuki and Nagata,

2017, the redundant sentence generation problem is mitigated by the Word

Frequency Estimation sub-model. This sub-model explicitly manages how many times each word has been generated so far and might be generated in the future during the decoding process. In Liu* et al.,2018a similar algorithm

is enriched by Diverse Beam Search in order to select the summaries which are less extractive (which is a common problem of these methods).

The state-of-the-art abstractive summarization algorithm is achieved by Celikyilmaz et al.,2018. The presented method employs deep communicating

agents, in an encoder-decoder architecture. Various collaborating agents per-form the task of encoding the text, the encoders are connected to a single decoder that generates the summary.

The text is divided into paragraphs each of which will be assigned to an agent. Each agent if made of a local encoder and a contextual encoder; the latter allows communication between agents to absorb information about the context.

2.6 a b s t r a c t i v e pa s m e t h o d s

We will discuss the Predicate Argument Structure (PAS) theory in Chapter3.

For now, we consider asPASa semantic unit which is associate to verb, subject,

object, and various other arguments of a sentence. We can extractPAS’s via

SRL.

In Khan, Salim, and Kumar, 2015_PAS’s are extracted from the input text

and arranged in a semantic graph. Each node of the graph corresponds with aPAS, while the edges represents the semantic similarity between them. Semantic similarities weights combine PAS-to-PAS semantic similarity and PAS-to-document set relationship. PAS-to-PAS similarity is obtained using WordNet relations between predicates and main arguments, combined with the edit distance between the PAS’s. The PAS-to-document set relationship

is represented by different features, weighted and optimized by a genetic algorithm.

The salient graph nodes are ranked with a method similar to PageRank. The ranking algorithm may assign equal score to the PAS’s representing the same concept, and therefore the final summary may contain redundant information. In order to reduce redundancy, clustering is applied to re-rank the selected PAS’s. The top scoring PAS’s are chosen based on compression rate of summary and are fed to language generation phase (using SimpleNLG, Gatt and Reiter,2009).

In Khan, Salim, and Jaya Kumar, 2015 an Agglomerative Hierarchical

Clustering (HAC) algorithm based on average linkage method is employed to cluster semantically similarPAS’s. Alshaina, John, and Nath,2017adopts

(32)

a hybrid approach of K-means and HACto cluster similar PAS’s. K-means is selected due to its run time efficiency, while HACis used due to its quality. Twelve features (including location, length, headline words) are extracted from eachPAS.PAS’s are ranked according to optimized feature weights. Thus features weights play a central role in the algorithm; a Genetic Algorithm is adopted for optimal feature weighting. Then high scored PAS’s are collected, and the summary is generated using SimpleNLG.

(33)

3

T E C H N I Q U E S A N D T O O L S

In this chapter we provide an overview of the tools and technologies we will employ in our system. First, we will introduce tools regarding NLP, then we will briefly outline the concept of Deep Learning (DL) and sentence embeddings, as they will be later employed in our work.

3.1 w o r d s a n d s y n ta x

A fundamental part of any application regarding human language, is the analysis of each individual words in the text. Many important tools have been developed to handle this task.

3.1.1 Tokenizer

A tokenizer demarcates the text producing tokens that can be either individual words or sentences. Tokenizing is a non-trivial and basic step to almost every

NLPtask.

In our work we employed Punkt Stentence Tokenizer (Kiss and Strunk,2006)

to split the text into sentences. Sentence tokenization is not as easy as it might seem. Indeed the period, usually used as sentence boundary marker, may have many other uses in the sentence (for example, to mark abbreviations, Example3.1). Thus, this task has to be regarded as an instance of ambiguity

resolution. Moreover, an error in the tokenization phase, will be propagated along all the following sentences.

Punkt Stentence Tokenizer is a sentence boundary detection language inde-pendent method; its accuracy, tested on corpora in 11 languages, reaches 98.74%. The system first detects possible abbreviations in the text, based on the fact that they are rather short truncated words, followed by, and possibly interspersed with, a period. Punkt adopts a two-stage method: the first stage produce an intermediate annotation of the text, identifying abbreviations and ordinary words; the second, token-based, stage employs heuristics to refine

(34)

Many attribute the title "father of A.I." to A. Turing. He was was an English mathematician and computer scientist.

Example 3.1:Period ambiguity. In this sentence there are an acronym and an ab-breviation, both use the period not as a boundary maker, making the process of demarcating sentences harder.

Machines | {z } NNS take | {z } V BP me |{z} PRP by |{z} IN surprise | {z } NN with | {z } IN great | {z } JJ frequency | {z } NN . |{z} . Example 3.2:POStags of an A. Turing’s quote

the annotated text. The heuristics are based on orthography, collocation and frequent sentence starters.

3.1.2 _POStagger

A Part of Speech (POS) is a category of words which have similar grammatical properties.POSare syntactic roles like noun, verb, adjective (Example3.2). The POS tagger associate to each word the relative POS. Common POS tagging techniques include: rule-based methods, Markov models and transformation-based learning. Usually POS tags are taken from appropriate lists; we have many POS lists, one of the most used is the Penn Treebank POS tags list (Table3.1).

Our system uses Semantic/syntactic Extraction using a Neural Network Archi-tecture (SENNA)POSannotations (Collobert et al.,2011)._SENNAemploys_NNto

annotate the text with various annotations other than POStags:SRL, chunking, syntactic parsing and name entity recognition.

Each word is fed to the model as indices of a dictionary, then the first layer of the network will map the indices to feature vectors. A different word representation will be used for different tasks. Moreover, two methods are used: a word level (window) approach and a sentence level (convolutional) approach. The first approach assumes that word tags depends on the on the neighboring words (Figure 3.1). Thus words are grouped in a sliding

window and passed to theNN to predict the tags.SENNA POStagger reaches an accuracy of 97.29%.

Machines take me by surprise with great frequency

Figure 3.1: POStagging sliding window. Each word is tagged alongside its neigh-bouring words, to take into account the context.

(35)

3.1 words and syntax 19

Tag Description example

CC coordinating conjunction and

DT determiner the

NN noun, singular or mass table

NNS noun plural tables

VB verb be, base form be

VBP verb be, sing. present, non-3d am, are

PRP Personal pronoun me

IN preposition, subordinating conjunction in, of, like

JJ adjective green

MD modal could, will

Table 3.1: Most commonPOStags from the Penn TreebankPOSlist Oscillators

Oscillator Oscillate

Oscill Oscil

Figure 3.2: Stemming. A series of rules are applied in sequence to obtain the stem from the original word. In this case the stem is not a word.

3.1.3 Stemmer

Stemming is the process of reducing the inflected words into their word stems, i.e. the word without any affixes or suffixes. This is particularly useful inATS, when we need to analyze term frequency. Stemming methods ranges

from simply searching the word in lookup tables to more complex methods: production technique, suffix stripping, stochastic algorithms.

We adopted the Porter Stemmer (Porter, 1980). It is probably the most

employed stemming algorithm, it is a suffix stripping method, very simple but effective. Each word is processed by applying a large set of rules, one after another. The result is the stem, which might not be a word, like in Figure3.2(if we had wanted words we could have used a lemmatizer).

(36)

Machines | {z } ARG0 take | {z } V me |{z} ARG1 by surprise | {z } MANNER

with great frequency

| {z }

MANNER

Example 3.3:SRLof an A. Turing’s quote. ARG0 indicates the subject while ARG1 is the object of the verb "take". There are two MANNER modifiers, that specify "how" the verb is affecting the object.

3.2 s e m a n t i c s

Understanding the meaning of written text is a necessary step for many NLP

tasks. Being the natural language ambiguous, analyzing the semantic can be very hard and, for the same reason, of central importance.

In this work we will consider the meaning structure of language, that is form-meaning associations, word-order regularities and other methods by which human language convey meaning.

3.2.1 Predicate Argument Structure (_PAS)

Human language is arranged in the form of predicate-argument, asserting the dependencies that hold among the various phrases and words in the sentence. From the meaning of several parts of the text, this structure compose a single meaning representation. Verbs pose many constraints into their arguments: number, grammatical category, location of the phrases. Thus, the structure affects both syntax and meaning of the sentence, forcing the words to follow particular rules (Selectional Restriction). Core of thePASis the predicate itself: it links various arguments and assign them a semantic role, defining the way in which that specific argument interacts with the remaining arguments through the predicate (Jurafsky and Martin,2009, Chapter 17).Therefore we

are interested in performing SRLto identify semantic roles in the sentence. An example ofSRLis shown in Example3.3.

We used SENNA SRL annotations. We already talked about the general architecture ofSENNAabove. To performSRL, the convolutional approach is employed, as the window approach fails in this case. The problem is solved using a Time DelayedNN.SENNA SRLannotator has an F1 score of 75.49% 3.2.2 Anaphora resolution

Anaphora resolution is another challengingNLPtask. In linguistics, anaphora (more generally coreference) is the use of an expression whose interpretation depends on another expression: its antecedent or postcedent. Resolving the anaphora means to assign to each referee its referent (Figure 3.3). This

operations is fundamental to many high levelNLPapplications that requires natural language understanding (such asATS).

(37)

3.3 deep learning 21

A computer would deserve to be called intelligent if it could deceive a human into believing that it was human Figure 3.3: Anaphora resolution of an A. Turing’s quote. In this sentence, both

pronouns "it" refer to the word "computer". We can resolve this anaphora by substituting both of them with "computer". (Results obtained using CoreNLP coreference annotations).

In our work we adopted Stanford CoreNLP (Manning et al., 2014). Like SENNA, the tool itself performs various tasks, such as:POStagging, dependency parsing, sentiment analysis. We employed it to perform anaphora resolution, using the coreference annotations, by just substituting the pronouns with the noun they refer to.

CoreNLP implements the algorithm described in Lee et al., 2013. This

approach takes inspiration from both supervised and unsupervised models, as well as deterministic rule-based systems. The algorithm is composed of two stages. The first is the mention detection stage, where nominal and pronominal mentions are identified. The second stage is the coreference resolution: it is composed of ten independent coreference models, which are applied subsequently from the highest to the lowest precision. CoreNLP coreference annotator has an F1 score of 60.0%.

3.3 d e e p l e a r n i n g

Deep Learning (DL) is a class of Machine Learning based on learning data rep-resentations. It solves the problem of representation learning, by introducing representations that are expressed in terms of other, simpler ones.DLallows the computer to learn from experience and to represent the problem as a complex hierarchy of concepts built by relations between simpler concepts. This approach shifts the problem from the formal specification the knowledge to properly collect enough data for the computer to learn the representation (Goodfellow, Bengio, and Courville,2016, Chapter 1).

DLusesNNwith multiple hidden layers (i.e. layers between the input and the output layer) that allow the network to catch non-linear behaviors. Each layer of the representation can be seen as the state of a computer, which is trying to solve a problem; so networks with more layers can execute more instructions. For this reason, ideally, a DNN with infinite layers is able to approximate any function. In more general terms, non-linear behaviors are learned thanks to the depth of theDNN. The most famousDNNnetworks are

CNNandRNN.

3.3.1 Convolutional Neural Network (_CNN)

A Convolutional Neural Network (CNN) is a feed-forward artificial NN, that exploits spatial dependencies in the input. It is mostly used to analyze images,

(38)

3feature maps

(e.g. RGB) nfeature maps nfilters of size k × k × 3 width w width w height h height h

neural

network

data

apply . . . . . . . . . . . . . . . . . .

Figure 3.4: A convolutional layer

as pixels are dependent from their neighboring pixels. In general they are employed to process data that has a known, grid-like topology.

The name of this networks derives from the convolution operation that is employed by them. CNN are just networks that use convolutional layer in place of at least one of their layers (Figure 3.4). Roughly speaking the

convolution operation gives more importance to certain values and less to others, usually stressing the neighboring values.

3.3.2 Recurrent Neural Network (_RNN)

Recurrent Neural Networks (RNNs), on the other hand, exploit temporal dependencies, therefore they are used to analyze sequences, like text. A

RNNcan process sequences much longer than what a standardNNcould do. Moreover, it allows the use of sequences of variable length.RNNsare based on the idea of sharing parameters across the model. This particular feature allows the network to transfer information from one point in the sequence to another. For example, if a word is repeated throughout the sentences of a document.

Parameter sharing occurs in a recurrent way, where each output is produced using the same update rule that processed the previous output, which in turn will be used as input for the update rule (Figure3.5). We are particularly

interested inRNN, as they can be used to learn dependencies in text. A specific

(39)

3.3 deep learning 23

Figure 3.5:RNNunfolding. We can a RNNinto a computational graph that has a

repetitive structure (Goodfellow, Bengio, and Courville,2016, pp. 375– 378).

3.3.3 Long short-term memory (_LSTM)

LSTMssolve the problem of long term dependencies. The gradients propagated over many stages tend to either vanish or explode; this problem is solved by introducing self-loops that produce paths where the gradient can flow for long durations (Figure3.6).

ALSTMallows to accumulate information that will be passed to the follow-ing steps, and to forget this information after it has been used. The network, through the forget gate, is able to learn to decide when to keep or flush some information. These networks are composed by so-called LSTMcells, which include an internal recurrence, other than the outer one of theRNN.

LSTMsare particularly useful and efficient inNLPas they catch the relations between words in the text. One of the most notable application is Language Translation (Figure3.7). Using a_LSTMEncoder-Decoder architecture, the input

text is encoded in an intermediate representation and then decoded into the target language. The same architecture can be used to performATS, as we discussed in Chapter2.

Finally Bidirectional Long short-term memory (BLSTM) networks extend

LSTM functionalities by considering dependencies in sequences in both di-rection. This feature is crucial when considering written text, indeed a word may be linked with other words preceding it.

(40)

c_t Cell × h_t × × ft Forget Gate it

Input Gate Output Gate ot

x_t

xt xt

x_t

Figure 3.6: AnLSTMcell

the blue house <eos>

la casa blu <eos>

(41)

3.4 embedding vectors 25

3.4 e m b e d d i n g v e c t o r s

In manyNLPtasks we may need to represent the meaning of text or words as numbers. An Embedding vector is a high-dimensional vector of real numbers that tries to embed semantic and syntactic information of words or sentences. The main purpose of the embedding vectors is to then derive relation between the natural language elements they represent by looking at the relation between the vectorial representation.

Sentence and word embeddings are a way to transfer learning to aNLPtask to another. For particular tasks (likeATS) annotating data is quite expensive, so we want to learn from easier-to-get data and to transfer the knowledge to perform related tasks.

3.4.1 Word Embeddings

Word embedding incorporate information about words and the relations between them. Words with close meaning will result in having close word embeddings. Some of the older approach to word embeddings employed Latent Semantic Analysis or word co-occurrence matrix. The latest and best performing techniques rely onNN.

One of the most famous word embedding application is word2vec (Mikolov et al.,2013). The model learns the data representation by maximizing the

prob-ability of that representation, given the context. The model has been trained using more than 30 billion words; the ability to learn from unstructured data makes easy the task of finding appropriate corpora.

3.4.2 Sentence Embeddings

While word embeddings have been explored and used for quite some time, sentence embeddings are a new tool that presents many possibilities inNLP. Sentence embeddings can be used to compute the similarity between two sentences. Ideally the distance between two sentences will be smaller the closer are their meaning (Figure 3.8). This allows us, in the scope of _ATS,

to derive relevant feature, like sentence centrality and many other features based on some concept of distance.

We employed the nnlm-en-dim128, which is a pre-trained sentence embed-der available as a TensorFlowHub module1

. The embedder maps sentences into 128-dimensional embedding vectors. It is based on aNNLanugage Model

(Bengio et al.,2003) with three hidden layers. The model has been trained on

English Google News 200B corpus.

(42)

"Machines take me by surprise with great frequency" "Machines often take me by surprise"

"Science is a differential equation, religion is a boundary condition"

Figure 3.8: Sentence Embeddings. Sentences with similar meaning or structure have close embeddings.

(43)

4

P R O P O S E D M E T H O D

In this chapter we discuss our approach toEATSandAATS. After a general overview of the abstractive algorithm architecture, we will further explore in details the steps of such algorithm. We will explain the various phases that transform input text into a summary. Then we will present the adaptation of this algorithm to the extractive case. Finally we will discuss our design decisions regarding the training of our models.

4.1 a b s t r a c t i v e a l g o r i t h m ov e r v i e w

The main idea behind this work is to perform AATS, by extracting sub-sentences from the input text and select the best of them using aDNN.

UsingSRL, we extract thePASof the sentences and select subparts of them. The resulting summary will not only select the best sentences, but also reduce their length by ignoring some parts of the longest ones.PAS are composed of one verb and many arguments, thus for almost every verb in the sentence a PAS is extracted. Now, there may be some verbs that are associated to a redundant subordinate clause. The aim of our system is to avoid those clauses and select only the most significant ones.

First off, the input text undergo a preprocessing phase. Then we perform

SRL and POS tagging on the cleaned and tokenized text to produce sub-sentences. At the same time, with the unique terms, we obtain stems and computeTF-IDFscores. These, together with the extracted sub-sentences, will be used to extract six features. The features, joined together with the sentence embeddings, form the vectorial representation of our sub-sentences.

We now have to feed the vectors to our trainedDNNmodel that will predict a score for each of them. The ranking will then be used to select the best sub-sentences. Finally, we arrange the selected ones following the order of extraction to form the summary.

In the following we present the detailed steps for the abstractive algorithm.

(44)

PREPROCESSING input text 1. text cleanup 2. tokenize Semantic Role Labeller 3. anaphora resolution POS tagger sub-sentences Stemmer FEATURES EXTRACTION Sentence Embedder document matrix SUMMARY COMPOSITION scores DNN model summary

(45)

4.2 preprocessing and srl 29

4.2 p r e p r o c e s s i n g a n d s r l

In this first phase we want to prepare the input to be processed by the Semantic Role Labeller. For this purpose we normalize the text using a reduced set of symbols (for example using ["] as apices, instead of [’] or [“]). We also eliminate any anomaly in the text that could interfere with theSRLstep.

Consequently, we employ a tokenizer to divide the text into sentences; at this point, the sentences are ready to be labeled.

We extractPAS’s andPOStags from the processed text. We usePOStags to reconstruct sub-sentences starting from theSRL annotations. We build the sub-sentence by putting together the arguments and the predicate contained in the extractedPAS, following the order of the original sentence. Since the predicate is sometimes stripped of the auxiliary verbs, we use thePOStags to reconstruct it.

We then populate a list of the unique stems in the sub-sentences. All these pieces of information, together with the sentence embeddings of each sub-sentence will be used in the next step.

4.3 f e at u r e e x t r a c t i o n

The following features are extracted from the processed sub-sentences. 4.3.1 Sentence position score

This feature favors the sub-sentences extracted form a sentence near the beginning of the document:

SPos(s) = |D| − iS

|D| (4.1)

where:

sdenotes a sub-sentence extracted from sentence S; i_S is the index of sentence S in the document D; |D| is the number of sentences in document D.

Notice that for every sub-sentence extracted from sentence S the SPos score will be the same. We decided to adopt this method as there is not always a strong positional order among sub-sentences in the same sentence. Besides, in the writers intentions, they are all in the same position with respect to the document.

(46)

4.3.2 Sentence length score

Sub-sentences extracted from a long sentence have higher score. A shorter sentence not only may consists of a single sub-sentence, but it is in general less informative:

SLen(s) = |S| max S1∈D

|S1| (4.2)

where|S| is the number of words in sentence S.

Even in this case, every sub-sentence extracted from sentence S will have the same SLen score. As for SPos score, the cue of importance is the length of the original sentence and not of the extracted sub-sentence. Moreover, our goal is to obtain shorter versions of the original sentence, considering sub-sentences length could get us the opposite result.

4.3.3 _TF-_IDFscore

This score embeds the concept of sub-sentence’s terms importance: T F-IDF(s) =

P

t∈sT F(t)× IDF(t)

|s| (4.3)

where:

|s| is the number of words in sub-sentence s;

T F (t)is obtained by equation (2.2) and IDF (t) by equation (2.3).

Since we are performing single document summarization, we don’t have a pool of documents to compute theIDFscores with. We used all the documents in our training dataset, to give an idea of general inverse frequency.

4.3.4 Numerical data score

A sub-sentence containing numerical data is considered as important, because that numerical value may be a date, time, percentage value, which might be relevant information:

ND(s) = No. of numerical values in s

|s| (4.4)

4.3.5 Centrality score

Indicates how a sub-sentence is close to all the other sub-sentences in the document. A high score means that the sub-sentence is similar to many other sub-sentences, thus, the writer insisted in this particular concept that should be relevant: C(s) = P s2∈SUBsim (s, s2) max s1∈SUB P s2∈SUB/{s1}sim (s1, s2) (4.5)

(47)

4.4 deep model architecture 31

document

features

sentence embeddings

Figure 4.2: Document matrix. Sub-sentence (or sentence) vectorial representation is formed by six features plus the sentence embedding. All the sub-sentence vectors together compose the document matrix

where: SUB is the set of all sub-sentences extracted form the document and the similarity function sim : s × s 7→ R is the inner product of the sub-sentences’ sentence embeddings.

4.3.6 Title similarity score

As said in Section2.3, the title holds important cues about the content of the

whole document. Thus, high score is assigned to sub-sentences which are similar to the title:

T Sim(s) = sim (s, title) (4.6) where sim is the usual inner product and title is the sentence embedding of the document title.

4.4 d e e p m o d e l a r c h i t e c t u r e

The features we described above joined together with sentence embedding form the sub-sentence vectorial representation. We organize the document in an array of such vectors; the resulting matrix will be the input of ourDNN

(Figure 4.2).

TheNN takes as input a fixed length matrix, so we add a padding of zero vectors until the maximum document length is reached. As stated above, we rank each sub-sentence using aDNN to predict the score. We will now describe in details the architecture we adopted.

Since the document can be seen as a sequence of sentences (or sub-sentences), here represented by vectors, we exploit the temporal dependencies between sentences, in both directions, using aBLSTM. Doing so we manage to get, not only the relations between one sentence and its successors, but

(48)

(49)

4.5 summary composition 33

also with its preceding sentences. This is a relevant feature, as in doing a summary we want to take into consideration if a sentence is semantically linked to another.

The input document matrix is passed through a masking layer to remove the padding. Then the filtered matrix will be the input of theBLSTMthat will produce a score for each sentence in the document. The scores are then fed to a feed-forward network with ten hidden layers. The last layer is an hard sigmoid activation layer.

4.5 s u m m a r y c o m p o s i t i o n

Having the ranking over sub-sentences, we just have to select the best ones until the maximum length is reached. However a few adjustments are re-quired.

Firstly, we exclude the sub-sentences with three or less words as they bear little information. For each sentence we select only one sub-sentence, the longest; sub-sentences from the same sentence may be overlapping, thus we retain only one for each sentence, the longest sub-sentence is usually more significant. This may seem in contrast with our goal of selecting a sub-part of a sentence to reduce its length, but the fact that more than one sub-sentence is extracted often means that parts of that sentence are left out of the summary anyway.

Finally, the sub-sentences are arranged following the order of the original sentence they were extracted from.

4.6 t h e e x t r a c t i v e a l g o r i t h m

We can derive an extractive algorithm from our abstractive algorithm by considering the particular case in which we extract from each sentence just one sub-sentence, coinciding with the sentence itself. Indeed all subsequent operations are performed over sentences.

So in the extractive case we can skip theSRLandPOStagging phase, as we

don’t need to extract sub-sentences. Thus, after a preprocessing phase, that also includes stemming and sentence embeddings derivation, we can proceed to the feature extraction phase. All the features can be restricted to the case of full sentences, and sentence embedding can be derived in the same way as for sub-sentences.

Thus, we can produce the vectorial representation of each sentence and form the document matrix. This will be fed to an adequately trainedDNNto produce the scores for each sentence. Finally, as for the abstractive case, the best ranked sentences are selected to compose the summary.

(50)

(51)

4.7 training the model 35

[1 2 3 4 5 6 7 8 9 10]

[101 102 103 104 105 106 107 108 109 110]

Example 4.1:Equivalent scorings. In this example, two scorings produce the exact same summary but they are far apart in terms ofMSE

4.7 t r a i n i n g t h e m o d e l

4.7.1 Loss function

As we will further discuss in Chapter 6, _ROUGE is the most utilized

evalu-ation metric for summaries. To evaluate a summary it requires a reference summary, usually written by a human annotator. Thus the best option for a loss function1

would be: −ROUGE (summary, reference). Unfortunately, in our case we cannot adopt this loss function. It would require to select the best sentences (or sub-sentences) from the text and compare them to the reference summary, but this operation is not differentiable (necessary condition to use a loss function when using gradient descend).

If we assign a score to each sentence then, to select the best sentences, it suffices to retain the best scoring ones. Given a document and a reference abstractive summary, we can derive a scoring over the document sentences. This scoring allows us to select the sentences to form a summary that resem-bles the reference summary the most. So, now, we have an optimal scoring that our model should learn.

The model will produce a scoring over the document sentences, and the loss will simply be the Mean Squared Error (MSE)2

between this scoring and the optimal one. In this way, in an ideal scenario of 100% accuracy, we would get the exact optimal scoring, and what we thought to be the optimal summary.

4.7.2 Accuracy problems

Unfortunately, as we will show in Chapter6, it is hard to get a high accuracy

using this loss function. However, we noticed that, despite having very low accuracy (around 20%), the models trained with this technique performed well in terms ofROUGEscore: they produced fairly good summaries. In our opinion, the reason behind this peculiar behavior is to be found in the fact that, what we are asking the model to learn is, in a certain sense, "too much".

What we are looking for is a ranking over the sentences that maintains the same order of the optimal ranking and it is close in terms each individual score magnitude. The second constraint is of course superfluous, as we just need that the ordering with respect to the optimal one is preserved 1 A function that quantify how wrong our model is doing during the training phase. Its value

is used to update the weights of theNNat each training step

2 Given two vectors the Mean Squared Error (_MSE) is the mean of the squares of the element-wise differences (errors)