Headline Generation and Analysis of Writing Styles in Journalism

(1)

Department of Computer Science Master Degree in Computer Science

Headline Generation and Analysis of Writing

Styles in Journalism

Supervisors:

Prof. Davide Bacciu

Prof. Malvina Nissim

Candidate:

Michele Cafagna

Fall Session

(2)

4 Headline Generation as Summarisation 31 4.1 Models . . . 31 4.2 Experiments . . . 32 4.2.1 Settings . . . 33 4.2.2 Analysis . . . 34 4.3 Human Evaluation . . . 36 4.3.1 Settings . . . 36 4.3.2 Analysis . . . 38 4.3.3 Agreement . . . 40 4.3.4 Discussion . . . 41 4.4 Classification as Evaluation . . . 42 4.4.1 Approach . . . 42 4.4.2 Data . . . 43 4.4.3 Settings . . . 44 4.4.4 Analysis . . . 48 4.4.5 Discussion . . . 49

5 Style Transfer Analysis 51 5.1 Approaches . . . 52

5.1.1 Style as Latent Representation . . . 52

5.1.2 Style as Translation . . . 53

5.1.3 Results and Discussion . . . 53

5.2 Embedding Analysis . . . 55

5.2.1 Data . . . 56

5.2.2 Embeddings and Measures . . . 57

5.2.3 Analysis . . . 60

5.2.4 Discussion . . . 65

(4)

(5)

Abstract

The thesis deal with generative language models in the journalism domain, focusing on head-line generation providing an analysis of writing styles on Italian newspapers. The work covers all the natural language processing pipeline, from the data collection to the human evaluation and it proposes a method to study the stylistic aspects looking at the embeddings shifts and it investigates stylistic transfers, experimenting with various approaches. We show that news are distinguishable by machines, and machines are also able to generate headlines according to specific-newspaper’s style. However through human evaluation we also observe that the attractive component is not integrated in the models. Contextually we use an automatic eval-uation strategy based on classification, in which we prove that machines are able to intercept stylistic features from corpora. We study a technique to detect stylistic shifts among corpora and we use it to analyse style correlating word shifts with their usage in the corpus. Lastly, we try to model satirical style for transfer purposes with an alternative cross-lingual approach concluding that this aspect needs further investigation and deeper analysis.

(6)

Chapter 1 Introduction

Text summarization is a task of producing a condensed version of text while preserving its meaning. When the original text is a document and the target text is a sentence, the task is specialized to headline generation [26]. Headline generation is useful in several scenarios, like compressing text for mobile device users [19], generating content table [30], and also has the potential for more advanced artificial intelligence applications. The task is challenging for its request of short, fluent and informative text generation. When approaching the headline generation task, the main focus is on aspect related to the content and the grammatical and syntactic correctness of the generation, but equally important is what concern way that con-tent is expressed, because different words can provide different nuances to the same concon-tent by simply lexicalizing the same meaning differently, depending upon the environment. This is particularly relevant for language generation tasks such as machine translation [106] [3], caption generation [50] [114] and natural language generation [112] [51]. Recent research ef-forts focused at “controlled” language generation—aiming at separating the semantic content of what is said from the stylistic dimensions of how it is said. These include approaches rely-ing on heuristics or neural approaches like deep generative models controllrely-ing for a particular stylistic aspect, e.g., politeness [101], sentiment, or tense [45] [103]. The latter approaches to style transfer, while more powerful and flexible than heuristic methods, have yet to show that in addition to transferring style they effectively preserve meaning of input sentences. Jour-nalism is the natural field in which these two aspects are perfectly mixed and represent two key points that contribute to the produce the headline of the news. Even though research is

(7)

progressing in new models and architecture, the evaluation of generation tasks is still missing a consistent and effective framework to evaluate the task. In this work we experiment with various approaches to headline generation, focusing mainly on the evaluation. We explore various approaches to the transfer of stylistic aspects focusing on analysing thee style among corpora.

Outline

The thesis provides for an initial stage of data collection of news, from Italian newspapers. The data is analysed and preprocessed. We train different models over this data in various settings experimenting with preprocessing techniques and topic aligned data. We design an evaluation form to assess the models by human-based evaluation. We devise an automatic method to evaluate generation models based on classification and we use it to asses the transfer of stylistic features originated from the corpora. We then focus on the stylistic aspect firstly approaching the problem with neural approaches. Then analysing the stylistic aspect from corpora providing an analysis based on embeddings devised to exploit vector shifts in the embedding space.

(8)

Chapter 2 Background and Related Work

2.1 Text Generation

Text generation is a subfield of natural language processing. It leverages knowledge in compu-tational linguistics and artificial intelligence to automatically generate natural language texts, which can satisfy certain communicative requirements. Two main aspects involved in text generation ara headline generation and style transfer. Headline generation is usually mod-eled as an extreme summarisation task, while style transfer can be addressed as a translation problem, even though newer approaches are facing the task using latent representation of the sentences of the documents.

2.1.1 Deep Learning Methods

Since the early 1990s, when interest in neural approaches waned in the natural language pro-cessing and AI (Artificial Intelligence) communities, cognitive science research has continued to explore their application to syntax and language production [29]. The recent interest in neural networks is in part due to advances in hardware that can support resource-intensive learning problems [39]. Neural networks are designed to learn representations at increas-ing levels of abstraction by exploitincreas-ing backpropagation [64] [39]. Such representations are dense, low-dimensional, and distributed, making them well-suited to capturing grammatical and semantic generalisations [82] [73]. Neural networks have also scored notable successes

(9)

in sequential modelling using feed-forward networks [11], log-bilinear models [85] and recur-rent neural networks [81] (including recurrecur-rent neural networks with long short-term memory units (LSTM) [43]). The latter are now the dominant type of recurrent neural network for language modelling tasks. Their main advantage over standard language models is that they handle sequences of varying lengths while avoiding both data sparseness and an explosion in the number of parameters by controlling the error flow from every cell through gates units able to prevent perturbation coming from the interactions with other units. [43].

2.1.2 Encoder-Decoder Architectures

An influential architecture is the Encoder-Decoder framework [60], where a recurrent neural network is used to encode the input into a vector representation, which serves as the aux-iliary input to a decoder recurrent neural network. This decoupling between encoding and decoding makes it possible in principle to share the encoding vector across multiple natu-ral language processing tasks in a multi-task learning setting ([25] [71] for some recent case studies). Encoder-Decoder architectures are particularly well-suited to Sequence-to-Sequence (Seq2Seq) tasks such as Machine Translation, which can be thought of as requiring the map-ping of variable-length input sequences in the source language, to variable-length sequences in the target (e.g., [49] [114]). It is easy to adapt this view to data-to-text natural language generation. For example, [31] adapt Seq2Seq models for generating text from abstract mean-ing representations. A further important development within the Encoder-Decoder paradigm is the use of attention-based mechanisms, which force the encoder, during training, to weight parts of the input encoding more when predicting certain portions of the output during de-coding ([3], [45]). This mechanism obviates the need for direct input-output alignment since attention-based models can learn input-output correspondences based on loose couplings of input representations and output texts. In natural language generation, many approaches to response generation in an interactive context (such as dialogue or social media posts) adopt this architecture. For example, Wen et al. in [111] use semantically-conditioned LSTMs to

(10)

rather than a word-based recurrent neural network. [111] use a reranker during decoding to rank beam search outputs, penalising those that omit relevant information or include irrel-evant information. Their evaluation shows that the joint optimisation set-up is superior to the seq2seq model that generates trees for subsequent time-steps. Mei et al. and Angeli et al. ([76],[2]) use a bidirectional LSTM encoder to map input records to a hidden state, followed by an attention-based aligner which models content selection, determining which records to mention as a function of their prior probability and the likelihood of their alignment with words in the vocabulary.

2.1.3 Headline Generation as Summarisation

Previous headline generation methods can be generally categorized to extractive methods and abstractive methods. Extractive methods treat sentences from the original document as the candidates and exploit sentence compression techniques to produce the headline. Abstractive methods may treat phrases, concepts or events as candidates, and exploit sentence synthesis techniques to generate the headline. Fully abstractive methods even do not choose candidates from the original document but generate a sentence from scratch using natural language gen-eration techniques based on understanding the document. As headlines are highly condensed, extractive methods usually suffer from generating less informative headlines. Therefore ab-stractive methods are essentially more appropriate for the task. However, it is more difficult for abstractive methods to ensure the grammaticality of the generated headlines, due to the difficulty and immaturity of natural language generation techniques. Recent success of neural sequence-to-sequence (Seq2Seq) models provides an effective way for text generation. Im-pressive progress has been achieved on tasks like machine translation, image captioning, and sentence summarisation. Rush et al. ([96]) explore training a sentence summarisation model on headlines and lead (first) sentences of news articles, which shows the capacity of sequence to sequence models to generate informative, fluent headlines conditioned on recapitulative sentences. Rush et al. assume that the lead sentences of most news articles will be the sum-mary of the important information, and eliminate those articles whose first sentences are not significantly overlapped with the headlines. Extensive later studies follow this assumption and concentrate on further improving the sentence summarisation model.

(11)

2.1.4 Stylistic Control

In the past few years, stylistic and especially affective natural language generation has wit-nessed renewed interest by researchers working on neural approaches to generation. The trends that can be observed here mirror those outlined in our general overview of deep learn-ing approaches. Many models focus on response generation (in the context of dialogue, or social media exchanges), where the task is to generate a response, given an utterance. Thus, these models fit well within the Seq2Seq or Encoder-Decoder framework. Often, these mod-els exploit social media data, especially from Twitter, a trend that goes back at least to [94] who adapted a Phrase-Based Machine Translation model to response generation. For example [66] proposed a persona-based model in which the decoder LSTM is conditioned on embed-dings obtained from tweets pertaining to individual speakers/authors. An alternative model conditions on both speaker and addressee profiles, with a view to incorporating not only the ‘persona’ of the generator but its variability concerning different interlocutors. This has the advantage of not enabling the generator to be tuned to specific personality settings, without re-training to adapt to a particular speaker style.

‘Good’ writers not only present their ideas in coherent and well-structured prose. They also succeed in keeping the attention of the reader through narrative techniques, and occa-sionally surprising the reader, for example, through creative language use such as small jokes or well-placed metaphors (e.g. [33], [86], [66]). The natural language generation techniques and applications discussed so far in this survey arguably do not simulate good writers in this sense, and as a result, automatically generated texts can be perceived as somewhat boring and repetitive. Alternative strategies have been studied to control and arbitrary reproduce the style. Some of those assume a shared latent representation among corpora with differ-ent styles. By continuous refinemdiffer-ents, they tune the latdiffer-ent variables to get distinguishable features vectors that represent content and style. This task is typically performed by autoen-coders [103]. An autoencoder is a data compression algorithm where the compression and the decompression functions are learned from data, applications involve data denoising [110]

(12)

eters to perform sampling like for Variational Autoencoders [55], [45]. Another approach is followed by Zhang et al ([118]) which use Attentive autoencoders. They merge LSTM Au-toencoder [104], which is an implementation of an auAu-toencoder for sequence data by using encoder-decoder architecture with the attention mechanism [3].

2.2 Models

2.2.1 Sequence to Sequence

The sequence-to-sequence framework follows the encoder-decoder architecture and could be formalized as follows: Denote the input text X = {x1, ..., xM}as a sequence of M words, and

each word xtis a one-hot vector of size |V | from the word vocabulary V . Seq2seq model takes

X as input,and generates the output Y from X : Y = {y1, ..., yN}. The goal is to find ˆY to

maximize the conditional probability of Y given X, as: ˆY = argmaxYP (Y |X; θ)where θ

in-dicates the parameters for the model to learn. Seq2seq model processes the input and produces the output sequentially. The model processes input sequence X into a low-dimensional vector representation h∗_{with an encoder. Then the model generates output Y word by word with a}

decoder, with each generated word ytconditioned on input representation h∗and previously

generated words {y1, ..., yt−1}, as:

P (Y |X; θ) =

N

Y

t=1

P (yt|{y1, ..., yt−1}, h∗; θ) (2.1)

Encoder In the seq2seq model the encoder is used to process the input sequence into its vector representation h∗_{. The idea of recurrent neural network (RNN) is to perform the same}

task for every element of a sequence, with the output being depended on the previous com-putations. Specifically, as the encoder, RNN gets hidden state ht = f (xt, ht−1) for each

input word xt in the input sequence X,where f indicates the function of RNN unit. Then h∗ = g(h1, ..., hM), where g is a function to calculate c from hidden states. A typical instance

(13)

Decoder The decoder is used to generate the output sequence given the input representa-tion h∗_{. The decoder generates one word every step based on the input representation and}

previously generated words. RNN is also mostly used as the decoder. The way to generate word yt is:

yt= argmaxy0P (y0|{y₁, ..., y_t−1, h∗, θ}) (2.2)

In primitive decoder models, h∗ _{is the same for generating all the output words, which}

re-quires h∗ _{to be a good representation for the whole input sequence. Attention mechanism}

[3] is introduced to allow the decoder to pay different attention to different parts of input when generating different words. Many successful applications show the effectiveness of at-tention mechanism. Atat-tention mechanism sets different ctwhen generating different word yt,

as h∗ t =

Pi=1

M α

t

ihtαti indicates how much the i-th word xi from the source input contributes

to generating the t-th word, and is usually computed by:

αt_i = exp(hi· hyt) PM

j=1exp(hj · hyt)

(2.3)

where hytrepresents the hidden state of the decoder when generating word yt. The attention

distribution can be viewed as a probability distribution over the source words, that tells the decoder where to look to produce the next word.

2.2.2 Pointer Generator Network

The pointer generator network presented in [100] uses an attention distribution αtand a

con-text vector h∗

t like a sequence to sequence with attention, but it uses an additional generation

probability pgen∈ [0, 1]for time-step t, computed from the vector h∗t, the decoder state stand

the decoder input xt:

pgen = σ(wTh∗h∗_t + w_sTs_t+ w_xx_t+ b_ptr) (2.4)

where vectors w∗

h, ws, wxand scalar bptrare learnable parameters and σ is the sigmoid function.

(14)

vocabulary), we obtain the following probability distribution over the extended vocabulary:

P (w) = pgenPvocab(w) + (1 − pgen)

X

i:wi=w

α_it (2.5)

Note that if w is an out-of-vocabulary (OOV) word, then Pvocab(w)is zero; similarly if w does

not appear in the source document then Pi:wi=wα

t

iis zero. The ability to produce OOV words

is one of the primary advantages of pointer-generator models. The loss function is computed with respect to the modified probability distribution P (w) given Equation 2.5

Figure 2.1: Pointer-generator model. For each decoder time-step a generation probability pgen ∈ [0, 1], is calculated. It weights the probability of generating words from the

vocabu-lary, versus copying words from the source text. The vocabulary distribution and the attention distribution are weighted and summed to obtain the final distribution, from which we make our prediction. Note that out-of-vocabulary article words such as 2-0 are included in the final distribution.

Coverage Mechanism Abigail et al. [100] introduce a coverage mechanism, employing a coverage vector ct_{, which is the sum of attention distribution over all previous decoder}

(15)

time-steps: ct = t−1 X t0₌₀ αt0 (2.6)

ct_{is a (unnormalized) distribution over the source document words that represents the degree}

of coverage that those words have received from the attention mechanism so far. The coverage vector is used as an extra input to the attention mechanism giving:

et_i = vTtanh(Whhi+ Wsst+ wccti+ battn) (2.7)

where the attention is computed as αt_{= sof tmax(e}t₎, and w

cis a learnable parameter vector

of same length as v. This ensures that the attention mechanism takes into account for its previous decision, making easer for the attention mechanism to avoid repetitions. Lastly, it is introduced an extra loss term to penalize any overlap between the coverage vector ct_{and the}

new attention distribution at:

covlosst =

X

i

min(at_i, ct_i) (2.8)

This discourages the network from attending to anything that’s already been covered.

2.2.3 Attentive Autoencoder

The sequence autoencoder [23] is similar to sequence to sequence learning (also known as seq2seq) [106]. It employs a recurrent network as an encoder to read in an input sequence into a hidden representation. Then, the representation is fed to a decoder recurrent network to reconstruct the input sequence itself. The sequence autoencoder is an unsupervised learning model which is a powerful tool for modelling sentence representations with large scale unan-notated data. Zhang et al. in [117] present the first transduction model relying entirely on self-attention to compute representations of the input and output sequences without using RNNs or convolution. The attention autoencoders follow the encoder-decoder architecture, shown in Figure 2.2. Each layer of encoder has two sub-layers: the first layer is a self-attention

(16)

mech-for position i can depend only on the known outputs at positions, less than i. The authors also employ residual connection and layer normalization around each of the sub-layers. The role of the attention mechanism it to map a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed based on the query and the corresponding key. In attention autoencoders, there are three types of attention: the encoder self-attention, the encoder-decoder attention and the decoder self-attention.

Figure 2.2: Schema of the attention autoencoder

The attention is computed on a set of queries simultaneously: Attention(Q, K, V ) = sof tmax(QK

T

√ dk

V ) (2.9)

where Q ∈ Rnq×dk, K ∈ Rnk×dk and V ∈ Rnk×dk are queries, keys and corresponding values

(17)

The encoder transforms a sentence into a list of vectors, one vector per input symbol. Given the input embedding sequence x = (x1, ..., xn), it produces hidden representations

he= (he1, ..., hen)with the following equations:

α0_e = Attention(xW_eq, xW_ek, xW_ev) (2.10) αe = LayerN orm(αe0 + x) (2.11) h0_e= ReLU (αeWe1+ be1)We2+ be2 (2.12) he = LayerN orm(h0e+ ae) (2.13) where Wq e ∈ Rdm ×dk, Wk e ∈ Rdm ×dk, Wv e ∈ Rdm×dk, We1 ∈ Rdm×df and be2 ∈ Rdm are

bias vectors; LayerNorm denotes layer normalization and ReLU is activation function. The encoder and decoder are connected through an attention module, which allows the decoder to focus on different parts of the input sequence during the course of decoding. Through Equation 2.10 and Equation 2.11 is computed αd = (αd1, ..., αdn)in the self-attention layer .

The encoder-decoder attention is computed following the self-attention layer as follows:

α0_ed = Attention(αdWaq, heWak, heWav) (2.14)

αed = LayerN orm(α0ed + αd) (2.15)

where Wa

q ∈ Rdm×dk, Wka ∈ Rdm×dk, Wva ∈ Rdm×dm are parameter matrices in

encoder-decoder attention layer. Then αed is fed to the position-wise feed-forward layer to produce

the hidden representation hd = (hd1, ..., hdn). Given hdand the previous (i − 1) words, the

probability of generating word wi is:

P (wi|w1, ..., wi−1, hdi) ∝ exp(Wphdi+ bp) (2.16)

The objective is the sum of the log-probabilities for the input sequence itself:

J (θ) =X

i

(18)

2.3 Evaluation

In general, natural language generator evaluation is marked by a great deal of variety and it is difficult to compare systems directly. There are at least two reasons why this is the case. A fundamental methodological distinction, due to [48], is between intrinsic and extrinsic evalua-tion methods. In the case of natural language generaevalua-tion, an intrinsic evaluaevalua-tion measures the performance of a system without reference to other aspects of the set-up, such as the system’s effectiveness in relation to its users. Gatt et Krahmer. in [36] provide an example scenario shown in Figure 2.3. Questions related to text quality, correctness of output and readability qualify as intrinsic whereas the question of whether the system achieves its goal in supporting adequate decision-making on the offshore platform is extrinsic.

Figure 2.3: Hypothetical evaluation scenario: a weather report generation system embedded in an offshore oil platform environment. Possible evaluation methods, focussing on different questions, are highlighted at the bottom, together with the typical methodological orientation (subjective/objective) adopted to address them.

2.3.1 Intrinsic Methods

Intrinsic evaluation in natural language generation is dominated by two methodologies, one relying on human judgements (and hence subjective) and the other on corpora. Human judge-ments are typically elicited by exposing naive or expert subjects to system outputs and getting them to rate them on some criteria. Common criteria include:

Subjective (Human) Judgements

(19)

• Accuracy, adequacy, relevance or correctness relative to the input, reflecting the sys-tem’s rendition of the content (e.g. [65], [93] [46]), a criterion often used in subjective evaluations of image-captioning systems as well (e.g. [61], [84]).

Though they are the most common, these two sets of criteria do not exhaust the possibilities. For example, subjective ratings have also been elicited for argument effectiveness in a system designed to generate persuasive text for prospective house buyers [18]. The use of scales to elicit judgements raises some perplexity. While discrete, ordinal scales are the dominant method, a continuous scale, for example, one involving a visually presented slider ([35], [9]), might give subjects the possibility of giving more nuanced judgements. For example, a text generated by the hypothetical weather report system (like in [36]) might be judged so difficult as to be given the lowest rating on an ordinal scale.

An additional concern with subjective evaluations is inter-rater reliability. Multiple judge-ments by different evaluators may exhibit high variance, a problem that was encountered in the case of Question Generation [95]. Godwine et al. in [38] suggested that such variance can be reduced by an iterative method whereby training of judges is followed by a period of discussion, leading to the updating of evaluation guidelines. This, however, is more costly in terms of time and resources.

It is probably fair to state that, subjective, human evaluations are often carried out via on-line platforms such as Amazon Mechanical Turk1_{and CrowdFlower}2_{, though this is probably}

more feasible for widely-spoken languages such as English. A seldom-discussed issue with such platforms concerns their ethical implications (for example, they involve large groups of poorly paid individuals; see [17]) as well as the reliability of the data collected, though mea-sures can be put in place to ensure, for instance, that contributors are fluent in the target language (see e.g. [40], [75]).

Objective Humanlikeness Measures using Corpora Intrinsic methods that rely on cor-pora can generally be said to be addressing the question of ‘human-likeness’, that is, the extent

(20)

ety of corpus-based metrics often used earlier in related fields such as Machine Translation or Summarisation, have been used in natural language generation evaluation, like METEOR [5], ROUGE [69] and BLEU [16]. Measures of n-gram overlap or string edit distance, usually originating in Machine Translation or Summarisation are frequently used for evaluating sur-face realisation (e.g. [113],[14]) and occasionally also to evaluate short texts characteristic of data-driven systems (e.g.[91], [57]) and image captioning (see [12], [52]). The focus of these metrics is on the output text, rather than its fidelity to the input. In a limited number of cases, surface-oriented metrics have been used to evaluate the adequacy with which output text reflects content ([6], [91]). However, if content determination is the focus, a measure of sur-face overlap is at best a proxy, relying on an assumption of a straightforward correspondence between input and output. This assumption may be tenable if texts are brief and relatively predictable. In some cases, it has been possible to use metrics to measure content determi-nation directly, based on semantically annotated corpora. Direct measurements of content overlap between generated and candidate outputs will likely increase, as automatic data-text alignment techniques make such ‘semantically transparent’ corpora more readily available for end-to-end natural language generation (see e.g., [20] [68]). An important development away from pure surface overlap is the use of semantic resources (as in the case of METEOR, [5]), or word embeddings (as [62]), to compute the proximity of output to reference texts beyond literal string overlap. In a comparative evaluation of metrics for image captioning, [52] found some advantages if compared to other metrics.

2.3.2 Extrinsic Methods

In contrast to intrinsic methods, extrinsic evaluations measure effectiveness in achieving a desired goal. Clearly, ‘effectiveness’ is dependent on the application domain and purpose of a system. Gatt and Krahmer [36] provide some examples in various scenarios:

• persuasion and behaviour change (for example, through exposure to personalised smok-ing cessation letters [92]);

• purchasing decision after presentation of arguments for and against options on the hous-ing market based on a user model [18];

(21)

• engagement with ecological issues after reading blogs about migrating birds [74]; • decision support in a medical setting following the generation of patient reports [89],

[46];

• enhancing linguistic interaction among users with complex communication needs via the generation of personal narratives [107];

• enhancing learning efficacy in tutorial dialogue [24] [21].

While questionnaire-based or self-report studies can be used to address extrinsic criteria (e.g., [46], [88], [18]), in many cases evaluation relies on some objective measure of performance or achievement. This can be done with the target users in-place, but can also take the form of a task that models the scenarios for which the natural language generation system has been designed

A potential drawback of extrinsic studies, in addition to time and expense, is a reliance on an adequate user base (which can be difficult to obtain when users have to be sampled from a specific population) and the possibility of carrying out the study in a realistic setting. Such studies also raise significant design challenges, due to the need to control for intervening and confounding variables, comparing multiple versions of a system, or comparing a system against a gold standard or baseline. For example, [18] note that evaluating the effectiveness of arguments presented in text needs to take into account aspects of a user’s personality which may impact how receptive they are to arguments in the first place.

2.3.3 Relationship Between Evaluation Methods

When it comes to collect results It turns out that multiple evaluation methods seldom give converging verdicts on a system, or on the relative ranking of a set of systems under com-parison. Although corpus-based metrics used in Machine Translation and Summarisation are typically validated by demonstrating their correlation with human ratings, meta-evaluation

(22)

The correlation between a metric and human judgements appears to differ across studies, sug-gesting that metric-based results are highly susceptible to variation due to generation algo-rithms and datasets. For instance, [57] find that on corpus-based metrics, the best-performing version of their model does not outperform that of [53] on another domain or corpus.

An important contribution is the study by [37], which addressed the validity of corpus-based metrics in relation to human judgements, within the domain of weather forecast gener-ation. In a first experiment, focussing on linguistic quality, the authors found a high correla-tion between expert and non-expert readers’ judgements, but the correlacorrela-tion between human judgements and the automatic metrics varied considerably (from 0.3 to 0.87), depending on the version of the metric used and whether the reference texts were included in the comparison by human judges. The second experiment evaluated both linguistic quality, by asking human judges to rate clarity/readability; and content determination, by eliciting judgements of accu-racy/appropriateness (by comparing texts to the raw data). The automatic metrics correlated significantly with judgements of clarity, but far less with accuracy, suggesting that they were better at predicting the linguistic quality than correctness. Other studies have yielded sim-ilarly inconsistent results. In a study on paraphrase generation, [105] found that automatic metrics correlated highly with judgements of adequacy (roughly akin to accuracy), but not fluent. Various factors can be adduced to explain the inconsistency of these meta-evaluation studies:

• Metrics such as BLEU [16] are sensitive to the length of the texts under comparison. With shorter texts, n-gram based metrics are likely to result in lower scores;

• The type of overlap matters: for example, many evaluations in image captioning rely on BLEU-1 ([28],[27] were among the firsts to experiment with longer n-grams), but longer n-grams are harder to match, though they capture more syntactic information and are arguably better indicators of fluency;

• Semantic variability is an important issue. Generated texts may be similar to reference texts, but differ in some near-synonyms, or subtle word order variations;

• Many intrinsic corpus-based metrics are designed to compare against multiple reference texts, but this is not always possible in natural language generation. For example, while

(23)

image captioning datasets typically contain multiple captions per image, this is not the case in other domains, like weather reporting or restaurant recommendations.

Gatt and Krahmer in [36] suggest to make mistakes in favour of diversity, by using multi-ple methods, as far as possible and reporting not only their results, but also the correlation between them or, relying mainly on human or systematic measures, if is the case. Weak cor-relations need not imply that the results of a particular method are invalid. Rather, they may indicate that measures focus on different aspects of a system or its output.

(24)

Chapter 3 Harvesting Corpora of News

3.1 Scraping Data

We collected data, scraping directly from the web using news-please [42], a generic, multi-language, open-source crawler and extractor, specific for news. News-please is a versatile, powerful and easy configurable scraping tool, providing crawling, extraction and data storage functionalities allowing to scrape from almost every news website. During the course of the project, it has been necessary to perform multiple scraping session according to the needs and issues stood out. In the first scraping stage, we scraped Il Giornale1_{and La Repubblica}2_,

two of the most important Italian newspapers known to belong to opposite political directions and Lercio3_{, a popular satirical newspaper. The scraping process took two days after which it}

produced the following raw data:

• Il Giornale : 202736 news including headline, abstract, author, date and full text • La Repubblica : 150635 news including, headline, abstract author, date and full text • Lercio : 16707 news including headline, full text, date.

To face the lack of data of satirical news, we performed other scraping sessions on various satirical Italian newspapers gathering very few data which in the end they haven’t been used

1_{http://www.ilgiornale.it/} 2_{https://www.repubblica.it/} 3_{https://www.lercio.it}

(25)

because not sufficiently plentiful. Afterwards in the context of an experiment exploiting a cross-linguistic approach we scraped news from 26 of the most popular satirical newspapers from USA, UK and Australia, gathering 494690 news.

Data Cleaning The raw data underwent a cleaning process comprising some very basic common operations, like dropping duplicates and removal of records with null values, and some newspaper-specific operations like JavaScript code and advertisements text removal, date format normalization and removal of other unnecessary text extracted during the scraping phase. After this process the available data were:

• Il Giornale : 202421 news • La Repubblica : 74248 news

• Lercio : 6494 news (only 2020 with full text) • Satirical English newspapers: 130987 news

Satirical English news have been used to perform a particular experiment described later since this work is focused on Italian news we do not follow the same detailed analysis for the English news.

3.2 Data Understanding

We performed some statistics on the corpora, leading two separate analysis for satirical news and regular news, keeping aside the English satirical news.

Il Giornale and La Repubblica As shown in Figure 3.1 The two newspapers have similar news chronological distributions. Most of the publications lie between 2004 and 2019 and present an increasing trend in the recent years. Both have some news published before 2004, but overall they cover the same time frame.

(26)

Figure 3.2: La Repubblica news topic distribution

Figure 3.1: Left: news distribution over the years for La Repubblica . Right: news distribution over the years for Il Giornale .

According to the topic distribution (Figure 3.1) La Repubblica tends to publish news related to world news, politics and news instead no topic information are available for Il Giornale . In Figure 3.3 are shown statistics about headlines and abstracts: both have similar headlines median length of around 10 words. Il Giornale has many outliers and show much variation in length. Concerning the abstracts, La Repubblica features a longer median length and a wider variation with respect to Il Giornale which has shorter abstract and less variation.

(27)

Figure 3.3: Left: headlines length box plot for La Repubblica and Il Giornale . Right: abstract length box plot for La Repubblica and Il Giornale .

Another interesting information in provided by the most common words used in the head-lines. Both newspapers share most of the words shown in Figure 3.5 and Figure 3.4; ”italia” is the most common word for both and the others word-counts are pretty similar too. Many words are politically related, like ”renzi”, ”pd”, ”m5s”, ”salvini” for La Repubblica and ”renzi”, ”salvini”, ”berlusconi” for Il Giornale and they are also indicative to the political party. Other common topics are European Union (”ue”, ”europa”, ”euro”) and recent news like ”migranti” and world news like ”trump” and ”usa”.

(28)

Figure 3.5: Most common words used in the headlines of Il Giornale .

Lercio In Lercio there is an opposite trend in the article distribution per year with respect to La Repubblica and Il Giornale , indeed we see a progressive diminishing of published arti-cles over the years. Looking at the headlines length distributions we see similar values as La Repubblicaand Il Giornale (Figure 3.6)

Figure 3.6: Left: Lercio headlines box plot . Right: news distribution over the years in Lercio .

Obviously the words counts are not comparable to La Repubblica and Il Giornale , but it is interesting to see that also in Lercio the most common words are very similar to those used by the other newspapers. This suggest that Lercio has a vocabulary more similar to La Repubblica and Il Giornale than expected, and it could shares also many topic, especially politics, European Union and news, according to the shared words showed in Figure 3.6

(29)

Figure 3.7: Lercio most common words used in the headlines.

3.3 Newspapers Classification

As previously said, La Repubblica and Il Giornale have similar common words and probably they share some writing features, the next step is to understand if it’s possible to distinguish and characterize them by extracting some meaningful features. We used an SVM as baseline for the task and to extract features for interpretative purposes and we compared it with an LSTM-based classifier to understand the difficulty of the task and figure out which approach performs better. The data has been preprocessed following the strategy proposed by [1] to build itWac Italian word embeddings, that briefly consists in:

• keep only digits comprised in the interval 0 - 2100, and replacing the other ones with the place-holder @Dg

• replace urls with the placeholder URL

• replace words longer than 26 characters with the placeholder LONG-LONG

The data has been split in 90 % training and 10 % for testing, the training partition is com-posed by an overall number of 133646 headlines perfectly balanced between La Repubblica and Il Giornaleand we used the remaining 14862 headlines for test.

(30)

n-gram in the range 1-3, then we trained a Linear SVM for the classification. We performed different experiments by using character-level and word-level tokenizers, POS tagging and bleaching. The latter is presented in [109] and consists in a normalization technique proved to be effective in gender prediction task. It consists in replacing every character with place holders in order to preserve only upper-cases, lower-cases, digits, length and punctuation. We used standard configuration for the Linear SVM4_.

Newspaper Precision Recall F1-score Word tokenization La Repubblica 0.770 0.803 0.786 Il Giornale 0794 0.761 0.777 Char tokenization La Repubblica 0.767 0.828 0.796 Il Giornale 0.813 0.749 0.780 Bleaching La Repubblica 0.677 0.693 0.685 Il Giornale 0.686 0.669 0.677 POS tagging La Repubblica 0.651 0.670 0.660 Il Giornale 0.660 0.640 0.650

Table 3.1: Results obtained by the SVM with different tokenization methods

In Table 3.1 are shown the results of the experiments. Char and word tokenizations per-form quite better than bleaching and POS tagging, probably because of the inper-formation lost in the normalization of the tokens imposed by the latter. The char tokenization performs slightly better than the counterpart in words, but it probably takes advantage of some intra-words rela-tionships which are more frequent but less meaningful than inter-words relarela-tionships captured

(31)

by the other one. For example among the top char features for La Repubblica we se different n-gram belonging to the same word, like ”m i l a n” and ”i l a n” or ”’ i s i s” and ”l ’ i s i s”, or other meaningless features like for instance ”i . ”, or ”, ”, or ”, i” that we see in La Repubblica . Word features are more informative and provide a better understating of the style or the topic featured by the two newspapers. Among Il Giornale top word features there is also some noise due to punctuation like ””, but also some more interesting words like: ”choc”, ”fiat”, ”milan”,

”milano”, ”sexy”, ”malpensa”. Some features are basically the same, like ”milan” and ””m i l a n”, but it’s better to consider them as words and not as characters if we want to find some interesting interpretation.

LSTM-based Classifier We used a character-level encoder which encodes character embed-dings for a single word using a convolutional layer with kernel size 4 followed by a pooling layer of size 2; the output is then flattened out and fed into a fully connected layer of 20 units with activation function ReLu. The output is concatenated to the corresponding word em-bedding of size 256, obtaining an emem-bedding with both char-level and word-level information. Then we encode these them in a Bi-LSTM with 15 units, dropout 0.5 and recurrent dropout 0.5, binary cross entropy as loss, the network is optimized with Adam. We train the model with batch size 256 using early stopping [10], fixing the maximum headline length to 25 words and the maximum number of chars per words to 20. The results are promising since the model perform pretty well in testing, reaching good scores in precision, recall and F1-score as shown in Table 3.2, even outperforming the Linear SVM by several points.

Newspaper Precision Recall F1-score La Repubblica 0.812 0.813 0.812

Il Giornale 0.813 0.811 0.812

(32)

Figure 3.8: LSTM classifier’s schema: the char encoding is a convolutional network and pro-ducing an embedding for each token’s char. Every char embedding is then concatenated and attached to corresponding token’s word embedding obtaining a unique embedding vector for each token which includes characters and word information.

Italian Satirical News We adopted a similar procedure to classify satirical news and regular news. The dataset was composed by 12988 headlines perfectly balanced between satirical and regular ones. We used all the available 6494 Lercio headlines collected and 6494 regular

(33)

headlines also perfectly split amid La Repubblica and Il Giornale . Differently from regular news, satirical ones are way fewer than the regular ones, so we performed a grid search using a validation partition of 20 % of the data on the Linear SVM, setting aside the LSTM-based approach. Aiming to exploit al the available satirical data, we did the same with full text. After the grid-search we used for headlines C = 1, regularization l2, tokenization char-based, n-grams range (3, 6) and for full text C = 10, regularization l1, tokenization char-based, n-grams range (3, 6) obtaining results showed in Table 3.3.

Headline Full text Train 0.892 0.986 (0.948)* Valid 0.914 0.986 (0.959)*

Table 3.3: Results obtained by the Linear SVM classifier after the grid-search for headlines and full text. *Scores obtained using normalization and word-based tokenization.

Figure 3.9: Top features extracted from the SVM using char-based n-grams. Positive scores are top features for Lercio , negatives ones are top features for regular news.

(34)

less of the regular news, meaning that we are considering a very small portion of data which is probably not representative of the satirical domain. As with the regular news we plotted the top features for both classes and as expected, the features make no sense by using char-based tokenization despite having higher scores. Besides we noticed that most of the top char n-grams features contain special characters which are used by a newspaper rather than the other, making the task way much easier for the classifier. The point is, that those features are not valuable and furthermore they could cover eventual potential interesting features, thus we repeated the experiment by using a word-based tokenization, applying a normalization for the special characters. The resulting scores in test showed in Table 3.3 (in brackets), were slightly lower, but the features were more valuable and representative of the newspaper’s style, this is visible in Figure 3.10 where are shown the top features in the last experiment. Clearly not every word showed is particularly distinctive of the style, but we can find some clues concerning style and content. For example swear words and adverbs seem to be common in the Lercio style, instead La Repubblica and Il Giornale leverage on contents indeed we could say that all of the features are related to the most common topics treated by the two dailies.

Figure 3.10: Top features extracted from the SVM. Positive scores are top features for Lercio , negatives ones are top features for regular news.

(35)

3.4 News Topic Alignment

After the data understanding, it’s pretty clear that newspapers style is characterized by some peculiar features in the text, like particular words, expressions o even some more abstractive features like humour and political utterances which give to the corpus different nuances and overall describe the style of a newspaper. Aiming to capture this differences is not trivial in a corpus, because the latter is always conditioned by other aspects like context and topic which are melted together and create a really complex data to deal with, if you want to capture something so abstract like style. A way to deal with this problem is to make news invariant to the other aspects, consequently standing out what we are interested in. Aligned data are a practical example of what has been said, indeed topic aligned news, make it able to point out the various, different uses of those lexical and semantic features which compose the style. We devised a method to align news based on the topic and we built other two dataset of topic aligned news aligning Lercio with La Repubblica and Il Giornale and Il Giornale and La Repubblica. To perform the alignment we compute the tf-idf vectors of all the articles of both newspapers and we create subsets of relevant news filtering by date, i.e. considering only news which were published in approximately the same, short, temporal range for the two sources. On the tf-idf vectors we then compute cosine similarities for all news in the resulting subset, we rank them, and we retain only the alignments above a certain threshold. The threshold is chosen taking into consideration a trade-off between number of documents and quality of alignment.

(36)

Chapter 4 Headline Generation as Summarisation

The headline is the first access point to the news, it is the trigger that activates the reader and for this reason, it is probably the most important part of the news, even more than the content, we could say in some sense. It should be attractive, raising interest and curiosity, encouraging the reader going into the news, even fooling him with misleading and ambiguous words.his problem has been modelled as an extreme summarization task with relatively good results. Given that many different generated texts can be correct, existing measures are usually deemed insufficient [70]. The problem is even more acute for headline generation since due to their nature and function, simple content evaluation based on word overlap is most likely not exhaustive. Human-based evaluation could provide a richer picture. In this task more than in others, the human evaluation is fundamental to understand if a generated headline is effective like a real one, exactly because a headline is not a mere summarization. We use that approach in this work, aiming to provide a good human-based evaluation framework that could be used in other language tasks which are not easily evaluable like generation tasks.

4.1 Models

In this section, we give a brief description of the models employed in the experiments.

Sequence-to-Sequence with Attention (S2S) We used a sequence-to-sequence model [106] with attention [3] with the configuration used by [100] but we used a bidirectional instead of

(37)

a unidirectional layer. This choice applies to all the models we used. The final configuration is 1 bidirectional encoder-decoder layer with 256 LSTM cells each, no dropout and shared embeddings with size 128; the model is optimised with Adagrad with learning rate 0.15 and gradient clipped [80] to a maximum magnitude of 2.

Pointer Generator Network (PN) The hybrid pointer-generator network architecture de-veloped by [100] is able to copy words from the source text via a pointing mechanism, and to generate words from a fixed vocabulary. This allows for a better handling of out-of-vocabulary words, providing accurate reproduction information, while retaining the ability to reproduce novel words. The base architecture is a sequence-to-sequence model, except for the pointing mechanism, and for the fact that the copy attention parameters are shared with the regular attention. An additional layer (so called bridge [56]) is trained between the encoder and the decoder and is fed with the latest encoder states. Its purpose is to learn to generate initial states for the decoder instead of initialising them directly with the latest encoder states.

Pointer Generator Network with Coverage (PNC) This model is basically a Pointer Gen-erator Network with an additional coverage attention mechanism that is intended to overcome the copying problem typical of sequence-to-sequence models [100]. This is basically a vector, computed by summing up all the attention distributions over all previous decoder time-steps. This unnormalised distribution over the document words is expected to represent the degree of coverage that the words have received from the attention mechanism until then. This vec-tor, called coverage vecvec-tor, is used to penalise the attention over already generated words, to minimise the risk of generating repetitive text.

4.2 Experiments

We led various experiment aiming to study possible improvements in generation when adopt-ing preprocessadopt-ing over the news. Preprocessadopt-ing is needed and useful, clearly because the news

(38)

or selecting the most important sentences by using a graph-based method and a neural-based method developed from scratch. Furthermore, we performed experiment using embeddings built on ItWac and other experimental embeddings that hereinafter we call XXL, built on a larger corpus including ItWac, an Italian Wikipedia dump specifically scraped for the case, and the top 17 Italian dailies also specifically scraped (included La Repubblica and Il Giornale ). We deployed a sequence-to-sequence with attention to make the experiments comparable, then we compared the best preprocessing solution with every single model.

4.2.1 Settings

We used La Repubblica and Il Giornale to build a dataset of 276669 news, using 70% of the sam-ples for training (193668 news), 20% for validation (55334 news) and 10% for test (27667 news). We performed experiments using news truncated at 300 tokens, removing partial sentences in order to keep the consistency of the text. Leveraging on the hypothesis that the headline is representative of the core facts featured in the news and some sentences are more factual than others, we performed experiments preprocessing the news in order the extract the most important sentences. Clearly, we cannot extract the most interesting sentences manually to do not incur into a biased model, so we used two approaches described below, to accomplish this task.

Textrank TextRank1[79] [7] is an algorithm based on PageRank [69], which is often used in keywords extraction and text summarization. It uses the working principle of the PageRank basically breaking down the news into sentences and building a graph which computes a rank of the most important sentences. We used the top 5 sentences of the rank in order to have full text lengths, similar to the truncated news, in average.

Sequence Labeller We built a dataset using full texts as samples and a sequence of binary labels denoting the presence of the token in the headline as targets and we implemented the sequence labeller using a Bi-LSTM encoder with 10 units, embedding of size 128 and Adam optimization. The experiment did not produce good result, even with high accuracy the model

(39)

was biased on the most frequent class. We found the reason in an extreme class unbalance-ment, indeed the class of “the out-headline” words was ' 98% of the tokens and despite we adopted weighted classes during the training, we got really poor result. This was a critical problem in the data and in the end, we decided to discard the sequence labeller and continue solely with the dataset generated by Textrank.

4.2.2 Analysis

We computed the ROUGE [69] scores for each experiment (Table 4.1) aware of the limits of that metric since it basically relies on content overlap. It provides no clues on the quality of the headline, thus we performed a qualitative human evaluation considering aspects like grammar correctness, attractiveness and suitability to the article. However, we use it as an index to evaluate critical cases.

Settings Rouge-1 Rouge-2 Rouge-L

Prec Rec F1 Prec Rec F1 Prec Rec F1

Truncated 0.283 0.190 0.222 0.093 0.062 0.074 0.258 0.174 0.184 Truncated + XXL 0.264 0.190 0.215 0.081 0.068 0.072 0.241 0.172 0.183 Truncated + itWac 0.249 0.192 0.185 0.082 0.053 0.081 0.251 0.169 0.178 Textrank 0.215 0.151 0.172 0.059 0.033 0.048 0.195 0.137 0.144 Textrank + XXL 0.202 0.141 0.162 0.051 0.036 0.041 0.185 0.130 0.137 Textrank + itWac 0.216 0.138 0.163 0.054 0.038 0.043 0.200 0.127 0.135 Table 4.1: Results of the experiments in various settings for a sequence to sequence with at-tention. For each experiment is shown Precision, Recall and F1-Score of Rouge 1, Rouge-2 and Rouge-L.

(40)

Model

Example

Truncated

Al Zawahiri : ” Obama non ha cambiato niente ”

Truncated + XXL

Genova , donna incinta di 14 anni fermata a Genova

Truncated + itWac

Corruzione , l’ Antitrust boccia il posto fisso

Textrank

L’ Inter fa il bis : il Napoli batte l’ Inter

Textrank + XXL

Gp Malesia , Lorenzo trionfa davanti a Lorenzo

Textrank + itWac

L’ appello di Amanda Knox : ”La giustizia non e innocente”

Table 4.2: Some examples of generated headline for each setting.

Comparing generation and scores, we notice that Truncated produced better generation with respect to Textrank. The truncated ones are more realistic and represent correctly the content of the news, even using different term. XXL embeddings produced slightly better generation than ItWac, conversely to what we see in Table 4.1, but in the end, did not go better than Truncated. We observe that neither Textrank nor much less the embeddings look to improve the generations. Textrank tend to produce very different headlines from the original ones, probably because of the pre-selection of the sentence performed during the ranking. The result is that most of the time Textrank’s generated headlines are confusing and do not represent correctly the overall content of the news, for example the headline ” L’ Inter fa il bis : il Napoli batte l’ Inter” in Table 4.2 refers to a completely different subject if compared to the original one: ”La Fiorentina si aggiudica il derby dell’ Appennino : Bologna ko 2”, this is due to the fact that probably that sentence has been ranked as the one of the most important of the news, conditioning the model to generate a headline with a different content. Thus we decided to continue by truncating the input, since it produces, in general, more grammatically correct sentences, and further, headlines with consistent content. We then proceeded testing the three models discussed in Section 4.1.

(41)

Model Example

S2S Erdogan - Netanyahu , accuse durissime : ” Israele come Hitler ” , ” No , tu sei un dittatore e stragista ” PN Erdogan - Israele , la replica : ” Israele e il Paese piu fascista ”

PNC Israele , Netanyahu : ” Israele e il Paese piu sionista , Hitler fascista fra i curdi ”

Table 4.3: Examples of generated headlines for each model used in the human evaluation with the final setting.

In the final test, we compared the tree models by using a truncated full text and the con-figuration presented in the Section 4.1. We trained the three models using the same amount of training steps to make their generations comparable in the evaluation. We found that 125000 training steps were enough to get good generations for every model.

4.3 Human Evaluation

Referring to human-based (intrinsic) evaluation of summarisation models, Gatt & Krahmer [36] mention two core aspects: linguistic fluency or correctness, and adequacy or correctness relative to the input. In terms of the system’s rendition of the content and in the context of evaluat-ing the generation of the final sentence of a story, we need also to examine grammaticality, (logical) consistency, and context relevance [67]. We took these factors into consideration when designing our evaluation settings. Since headlines must also carry some ”attraction” factor to read the whole article, we included this aspect as well.

4.3.1 Settings

We prepared an evaluation form2_{, which included five different questions for each case (see}

Figure 4.1). A case is basically an article and its four corresponding headlines to be evaluated, namely the three automatically generated ones, and the original (gold) title. Each subject could see the four headlines and answer questions Q1–Q3. The corresponding article is presented

(42)

on the basis of the headlines only, especially for the validity of Q3. The order in which gold and generated titles were shown was randomised, though it was the same for each case for all participants. Each form comprised 20 cases to evaluate and was sent to 3 participants.

The four titles are shown (repeated for each question below)

A. Usa , la fabbrica del vetro d’ aria per il telefono d’ aria in Usa B. Se il lavoro va ai robot : un automa vale sei operai

C. Usa , Trump : ” Trump si difende l’ occupazione e l’ economia nazionale ”

D. Usa , la beffa del condizionatore d’ aria ” made in Usa ” : ” Ecco come si difende ”

And the following questions are then asked:

[at this stage the subjects only see titles, without the article]

Q1. Questi titoli sono scritti correttamente? yes,no for each

Q2. Secondo te, questi titoli parlano dello stesso articolo? yes,no for pairs of titles Q3. Quale di questi titoli ti invoglia maggiormente a leggere l’intero articolo? pick one

[now the subjects also see the article]

Q4. Ritieni che il titolo sia appropriato all’articolo? yes,no for each

Q5. Quale ti sembra pi`u adatto? Ordinali rank 1–4

Figure 4.1: Sample evaluation case. Subjects are presented with the gold and generated head-lines in random order, and must answer a progression of questions, without and with seeing the article. Q1 targets correctness, Q2 targets the similarity in topic focus, Q3 targets attrac-tiveness, Q4 and Q5 target appropriateness (absolute, and relative to one another). In this example, A=s2s, B=gold, C=pnc, D=pn.

We created 10 different forms, thus obtaining judgements for 200 total cases with 30 par-ticipants (600 separate judgements). The parpar-ticipants are all native speakers of Italian and balanced for gender (15F/15M). We also aimed at a wide range of ages (17–77) and education levels (middle school diploma to PhD).

(43)

Figure 4.2: Left: Participants’ Education level. Right: Participants Age distribution

The headlines used for this evaluation were randomly selected from the test set. We ex-cluded all cases where at least one model produced a headline containing at least an unknown word (represented with the special token < UNK >), since this would make the headline look too weird and not much comprehensible. This led to excluding approximately 50% of the samples. The model with the highest proportion of headlines with at least one < UNK > was the S2S (37%), followed by the PNC (31%), and the PN (30.2%).

4.3.2 Analysis

In this section, we discuss and analyse results in detail dwelling on each question.

Grammatical Correctness (Q1) In Q1 we asked to evaluate the grammatical correctness of the headline. Results show that participants judge the headline correct more frequently than not correct, with Gold and PN having the best ratio and S2S the worst of yes vs no (Figure 4.3). It is interesting to note that Gold ones are frequently judged not correct, implying a certain strictness employed during the evaluation. this result could also suggest that correctness is

(44)

Figure 4.3: Left: Correctness judgements (Q1). Right: Attractiveness judgements (Q3)

Content Similarity (Q2) We note that the most similar in content are always the two pointer networks, and the most dissimilar are all the pairs that involve the gold headlines. This suggests that human titles focus on aspects of the article that are different from those picked by the generator, most likely as humans can abstract away from the actual text and use much more creativity.

Attractiveness (Q3) In this result there is a clear distinction between generated and Gold. Indeed in the large majority of the cases, the gold headline was chosen as the most inspiring for reading the whole article (Figure 4.3). Among the models, the headlines generated by the PN is mostly chosen, followed by the PNC, and lastly by the S2S. Clearly, gold headlines are the ones which raised more interest among the participants, suggesting that there is some-thing in the way experts create headlines, most likely related to human creativity, rhetoric and communication strategies, which systems are not yet able to reproduce.

Suitability (Q4-Q5) In the context of assessing how appropriate a headline is with respect to its article, we have to underline two interesting results. In terms of a binary evaluation for each headline (Figure 4.4, left) in almost every case including Gold, the headlines are deemed not appropriate, more frequently than appropriate to their article, except for PN which is the

(45)

only case where the positive evaluations are slightly more frequent than negatives. In the case of the generated headlines, they might just not be good enough, but in the case of Gold, this could be due to the fact that excessive creativity to make the title attractive can make it less adherent to the actual content, misleading the reader.

Figure 4.4: Left: suitability judgement for each headline (yes/no). Right: headlines are ranked according to most (1) to least (4) appropriate for each corresponding article. Average ranking: PN=2.401; Seq2Seq=2.488; PN C=2.530; GOLD=2.580

The rank shows a possibly unexpected trend (Figure 4.4, right side). The headline chosen as most appropriate (ranked 1st) is most of the times the one produced by the PN model, even more so than the gold. Not only, the gold is also the headline that features last (ranked 4th, thus least suitable) more than any of the other titles. This is reflected in the average rank (see caption of Figure 4.4), as the gold headline comes in last, and the PN-generated title is comparatively the most preferred.

4.3.3 Agreement

(46)

evalu-an α ≥ 0.667 is required to consider accepting the agreements [59], however, we got lower scores, this suggest that the task is highly subjective, and this is especially true for the eval-uation of how attractive a headline is towards reading the whole article. Possibly surprising is the score regarding the evaluation of the headline’s correctness, which could be viewed as a more objective feature to assess. Such relatively low score could be due to the vagueness of Q1, in combination with the nature of headlines, which even in their human version might be formulated in ways that do not necessarily abide to grammatical rules.

G S2S PN PNC tot

Correctness 0.439 0.427 0.345 0.337 0.387

Attractiveness – – – – 0.120

Suitability 0.349 0.354 0.374 0.313 0.348 Suitability-Rank 0.444 0.364 0.339 0.398 0.389

Table 4.4: Krippendorf’s alpha scores for human annotations. The rightmost column shows the agreement over all systems plus gold headlines. The higher the score the higher the agreement.

4.3.4 Discussion

We assessed and compared the quality of three different sequence-to-sequence models that generate headlines starting from an article through human judgement and contextually we used them to evaluate the original headlines as well. The best system is a pointer network model, with correctness judgements on par with the gold headlines. Evaluating the generated output on different levels, especially attractiveness, which typically characterises news head-lines, uncovered an interesting aspect: gold headlines appear to be the most attractive to read the whole article, but are not considered the most suitable, on the contrary, they are judged as the most unsuitable of all. Therefore, when automatically generating headlines, just relying on content might never lead us to titles that are human-like and attractive enough for people to read the article. This should be considered in any future work on news headline generation, exactly because the human component in the human-written headline, is not in the content, but in an abstract aspect as the attractiveness that in this experiment the neural model is not

(47)

been able to reproduce. One aspect that has not been explicitly considered in our experiments is that the headlines come from different newspapers (positioned at opposite ends of the polit-ical spectrum), and can carry newspaper-specific characteristics. Robust headline generation should consider this, too.

4.4 Classification as Evaluation

We generated headlines using a set of news, coming from newspapers belonging to opposite positions in the political spectrum. When using a generative model to generate headlines, it is reasonable to think that the generated headlines will be conditioned to the style of source used for the training. We devised and experimented an automatic methodology to evaluate the capabilities of a Natural Language Generation system to learn a specific text style, coupling generation and classification under a variety of training and testing settings

4.4.1 Approach

The main idea is to use a classifier to assess the quality of automatically generated headlines. We trained two models to generate headlines for newspaper articles coming from La Repub-blicaand Il Giornale and we expected that the generate headlines will carry some newspaper-specific characteristics. At the same time, we train a discriminator on the gold headlines that learns to classify the headline membership of a newspaper or the other, knowing that good performances indicates the ability to distinguish the sources. Figure 4.5 shows an overview of the approach.

(48)

Figure 4.5: Red: generation task. Blue: classification task. Darker: training. Lighter: testing.

Models We used a Pointer Generator Network with Coverage Attention [100] as generative model with the same configuration used in the previous experiment, and an LSTM-Based ar-chitecture with word and character encoder to build a classifier as used in Section 3.3 using embedding trained with word2vec [82] on itWac corpus. We sampled a validation set equal to the 10% of the data to perform model selection and fine-tuning, we used cross entropy function as loss and Adam optimizer.

4.4.2 Data

The dataset was composed by La Repubblica and Il Giornale samples equally represented in training partition by the 70% of the whole available news (in terms of documents). In order to disentangle potential topic biases present in the newspapers, and highlight newspaper-specific

Headline Generation and Analysis of Writing Styles in Journalism