Developing and Experimenting Approaches for DeepFake Text Detection on Social Media

(1)

Computer Engineering

Developing and experimenting approaches

for DeepFake text detection on Social

Media

Candidato:

Margherita Gambini

Relatori:

Dott. Maurizio Tesconi

Dott. Fabrizio Falchi

(2)

(3)

Social media, by connecting people and spreading ideas, have always been the perfect vehicle to shape and alter the public opinion through bots, i.e. agents that behave as human users by liking, re-posting and publishing machine-generated messages that trick people into believing that they are human-written. Even the cheapest text generation techniques (e.g. the search-and-replace method) can deceive humans, as the Net Neutrality scandal (2017) proved. Meanwhile, more powerful generative models have been released, from RNN-based methods to the GPT-2 language model: these deep neural networks can produce ”Deepfake Texts”, i.e. autonomously generated and non-formulaic texts that resemble human-written contents. Even though Deepfake text can be found in social media, there is still no misuse episode on them, but this generative capability deeply worries: it is necessary to continuously probe the language generator models’ abilities in producing human-like SM texts by developing appropriate detectors. In particular, this work aimed at developing a GPT2-based detector and testing it over a dataset composed by both human-written and machine-generated tweets produced by GPT-2, RNN and other non-better specified deep generative models. The results are satisfactory, as our detector can discriminate deepfake texts from human ones with an accuracy of 91

(4)

(5)

1 Introduction 1

1.1 Deepfakes: a Growing Threat . . . 2

1.1.1 Deepfakes Definition . . . 2

1.1.2 Deepfakes: the Good and the Bad . . . 3

The Good: Creativity, Education and Entertainment . . 3

The Bad: Fraud, Humiliation and Misinformation . . . . 3

1.1.3 Deepfakes to Show Potential Misuse. . . 5

1.1.4 Deepfakes Public Accessibility . . . 6

1.1.5 Deepfakes Detection . . . 6

1.2 Social Media: Spreading Ideas and Mis-Disinformation . . . 8

1.3 Contribution of This Thesis . . . 9

2 Evolution of Natural Language Generation 11 2.1 Natural Language Generation Definition . . . 11

2.2 The Traditional NLG Pipeline . . . 13

2.3 Language Models for NLG . . . 16

2.3.1 N-Gram models . . . 18

Text Generation. . . 19

Drawbacks . . . 19

2.3.2 Recurrent Neural Network . . . 20

Drawbacks . . . 23

2.3.3 Long Short-Term Memory . . . 23

(6)

Drawbacks . . . 25

2.3.4 Conditional Language Model. . . 26

Drawbacks . . . 27

2.3.5 The Attention Mechanism with Neural Networks . . . . 28

RNNs . . . 28

Seq2Seq . . . 30

2.4 Transformer . . . 31

2.4.1 Attention Mechanism on its own . . . 31

Self-Attention . . . 32

Multi-Head Attention . . . 32

Masked Self-Attention . . . 33

2.4.2 Transformer: Encoder-Decoder . . . 34

How the Transformer Works . . . 40

2.4.3 GPT-2 Transformer . . . 44

The Model . . . 45

The Training Dataset . . . 47

The Input Representation: BPE . . . 47

The Creation of Written Content . . . 48

2.4.4 GROVER: Extending GPT-2 to Generate News . . . 50

2.4.5 CTRL: Controllable Generation . . . 52

2.5 Other Deep Learning Generative Methods . . . 53

2.5.1 GAN . . . 53

Text generation process is discrete . . . 54

Reinforcement Learning . . . 55

Gumbel-Softmax . . . 58

Working on the Continuous Space . . . 59

2.5.2 VAE . . . 60

OPTIMUS . . . 61

2.6 Sampling the Next Token from Multinomial Probability Distribution . . . 63

(7)

3 Deepfake Text Detection 67

3.1 Automated Machine-Generated Text Detection Systems . . . 67

3.1.1 Standard Machine-Learning Text Classifiers . . . 68

Features Extraction . . . 68

Popular Machine Learning Text Classifiers . . . 69

Deep Learning Text Classifiers . . . 69

Machine-Learning Text Classifier as a Baseline . . . 71

3.1.2 Linguistic-Statistical Detection . . . 72

3.1.3 Zero-Shot Detection . . . 72

GPT-2 zero-shot downward tasks . . . 73

3.1.4 Fine-Tuning Detection . . . 74

3.2 Automated Machine-Generated Text Detection Results . . . 75

3.2.1 Open AI Baselines . . . 75

3.2.2 Grover Detector . . . 77

3.2.3 GLTR: Human-Machine Teaming . . . 79

3.2.4 Open AI RoBERTa Detector . . . 82

3.2.5 Energy-Based Model Detector . . . 83

3.2.6 Factors Affecting the Detection . . . 89

Sampling Techniques . . . 89

Length of Text Samples . . . 90

Conditioned or Not Conditioned Text Samples . . . 90

Detector Model’s Size. . . 91

Generalization Capabilities over Datasets and Architectures . . . 91

Machine-Learning Classification vs Zero-shot vs Fine-Tuning Approaches . . . 92

3.3 Automated Machine-Generated Text Detection on Social Media 93 Social Networks . . . 93

Consumer Review Networks . . . 95

(8)

3.4.1 About Social Media . . . 97

4 Datasets 99 4.1 Twitter Dataset . . . 99

4.1.1 Data Collection . . . 99

4.1.2 Train, Validation and Test Sets . . . 101

4.2 Standard Sentiment Treebank v2 Dataset . . . 103

4.3 GROVER Dataset . . . 104

5 GPT2-Based Detector 107 5.1 Choice of GPT2-Based Detector . . . 107

5.2 Model construction . . . 108

5.2.1 GPT-Based Classifier: the Starting Point . . . 108

5.2.2 A Different Classifier Module . . . 109

5.2.3 GPT2-Based Detector: the Complete Architecture . . . . 111

GPT-2 Model . . . 113

Text Prediction Head . . . 114

Task Classifier Head . . . 115

Code Influence . . . 117

5.3 Parameters and Architectural Choices. . . 118

5.3.1 GPT-2 Model’s Size . . . 120

5.3.2 Gradient Descent Learning . . . 120

5.3.3 Gradient Descent Optimizer Choice . . . 122

5.3.4 Scheduler Choice . . . 127

5.4 Model’s Validation . . . 129

(9)

5.4.2 GROVER Dataset . . . 130

6 Experiments 135

6.1 Experiments’ Design . . . 136

6.2 Results Experiment 1: Default Values . . . 137

6.2.1 Results per Tweet-Generator. . . 138

6.3 Results Experiments 2 and 3: Investigating Some Parameters . . 140

6.4 Results Experiments 4: Without Training on GPT2-Tweets . . . 142

6.5 Results Experiments 5: Without Training on RNN-Tweets . . . 144

(10)

(11)

Introduction

During the last decade, the social media platforms - developed to connect people and make them share their ideas and opinions through multimedia contents (like images, video, audio and texts) - have also been used to manipulate and alter the public opinion thanks to bots, i.e. computer programs which control a fake social media account as a legitimate human user would do: by ”liking”, sharing and posting old or new media which could be real, forged through simple techniques (e.g. editing of a video, use of gap-filling texts and search-and-replace methods) or deepfake. The generation and sharing of deepfake multimedia over social media - tricking people into believing that they are human-generated - have already caused distress in several fields (such as politics and science). It is therefore necessary to continuously probe the generative model’s ability into producing deceiving multimedia by enhancing and developing appropriate detectors. This is more than ever necessary for the text generation field: in 2019, for the first time, a generative model (i.e. Radford et al.’s GPT-2 [1]) showed incredible text generation capabilities that deeply worry the research community. Deepfake social media texts (GPT-2 samples included) can already be found, but there is still no misuse episode on them; therefore, we are still in time to study and raise shields against machine-generated social media texts. This is the purpose of this thesis.

The next sections will offer a wide view of the deepfake threat, along with the social media misuse due to its connecting and sharing structure.

(12)

1.1 Deepfakes: a Growing Threat

1.1.1 Deepfakes Definition

The term “Deepfake” – a portmanteau of “deep learning1” and “fake” - refers to AI-generated multimedia (images, videos, audios, and texts) that are potentially deceptive [3], although good usages of deepfakes can be found [4]. Deepfake technologies have first risen in the computer vision field2 _[₅_, ₆_,

7, 8], followed by effective attempts on audio manipulation [9, 10] and text generation [1]. Deepfakes in computer vision usually deal with face manipulation - such as entire face synthesis, identity swap, attribute manipulation, and expression swap [11] – and body re-enacting [8]. Recently, audio deepfakes involved the generation of speech audio from a text corpus by using the voice of different speakers after 5 seconds of listening time [10]. In 2019 Radford et al. developed GPT-2 [1], an unsupervised language model3 _{which can autonomously generate coherent human-like paragraphs of}

text [13]; in the same year, Zellers et al. [14] contributed to text generation with GROVER, a new approach for efficient and effective learning and generation of multi-field documents such as journal articles. Soon after, Keskar et al. released CTRL [15], a conditional language model which uses control codes to generate text having a specific style, content, and task-specific behavior.

1_{”Deep Learning” is a family of machine-learning methods that use ANNs (Artificial}

Neural Networks) to learn a hierarchy of representations - from low to high non-linear features representation - of the input data. Deep learning is employed in computer vision and computer-text generation, whose major achievements are, respectively, Image Net [2] and GPT-2 [1].

2_{”Computer vision is an interdisciplinary scientific field that deals with how computers}

can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the human visual system can do”https://en.wikipedia.org/wiki/Computer vision

3_”_{Language modeling}_{(LM) is the use of various statistical and probabilistic techniques}

to determine the probability of a given sequence of words occurring in a sentence”. Further details about Language Modeling for text generation can be found inchapter 2. The GAN-text generation had been studied, but they weren’t as effective as the language modeling technique [12]

(13)

1.1.2 Deepfakes: the Good and the Bad

As all technology innovations, deepfake can bring both enhancements and pitfalls in human lives.

The Good: Creativity, Education and Entertainment

Positive examples are to found in art, educational and entertainment fields: researchers at Samsung’s AI lab in Moscow transformed Da Vinci’s Mona Lisa into a video [16][17], Dal`ı was brought back to life in Dal`ı Museum in Florida [4], a Scottish company (CereProc) was able to create the ‘lost’ audio of the speech John F. Kennedy was due to give the day he was assassinated [18] – this technology could be useful to give voice to those people who lost it due to illness or accident -, and so on [19]. Other possible good deepfake uses can be conceived, from the medical field [20,21] to journalism [22].

The Bad: Fraud, Humiliation and Misinformation

Unfortunately, the deepfake technology is a major menace too, as it supports the spreading of fake news and deceptions by playing a dangerous game with transparency, authenticity, and trust. Indeed, the term “deepfake” was coined with malicious intent by a Reddit user who employed deep learning to swap the faces of celebrities onto the bodies of actresses in pornographic videos.

Initially, the deepfake technique was mainly used for deepfake pornography

– primary targeting female celebrities [23] -, but it was soon employed for financial and political distress too.

Cheap-fakes Since 2017, it has been shown that even “cheap-fake” multimedia (e.g. synonym replacement in text, low-tech-doctored videos) can navigate public opinion and politic in a very effective way: in 2017, around 96% of the 22 million comments on the FCC’s public comment website about the net neutrality were fake [9] - most of them were duplicated comments and 1.3 million comments used synonym replacement [24] – and in favour of both sides of the debate, although the repealing side was the most preponderant. Jeff Kao [24] showed that only 3-4% of total comments were

(14)

real, 99.7% of which in support of net neutrality. FCC ultimately voted in favour of repealing the net neutrality: the concern is that fake anti-net neutrality comments might have been used to justify the repeal [25]. Editing a video just by cutting some frames or manipulating the speed can have a spell-effect on people too: in 2018 a Pakistan raise-awareness video about child kidnapping was deprived of the frames which would have explained the fake kidnapping. Then, the video was spread on WhatsApp in India with the effect of inspiring vigilante mobs to protect children from whoever stopped to talk to them. At least 20 individuals were lynched or beaten to death in incidents across India [26]. Again, in 2019 the American politician Nancy Pelosi was the victim of a low-tech-doctored video: in this case, the video’s speed was slowed down to spread a false claim that the politician slurred her words [27].

Real Deepfakes If even “cheap-fakes” can bring to such incredible results, deepfakes – for which it is much more reasonable to believe that they are real

4 _{- have what it takes to be a calamity for individuals and national security.}

Examples of this threat can already be found: in March 2019 a deepfaked voice was used to trick a CEO – of a UK energy firm – into believing that its boss was calling [29]. The scammed CEO recognized the voice as belonging to its boss – having subtle German accent and carrying “the man’s melody” – therefore he had no doubts or bad feelings when he transferrede243000 to the bank account of a Hungarian supplier as the “boss” had ordered. Another deepfake high-repercussion case dealt with pornography at the expense of the investigative journalist and writer Rana Ayyub [30]: it all started with a disinformation campaign to discredit her (through images of fake tweets, bad doctored if examined by an expert eye), then the deepfake porn video and doxing followed. The journalist has suffered from mental and physical distress since then.

Even buggy and flickery video deepfakes caused concerns in public opinion: this is the case of 2019 Matteo Renzi’s deepfake video posted on the Twitter page of the Italian satirical news show “Striscia la notizia” [31]: the video is so outrageous that it is clearly a parody, but few twitter users fell for it. The

4_{As Hany Faird, professor at the University of California, stated: “In January 2019,}

deepfakes were buggy and flickery. Nine months later, I’ve never seen anything like how fast they’re going. This is the tip of the iceberg” [28]

(15)

video was convincing enough at a glance, and that’s all that’s needed to spread misinformation.

Deepfakes Sow Mistrust What is important to state and notice is that just knowing that deepfake technology exists is enough to sow fear, uncertainties, and mistrust, which can lead people to question everything [32] and anyone. A clear example of deepfake fear is to be found in Gabon, where a video of their president Ali Bong delivering his New Year’s address raised doubts in some people from Gabon. In late 2018 Ali Bongo hadn’t seen in public for months, which made some people suspect that he was ill or even dead and that the government was covering the fact. To end the speculation, the government released an announcement saying that Bong had suffered a stroke but remained in good health, followed by Ali Bongo New Year’s address video. Instead of reassuring, the video did exactly the opposite: soon after, some people started to debate on the authenticity of the video – Bongo looked odd in eyes and head movements - and one week later an unsuccessful coup was launched - by the militia - citing the video as part of the motivation. Currently, no one has never found evidence of manipulation on Ali Bongo New Year’s address video, but the doubt remains. An additional case of mistrust is the publication of a video allegedly reporting the economic affairs minister Azmin Ali (Malaysia) in a tryst with another minister’s male aide [33]. Again, no deepfake proof has been found yet on that 35-seconds clip, but the suspicion was enough to spark a debate on it.

1.1.3 Deepfakes to Show Potential Misuse

Apart from deepfake pornography and the above cases, most of the deepfake contents are labeled and used to show the potential misuse and threatening of this deep learning technique: obvious examples are the Buzzfeed’s Obama speech [34] - developed by using Adobe After Effect and the proprietary desktop application called FakeApp [35] –, the numerous movie clips with swapped actors [36], the GPT-2 examples showing its ability to craft passages in the style of a New Yorker article in seconds [37], the fake-news website containing news articles generated by Grover, Grover’s multi-headed

(16)

documents samples [38], machine-generated tweets5 and so forth. An important proof of deepfake text generation capabilities is the 2019 study [39] on a public comment process in the USA: Max Weiss, a senior student at Harvard University, generated aroun one thousand comments by employing GPT2 and submitted them to the ”Centers for Medicare & Medicaid Services” (CMS) website, proving that federal public comment websites are not secure against bot attacks. The deepfake text submitted where over 55% of the 1810 total comments already posted. Then, Weiss asked people to tell whether the comments on CMS’s website were generated by a bot or written by a human. Participants’ accuracy was near-random guessing value, proving that people are not able to distinguish a machine-generated text from a human-written one.

1.1.4 Deepfakes Public Accessibility

Initially, deepfake technology was accessible only to academics and skilled machine-learning people, but in 2018 a commercial development of deepfake apps usable by common people began to rise. Examples of these commodity deepfake apps are FakeApp [35], Faceswap [40], the more recent command-line based DeepFaceLab [41] and Zao [42], an application developed by the mobile app giant Momo which allows users to superimpose their faces on actors’ body in a TV or movie clip by submitting just one picture. The aim is entertainment, but misuses are immediately detected, as the DeepNudesApp case teaches [43]. Zao app’s ease of use – just one picture of the target person is required - and effectiveness are raising awareness on the deepfake threat affecting not only public figures but ordinary people too.

1.1.5 Deepfakes Detection

Of course, deepfake detecting strategies are continuously developed - from deepfake video to audio and text detection methods6_{- but Hao Li, professor}

at the University of Southern California and CEO of Pinscreen, tells that “at some point, it’s likely that it’s not going to be possible to detect [AI fakes] at

5_e.g.

https://twitter.com/botustrump

(17)

all. So a different type of approach is going to need to be put in place to resolve this” [47]. Professor Li stated that because deepfake videos are usually generated with a GAN approach7, with the subsequent detectors exploiting flaws in deepfake generation. The detectors are just replicating the GAN dynamic in the wider research landscape, where each new deepfake detection paper gives the deepfake makers a new challenge to overcome, as it was in the eye blinking case [48, 47]. This could result in an endless run of deepfake video generation and detection, as the endless fight between computer viruses and anti-viruses. A different deepfake video detection approach must be soon devised. However, the conflict between generators and detectors could find an end in the deepfake text field, as current machine-text generation techniques exploit language modeling rather than the GAN approach. Anyway, the deepfake text battle is still ongoing too.

Furthermore, Delip Rao - VP of research at the AI Foundation – noted that even deepfake detectors with an accuracy around 97% cannot be enough when thinking at the scale of internet platforms [47] which deal with millions of multimedia contents every day. He suggested that, first and foremost, social media should make further efforts in helping deepfake detection, and that automatic deepfake detection should work together with human education.

In sight of the 2020 US election, on Jan 7th 2020 Facebook announced [49] a new policy banning AI-manipulated “deepfake” videos, but not “cheap-fakes”, i.e. videos made using conventional editing tools. On the other hand, Twitter and Youtube will adopt a banning policy on all kind of manipulated media, both deepfakes and cheapfakes, by banning or labeling faked pictures, videos, and other media that are “deceptively shared” [50, 51].

7_{This is a type of machine learning system comprised of two neural networks, operating}

in concert. One network generates the fake and the other tries to detect it, with the content bouncing back and forth, and improving with each volley

(18)

1.2 Social Media: Spreading Ideas and

Mis-Disinformation

Social media are computer-mediated technologies that allow people to interact with each other in real-time by the creation or sharing of information, ideas, career interests and other forms of expression via virtual communities and networks [52, 53]. Over the last decade, social media usage has increased dramatically [54], integrating into the fabric of everyday life of one-third of the world population. Some social media like Twitter have been conceived from the beginning as information networks [55, 56] - to express, share users’ ideas and thoughts -, unlike other platforms, like Facebook, which born as social networks connecting family and friends – although recently Facebook is used as a mean to spread opinions as well. Besides, over the last years more and more people have relied and still are relying on social media as a source for news [57, 58, 59, 60, 61]: in 2019, the Pew Research Center estimated [62] that 55% of US adults get their news from social media either “often” or “sometimes”.

The sharing of ideas, opinions and the distribution along with the consumption of online news on social media, have both positive and negative effects: on one side, discussions and confrontations help people growing, and the presence of worldwide news on immediate social platforms keep the population connected and aware of the context it is living in. On the other side, the spreading of thoughts and old or new beliefs – aided by the fact that, usually, social media connect a person with others with common interests (“following”, “followers” practice) – contributes to fuel dormant extremist movements [63,64] and to not believe in science and facts anymore [65], to cite a few examples. Additionally, the change in how we consume news [60] – by just “liking”, “retweeting” or “sharing”, or by just reading the news’ headline, or by just paying attention to immediate media like audio or video without reading carefully and performing fact-checking – have helped “fake news” websites to get more attention with “sensational headlines and ridiculous story lines”. As Aral et al. [66] affirm, ”falsehood diffused significantly farther, faster, deeper, and more broadly than the truth in all categories of information”.

(19)

Twitter case Even though Twitter is not the most popular social media platform, it is often in the spotlight: terrorist and extremist movements are more likely to organize and get more following on Twitter [67,68,69] thanks to the possibility of anonymously sharing and receiving information; examples of rumours, false flags, misinformation, fake news on Twitter can be immediately pinpointed, from medical issues (e.g. Ebola in 2014 [70], COVID-19 in 2020 [71]), to events (Boston Marathon Bombing in 2013 [72], to politic (2016 US presidential election [73, 74]), to personal harassment [75].

1.3 Contribution of This Thesis

The previous sections have shown how deepfakes and social media’s peculiarity8 of spreading ideas and news could result in a great threat to people as individuals (e.g. personal harassment, mistrust/trust in everything we see and read) and to national/international security (misinformation diffusion, disinformation campaigns etc.). Though, at the best of our knowledge, there are still no episodes of deepfake text misuse - current machine-generated text misuse dealt with cheap-fake texts, such as for the Net Neutrality scandal and the 2016 USA Presidential Elections9 -, it is necessary to stay vigilant. Thus, it is crucial to continuously assess the capabilities of deepfake text generators - both current transformer-based text generators (such as GPT-2 [13], GROVER[14] and CTRL[15]) and older techniques involving other deep neural network typologies (like RNN-LSTM [77], Markov-chains [78]) - applied to social media messages by employing both human detectors and automatic machine-generated text detection methods. Unless differently specified, from now on the terms ”machine-generated” will refer to new natural language texts autonomously generated by an AI model, without the usage of neither schemes or templates, nor non-linguistic data inputs (only natural language inputs are allowed, but not with the aim of summarization or translation). This is the definition of ”Deepfake Texts”.

Most of social media research (about bots, the disinformation campaigns

8_{Twitter in particular.}

(20)

etc) is conducted on Twitter, being always in the public eye and having an open API to crawl with, as opposed to Facebook. Then, this study will focus on tweets.

Of course, this detection research requires a dataset of human written and machine-generated tweets. Currently, no one has ever created a properly labelled Twitter dataset containing only human and machine-generated tweets: the ones used for detecting ”auto-generated” tweets (like in Garcia et al. [79] and Lundberg et al. [80]’s works) usually include tweets coming from all kind of bots and their labelling is done by humans at twitter account level. The fact that they comprises a large variety of bots [81] (spam bots, social bots, sockpuppet, cyborgs) means that the detection task is not focused only on machine-generated tweets which comply with the definition given above. Besides, the human-labelling made at account level it’s not correct: a study conducted by Cresci et al. [82] has shown that it’s actually not possible for humans to understand whether an account is bot or not looking at the tweets produced. Therefore, unfortunately these works about the detection of ”machine-generated” tweets cannot be taken into account. However, Fagni et al. (2020)10 _{have recently created a Twitter dataset}11 _(read

section 4.1) which hasn’t used human-labelling, but it has been built by manually taking tweets from verified real users (@realDonaldTrump) and machine-generated12 _labeled _accounts _(e.g. _{@DeepDrumpf).} _These

researchers are also currently experimenting and creating baselines for the detection of AI-generated tweets belonging to this dataset, expecting a wide diffusion of social bots that can autonomously generate new social media messages in the (near) future: older techniques as Bag-Of-Words and Char-CNN, as well as recent methods as transformer-based detectors (BERT, RoBERTa, XLNet etc.), are tested. The purpose of this thesis is, thereby, to develop and experiment approaches for automatic deepfake text detection on Fagni et al.’s dataset, by using, in particular, a GPT-2 based method.

10_{this work hasn’t been published yet.} 11_{which was kindly lent to me:}

https://www.kaggle.com/mtesconi/twitter-deep-f ake-text

12_{which comply with the definition ”natural language contents autonomously generated}

(21)

Evolution of Natural Language

Generation

2.1 Natural Language Generation Definition

Figure 2.1: Natural Language Processing, Natural Language Understanding and Natural Language Generation diagram

In general terms, NLG (Natural Language Generation) and NLU (Natural Language Understanding) are sub-fields of the more general NLP domain that encompasses all software which interprets or produces human language, in either spoken or written form.

• NLU reads human language and turns its unstructured data into structured data understandable to computers. In simpler words, NLU

(22)

models can read but not write. Some examples are question answering, sentiment analysis, relation extraction and semantic parsing.

• NLG, based on structured data, generates an output (written text or speech) in a way that is easily understandable by humans. In simpler words, NLG models can write but not read. Some examples are the generation of analytic reports in natural language and the creation of autonomously written content like articles and stories.

• In practice, NLU and NLG are strongly related and they can be both used to build models which can read and write contents in the form of natural language (fig. 2.2). Some NLU and NLG combined application examples are the creation of written content like articles and stories having text as input, summarization, language translation and automatic spelling, grammar and text correction. In the following sections, unless otherwise specified in the context, the term NLG will be used to refer to systems that generate text (not audio) from both linguistic and non-linguistic data inputs.

Figure 2.2: NLU and NLG flow

So far, NLG has gone through two different stages: at the beginning, text generation was based on templates or rule-based systems only, which were fairly interpretable and well-behaved for specific narrow applications. But scaling such systems required infeasible amounts of hand-engineering efforts. Simply,

(23)

natural language doesn’t have clearly-cut and defined rules and that’s why it is frequently compared to a wild west by many. For this reason, aided by machine learning progress, the field moved to language models. The following sections will quickly go over the traditional text generation pipeline used during the first NLG phase and mainly for data-to-text generation, and then they will carefully review the more advanced and skilled language models techniques (focusing on the latest and most talked about NLG task of autonomously writing new human-like texts).

2.2 The Traditional NLG Pipeline

At the end of the twentieth century and the beginning of the twenty-first century, some researchers [83,84] presented a standard pipeline architecture for the development of NLG systems. This architecture comprises six components, each performing an important task to generate a coherent output1_.

• Content Determination is the problem of deciding which information should be included in the text under construction, and which should not. Typically, the input data contain more information and/or is more detailed than what we want to express through text2. This step has to take into account several issues: communicative goal of the text, size and level of detail of the generated text and how unusual and unexpected the information is.

• Document Structuring is strongly related to Content Determination, as it involves deciding the order and grouping (for example into paragraphs) of sentences in a generated text.

• Sentence Aggregation is the task of combining multiple sentences into a single one. This is needed because, typically, not every sentence in the

1_{for extra details, Santhanam et al. (2019) [}₈₅_{] and Gatt et. al (2018) [}₈₆_{] wrote a}

comprehensive review of Natural Language Generation techniques.

2_{e.g. the input data comprise four different information: ”the baby is being given}

morphine via an IV drop”, ”the baby’s heart rate shows bradycardia’s”, ”the baby’s temperature is normal”, ”the baby is crying”. Which information should be included in the generated text?

(24)

(a) Content Determination (b) Document Structuring

(c) Lexicalization etc. (d) Linguistic Realization

Figure 2.3: This example of NLG pipeline, about neonatal intensive care domain, was taken from [86]. a) First, the system has to decide what will be the most important events/information to work on. b) Then it has to decide in which order presenting the data, i.e. structuring the output document. c) Aggregation, Lexicalization and Reference Expression Generation follow (in this case only Lexicalization is shown). d) Finally, the output sentences are generated

Document Structuring phase needs to be expressed in a single sentence. Sentence Aggregation makes the generated text more fluid and readable. Actually, this step could be done in all the stages of the NLG process except during Content Determination and Linguistic Realization.

• Lexicalization is the task of choosing the right words to express the contents of the message. Lexicalization comprises two categories: Conceptual Lexicalization and Expressive Lexicalization. Conceptual Lexicalization is the act of converting data into linguistically expressible concepts3 _{and Expressive Lexicalization is how lexemes}4

available in a language can be used to represent a conceptual meaning.

• Referring Expression Generation (REG) focuses on the creation of referring expressions5 _{that identify specific entities called targets. This}

3_{semantic entities like motion, manner, path, cause etc. and surface entities like the verb}

and the satellite

4_{a lexeme is a unit of meaning in a language, consisting of a word or group of words} 5_{a referring expression (RE), in linguistics, is any noun phrase, or surrogate for a noun}

(25)

task can be split into two sub-steps. The Content Selection determines which set of properties distinguish the intended target, and the Linguistic Realization part defines how these properties are translated into natural language.

• Linguistic Realization is the task of ordering different parts of a sentence and using the right morphology, along with function words (such as auxiliary verbs and prepositions) and punctuation marks, to produce a syntactically and orthographically correct text. The main approaches to Linguistic Realization are: Templates6_{, Hand-Coded}

Grammar-Based Systems - which use a small set of hand-crafted grammar rules as features to generate alternative representations of the output text - and Statistical Approaches - which employ statistical techniques either to filter out alternative realizations (generated with a small hand-crafted grammar) or to generate alternatives (in place of hand-crafted grammar-based systems).

This pipeline is important to understand because it shows the milestones and challenges of natural language generation. Until machine learning techniques were applied to the NLG field, those six sub-tasks were arranged, organized in different ways to devise various text generation approaches. These techniques evolved and varied significantly together with the technology development: from simple gap-filling approach - which used hand-crafted approaches based on rules -, to the more recent and more accurate stochastic and domain-independent ones that rely on corpus data7_.

Nonetheless, it is with language model research (typically used for NLU tasks), along with machine-learning and deep learning improvements, that NLG made giant leaps both in data-to-text and text-to-text tasks. But while data-to-text generation still relies on that pipeline8 _{and mainly on schemes or}

phrase, whose function in discourse is to identify some individual object (thing, being, event...)

6_{consider the template ”[person] is leaving [country]”: in this scenario person and country}

values can be replaced by the system during output phase

7_{the comprehensive description of all these approaches can be found in Gatt et al.’s work}

[86]

8_{even though language modeling and machine learning can be used in some steps. [}₈₇_,

(26)

templates to structure the text output, text-to-text generation has reached higher peaks by employing ”end-to-end” machine learning without having separate stages. The latter field is highly interesting, since recently the GPT-2 language model [1] succeeded in autonomously generating new and coherent human-like corpora (stories and articles) by having in input just a short sentence. This particular task gained a lot of attentions during the last years, as we are entering the era in which AI is able to write like a human. As extensively shown in the Introduction chapter, this AI capability made clear that fake news, disinformation campaigns, personal harassment etc. are no more a human issue only. Therefore, it is important to understand how language models and deep learning methods brought to GPT-2’s development - specifically, to how an AI can generate new and correct texts on its own - in order to devise appropriate detection techniques. These will be the arguments of the next sections.

2.3 Language Models for NLG

Given example sequences of the form x = (x1, ..., xm), where each xi comes

from a vocabulary of tokens9V = {x1, ..., x|V |}, the goal of Language Modelling is to learn the probability p(x). Being x a sequence, it is natural to factorize this distribution using the chain rule of probability (Bengio et al., 2003 [89]):

p(x) =

m

Y

i=1

p(xi|x<i) (2.1)

From this equation it can be immediately understood that language modelling is decomposed into next-word predictions p(xi|x<i), i.e. the prediction of the

word xi given the sequence of the words already present (x1, ..., xi−1).

Language Models have been extensively used for both text-to-data NLP tasks - like sentiment analysis, speech recognition, part-of-speech tagging etc. -, and text-to-text NLG tasks, which encompass machine translation, summarization and the recent field of autonomous creation of written content like articles and stories.

9_{a token could be a word, a char, a Unicode code point (like in Byte Pair Encoding). For}

(27)

Primarily, two types of Language Models exist:

• Statistical Language Models: these models use traditional statistical techniques, like N-grams and certain linguistic rules, to learn, construct the joint probability distribution of words.

• Neural Language Models: these models have been firstly devised to solve the data sparsity problem of the statistical language models. Anyhow, these models far surpassed the statistical ones in their effectiveness. The main neural language models which became milestones for text generation are the RNNs and its variations (LSTMs, GRUs) and the recent Transformer architecture.

Since Neural Networks always have some parameters to determine, the Maximum Likelihood Estimation10 _{can be employed: the parameter}

values are chosen such that they maximize the likelihood of making the model producing the training observations (the tokens/words in the training text corpora) given those parameter values. In particular, current state-of-the-art neural network methods (e.g. Dai et al., 2019 [90] and Radford et al., 2019 [1]) train a neural network with parameters θ to minimize the negative log-likelihood11 _{(eq. (}_2.2_{)) over a}

vocabulary of tokens V = {x1_{, ..., x}|V |_}: L12(V ) = − |V | X k=1 logpθ(xki|x k <i) (2.2)

The training data format differs from one language model to another, but generally is composed by a set of text corpora (sentences, paragraphs or combined paragraphs, paired if necessary for the task), appropriately tokenized. Additionally, standard neural net training algorithms such as stochastic gradient descent with back-propagation (Bengio et al, 2008 [91]) are employed to train the NN language model.

10_here_{there is a simple explanation of MLE}

11_{calculated by applying the equivalent cross-entropy error, for example. Maximizing the}

log-likelihood (MLE) is equivalent too. Taking the logarithm of an expression is always a way to simplify calculations, in this case the derivative used in the stochastic gradient descent training algorithm.

(28)

Let’s now focus on the evolution of language models’ ability to autonomously create new text corpora which brought to the much talked about GPT-2.

2.3.1 N-Gram models

The N -gram13 model [92] is a statistical language model and the simplest one. The problem is to calculate the joint probability of a sequence of words, hence deriving probability of a word xi knowing its history h = (xi−1, ..., x1).

A straightforward way to calculate the latter is through counting ”out of the times we saw the history h, how many times it was followed by the word xi”:

P (xi|x1, ..., xi−1) =

count(x1, ..., xi−1, xi)

count(x1, ..., xi−1)

(2.3)

To compute the exact probability of a word given a long sequence of preceding words, all english (e.g.) text corpora available (e.g. take each text of each website) should be taken in consideration; but, actually, they are still not enough, because the language is creative: new sentences are created all the time and maybe we would like to compute the probability of a word knowing its sentence history which may not exist in our dataset. For this reason, instead of computing the probability of a word given its entire history, the latter can be approximated by just the last N words: that’s where the Nth _Markov

property comes in handy, i.e. predicting the probability of some future unit without looking too far into the past. Thus, the general equation for the N-gram approximation of the conditional probability of the next word in a sequence is:

p(xi|xi−1, ..., x1) ≈ p(xi|xi−(N −1), ..., xi−1) (2.4)

which can be estimated by applying the MLE, i.e. counting the number of times each word appeared after n-1 words (history) and divide by the number of times those n-1 words appeared in the training data:

P (xi|xi−(N −1), ..., xi−1) =

count(xi−(N −1), ..., xi−1, xi)

count(xi−(N −1), ..., xi−1)

(2.5)

(29)

Text Generation

Generating text is straightforward:

1. Look at the last N − 1 grams, i.e. the historic data.

2. Get the distribution of the next gram (word) by using eq. (2.5), i.e. get the probabilities of each vocabulary gram given the historic data.

3. Sample a gram from that next gram distribution. There exist different techniques to sample the next gram, which are described in section 2.6.

4. If either the < stop > word or a pre-defined sequence length is reached, then stop. Otherwise, loop over the above steps updating the history with the generated gram at each turn.

Drawbacks

However, the N-gram model shows some drawbacks that impact its performances:

• Sparsity problem: N-grams are a sparse representation of language. This is because the model is built based on the probability of co-occurring words. It will give zero probability to all the words that are not present in the training corpus. In these cases, either the numerator or denominator of eq. (2.5) would be zero. To avoid such situations, smoothing and back-off techniques are used:

1. if a gram never occurred in the historic data, N-gram assigns 0 probability (0 at the numerator of eq. (2.5)). Consequently, the probability distribution should be smoothed, as everything should have at least a small probability assigned to it.

2. if a N − 1 gram sequences never occurred in the historic data, the probability of the next gram cannot be calculated (0 at the denominator of eq. (2.5)). In this case, conditioning on a shorter N gram (i.e. use a N − 1 gram model instead) should do the trick. This technique is called backoff.

(30)

• Curse of Dimensionality: this problem limits the size of text corpora to work on. The fact is that, as language models are trained on larger and larger texts, the number of unique words (the vocabulary) increases. The number of possible sequences of words increases exponentially with the size of the vocabulary, causing the data sparsity problem above14. Thus, statistics are needed to properly estimate probabilities. Although the smoothing and back-off techniques reduce the data sparsity problem, it’s still not enough.

• Nth _{order Markov property: due to this simplified assumption, the}

model is not as good at predicting coming words as it would have been if it kept the entire context (history).Therefore, incoherent text is usually produced (e.g. ”while production of bag lasts and shoe industry , the stock market has risen just after”).

It may be thought that larger context (increasing N ) may give a better predictor, but by increasing N the sparsity becomes a bigger problem producing lower-quality texts, and additionally it requires larger computational power.

• Exact pattern N-Gram model has to rely on an exact pattern, i.e. a word/string sequence matching: for example, n-gram model cannot recognize a sequence like “the cat is walking in the bedroom” to be syntactically and semantically similar to “a dog was running in the room”.

It can be safely stated that the N-gram performance is limited by its lack of complexity and context awareness. Simplistic models like this one cannot achieve fluency, enough language variation and correct writing style for long texts.

2.3.2 Recurrent Neural Network

It is known that deep neural networks models, with their continuous-space representation of tokens/words, work much better than the traditional

14_{a word sequence on which the model will be tested is likely to be different from all the}

(31)

(statistical) language models for NLP tasks [93], allowing conditioning on increasingly large context sizes15 with only a linear increase in the number of parameters, supporting generalization across different contexts, solving the sparsity problem16 and therefore eliminating the curse of dimensionality [91]. Nonetheless, not all kinds of neural networks can work as language models (e.g. densely connected networks and convnets17) because they have no memory: each input shown to them is processed independently, with no state kept in between inputs. With such networks, in order to process a sequence or a temporal series of data points (as text), you have to show the entire sequence to the network at once (which is useless). In 2010, Recurrent Neural Networks addressed this issue [94, 95]. They are networks with loops in them, allowing information to persist.

Figure 2.4: Recurrent Neural Network rolled. (Source github)

A RNN is therefore able to exploit the sequential nature of the input. At step i, the RNN is inputted with both the representation vector18 _{of the i}th

token (xi) of the input sequence and the RNN’s previous hidden state; those

inputs go through a feed-forward network and the next hidden state is produced. The hidden state acts as the neural network’s accumulated internal memory, holding information on previous data that the network has seen before. This ”memory” possessed by RNNs makes them great for language generation, as they can remember the context of the conversation over time19_.

15_{due to the memory capability.}

16_{learning a distributed representation for words allows the model to be informed by}

each training sentence about an exponential number of semantically neighboring sentences (maybe never seen before). Furthermore, there is no need to store all observed n-grams

17_here_{is explained very well why.} 18_{the encoding vector}

19_{RNNs have been widely used for a variety of problem like speech recognition, language}

(32)

Figure 2.5: Recurrent Neural Network unrolled: a RNN can be thought of as multiple copies of the same network, each passing a message to a successor. (Sourcegithub)

Text Generation

In practice, RNN’s goal is to learn a general function (A in the RNN diagram in fig. 2.4) on how it should deal with a token given the context of previous tokens, while learning a general representation of language and context.

In every iteration of the RNN, the model stores in its memory the previous encountered words and calculates the probability of the next one. For example (fig.2.6), if the model generates the text “We need to rent a ”, it now has to figure out the next word in the sentence. For every word in the vocabulary V , the model assigns the probability based on the previous words it has seen20. In this example, the words “house” or “car” will have a higher probability than words like “sunny” or “promenade”. By applying some sampling method (section 2.6) the next word is selected and stored in memory, and the model then proceeds with the next iteration.

Figure 2.6: Example of RNN text generation. (Source medium)

found herehttp://karpathy.github.io/2015/05/21/rnn-effectiveness/

20_{thanks to a dense layer positioned after the only neural network layer of the RNN}

(33)

Drawbacks

There are cases in which RNN performs really well, others in which it doesn’t. For example, it’s doable to predict the last word in the sentence “Sun is shining in the sky”. In such case, where the gap between the relevant information and the place where it’s needed is small, RNNs can learn to use the past information. But it is harder to predict the last word of the sentence “The weather during November is mainly sunny.” when 100 words earlier there was a sentence “I am planning a trip to Cyprus.”. Here the gap between the relevant information and the point where it is needed becomes very large. The fact is that, as the gap grows, RNNs become unable to learn to connect the information.

In theory, RNNs are absolutely capable of handling ”long-term dependencies”. A programmer could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The reasons laying behind it are the so-called vanishing gradient problem and the linear decay of the context information. It is known that RNN learns by updating its weights with the partial derivative of the error function. The problem is that, in some cases, the gradient will be vanishingly small : this translates to insignificant changes in the weights and thus, no significant learning. In particular, the vanishing gradient issue is due to the usage of non-linear activation functions (such as tanh) and to the back-propagation which computes gradients by the chain rule.

2.3.3 Long Short-Term Memory

LSTMs have got a similar chain-like structure to RNNs; however, instead of having a single neural network layer like RNNs, LSTMs comprises four neural network layers interacting in a very special way. These components allow the LSTM to remember or forget words over arbitrary time intervals by regulating the flow of information in and out of the LSTM cell 21 _(fig._2.7_{). In particular,}

the vanishing gradient issue is minimized by the presence of the forget gate

21_{the description of this new structure is out of the scope of this thesis, but detailed}

explanations can be found in this blog:http://colah.github.io/posts/2015-08-Unders tanding-LSTMs/

(34)

and the additive structure of the LSTM cell22.

(a) The repeating module in a standard RNN contains a single layer.

(b) The repeating module in an LSTM contains four interacting layers.

Figure 2.7: The RNN and LSTM repeating module structure. The cell’s state is represented by the horizontal line running through the top of the diagram. The neural network layers are the yellow rectangles. (Sourcegithub)

Several variants of LSTMs have been developed (like GRU), but Greff et al. [96] found out that they are all about the same.

Text Generation

Suppose that you give the text “I am from Spain. I am fluent in .” in input to an LSTM model (fig. 2.8). To correctly predict the next word as

(35)

”Spanish”, the model focuses on the word “Spain” in an earlier sentence and ”remembers” it using the cell’s memory. This information is stored by the cell while processing the sequence, and is then used when predicting the next word. When the full stop ’.’ is encountered, the forget gate realizes that there may be a change in the context of the sentence, and the current cell state information can be overlooked. This allows the network to selectively keep track of only relevant information while also minimizing the vanishing gradients problem which allows the model to remember information over a more extended period.

Figure 2.8: Example of LSTM text generation. (Sourcemedium)

Drawbacks

Unfortunately, LSTM has got flaws as well:

• memory issue: LSTMs and its variations (GAN etc.) seem to be the answer to the problem of vanishing gradients to generate coherent sentences. However, there is a limitation to how much information can be saved, as the path from previous cells to the current one is a complex sequential path. This limits the length of sequences that an LSTM can remember to just a few hundred words.

• training: An additional pitfall is that LSTMs are very difficult to train due to high computational requirements. Because of their sequential nature, they are hard to parallelize, limiting their ability to take advantage of modern computing devices such as GPUs and TPUs.

(36)

2.3.4 Conditional Language Model

Let’s take aside the memory and parallelization problems of LSTM. The language models seen so far (n-gram, RNN, LSTM) can generate text, but are they able to generate text based on a condition other than the starting text (fig.2.9)?

Figure 2.9: Examples of Conditional LM text generation. (Source

towardsdatascience)

This paradigm is called Seq2Seq Modelling [97], and it was a big improvement in 2014 for neural machine translation23_{. Its architecture is a}

single neural network model which comprises two RNNs (fig. 2.10):

(37)

• An encoder, which creates a fixed-length encoding (a vector of real numbers) that encapsulates information about the input.

• A decoder (the real conditional LM part), which is a language model that generates the target sequence conditioned with the encoding created by the encoder.

The decoder is trained with the teacher forcing method (as it happens for the standard RNN and LSTM’s training), which consists in forcing the inputs of the decoder to be the gold target inputs instead of its predictions. For example, if the decoder predicts ”a” instead of ”Mary”, the seq2seq network discards this output after calculating error and feed in “Mary” as part of the input on the subsequent time step.

Additionally, the backpropagation in Seq2Seq operates end-to-end24, learning the entire system with respect to a single loss (fig. 2.11).

Figure 2.10: Seq2Seq example for generating a message based on some search criteria. (Source towardsdatascience)

Drawbacks

The embedding created by the encoder becomes the initial state of the decoder. It’s like forcing all of the information about the source sentence to be captured in this single vector, as this is the only thing that is given to the decoder. However, this practice falls into the information bottleneck problem: if some information isn’t in that embedded vector, then there’s no

24₁st _{end: the beginning of encoder RNN; 2}nd_{end: the loss function value calculated by}

(38)

Figure 2.11: Seq2Seq loss backpropagation. (Sourcetowardsdatascience)

way the decoder is going to be able to, for example, translate the sentence correctly.

The Attention mechanism described in the next section provided a solution to this problem while also helping with other problems including the vanishing gradient problem and the ”longer-term” dependencies (LSTM flaw).

2.3.5 The Attention Mechanism with Neural Networks

Generally speaking, the attention mechanism looks at an input sequence and decides at each step which other parts of the sequence are important. Clarifying this concept with an example: when reading any text, we always focus on the word we read, but at the same time our mind still holds the important keywords of the text in memory in order to provide context (see fig. 2.12 and fig.2.13). That’s the ability that researchers successfully gave to neural networks - first to RNNs and Seq2Seq, then to Transformers25 - for the text generation task.

RNNs

A standard RNN builds up its understanding of the input sentence by looking at each token of that sequence. After seeing a token, it updates its memory cell with what it thinks represents the meaning of the whole sentence so far. Then, its next word prediction is based on the accumulated memory. The more

(39)

Figure 2.12: Measurement of how strongly the words in the phrase ”Abraham Lincoln was born in 1809” relates to ”born”. (Source medium)

Figure 2.13: Measurement of how strongly the words in the phrase ”Abraham Lincoln was born in 1809” relates to ”Lincoln”. (Source medium)

recent processed tokens tend to overpower earlier tokens. As a result, RNNs (LSTM and GRU included) don’t work very well on long pieces of text.

Figure 2.14: Making RNNs better with Attention (Sourcemedium)

With the attention mechanism, the model keeps track of its memory values after having seen each word (fig. 2.14). The prediction of the next word is

(40)

based on both the current memory values, and the weighted version of all the past memories. By weighting the old memory states in a different way, the model pays more or less attention to each word depending on how it thinks that that word will help in predicting the next word. This mechanism help the RNN and LSTM’s short memory.

Seq2Seq

This mechanism have been successfully applied to the Seq2Seq architecture too, solving the information bottleneck problem elegantly.

Imagine the encoder and decoder modules as humans who wants to translate a sentence from French to English: the encoder, instead of only writing down the translation of the sentence in the imaginary language (the representation vector), also writes down keywords that are important to the semantics of the sentence. Then, it gives them to the decoder in addition to the regular translation (in the imaginary language). Those new keywords make the translation much easier for the decoder, because it knows what parts of the sentence are important and which key terms give the context of the sentence.

By using more technical words, the attention mechanism assign a score to each token of the processed source sentence. These scores represent how much the tokens are important inside the sentence given what the decoder has already produced. Then, the scores are summarized to help the decoder to produce the next token in a more ”intelligent” way. Comparing the old with the new Seq2Seq structure, this is what happens for step t (fig.2.15):

1. the dot product of the decoder last hidden state st (step t) with every

encoder hidden state hi is calculated in order to get the attention scores

et_{, e}t_{= [s}T

t ∗ h1, ..., sTt ∗ hN] ∈ RN;

2. the attention distribution for the current step is obtained through a softmax layer, αt _{= sof tmax(e}t_{) ∈ R}N_;

3. the attention output (for step t) is calculated as the weighted (attention distribution values) sum of the encoder hidden states, at =PN_i=1αtihi ∈

(41)

4. finally, the attention output is concatenated with the decoder hidden state, [at; st] ∈ R2h, and then the model proceed as in the non-attention

version of Seq2Seq. This time, if for some reason some information isn’t in the encoder embedded vector, the attention mechanism provides it.

Figure 2.15: Attention on Seq2Seq architecture. (Sourcetowardsdatascience)

In general, the attention mechanism brought significant performance enhancement over RNNs and Seq2Seq text generation, but the coherence across entire paragraphs of text had to be further improved. That’s where Transformers come in.

2.4 Transformer

2.4.1 Attention Mechanism on its own

After noticing the giant enhancement that the attention mechanism brought to RNNs and Seq2Seq, Vaswani et al. (2017) [98] asked themselves whether Attention could perform well on its own (Self-Attention): maybe the way that Attention weights the predictive value of each previous word in the sentence

(42)

is what really matters. Before diving into Transformer structure, let’s review some important concepts.

Self-Attention

Take the phrase ”Abraham Lincoln was born in 1809 ” (see fig. 2.12 and fig. 2.13) as an example: if a model is trained to predict a specific missing word (e.g. ”born”) based on the remaining words in the sentence, it is also possible to measure how much each remaining word was responsible in that prediction, i.e. obtaining the measurement of how strongly each word in the sentence is related to the missing word ”born”. This procedure can be repeated for each word in the sentence to find how how strongly-related each word is to every other word. This is what is called self-attention or an attention unit/block/module.

Multi-Head Attention

Any given word (in any language) can have multiple meanings and relates to other words in different ways. For example, the word ”fair” has got distinct meanings according to the context surrounding it: when used as an adjective, it can describe someone as agreeable, but it can also describe someone who has light skin or hair; as a noun, a ”fair” is typically a local event that celebrates a certain person, place, or historical moment. How to get all these flavours of a word? The idea is to train different self-attention units in parallel for each single word (e.g. ”fair”) processed: each unit will pays attention to a single flavour of the word under consideration (fig.2.16).

Now suppose to apply multi-head attention over the words of millions and millions of sentences scraped from the web: the model will learn how every word (in English) relates to every other word in every possible context.

(43)

Figure 2.16: As we encode the word ”it”, one attention head is focusing most on ”the animal”, while another is focusing on ”tired” – in a sense, the model’s representation of the word ”it” bakes in some of the representation of both ”animal” and ”tired”. (Source github)

Masked Self-Attention

A normal Self-Attention block is auto-regressive, i.e. during the processing of a word in a sentence, it is allowed to peak at tokens26 _{at both its left and right.}

This kind of attention unit is used in BERT language model [99].

However, other transformer language models like GPT-2 [1], use the so called Masked Self-Attention: while processing a word in a sentence, it is allowed to peak at the left words only (fig. 2.17).

By using the standard Self-Attention, the context of a word is better represented, but language models with Masked Self-Attention work pretty well too.

Actually, all Transformer model typologies derive from the standard one conceived by Valswani et al. [98]. That’s why it’s important to understand how this traditional transformer work. Let’s get into it.

26_{unless differently stated, tokens could refer to a word, a char or even Unicode code points}

(Byte Pair Encoding for GPT2). But to understand the concepts easily, you can think of a token as a word

(44)

Figure 2.17: Self-Attention and Masked Self-Attention. (Source github)

2.4.2 Transformer: Encoder-Decoder

As CNN teaches, if something works well on its own, why not try to stack it up? Therefore, generally speaking, the Transformer is an architecture for transforming one sequence into another composed by a stack of multi-head attention units, each one giving a better and refined representation of the tokens27 _{in the sequence.}

Of course, it’s a little more complex than this: the first devised Transformer architecture comprises two parts, an Encoder and a Decoder 28. As it can be seen in fig. 2.18, both the Encoder and the Decoder are composed of N stacked blocks, which can be addressed as the encoder block and the decoder block. Those blocks mainly consists of (Masked) Multi-Head Attention and Feed Forward layers.

Let’s quickly go over the Transformer’s main components:

• Maximum Context Size: it refers to the maximum number of tokens that a sentence input can be composed of. By looking at the Transformer’s math there’s no hint at this input size limitation; this limit is instead due to the needed amount of memory to compute the self-attention over a corpus of text. The amount of memory needed by the self-attention is quadratic on the length of the input. This means that increasing the maximum length of the input, increases drastically the needed memory for self-attention. The maximum length/context

27_{remember, a token could be a word, a char, or even Unicode code points like in GPT-2.} 28_{it differs from the previously described/existing Seq2Seq models because it does not}

(45)

Figure 2.18: Encoder-Decoder Transformer architecture. The Encoder is the module on the left, while the Decoder is on the right. From ’Attention Is All You Need’ by Vaswani et al.

size is fixed at a convenient value to not got out-of-memory (plus, also the token embeddings occupy a lot of memory). Nonetheless, recent transformer-based language models can learn dependency beyond a fixed length without disrupting temporal coherence (Transformer-XL [90]) or they introduce optimization to handle much longer texts in a much more efficient way (Reformer [100]).

(46)

• Output sequence shifted right and masked (fig. 2.19) Imagine that the task is French to English translation where the input phrase is ”Je suis ´etudiant” and the output phrase is ”I am a student”. The input phrase is the input to the Encoder part, while the output phrase is the input to the Decoder part. The output phrase is conveniently shifted right to prevent the model from learning the copy/past task: if the decoder sequence is not shifted, the model learns to simply ‘copy’ the decoder input, since the target token to predict (word/character) for position i would be the token i in the decoder input. Thus, by shifting the decoder input by one position right, our model needs to predict the target token for position i having only seen the tokens 1, ..., i − 1 in the decoder sequence. The copy/past task is also prevented by the usage of a Masked Multi-Head Attention layer in the Decoder, instead of a standard Multi-Head Attention one: the idea is to hide all tokens from the ith that has to be predicted to the end of the text corpus29.

Figure 2.19: Output sequence shifted one right and masked.

• Input Embedding and Output Embedding (fig. 2.20) As text generation techniques require, the inputs and outputs (target sentences)

29_{for simplicity, in this section we talk about the text corpus as a sentence, but it could}

(47)

are first embedded into an n-dimensional30 space since we cannot use strings directly. First and foremost, a vocabulary mapping ”from token text to integer” of all the possible tokens has to be defined (e.g. ”Example”: 637). Those vocabulary integers will be used as indexes of the real valued Token Embeddings matrix, where each row contains a fixed-size vector representing the embedding for a certain token. The values of this matrix are learned during the language model training. Of course, if the task is translation there will be two vocabularies31 and token embeddings matrices.

Figure 2.20: The mapping between the vocabulary integers and the Token Embedding matrix.

• Positional Encoding (fig. 2.21) The information about the order of the sequence (the position of words in a sentence) is very important. Since there is no recurrence that can remember how sequences are fed into a model, this information on the absolute (or relative) position in a sequence is represented by the use of ”position encodings” (e.g. with sinus/cosinus). These positions are added to the embedded representation (n-dimensional vector) of each token.

• Multi-Head Attention In both training and prediction32 _{phases, the}

Encoder and Decoder part are given as input a sentence (input sentence and output sentence): the tokens of a sentence are processed in parallel through the encoder and decoder side (fig.2.22). A Multi-Head Attention

30_{n is the maximum sentence length} 31_{obtained by pre-processing the dataset} 32_{see later for further details about.}

(48)

Figure 2.21: Summing the embedding and the positional encoding vectors for each token.

block produces a matrix33 for each processed token: a row of this matrix is a token feature vector, i.e. the output of a self-attention head whose job is to understand one of the possible meanings of the token/word under consideration inside the target sentence. Therefore, the output of a multi-head attention block is not a single matrix (a vector for each word of the target sentence) but a set of n34 matrices.

• Feed Forward Neural Network The issue is that the output of an encoder block should be a single matrix (containing a vector for each processed word of the target sentence), as this is the shape of the input accepted by the following encoder/decoder block. To condense the multi-head attention’s output matrices (n matrices, one matrix for each word) into a single one, a further processing is needed: a Feed Forward Neural Network does the trick.

• Add and Norm The “add and norm” layer also plays a key role to connect the inputs and outputs of other layers smoothly. This layer contains a residual structure (to optimize the deep network) and

33_{in fig.} _2.22_{is shown as a vector, but it is a matrix indeed.} 34_{n is the length of the text sentence under consideration}