MusÆ: an Adversarial Autoencoder for Music

(1)

U

NIVERSITÀ DI

P

ISA

M

ASTER

’

S

T

HESIS

MusÆ: Adversarial Autoencoder

for Music

Author:

Andrea Valenti

Supervisors:

Dr. Davide Bacciu

Dott. Antonio Carta

Examiner:

Prof. Francesco Romani

A thesis submitted in fulfillment of the requirements

for the Master’s Degree in Computer Science

in the

Computational Intelligence & Machine Learning Group

Dipartimento di Informatica

(2)

(3)

iii

Declaration of Authorship

I, Andrea Valenti, declare that this thesis titled, “MusÆ: Adversarial Autoen-coder for Music” and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a degree at this University.

• Where any part of this thesis has previously been submitted for a de-gree or any other qualification at this University or any other institu-tion, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.

(4)

(5)

v

“What I cannot create, I do not understand.”

(6)

(7)

vii

UNIVERSITÀ DI PISA

Abstract

Dipartimento di Informatica

Master’s Degree in Computer Science

MusÆ: Adversarial Autoencoder for Music

by Andrea Valenti

Automatic music modelling and generation is a challenging task. The abil-ity to learn from big data collections of deep generative models makes them well-suited for modelling musical data. Among them, the adversarial au-toencoder model stands out for its intrinsic flexibility and seems to be a nat-ural choice for dealing with complex data distributions, such as the one of music.. Despite that, in the literature there are no mentions of adversarial autoencoders applied to music. This thesis intends to fill this gap, presenting a novel architecture for symbolic music generation, called MusÆ. The exper-iments show that MusÆ has a higher reconstruction accuracy than similar models based on standard variational autoencoders. It is also able to create realistic interpolations between two musical sequences, smoothly changing the dynamics of different tracks. Experiments on the learned latent space show that some latent dimensions have a significant correlation with some low-level properties of the songs, allowing us to perform changes to the gen-erated pieces in a principled way. We encourage the reader to judge the qual-ity of results by uploading a selection of the generated songs on YouTube.

(8)

(9)

ix

Acknowledgements

I would first like to thank my thesis supervisor Dr. Davide Bacciu of the Computer Science department at University of Pisa. The door to Dr. Bacciu office was always open whenever I ran into a trouble spot or had a question about my research or writing. He consistently allowed this thesis to be my own work, but steered me in the right the direction whenever he thought I needed it.

I would also like to thank the examiner who was involved in the validation survey for this research project: Prof. Francesco Romani. Without his pas-sionate participation and input, the validation survey could not have been successfully conducted.

I would also like to acknowledge Dott. Antonio Carta of the Computer Sci-ence department at University of Pisa as the second reader of this thesis, and I am gratefully indebted to him for his very valuable comments on this thesis. Finally, I must express my very profound gratitude to my parents, my grand-parents and to all of my friends for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them.

Thank you.

(10)

(11)

xi

Wasserstein GAN . . . 31 2.3.3 Adversarial Autoencoders. . . 32 3 The Model 37 3.1 Data Representation . . . 38 3.2 Architecture . . . 39 3.2.1 Encoder . . . 41 3.2.2 Decoder . . . 42 3.2.3 Discriminator . . . 44 3.3 Training . . . 44 3.3.1 Reconstruction Phase. . . 46 3.3.2 Regularisation Phase . . . 47 4 Experiments 49 4.1 Data Preprocessing . . . 50 4.2 Sequence Reconstruction . . . 52

4.3 Latent Space Interpolation . . . 55

4.4 Latent Space Sweep . . . 56

5 Conclusion 61

(13)

xiii

List of Figures

2.1 Hierarchical architecture of the Music VAE model [4]. The high-level conductor creates a series of intermediate codes, used by low level decoders to generate the actual piano roll. (Figure credits by Roberts et al. [4]). . . 12

2.2 System architecture of the MuseGAN model [44] for multi-track sequential data generation. (Figure credits by Dong et al. [44]) . . . 12

2.3 Excerpt of Chopin’s Nocturne op.9 no.2, as an example of mu-sical score notation. . . 15

2.4 Example of raw audio representations. The same sound can be represented either as a waveform or a spectrogram. . . 17

2.5 Evolution of piano roll representation. . . 18

2.6 Intro of "Hotel California" song by Eagles, shown in its tab representation.. . . 18

2.7 Typical architecture of a variational autoencoder. The encoder approximates the posterior qλ(z|x), whilst the decoder

approx-imates pθ(x|z). Dashed lines denote stochastic operations, full

lines denote deterministic operations. . . 23

2.8 This figure illustrates the reparameterisation trick of VAEs. Red arrows show gradient backpropagation flow during training. Dashed lines denote stochastic operations, full lines denote de-terministic operations. . . 24

(14)

xiv

2.9 Typical setting of generative adversarial networks. The gen-erator produces fake training samples x= gθ(z)starting from

latent variables z. The discriminator, fed with both real and fake samples, assigns to each sample the probability of com-ing from the actual data distribution. . . 27

2.10 Kullback-Leibler and Jensens-Shannon divergences between two normal distributions p∼ N (0, 1) and q∼ N(µ, 1). When

the two distributions becomes very different from each other, the grandient of both divergences tends to vanish. . . 28

2.11 Outputs of optimal GAN and WGAN discriminators in the data space. While GAN discriminator output tends to have vanishing gradient when its predictions are very confident, WGAN discriminator provides informative gradient for the generator during every stage of the training. . . 30

2.12 Typical architecture of an adversarial autoencoder. Dashed lines denotes non-differentiable random operations, while full lines denotes differentiable operations. As for VAEs, the en-coder approximates the posterior qλ(z|x), whilst the decoder

approximates pθ(x|z). The discriminator has to distinguish

be-tween fake latent factors ˜z generated by the encoder, and real latent factors z directly sampled from the chosen prior distri-bution. . . 33

3.1 Encoder’s architecture. The forward and backward parts of the 3-layer bi-LSTM are shown, respectively, in blue and red. All of these layers use tanh activation functions. The linear layers that use the outputs of the biLSTM to finally generate the latent factors distribution’s parameters are shown in orange. 40

(15)

xv

3.2 Decoder’s architecture. Each of the four tracks follows the same basic architecture. The 3-layer LSTM with tanh activa-tions are shown in blue. The linear layers used to initialise the LSTM hidden states with z are shown in orange. The final soft-max layer is shown in green: this layer transform the LSTM output into a distribution over the note states, from which the

final pianoroll is sampled. . . 43

3.3 Training phases of an adversarial autoencoder. Dashed lines denotes stochastic operations, while full lines denotes deter-ministic ones. Red arrows show gradient backpropagation flow during training. . . 45

4.1 Piano roll of reconstructions of song Honesty by Billy Joel. . . 53

4.2 Examples of interpolations created by the trained model. . . . 55

4.3 Examples of variations of several different dimensions of a la-tent code z. . . 56

A.1 Examples of song reconstructions. . . 66

A.1 Examples of song reconstructions. . . 67

A.2 Examples of bar interpolations. . . 68

A.2 Examples of bar interpolations. . . 69

A.3 Examples of latent space sweeps. . . 70

(16)

(17)

xvii

List of Tables

2.1 Summary of the main MIDI event messages. For a complete list, see [7]. . . 21

4.1 Reconstruction accuracies of MusÆ on the Lakh MIDI dataset, compared to the Music VAE model [4]. A higher score is better. 52

4.2 Correlation coefficients between latent factor 1 and the evalu-ation metrics. . . 57

(18)

(19)

xix

List of Abbreviations

ML Machine Learning

DL Deep Learning

ASCII American Standard Code (for) Information Interchange

MIDI Musical Instrument Digital Interface

RNN Recurrent Neural Network

CNN Convolutional Neural Network

LSTM Long Short-Term Memory

biLSTM Bidirectional Long Short-Term Memory

RBM Restricted Boltzmann Machine

VAE Variational AutoEncoder

GAN Generative Adversarial Networks

WGAN Wasserstein Generative Adversarial Networks

WGAN-GP Wasserstein Generative Adversarial Networks (with) Gradient Penalty

AAE Adversarial AutoEncoder

ELBO Evidence Lower BOund

KL Kullback-Leibler

(20)

(21)

xxi

(22)

(23)

1

Chapter 1

Introduction

The rise of artificial intelligence in these recent years is arguably one of the events in history of technological developement that has te potential to have a major impact on our society and on our current way of life.

Since the invention of the first machine learning techniques, scientists all over the world have been able to find new ways of applying such algorithms in an increasingly wide range of domains. Helped by an ever-increasing amount of sheer computing power that became available over the years, some of these models have beaten the previous state of the art and have now become the de-facto standard solution to many different problems. Machine vision and natural language processing are two notable examples of research areas al-most completely revolutionised by the advent of machine learning [68][5].

Despite this encouraging series of successes, it may seem that there will al-ways be areas of human expertise that the machines will fail to entirely grasp in all of their facets, sparing them from this process of automation. One ex-ample of such areas seems to be the fine arts, such as poetry, painting or music, where human intuition and creativity plays (or it is thought to play) a determinant role for producing high quality results. Many key concepts

(24)

2 Chapter 1. Introduction

related to these arts are based on informal, intuitive definitions: for exam-ple, the division of music into different genres is mainly based on heuristics, which are, by definition, capable of considering only a subset of the impor-tant aspects that characterise a particular genre. It is often impossible to ex-actly describe with words the reason why we like a specific painting, or why a particular section of a song is able to evoke within us feeling that we did not felt for a long time. To complicate things even further, sometimes our grade of appreciation of a piece of art is not related at all to the quality of the piece of art itself. Instead, we like it only for collateral reasons: it might remind us of a happy moment in our lives, or of a beloved person (or both). Therefore, it would seem at a first glance that the criteria human beings use to evaluate a piece of art are more subjective and emotional, rather than ob-jective and rational. This can be a good thing if you are a music critic (or just passionate about the subject), since it allows you to sustain endless conver-sation about whether Bruce Springsteen has to be considered as a rock song-writer or a folk rockstar, or which band was the most innovative between The Beatles or The Rolling Stones, or whether D minor is the saddest of all chords, without having to deal with the unpleasant fact that only one of the possible options must be true. On the other hand, it can be a real nightmare if you are trying to teach a machine the notion of musical genre. Looking at the problems listed above, it may seem that any attempt of building an auto-matic art generator machine is doomed to end in failure.

Thinking about the constituent parts of an artwork, it is possible to make a very broad division into subjective part and objective part: the subjective part contains things such as sensations, emotions and (possible) memories that arises in the subject experiencing the artwork. Even if it can heavily af-fect the perceived quality of a piece of art, it is not of much interest in this

(25)

Chapter 1. Introduction 3

context, mainly beacuse it depends on the particular subject experiencing the artwork rather than on the artwork itself. There will always be people that hate the Monna Lisa just because it reminds them of that time they visited the Louvre museum during a trip to Paris with their ex-boyfriend/ex-girlfriend, and there’s nothing Leonardo da Vinci could have done to prevent this. The objective part, on the other hand, can be described as the composition of fea-tures that, toghether, constitute a specific piece of art. These feature are often organised in a hierarchical way. For a painting, the lowest-level features will describe things such as: type of paint, type of brush, choice of colours and so on, whilst the highest-level features will represent, for example, the general content of the picture (still life, landscape, etc.) and the artistic style (Realism, Cubism, Impressionisim, etc.). For a piece of music, the lowest-level features will be something like the tempo, choice of the instruments, main tonal key, etc. whilst at the opposite end of the hierarchy we find features such as the musical genre. Every piece of art can be represented by a unique pattern of its characterising features. Similar artworks will contains similar patterns, whereas very different artworks will be composed by patterns that differ by a large number of features’ values. Each person will appreciate different kinds of patterns, and each pattern will be appreciated by a different number of people.

A hypotetical automatic art generator should be able to capture, and to re-produce, the underlying mechanism that contributes to the objective quality of the artwork. This means that it has to be, in its essence, a pattern recog-nition machine. It would need to have an understanding of the lowest-level features of the specific art it is trying to generate, and how those feature com-bine toghether to form higher-level features. It would need to also know how to modify such features in a meaningful way, possibly in way that allows it

(26)

to increase the total amount of objective quality of the resulting artwork. The main issue with that is that we, as humans, usually do not posses enough knowledge about the arts that is both formal and detailed enough in order to define such features in a way a machine could easily understand. For exam-ple, let’s suppose our goal is to create the ultimate automatic music generator machine. Such a machine will have to be able to generate any kind of mu-sic, of any genres, and with any instrument or combination of instruments. As a first approach, we could try to hard-code into a computer program all the rules of standard music theory and harmony [59]. This may allows us to generate some pretty realistic baroque-sounding music for some time, but its limitation will become unbearable very soon. We could, then, try to hard-code some exceptions to the previous rules, in order to make generated music more interesting, but we have to be very careful, since it is easy to slip from "interesting" to "cacophonic". After a staggering amount of work, we might end up being satisfied with our music generator again. We are now able to generate, with a good variable of realism, various types of classical music. This is, again, fine for some time, but now we woud like to generate jazz music, and then pop, rock, disco, electronic, hip hop, and so on. The com-plexity grows exponentially and soon everything becomes either impossible or impractical. An alternative approach could be to create a general-purpose pattern recogniser, general enough to be able to learn such features and pat-terns by itself, given only examples of songs we would like it to learn. If the training example constitute a representative sample of the real data distribu-tion, the music generator will soon be able to identify by itself the underlying common features, as well as the differences, between the songs. It will struc-ture the learned feastruc-tures space in a meaningful way, or at least in way that allows it to generate realistic-sounding music (it does not necessarily need to structure the feature space in a way that is meaningful to humans). Here is where Machine Learning (ML for short) comes in help: in particular, there

(27)

Chapter 1. Introduction 5

exists a family of unsupervised neural networks models, called neural gen-erative models [10], that uses ML techniques to learn the underlying proba-bility distribution of the training data. Modelling the whole data distribution allows these models to perform many interesting tasks, such as: data com-pression, feature extraction, clustering, generation of new data samples, etc. As we will show in Chapter2, neural generative models represent the cur-rent state of the art of automatic music and image generation, consistently outperforming other types of approaches.

The work of this thesis is builded upon the theory of generative models. In particular, it extends one of them, called adversarial autoencoder (AAE for short) [60]. Adversarial autoencoders combine the mechanism of Variational Autoencoders (VAE) [27] and generative adversarial networks (GAN) [40] (both to be introduced in Section2.3) in order to provide flexibility and stabil-ity during the training process, making them a natural choice when dealing with complex and multimodal data distributions, such as the one of music. Despite that, to the best of the author’s knowledge, there are no examples in the literature of AAEs applied to music generation. This thesis wants to fill this gap, training for the first time an adversarial autoencoder on actual MIDI songs. This model, called MusÆ as a reference of ancient Greek’s Muses, has been subjected to experiments that proved it to be a competitive approach for music-related tasks, such as obtaining faithful reconstructions of musi-cal sequences, generating convincing interpolations between two songs and identifying meaningful factors of variation in the data.

The remaining part of this thesis is structured in the following way: Chapter

2will provide the reader a theoretical overview to the most important top-ics related to the automatic music generation problem. The chapter contains a survey of the many approaches that have been tried over the years, with

(28)

an eye of regard to the latest neural networks models, such as the aforemen-tioned neural generative models. In Chapter3the MusÆ model is presented, and its architecture is described in detail. In Chapter4the experiments con-ducted in order to test the capabilities, as well as the limitations, of MusÆ are described. The experimental results are then discussed and compared to other similar results found in the literature. Finally, Chapter 5 contains a recap of the work done in this thesis and outlines some possible further developements of this line of research.

(29)

7

Chapter 2

Background

Automatic music generation is a challenging task.

Music is, by its very nature, quite different with respect to other forms of arts, presenting some peculiar issues that need to be addressed [23][87]. First of all, music "happens" over time. Time dimension is not only relevant, but cru-cial for this type of art. The artistic value of a note (or a chord) is not absolute, but it is entirely depending from the temporal context the note is played in. The same exact notes can either be a pleasant conclusion to a well-structured musical phrase or a cacophonic intermezzo in an otherwise good-sounding riff. Furthermore, a musical piece is often the resulting composition of differ-ent instrumdiffer-ents, so the overall pleasantness of a song cannot be measured in an additive way, i.e. by adding the pleasantness of its individual tracks. An instrument playing the wrong notes, or even the right notes but at the wrong time, can ruin an entire part of a song, not matter what the other instruments are doing. To complicate things even more, other forms of art often utilise standardised digital representations that are easy to store and to manipu-late inside a computer, while also able to capture all the relevant low-level features of the particular artwork. For example, a painting can be easily rep-resented by a pixel matrix, whilst a poem can be encoded in one of the many character encoding standard such as ASCII or Unicode. This is not the case

(30)

8 Chapter 2. Background

with music [71]. Traditional music notations, such as the Western score no-tation, are still heavily dependant from the performer’s interpretation; and lower-level representations like raw audio waves, while containing all the necessary information, are usually quite difficult to manipulate and do not allow to exploit the intrinsic structures of a musical piece. A partial solu-tion to these problems can be sought in intermediate representasolu-tions such as MIDI [7] (see section2.2).

This chapter is divided into three sections: in Section 2.1 the efforts made over the years in the attempt of solving the automatic music generation prob-lem are presentend and discussed. The peculiar characteristics of neural net-works allowed them to establish themselves among the most used models for automatic music composition nowadays, especially considering their gener-ative versions [51]. Notable examples of such kind of models are going to be described in Section2.1.3 and Section2.1.2. In Section2.2 some of the most widely used forms of music representations are outlined. Finally, section

2.3contains a formal introduction to the theory underlying deep generative models, which constitutes the basis of the MusÆ model.

2.1 Survey of Music Modelling and Generation

Despite the many difficulties, automatic music generation has been a very ap-pealing problem for a long time and many attempts have been made during the years. However, only with the recent breakthroughts in artificial neural networks it has been possible to produce any result really worth of notice. This has been mainly accomplished through (deep) neural generative mod-els [51] such as the variational autoencoder and generative adversarial net-works, which will be subject to further discussion in Section2.3. Therefore,

(31)

2.1. Survey of Music Modelling and Generation 9

this survey is mainly focused on the attempts regarding Machine Learning and its most promising generative models. The interested reader is invited to check these, more comprehensive, surveys: [51][34][28][74].

2.1.1 Early Attempts

The first practices of music automation can be traced back to medieval times, where the famous musician Guido d’Arezzo designed a rule-based vowel-to-pitch mapping algorithm to generate sequences of notes [59]. Fast-forward to modern times, it is possible to find some notable works from Kirnberger [54], Menabrea and Lovelace [66], that dates back up to 200 years [25]. The first large-scale attempts to build general automatic music generators became possible only with development of modern computers. Since the late 1950s, a variety of computer-music programming languages have been invented [24][64][13][35], making the process of music creation via programming even more efficient. The first piece of music generated by a computer dates back to 1957 [51]. It was a 17 seconds long melody named “The Silver Scale” and was generated by a software for sound synthesis named Music I. Prior to the recent neural network reinassance, most systems mostly used either human-encoded rules or Markov models, such as in [22], where a semi-automatic approach based on Markov models was able to create music in the style of a given composer. However, these solutions mostly consisted in tools meant to assist human musician in their creative process. They still couldn’t totally replace the human factor, hence they couldn’t be considered fully-automatic music generators.

(32)

2.1.2 Neural Networks Models

The use of neural networks in symbolic music generation has seen a resur-gence in interest (as surveyed in [51] by Briot et al.), and early works at music composition with artificial neural networks include Todd2 [?], Bharucha and Todd [12], Mozer [69][70], Chen and Miikkulainen [15] and Eck and Schmid-huber [29][30], mostly based on plain Recurrent Neural Networks (RNN). In more recent years, Boulanger-Lewandowski et al. [73] was one of the first neural network systems for generating poliphonic music, combining Long Short-Term Memory (LSTM) [47] units and Restricted Boltzmann Machines (RBM) [79] to simultaneously model the temporal structure of music, as well as the harmony between notes that are played at the same time. Chu et al. [42] use domain knowledge to model a hierarchical RNN architecture that produces multi-track polyphonic music. Brunner et al. [14] combine a hier-archical LSTM model with learned chord embeddings that form the Circle of Fifths, showing that even simple LSTMs are capable of learning music theory concepts from data. Some other works have been focused on spe-cific domains. Hadjeres et al. [36] introduce DeepBach, a LSTM-based sys-tem that can harmonize melodies by composing accompanying voices in the style of Bach Chorales. This is considered to be a difficult task even for pro-fessionals. Other works that mimick the style of J.S. Bach are BachBot [58] and CoCoNet [17]. This last model uses an architecture based on Convo-lutional Neural Networks (CNNs) to capture the time dependencies of the compositions. Johnson et al. [52] use parallel LSTMs with shared weights to achieve transposition-invariance (similar to the translation-invariance of CNNs). Chuan et al. [21] investigate the use of an image-based Tonnetz representation of music, and apply a hybrid LSTM/CNN model to music generation. One of the most famous examples of symbolic music generation are probably the PerformanceRNN models proposed by the Magenta project

(33)

from the Google Brain team [82]. The main function of MelodyRNN is to generate a melody sequence by first conditioning the model on a priming melody. Other recent systems for generating polyphonic music include Jam-Bot [38] which generates chords and notes in a two-step process, DeepJ [49] which generates polyphonic piano music where a user can control several style parameters and a model from Roy et al. [76] able to generate variations on lead sheets. Another line of research focus on directly manipulating the raw audio waves that compose the musical piece. Van der Oord et al. in-troduced WaveNet [1], a CNN-based model for the conditional generation of speech, which can be also used to generate quite realistic piano music. Engel et al. [32] incorporated WaveNet into an Autoencoder architecture to gener-ate musical notes with the timbre of different instruments. Mehri at el. devel-oped SampleRNN [83], an RNN-based model for unconditional generation of raw audio. Despite these models, the domain of raw audio generation is very high dimensional, so it is much more difficult to generate both realis-tic and pleasing sounding music [81]. Therefore, the majority of the existing works are based on some form of symbolic music representation (more on this in Section2.2).

2.1.3 Deep Generative Models

Recent years have seen an increasingly successful applications of generative models [51] such as the variational autoencoder (VAE) [27] and generative adversarial networks (GAN) [40]. The MIDI-VAE model by Brunner et al. [14] is based on VAEs and it considers both pitches, velocities and intruments to perform style transfer between two genres. Roberts et al. introduce Mu-sic VAE [4], a hierarchical sequence-to-sequence VAE model (see Figure2.1). The hierarchy is expressed by using a two-level decoder: in the first level the so-called "conductor" decoder takes as input the latent factors generated by

(34)

FIGURE2.1: Hierarchical architecture of the Music VAE model [4]. The high-level conductor creates a series of intermediate codes, used by low level decoders to generate the actual piano

roll. (Figure credits by Roberts et al. [4]).

FIGURE2.2: System architecture of the MuseGAN model [44] for multi-track sequential data generation. (Figure credits by

(35)

the encoder, and in turn generates some intermediates codes, which are then used by the low-level decoder to generate 16 bars of the actual piano roll. The hierarchical structure of Music VAE makes it able to capture long-term structure in polyphonic music, which is generally considered to be a difficult problem.

Generative adversarial networks, while powerful in principle, are very diffi-cult to train effectively [61], especially when applied to sequential data. How-ever, this has not discuraged scientists, and many recent works successfully apply GAN to music generation. Mogren [67], Yang et al. [57] and Dong et al. [44] have shown that CNN-based GAN can be effectively applied to music composition. In particular, the MuseGAN model by Dong et al [44] (Figure 2.2) consider four types of latent vectors as input: inter-track time-independant, intra-track time-time-independant, inter-track time-dependant and intra-track time-dependant. Time-independant factors are meant to contain static feature of the song, while time-dependant factors contains the part of information that changes over time during song execution. In order to ate a new song, first the time-dependant factors are fed to a high-level gener-ator (in a similar way to the conductor decoder of Music VAE), which creates intermediate latent codes, one for each time step. Such latent codes are then concatenated with the time-independant information, and the result is used as input of the low-level generator, responsible for the generation of the fi-nal song. Fifi-nally, it is worth noticing the SeqGAN model by Yu et al. [55], which applied with success a RNN-based GAN to music by incorporating reinforcement learning techniques.

(36)

2.2 Music Representations

As outlined in the beginning of this chapter, the problem of representation is not trivial. This is fundamentally related to the modal and multi-level character of music representation. Music can be read, listened to or performed, and we can rely on score, sound, or some other type of interme-diate representation. For these reasons, it is possible to find in the literature many discussions on representations [23] [87] [71]. A big divide in terms of the choice of representation (both for input and output) is audio versus symbolic. This also corresponds to the divide between continuous and dis-crete generative processes [81][63]. The relationship between music score and sound is similar to the one between text and speech: score serves as a symbolic and highly-abstract visual representation with the purpose of effi-ciently recording and communicating musical ideas, whereas sound is a set of continuous signals that encode all the details we can hear. In addition, there exists many musical representations that fall in-between the previous two. Such representations are often based on the usual musical score, with some additional parameters that allow to control the playing performance in a machine-friendly way. MIDI is one the most famous examples of these in-termediate representations [7], and will be the object of a more in-depth dis-cussion towards the end of this section. Despite the big differences between the different types of representations, it is important to notice that, in many cases, the actual processing of these representation by a machine learning models is not consistently different. The actual audio and symbolic architec-ture for music generation may be quite similar in practice [51]. For example, the WaveNet [1] audio architecture has been transposed to the MidiNet [57] symbolic architecture. This kind of polymorphism is one of the advantages of using a machine learning-based approach to solve these kind of problems.

(37)

2.2. Music Representations 15

FIGURE2.3: Excerpt of Chopin’s Nocturne op.9 no.2, as an

ex-ample of musical score notation.

2.2.1 Symbolic

Score representation exists in many forms, including sheet music notation, lead sheet, chord chart and numbered musical notation. Most of them are highly symbolic and can encode many abstract music features indicated by the composer, such as: sound information like tonality, pitch, chord, timing; structure information like phrases and repetitions; and, to a certain extent, performance information like dynamics. These features have the tendency of being both discrete and expressed in different measurement scales. For example, pitch is a discrete variable that express continuous, but fixed, fun-damental frequencies values. That is, between two consecutive note (e.g. A4=440.00Hz and A#4=466.16Hz in the equal temperament scale [9]) there is not any legal note. Other features such as chords are represented by nominal variables, and dynamics is an an ordinal variable, with values (ppp, pp, p, f, ff, fff ), ranging from pianissimo (ppp, the softest) to fortissimo (fff, the loudest). These characteristcs make the developement of generative models based on score representations very challenging, and in practice this representation is

(38)

rarely considered for real-world applications [81].

2.2.2 Audio

On the other end of the representations spectrum, we have the raw sound representation, which can be seen as an acoustic realisation of the corre-sponding score through a certain instrument (or a group of them). The most direct representation is the raw audio signals, a.k.a. the waveform (see Figure

2.4). Using transformed representations of the audio signals usually leads to data compression and higher-level basic features, at the cost of losing some information and introducing some bias [51]. Common transformed represen-tations are: spectrograms (Figure2.4), where the raw audio signal is divided into its composing frequencies via a Fourier transform; and chromograms, which are essentialy spectrograms where the considered frequencies are re-stricted to the pitch classes. Sound representation is essentially continuous and very rich in acoustic details. All symbolic abstractions and performance control information are no more explicitly described and get hidden in the audio wave. If score representation is difficult to deal with because of its excessive abstraction and its need of further interpretation, sound wave rep-resentation brings with it issues that are about as challenging, but for the op-posite reasons: generative models must now do extra work for extracting the relevant features from the extremely-entangled raw data space. Such mod-els usually heavily rely on complex data preprocessing procedures in order to reach a good starting representation, and usually a compromise of some kind between efficiency and accuracy of generated pieces is needed.

(39)

(A) Waveform representation.

(B) Spectrogram representation.

FIGURE2.4: Example of raw audio representations. The same

sound can be represented either as a waveform or a spectro-gram.

(40)

(A) Traditional piano roll. (B) Digital piano roll. FIGURE2.5: Evolution of piano roll representation.

FIGURE2.6: Intro of "Hotel California" song by Eagles, shown in its tab representation.

(41)

2.2.3 Performance Control

A performance control encodes an interpretation of the corresponding score, relying on which a performer turns the score into performance motions. A commonly used control representation is the Musical Instrumentation Dig-ital Interface (MIDI) [7]. MIDI is a technical standard that describes a pro-tocol, a digital interface and connectors for interoperability between various electronic musical instruments, softwares and devices. MIDI can be consid-ered as an intermediate level of representation between symbolic and audio [81]: while being fundamentally a symbolic representation, it includes ways to fine-control the dynamics and the performance of the playing tracks, al-lowing the protocol to express a wider range of dynamics than traditional score. Each note is encoded by its pitch, dynamics, onset (starting time), and duration. It also has a number of controllers such as pedal and pitch bend for more performance nuances. Pitches are represented as integers, with the semitone being the smallest unit. Dynamics are integers in velocities units, ranging from 1 to 127, and timings are floating point numbers in seconds. The MIDI protocol carries event messages that specify real-time note perfor-mance data as well as control data. They consist in a status byte, which can be followed by one or more data bytes. A summary of the most important event messages can be found in Table2.1. For a detaild description of such events, the reader is referred to the official MIDI specifications [7]. In [48] Huang and Wu claim that one drawback of encoding MIDI messages directly is that it does not effectively preserve the notion of multiple notes being played at once through the use of multiple tracks. In their experiments, they concate-nate tracks end-to-end and thus posit that it will be difficult for such a model to learn that multiple notes in the same position across different tracks can really be played at the same time. Piano roll, to be introduced in next para-graph, does not have this limitation (but at the cost of other limitations)[51].

(42)

The piano roll representation of music is inspired from automated piano, which has a continuous roll of paper with holes punched into it. Each per-foration triggers a specific note. The pitch is controlled by the localisation of the perforation, whilst the duration of the note corresponds to the length of the perforation. Thus, in this representation, time is discretised in many steps of the same duration. There are several music environments using piano roll as a basic visual representation, in place of or in complement to a score, as it is more intuitive than the traditional score notation. An example is Hao Staff piano roll sheet music [43], with the time axis being horizontal rightward and notes represented as green cells. Another example is tabs, where the melody is represented in a piano roll-like format [51], in complement to chords and lyrics. Tabs are used as an input by the MidiNet model [57]. The piano roll is one of the most commonly used representations, despite its limitations. An important one, compared to MIDI representation, is that there is no note-off information. As a result, there is no way to distinguish between a long note and a repeated short note. For a more detailed comparison between MIDI and piano roll, see [48] and [85]. Compared to score representation, the key characteristics of performance control are the enriched and detailed timing and dynamics information, which more or less determines the musical ex-pression of a performance. On the other hand, most structural information such as phrases, repetitions, and chord progressions are flattened and be-come implicit during the translation from the score to performance control. Finally, it is important to notice that performance control is largely indepen-dent from the actual instrument.

(43)

Status byte Data byte(s) Description

Channel Voice Messages [nnnn = 0-15 (MIDI Channel Number 1-16)]

1000nnnn 0kkkkkkk 0vvvvvvv Note Off event. This message is

sent when a note is released (ended). (kkkkkkk) is the key (note) number. (vvvvvvv) is the velocity.

1001nnnn 0kkkkkkk 0vvvvvvv Note On event. This message is

sent when a note is depressed (start). (kkkkkkk) is the key (note) number. (vvvvvvv) is the velocity.

1010nnnn 0kkkkkkk 0vvvvvvv Polyphonic Key Pressure

(After-touch). This message is most often

sent by pressing down on the key after it "bottoms out". (kkkkkkk) is the key (note) number. (vvvvvvv) is the pressure value.

1011nnnn 0ccccccc 0vvvvvvv Control Change. This message is sent

when a controller value changes. Con-trollers include devices such as pedals and levers. Controller numbers 120-127 are reserved as "Channel Mode Messages" (below). (ccccccc) is the controller number (0-119). (vvvvvvv) is the controller value (0-127).

1100nnnn 0ppppppp Program Change. This message sent

when the patch number changes. (ppppppp) is the new program num-ber.

1101nnnn 0vvvvvvv Channel Pressure (After-touch). This

message is most often sent by press-ing down on the key after it "bottoms out". This message is different from polyphonic after-touch. Use this mes-sage to send the single greatest pres-sure value (of all the current depressed keys). (vvvvvvv) is the pressure value.

1110nnnn 0lllllll 0mmmmmmm Pitch Bend Change. This message

is sent to indicate a change in the pitch bender (wheel or lever, typi-cally). The pitch bender is measured by a fourteen bit value. Center (no pitch change) is 2000H. Sensitivity is a function of the receiver, but may be set using RPN 0. (lllllll) are the least significant 7 bits. (mmmmmmm) are the most significant 7 bits.

TABLE 2.1: Summary of the main MIDI event messages. For a complete list, see [7].

(44)

2.3 Generative Models

As shown in Section 2.1, automatic music composition is not a new idea, and computational models for algorithmic composition have been developed using a large set of different techniques. However, their ability of learning from big data collections, as well as their intrinsic flexibility, allowed deep generative models to establish themselves as the de-facto standard for art and music generation [51][34]. Machine Learning models can be divided in two broad categories: discriminative models and generative models. Given an observable variable x (usually referred to as "data sample") and a target variable y ("label"), a discriminative model learns the conditional probability p(y|x), whereas a generative model learns the joint probability p(x, y). In other words, a discriminative model learns the distribution needed to classify a given sample x into a class y, whereas a generative model learns to model the entire data distribution. The joint distribution p(x, y)can be transformed into p(y|x)using Bayes rule

p(y|x) = p(x|y)p(y)

p(x) =

p(x, y)

p(x) (2.1)

so a generative model can be used for classification as well. However, it can do more: for example, it is possible to sample from p(x, y)in order to gener-ate new(ˆx, ˆy)pairs (where ˆx and ˆy are specific instances of x and y). That is, a generative model is able to generate brand new data samples (and their cor-responding labels). When applied to art, generative models can potentially be used to automatically create new artworks (often with surprising realism and accuracy). However, they not only can generate new data, but they can also change the properties of existing data in a principled way. This allows, for example, the generation of meaningful interpolations between two given data samples.

(45)

2.3. Generative Models 23

FIGURE2.7: Typical architecture of a variational autoencoder. The encoder approximates the posterior qλ(z|x), whilst the

de-coder approximates pθ(x|z). Dashed lines denote stochastic

op-erations, full lines denote deterministic operations.

In the rest of this section are described two of the most widely used gener-ative models within the scope of the deep learning paradigm, namely the Variational Autoencoder (VAE) [27] and Generative Adversarial Networks (GAN) [40]. An additional model, called adversarial autoencoder (AAE) [60] is then discussed: it can been viewed as a hybrid model between VAE and GAN and it constitutes the basis architecture of the MusÆ model presented in this thesis.

2.3.1 Variational Autoencoders

The variational autoencoer (see Figure2.7), first introduced by Kingma and Welling in [27], is a generative approach that learns to model a data distribu-tion with a set of hidden (latent) variables using only pure gradient-descent methods. Let x be a random variable representing the observed data sample and z the set of latent variables. Then, the generative model over the pair

(x, z)is

(46)

(A) The encoder cannot receive gradient information during training due to the

non-differentiable random sampling operation needed to produce z.

(B) With the reparameterisation trick, the encoder generates the parameters of the prior distribution in a deterministic way, so that the gradient can be freely backprop-agated. Stochastic behaviour is then obtained by adding a source of fixed random

noise ε.

FIGURE2.8: This figure illustrates the reparameterisation trick of VAEs. Red arrows show gradient backpropagation flow dur-ing traindur-ing. Dashed lines denote stochastic operations, full

(47)

where p(z) is the prior distribution of the latent variables and p(x|z) is the conditional likelihood distribution. The prior distribution is often assumed to behave as a known distribution, such a Normal or a Bernoulli, while the likelihood distribution is implemented by a neural network pθ(x|z), which

is referred to as the decoder. VAEs learn to perform approximate inference using another neural network, called the encoder qλ(z|x)to approximate the

true posterior p(z|x). Thus we have

z∼ Encoder(x) =qλ(z|x) (2.3)

˜x ∼Decoder(z) = pθ(x|z) (2.4)

so that the approximate posterior (the encoder) and the likelihood distribu-tion (the decoder) are parameterised by λ and θ, respectively. Following the framework of variational inference [26], the encoder and the decoder networks are then optimised toghether by maximising the evidence lower bound (ELBO) associated with a data sample x, given by

L_VAE(q) = E_z_∼_q

λ(z|x)[logpθ(x, z)] + H(qλ(z|x)) (2.5)

=E_z_∼_q

λ(z|x)[logpθ(x|z)] −DKL(qλ(z|x)kp(z)) (2.6)

≤logp(x) (2.7)

In Equation 2.5, the first term is the joint log-likelihood of both the visible and the latent variables under the approximate posterior over the latent vari-able. The second term is the entropy of the approximate posterior itself. In Equation 2.6, the first term is the reconstruction log-likelihood, which can be found in other varaints of autoencoders as well. The second term is the

(48)

Kullback-Leibler divergence between the encoder distribution and the prior latent distribution: DKL(qλ(z|x)kp(z))) = Z qλ(z|x)log qλ(z|x) p(z) dz (2.8)

this forces the approximate posterior qλ(z|x) to stay close to the prior p(x).

During training, the gradient is not directly computable from the ELBO, be-cause z is obtained from a random sampling operation, which is not differ-entiable. A way to get around this issue is the so-called reparameterisation trick [27] (see Figure2.8): it is done by replacing z∼ N (µ, σI)with

z=µ+σε (2.9)

ε∼ N (0, I) (2.10)

where I denotes the identity matrix. Thus, now z is expressed as a determin-istic (and differentiable) transformation of the parameters of the distribution plus a fixed random noise. With the reparameterisation trick the whole archi-tecture becomes differentiable end-to-end, making a gradient-descent based training possible.

The VAE approach is elegant, theoretically well-founded and simple to im-plement, making the VAE framework easy to extend to a wide range of model architectures. This constitutes an important advantage over other generative models such as restricted boltzmann machines [39]. An important property of the VAE is that the joint training of both the parametric encoder and the decoder forces the model to learn a predictable latent space that can be cap-tured by the encoder. This makes it an excellent model for manifold learning [39].

(49)

FIGURE2.9: Typical setting of generative adversarial networks. The generator produces fake training samples x = gθ(z)

start-ing from latent variables z. The discriminator, fed with both real and fake samples, assigns to each sample the probability of

(50)

FIGURE 2.10: Kullback-Leibler and Jensens-Shannon diver-gences between two normal distributions p ∼ N (0, 1) and q ∼ N(µ, 1). When the two distributions becomes very

dif-ferent from each other, the grandient of both divergences tends to vanish.

2.3.2 Generative Adversarial Networks

Generative adversarial networks [40], introduced by Goodfellow et al., are a framework for training generative models. GAN define a game theoretic scenario were there are two competing models: the generator gθ and the

dis-criminator dφ (see Figure 2.9). The generator takes as input a random noise

z sampled from the prior distribution of the latent space and maps it to the data space, producing x =gθ(z). On the other hand, the discriminator, given

a sample x, has to distinguish between real data from training set and fake data generated by gθ, and outputs dφ(x), that is the probablity that x belongs

to the real data. During training, the two networks compete against each other in a zero-sum minimax game:

L_GAN =min

θ

max

φ Ex∼pr

[logdφ(x)] +Ez∼pz[1−logdφ(gθ(z))] (2.11)

where prand pzrespectively represent the real data distribution and the prior

distribution of z. This adversarial formulation of the loss function prompts both models to get better and better at their respective task as the training

(51)

progresses: the generator will try to maximiseL_GAN by fooling the discrim-inator into assigning high probability to generated samples, while the dis-criminator will try to minimiseL_GAN by assigning high probability to real sample and low probablity to generated ones. Ideally, at full convergence, gθ’s samples are indistinguishable from the real ones, and dφ outputs 1₂

ev-erywhere. Unfortunately, GAN are notoriously famous for being difficult to train, especially when both gθ and dφ are neural networks and the loss is not

convex [61][39]. The equilibrum points of the adversarial game are the saddle points of L_GAN, that is, points that are local minima for both the generator and the discriminator. In general, such points are difficult to reach during training: it is possible for gθ and dφto take turns in increasing then

decreas-ingLGAN forever, never reaching exactly a saddle point. Moreover, it can be

proved that minimisingLGANwith an optimal discriminator is equivalent to

minimising the Jensen-Shannon divergence between the real data distribu-tion pr and the generated data distribution pg =gθ(z)[61][40]

DJS(prkpg) = 1 2DKL(prk pr+pg 2 ) + 1 2DKL(pgk pr+pg 2 ) (2.12) where DKLis the Kullback-Leibler divergence as defined in2.8. The problem

with the Jensen-Shannon divergence is that its gradient tends to vanish when the two distributions are very different from each other (see Figure2.10). In practice, this means that when dφ is close to optimal and gθ is still weak, the

gradient for gθ gets smaller and smaller to the point that it cannot learn

(52)

FIGURE2.11: Outputs of optimal GAN and WGAN discrimina-tors in the data space. While GAN discriminator output tends to have vanishing gradient when its predictions are very con-fident, WGAN discriminator provides informative gradient for

(53)

Wasserstein GAN

A big part of the follow-up research on GAN is devoted to mitigate the afore-mentioned convergence problems. Arjovsky et al. [6] introduce the use of the Wasserstein distance, also called the Earth Mover’s distance

W (pr, pg) = inf

γ∈Π(pr,pg)

E(x,y)∼γ[kx−yk] (2.13)

in order to stabilise the training process and to avoid mode collapse. The termΠ(pr, pg)denotes the set of all joint distribution γ(x, y)whose marginal

are respectively pr and pg. The Wasserstein distance for pr and pgis defined

as the greatest lower bound for any transport plan. It represent the min-imum cost of transporting probability mass in converting pg into pr. The

resulting model, called Wasserstein GAN (WGAN), is guaranteed to have a smooth gradient everywhere, so that the generator can learn even when it is not producing good data samples (see Figure2.11). The problem now is that the Wasserstein distance as expressed in Equation2.13is highly intractable. However, we can apply the Kantorovich-Rubinstein duality [31] to simplify the calculation to

W (pr, pg) = sup

kfkL≤1

Ex∼pr[f(x)] −Ex∼pg[f(x)] (2.14)

therefore, f must be a 1-Lipschitz continuous function, i.e. it must respect the following constraint

kf(x1) − f(x2)k ≤Kkx1−x2k (2.15)

with K=1.

(54)

The resulting discriminator is almost the same of the original GAN discrimi-nator, except that now it outputs a scalar score value instead of a probability. Enforcing the Lipschitz constraint of Equation 2.15 is crucial for the stabil-ity of WGAN. In the original WGAN paper Arjovsky et al. proposes to use weight clipping strategies [6]. Weight clipping is simple but the training pro-cess tends to be very sensitive to this parameter and, moreover, it has been found to cause some optimisation difficulties. In [41], Gulrajani et al. then propose an additional gradient penalty term to be added to the WGAN loss function, which forces the gradients of the discriminator to have norm close to 1 almost everywhere. The resulting loss function thus becomes

L_WGAN−GP=Ex∼pr[dφ(x)] −Ez∼pz[dφ(gθ(z))] +λEˆx∼pˆx[(k∇ˆxdφ(ˆx)k2−1)2]

(2.16) where ˆx is computed by ˆx = αgθ(z) + (1−α)x and α is sampled uniformly

between 0 and 1. The resulting WGAN-GP model has been found to have faster convergence to better optima while requiring less parameter tuning [41]. This variation of GAN has been chosen as the basis for the adversar-ial part of MusÆ, the adversaradversar-ial autoencoder proposed in this thesis and described in more detail in Chapter3.

2.3.3 Adversarial Autoencoders

Adversarial Autoencoders (AAE), first introduced in [60] by Makhzani et al., can be viewed as a hybrid model between VAE and GAN: while the basic encoder-decoder architecture resembles the one of a VAE, the variational in-ference mechanism of latent space is implemented in an adversarial way, just as in GAN (see Figure2.12). More formally, let x be the data sample and z

(55)

FIGURE 2.12: Typical architecture of an adversarial

autoen-coder. Dashed lines denotes non-differentiable random oper-ations, while full lines denotes differentiable operations. As for VAEs, the encoder approximates the posterior qλ(z|x), whilst

the decoder approximates pθ(x|z). The discriminator has to

dis-tinguish between fake latent factors ˜z generated by the encoder, and real latent factors z directly sampled from the chosen prior

(56)

be the corresponding latent variables of the AAE. Now let p(z) be the prior we want to impose on the latent variables, qλ(z|x)be the encoder conditional

distribution and pθ(x|z) be the decoder conditional distribution. Finally, let

pd(x) the actual data distribution and pg(x) be the model distribution (i.e.

the distribution of data generated by the model). Then, the encoder defines an aggregated posterior distribution qλ(z)on the latent variables z as follows

qλ(z) =

Z

qλ(z|x)pd(x) (2.17)

In the adversarial autoencoder the aggregate posterior qλ(z)is made to match

the arbitrary prior p(z). This is, in principle, no different from what VAEs al-ready do. However, standard VAEs use a KL divergence penalty (see Equa-tions2.8and2.6) to accomplish that, whereas in AAE an adversarial network is attached in the top of the latent variables z. Therefore, the loss function of AAE becomes

LAAE =Ez∼q_λ(z|x)[logpθ(x|z)] +βLGAN (2.18)

where β regulates the amount of desired adversarial regularisation. Of course it is possible to use any other adversarial loss other than the originalL_GAN. During training, the autoencoder learns to minimise the reconstruction er-ror, while the adversarial network guides the encoder to match the imposed prior. Thus the encoder of the autoencoder qλ(z|x)becomes also the

genera-tor of the adversarial part and the adversarial network is exactly equivalent to the GAN’s discriminator. At the end of the training, the decoder of the au-toencoder will implement a generative model that maps the imposed prior p(z) to the data distribution. There is a fundamental difference with respect to regular GAN training, where the imposed distribution is the data distribu-tion pd(x), matched to the generator’s output pg(x). In AAEs, the actual data

(57)

distribution is captured by the autoencoder part, and the adversarial train-ing is used only to match a prior distribution to the latent variables. Such distribution is simpler (usually much simpler than pd(x)) and lower

dimen-sional than the one of GAN, and this results in a better log-likelihood, as shown in [60]. An important consequence of using adversarial regularisa-tion on the latent variables is that VAEs, in order to backpropagate through the KL divergence, need to have access to the exact functional form of the prior p(z), whereas in AAEs it is only needed to being able to sample from the chosen prior distribution in order to induce the encoder to match p(z). In practice this means that the AAE is able to model complex distributions without having access to their explicit functional form [56][60]. The adver-sarial autoencoder is a very flexible model, and it can solve both supervised, semi-supervised and unsupervised tasks with minimal changes in its basic architecture. In particular, the MusÆ model (to be described in the following section) is based on a semi-supervised variant of standard AAEs.

(58)

(59)

37

Chapter 3

The Model

In this chapter a novel architecture for music modelling and generation is presented. Its name is MusÆ (the name is a reference to the Muses of an-cient Greek mythology, the inspirational goddesses, of Literature, Science and Art) and its architecture is based on adversarial autoencoders. Adver-sarial autoencoders give us several advantages over standard variational au-toencoders: using adversarial regularisation in place of the KL divergence allows for more flexibility over the choice of prior distribution of the latent variables. In principle, they can be made to follow any probability distribu-tion, even ones with unknown functional form (for example, in one of their experiments Makhzani et al. trained an adversarial autoencoder whose la-tent factors are forced to follow a swiss roll distribution [60]). This consti-tutes an important advantage for adversarial autoencoders over variational autoencoders, where the imposed prior distribution must necessarily be cho-sen in order to make the KL divergence term computable [56] (see Equation

2.6). The flexibility of the AAE is also reflected into its architecture, which can be easily modified in order to adapt to the specific task. For example, it is possible to exploit labels information in the data by having an additional set of discrete latent variables, following a categorical distribution [2]. Such factors can be trained to match the dataset labels using an additional super-vised training phase in addition, or they can be left unsupersuper-vised, in which

(60)

38 Chapter 3. The Model

case the categorical factors will learn to cluster the training set according to new discovered discrete features.

MusÆ is capable of reconstructing musical phrases with high accuracy, as well as to interpolate between two phrases in a sound pleasing way. It can perform meaningful changes to songs by modifying the the latent space fac-tors. To the best of the author’s knowledege, this is the first time an adver-sarial autoencoder has been applied with success to automatic music genera-tion. The general adversarial autoencoder model has already been illustrated in Section2.3.3of the previous chapter. Section3.2describes in detail the sev-eral parts of which the model is composed. Finally, in Section3.3the training procedure of the model is presented.

3.1 Data Representation

The model’s input data are MIDI songs encoded in a piano roll representa-tion (see Secrepresenta-tion 2.2). The model processes the song in chunks of either 2 or 16 bars, depending on the specific dataset, and each bar is represented by a sequence of 16 timesteps sampled at regular intervals. Thus, for 2-bar datasets the training samples are composed of 32 timesteps, while for 16-bar dataset they are composed of 256 timesteps. The timesteps contain the play-ing information for each of the four tracks considered for each song: drums, bass, guitar/piano, strings (for more information about how the tracks have been selected from the original songs and preprocessed, see Section4.1). For a single track, each timestep is represented by a categorical variable, taking one of the 130 binary discrete state:

(61)

3.2. Architecture 39

• One "hold note" state, which indicates to keep playing the note which was played at the previous timestep.

• One "silent note" state, which indicates a timestep when no notes are playing.

Thus, each training sample is represented by a ntimesteps×nnotes×ntracks

ma-trix.

The latent variables z are continuous and follow a normal distribution

z∼ N (µ, σ2) = √ 1 2πσ2exp − ||x−µ||2 2 2σ2 ! (3.1)

where µ and σ2 are the parameters of the prior normal distribution and exp is the exponential function. Having continuous factors enables the model to encode any subtle variation of the musical pieces. This allows musicians to fine-control their creations and to modify them in a meaningful and princi-pled way. Section4.4 describes a set of experiments design to explore this kind of model’s abilities.

3.2 Architecture

The MusÆ model is based on the adversarial autoencoder architecture (see Figure2.12), which is in turn composed of two main parts: the autoencoder part and the adversarial part. From an architectural point of view, the main components, to be described in the remaining part of this section, are three: the encoder, the decoder and the discriminator. For more information about the training procedure, see3.3.

(62)

FIGURE 3.1: Encoder’s architecture. The forward and

back-ward parts of the 3-layer bi-LSTM are shown, respectively, in blue and red. All of these layers use tanh activation functions. The linear layers that use the outputs of the biLSTM to finally generate the latent factors distribution’s parameters are shown

(63)

3.2.1 Encoder

The encoder’s main task is to take the piano roll representation of music se-quences as input and to transform it into the correspondent latent space rep-resentation, that is, a particular combination of factors z (see Figure3.1). In order to do that, it uses 3 layers of bidirectional LSTMs (biLSTM) [80] with 1024 units each and tanh activation function. Let x be a training sample. Then, the encoder’s behaviour can be described by

h=biLSTM(x) (3.2)

µ=Wµ∗h (3.3)

σpre=Wσ∗h (3.4)

σ=exp(2σpre) (3.5)

where Wµ and Wσ are the weight matrices of the linear layers used to

com-pute, respectively, µ and σpre. A bidirectional LSTM combines two standard

LSTMs, the first one process the time-series given as input (in this case, the piano roll) forward in time while the second one process the same input back-ward in time, starting from the last timestep. The hidden states of these two LSTMs are then concatenated for each timestep [80]. Thus, at each timestep it is possible to have context information not only about notes that have already been played, but also on notes that are going to be played in the future. This is a desiderable property in the case of music, since the meaning of a note depends on the whole phrase, it is not restricted to the previous notes only. Hence, using a bidirectional LSTM allows the encoder to produce more infor-mative latent variables. For the latent factor z, the encoder first generates the mean µ and log-variance σpreof the prior Gaussian distribution by attaching

(64)

two linear layers after the last biLSTM layer. The factors z are then sampled from a 256-dimensional normal distribution having mean µ and variance σ.

3.2.2 Decoder

The decoder’s task is specular to the encoder’s: taking as inputs z, it is able to produce the piano roll corresponding to that particular combination of latent factors (see Figure3.2). Let z be a latent code. Then, the decoder’s behaviour can be described by

htrack₀ =Wtrack∗z (3.6)

˜xtrack =So f tmax(LSTM(z, htrack0 )) (3.7)

˜x = [˜xdrums|˜xbass|˜xguitar|˜xstrings] (3.8)

where track∈ {drums, bass, guitar, strings}, and Wtrackdenote the weight

pa-rameters of the linear layers used to compute the initial hidden state for their respective LSTMs, ˜x is the final piano roll generated from the decoder, and the symbol | denotes the concatenation operator. The architecture is com-posed of a 3-layer LSTM with 1024 units and tanh activation function, for each of the four tracks. Having a LSTM for each track allows each track’s decoder to specialise on the particularities of the specific track generation, at a cost of having more burden on the encoder (and more computational cost during training). The latent variables z are then passed through a linear layer in order to initialise the hidden states and the cell states of the LSTMs, which are then executed for 32 timesteps to produce the final 2-bar pianoroll. The concatenated z vector is fed as constant input to the LSTMs during the entire duration of the generation process. After generation, each timestep is passed

(65)

FIGURE 3.2: Decoder’s architecture. Each of the four tracks

follows the same basic architecture. The 3-layer LSTM with tanh activations are shown in blue. The linear layers used to initialise the LSTM hidden states with z are shown in orange. The final softmax layer is shown in green: this layer transform the LSTM output into a distribution over the note states, from

MusÆ: an Adversarial Autoencoder for Music

U

NIVERSITÀ DI

P

ISA

M

ASTER

’

S

T

HESIS

MusÆ: Adversarial Autoencoder

for Music

Author:

Andrea Valenti

Supervisors:

Dr. Davide Bacciu

Dott. Antonio Carta

Examiner:

Prof. Francesco Romani

A thesis submitted in fulfillment of the requirements

for the Master’s Degree in Computer Science

in the

Computational Intelligence & Machine Learning Group

Dipartimento di Informatica

Declaration of Authorship

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

List of Abbreviations

Chapter 1

Introduction

Chapter 2

Background

2.1

Survey of Music Modelling and Generation

2.1.1

Early Attempts

2.1.2

Neural Networks Models

2.1.3

Deep Generative Models

2.2

Music Representations

2.2.1

Symbolic

2.2.2

Audio

2.2.3

Performance Control

2.3

Generative Models

2.3.1

Variational Autoencoders

2.3.2

Generative Adversarial Networks

2.3.3

Adversarial Autoencoders

Chapter 3

The Model

3.1

Data Representation

3.2

Architecture

3.2.1

Encoder

3.2.2

Decoder