Automatic speech analysis for early detection of functional cognitive impairment in elderly population

(1)

POLITECNICO DI MILANO

Facolt`

a di Ingegneria Industriale e dell’Informazione

Corso di Laurea Magistrale in Ingegneria Biomedica

Automatic speech analysis for early

detection of functional cognitive impairment

in elderly population

Relatore: Prof.ssa Simona FERRANTE

Correlatore: Emilia AMBROSINI, PhD

Elaborato di Tesi Magistrale di:

Matteo CAIELLI

Mtr.n. 883607

(2)

(3)

Ringraziamenti

Ringrazio la Professoressa Simona Ferrante, che mi ha dato la possibilità di lavorare a questa tesi conclusiva del mio percorso di studi in Ingegneria Biomedica. Un grazie davvero sentito alla Dottoressa Emilia Ambrosini, che con la sua grande disponibilità e pazienza mi ha sostenuto durante tutto questo percorso di lavoro durato quasi due anni e infine alla stesura di questo stesso elaborato. Ringrazio lo staff del reparto di Geriatria della Fondazione IRCCS Cà Granda, Ospedale Maggiore Policlinico di Milano, e in partico-lare Domenico Azzolino per avermi assistito per molti mesi nell’acquisizione dei dati. Ringrazio lo staff della Signal Generix per aver fornito il tool nec-essario per le registrazioni. Ringrazio anche tutti i volontari che hanno partecipato, e si sono sottoposti, alle acquisizioni con pazienza e disponi-bilità. Grazie anche ad Alessia Paglialonga e Sara Moccia, che con la loro esperienza e disponibiità mi hanno aiutato nella parte di analisi della voce e ad affacciarmi sul complesso mondo del machine-learning. Infine, grazie a tutte le persone che mi sono state vicine e che mi hanno supportato durante tutto il mio percorso di studi e lo svolgimento di questa tesi.

(4)

Index

Index I

List of Figures V

List of Tables VII

1 Abstract 1

1.1 Introduction and Aim of the Work . . . 1

1.2 State of the Art . . . 3

1.3 Methods . . . 5

1.4 Results and Discussion . . . 7

1.5 Conclusions . . . 9

2 Sommario 11 2.1 Introduzione e scopo del lavoro . . . 11

2.2 Stato dell’Arte . . . 13

2.3 Metodi . . . 15

2.4 Risultati e Discussione . . . 17

(5)

3 Introduction 21

3.1 Background . . . 21

3.2 The MoveCare Project . . . 24

3.3 Aim of the work . . . 25

4 State of the Art 27 4.1 Alterations in expressive prosody in dementia . . . 27

4.2 The vocal tract . . . 29

4.3 Acoustic features of the speech to support the diagnosis of dementia . . . 33

4.3.1 Voice periodicity related features . . . 33

4.3.2 Glottal pulses related features . . . 35

4.3.3 Formants related features . . . 36

4.3.4 Syllables related features . . . 37

4.4 Classification of different levels of cognitive function based on acoustic features of speech . . . 38

5 Methods 42 5.1 Participants and recruitment . . . 42

5.2 Data collection . . . 44

5.2.1 Experimental protocol . . . 44

5.2.1.1 Sentence-repeating task . . . 44

5.2.1.2 Story-telling tasks . . . 46

5.2.1.3 Picture description task . . . 47

5.2.2 Acquisition toolbox . . . 47

5.3 Data analysis . . . 49

(6)

5.3.1.1 Manual cut . . . 51

5.3.1.2 Polarity estimation . . . 51

5.3.1.3 Standardization . . . 53

5.3.2 Features extraction . . . 53

5.3.2.1 Voice periodicity related features . . . 54

5.3.2.2 Glottal pulses related features . . . 56

5.3.2.3 Formants related features . . . 58

5.3.2.4 Syllables related features . . . 59

5.4 Statistical analysis . . . 62

5.5 Classification algorithms . . . 63

5.5.1 Logistic Regression . . . 65

5.5.2 Support Vector Machines . . . 66

5.5.3 Random Forests . . . 67 5.5.4 k-Nearest Neighbors . . . 68 5.5.5 AdaBoost . . . 69 6 Results 71 6.1 Statistics results . . . 71 6.2 Classification results . . . 79

7 Discussion and Limitations 86

8 Conclusions and Future Developments 91

Appendices 94

(7)

B Geriatric Depression Scale – GDS 98

(8)

List of Figures

3.1 Life expectancy at birth in the Euro Area . . . 22

3.2 Number of people with dementia in low and middle income countries compared to high income countries . . . 23

3.3 Forecasted global costs of dementia 2015-2030 . . . 23

4.1 Sagittal section of human vocal tract . . . 31

4.2 Human speech production system . . . 32

4.3 Pitch contour extracted from a voice signal . . . 35

4.4 EGG waveform, showing the glottal opening and closure phases 36 4.5 Spectrogram of a speech signal with the first five formants superimposed . . . 37

5.1 Cookie Theft picture description task . . . 47

5.2 Home page of the acquisition toolbox . . . 48

5.3 Workflow of the study . . . 50

5.4 Audacity software interface . . . 52

5.5 Correction on voiced/unvoiced frames distribution . . . 56

5.6 Raw and voiced signal . . . 57

(9)

5.8 Spectrogram of a speech signal with the third formant super-imposed . . . 59

6.1 Histograms of the most effective features, all extracted from the positive story task . . . 78

(10)

List of Tables

5.1 Summary of algorithms and extracted features . . . 61

6.1 Groups’ characteristics . . . 72

6.2 Voice features – Positive Story Task . . . 74

6.3 Voice features – Negative Story Task . . . 75

6.4 Voice features – Episodic Story Task . . . 76

6.5 Voice features – Picture Task . . . 77

6.6 Classification – Multiclass settings . . . 80

6.7 Classification – Binary settings (Group 1 vs. Group 2) . . . 81

(11)

(12)

Chapter 1 Abstract

1.1 Introduction and Aim of the Work

The constant increase in life expectancy and the increase of the elderly population in Western societies pose important challenges for the healthcare system. As age increases, the need and demand for care increase due to the physiological and/or pathological decline of physical and cognitive fuctions. The ageing population phenomenon goes hand in hand with the fast growing of people with dementia worldwide. In 2018 the number of people affected by this pathology was 50 millions and it is expected to rise up to 131.5 millions by 2050. Consequently the costs of direct medical assistance and social care are projected to increase. Indeed, in 2015, worldwide, the cost was estimated in US$818 millions, corresponding to 1.09% of global GDP (Gross Domestic Product ). In 2018 it has reached the cost of US$1 trillion and it will double up by 2030.

(13)

as Mild Cognitive Impairment (MCI), a condition characterized by cognitive impairments that are apparent on clinical exam or formal cognitive testing, but that are not yet producing a clinically significant impairment in daily functioning. At present, neuropsychological tests performed in a clinical environment are part of the most useful tools for identifying patients with MCI and for differentiating between types of dementia.

Currently, there is no effective treatment for dementia but current ther-apies have been shown to be more effective when applied promply in the early stages of degeneration. Studies have shown that cognitive training started in the pre-dementia stage extends the duration of this phase and of independent living.

In this context the European MoveCare Project emphasizes the impor-tance of transparent monitoring, dedicated to still independent elderly, in order to increase their technology acceptance, both to improve daily life and to keep their health status, both physical and cognitive, under control. MoveCare provides a multiple-actors platform to support the elderly in an innovative way in their home environment and it is composed by different modules, each one achieving a specific function.

Within the MoveCare Project, the present study is focused on the early detection of the first signs of decline of the cognitive function based on the analysis of acoustic features of speech. Automatic voice processing and ma-chine learning have been proved to be techniques which can be exploited for early detection of cognitive impairment and monitoring of disease pro-gression.

The main objective of this study is to prove that automatic voice anal-ysis of non-linguistic content during spontaneous speech can provide useful

(14)

information for cognitive assessment and can represent an additional ob-jective assessment tool for diagnosis support purposes. The second and more challenging one is to verify if there is the opportunity to predict the cognitive level of a subject using machine learning algorithms based only on non-verbal voice features.

1.2 State of the Art

It has been demonstrated that various types of dementia significantly af-fect human speech and language. Therefore, speech can be considered as a source of information for early dementia assessment and diagnosis. De-mentia affects speech at two levels: the linguistic level (what is said ) and paralinguistic level (how it is said ).

Anomia is often found in dementia early stages, whose associated vocal characteristics seem related to temporal parameters of speech, like longer hesitation times and lower speech rates [1]. People affected by Alzheimer’s disease (AD) tend to increase the number of pauses needed to find words and the percentage of voiceless segments [2–5].

A significant increase of the percentage of voice breaks and duration have been found in subjects with AD when compared to age-matched control subjects [5, 6].

In [5, 7] the Authors concluded that AD patients produce a higher num-ber of pauses (that also last longer), and have a significantly slower speech rate and articulation rate. Also the duration of the syllables is increased, leading to higher values of phonation time.

(15)

studies, in which these and other features have been used to discriminate among different levels of cognitive impairment.

In [8] it has been possible to distinguish between AD (167) and healthy subjects (97) with a classification accuracy of 81.9%. A large number of features was extracted both from linguistic variables from transcripts and from the acoustic variables from the associated audio files.

Similar results were found in [9], where acoustic features extracted from speech recordings of 64 subjects (15 healthy, 23 MCI and 26 AD) provided high accuracy rates in classifying healthy versus AD (87%), MCI versus AD (80%), and healthy versus MCI (79%). Even higher classification accuracies were found in a subsequent work of the same group [10].

In [7] the Authors were able to discriminate AD patients (35) and con-trols (35) with an accuracy of 80%, using features extracted from a test with an oral reading task. In a following study based on the same approach but perfomed on a total of 127 subjects the Authors discriminated again AD patients and controls with an accuracy of 87% [6].

In [5] AD patients with respect to healthy subjects were classified cor-rectly with an accuracy of 84.8%. Audio recordings were collected from the administration of a sentence-reading task.

In [11], using data collected from three tasks (two based on recalling and one on spontaneous speech), the Authors were able to separate MCI patients (48) from the control group (38) with an F1 score of 78.8%. The acoustic parameters were extracted from the recorded speech signals, first manually, and then automatically, with an Automatic Speech Recognition (ASR) based tool.

(16)

all the previous studies acoustic features were manually extrated using the Praat software. Moreover, this is the first time a study of such an extent is performed on Italian samples. It must be noted also the importance of analyzing the free speech, since it is the best approximation of daily life conversations.

1.3 Methods

Voice data were collected from a sample of 153 people (with age >65 years) were recruited and divided into three groups based on the Mini-Mental State Examination (MMSE) score:

Group 1: 62 subjects with MMSE > 26 Group 2: 46 subjects with 20 ≤ MMSE ≤ 26 Group 3: 45 subjects with a 10 < MMSE < 20

The subjects, after receiving an explanation of the reason for the in-terview and by signing an informed consent, complied with the following recording protocol:

a sentence-repeating task, in which they were asked to read 10 sen-tences

three story-telling tasks (positive, negative, and episodic), in which they were asked to tell three stories in an uninterrupted way for two minutes each

(17)

a picture description task, in which they were asked to describe the Boston Diagnostic Aphasia Examination picture in uninterrupted way for two minutes

The recording protocol was approved by the ethical committee of the Fon-dazione IRCCS C`a Granda, Ospedale Maggiore Policlinico in Milan, Italy, where the recruitment took place.

The three spoken tasks were recorded for the subsequent extraction of voice features and as previously requested in the informed consent, by us-ing an ad-hoc toolbox developed by SignalGenerix LTD. An external USB microphone were used to record the voice. The sample frequency was set to 16kHz.

For each spoken tasks, the voice recordings have been processed in four steps. The first step consisted of a pre-processing of the voice signals. The following procedure have been applied: a manual cut to remove impulse noises, speech polarity detection, and standardization.

In the second step, voice features have been extracted using speech sig-nal processing techniques. Specifically, the following voice features were computed: percentage of unvoiced signal; mean value of the fundamental frequency; mean, median, percentile 15% and 85% of the durations of voiced (periodic) and unvoiced (aperiodic) segments; shimmer; percentage of voice breaks; standard deviation of the third formant; speech rate and articu-lation rate; phonation percentage; mean duration of syllables and pauses. The data analysis to extract the voice features was performed in Matlab.

In the third one, a subset of the most relevant voice features were se-lected based on statistical tests that assessed their ‘power’ to distinguish the different groups. Linear mixed model analysis were carried out on each one

(18)

of the previously extracted voice features, with age and years of education included as covariates in the model, group, age, and years of education en-tered as fixed effects and the voice feature as dependent variable. When the overall p-value across all groups was found to be significant (i.e. <0.05), then also the Post Hoc comparisons with Bonferroni correction between groups (1 vs. 2, 2 vs. 3, 1 vs. 3) were performed.

It must be noted that in this phase and in the previous one only the acoustic features extracted from the free speech tasks (the 3 stories and the picture) have been used. Features extracted from the sentence-repeating task were used to test the algorithm in the early phases of the work.

The last step consisted in the training and validation of automatic clas-sifiers for determining the most probable class for which a given set of voice features, representing a subject, belongs to. Two datasets were considered, the first one including only the voice features which were found to be signif-icantly different among the three groups and a second one adding age and years of education. Different state-of-the-art classification algorithms de-veloped to solve supervised learning problems (AdaBoost, k-Nearest Neigh-bors, Logistic Regression, Random Forests, Support Vector Machines) were evaluated, implementing them both in multiclass classification and binary classification. In both cases, the classifiers were applied on both the sin-gle tasks separately and altogether. Validation was performed using the Leave-One-Out cross-validation (LOOCV) approach.

(19)

1.4 Results and Discussion

From the analysis of the characteristics of the recruited subjects, signifi-cant differences in terms of age, years of education and MMSE scores were found between the three groups, with people with a lower score in MMSE being characterized by fewer years of schooling and an older age, which is consistent with the literature.

Concerning the acoustic features, as seen in the previous studies, a sig-nificant increase of unvoiced percentage, mean and percentile 85% of the durations of unvoiced segments and percentage of voice breaks, as well as a significant decrease in speech rate and phonation percentage was found across all the tasks.

Regarding classification, the learners which generally obtained the best results were Logistic Regression and Support Vector Machines. In the mul-ticlass case the best performance was achieved by Logistic Regression using the features from each task altogether with an accuracy of 67%. In the com-parison between Group 1 and 2 the best result was obtained by Support Vector Machines in the Picture task with an accuracy of 81%; in the com-parison between Group 2 and 3 it was achieved by Logistic Regression in the setting with features from all the tasks with an accuracy of 68%; lastly, in the comparison between Group 1 and 3 it was scored again by Logistic Regression in the Positive Story task with an accuracy of 94%.

The setting in which more interest was put is the binary comparison between Group 1 and 2, since it is the one which represents the transition from healthy subjects to subjects with mild cognitive impairment. In this case, an overall accuracy of 75% was found when significant voice features

(20)

were used alone. The accuracy increased up to 81% when age and years of education were included.

This performance is only slightly less than the accuracy in classifying healthy subjects versus MCI people obtained in [10], but in this study the Authors pointed out the importance to choose tasks which require sufficient cognitive effort in order to detect early signs of cognitive decline in the voice, which goes against one of the main objectives of this thesis, i.e. keeping the monitoring completely transparent to the subject and apply it in daily life. Also in [11] the focus was posed on the distinction between MCI and healthy subjects. The 81% of F1 score achieved in this study is fairly higher than the ones presented in their work. The use of an ASR based tool is an interesting approach, but does not permit the development of a language-independent tool.

The main limitation of this study is the heterogeneity of the population groups, which significantly differed in terms of age and years of education, with subjects more cognitively impaired being older and with a lower edu-cation level. Other than that, also the limited size of the sample, which led to the adoption of LOOCV. At last, a possible limitation could lie in the use of MMSE, which is not very specific, and moreover it is proved to be highly affected by age and education level.

1.5 Conclusions

The results presented in this thesis showed that voice features extracted from continuous free speech have a good capability to discriminate between healthy controls and subjects with a mild impairment of the cognitive

(21)

func-tion.

Since the progressive nature of cognitive decline, longitudinally recording of voice signals of subjects over time is crucial for the purposes of early detection of cognitive decline, and the development of a mobile app, which could be used on a day-by-day basis by elderly in a transparent and non-intrusive manner, might favor longitudinal assessment.

This study, although with its limitations, represents a first step towards its development and acts as a proof of concept in this direction.

Future work should further explore differences among the groups (espe-cially 1 and 2) on a larger sample size and investigate the extension of this framework to other languages, in order to define a language-independent toolkit which might support the detection of early signs of functional cognitive decline.

Part of the results obtained in this thesis have been sent to the 41st Engi-neering in Medicine and Biology Conference (EMBC) 2019 Berlin, Germany.

(22)

Chapter 2 Sommario

2.1 Introduzione e scopo del lavoro

Il costante aumento dell’aspettativa di vita e l’aumento della popolazione anziana nelle societ`a occidentali pongono una difficile sfida per il sistema sanitario. Con l’aumentare dell’et`a, il bisogno e la domanda di assistenza aumentano a causa del declino fisiologico e / o patologico delle funzioni fisiche e cognitive.

Questo fenomeno di invecchiamento della popolazione va di pari passo con il rapido aumento di persone affette da demenza in tutto il mondo. Nel 2018 il numero di persone colpite da questa patologia era di 50 milioni e si prevede che aumenti fino a 131,5 milioni entro il 2050. Di conseguenza è pre-visto un aumento dei costi legati all’assistenza medica diretta e all’assistenza sociale. Difatti, nel 2015, in tutto il mondo, il costo è stato stimato in US$ 818 milioni, corrispondenti all’1,09% del PIL (Prodotto Interno Lordo) glob-ale. Nel 2018 ha raggiunto il costo di US$ 1 bilione e raddoppierà entro il

(23)

2030.

Il passaggio di transizione tra invecchiamento fisiologico e demenza `e noto come Mild Cognitive Impairment (MCI), una condizione caratterizzata da disturbi cognitivi che sono evidenti all’esame clinico o ai test cognitivi formali, ma che non stanno ancora producendo un danno clinicamente signi-ficativo alla routine quotidiana. Al momento, i test neuropsicologici eseguiti in ambiente clinico sono gli strumenti pi`u utili per identificare i pazienti con MCI e per differenziare tra i tipi di demenza.

Attualmente non esiste un trattamento efficace per la demenza, ma le terapie odierne hanno dimostrato di essere pi`u efficaci nelle fasi iniziali. Gli studi hanno provato che l’allenamento cognitivo iniziato nello stadio pre-demenza estende la durata di questa fase e della vita indipendente.

In questo contesto, il progetto Europeo MoveCare sottolinea l’importanza di un monitoraggio trasparente, dedicato agli anziani ancora indipendenti, al fine di aumentare l’accoglienza alla tecnologia, sia per migliorare la vita quotidiana sia per mantenere sotto controllo il loro stato di salute, sia fisico che cognitivo. MoveCare offre una piattaforma a pi`u attori per supportare gli anziani in modo innovativo nel loro ambiente domestico ed `e composta da diversi moduli, ognuno dei quali dedicato a una specifica funzione.

All’interno del progetto MoveCare, il presente studio `e incentrato sulla rilevazione precoce dei primi segni di declino della funzione cognitiva basata sull’analisi delle features acustiche del discorso. L’elaborazione automatica della voce e il machine-learning sono tecniche che `e stato dimostrato possono essere sfruttate per la diagnosi precoce del deterioramento cognitivo e il monitoraggio della progressione della malattia.

(24)

auto-matica di contenuti non verbali del discorso durante il parlato spontaneo può fornire informazioni utili per la valutazione cognitiva e può rappresentare uno strumento aggiuntivo di valutazione oggettiva a fini di supporto diag-nostico. Il secondo e più impegnativo scopo è quello di verificare se esiste la possibilità di prevedere il livello cognitivo di un soggetto utilizzando al-goritmi di machine-learning basati solo su caratteristiche non verbali del discorso.

2.2 Stato dell’Arte

`

E stato dimostrato che vari tipi di demenza influenzano in modo significativo il linguaggio umano. Pertanto, il parlato pu`o essere considerata una fonte di informazione per la diagnosi precoce della demenza. La demenza influenza la parola a due livelli: il livello linguistico (ci`o che viene detto) e il livello paralinguistico (come viene detto).

L’anomia si riscontra spesso negli stadi precoci della demenza, le cui caratteristiche vocali associate sembrano legate a parametri temporali del parlato, come tempi di esitazione pi`u lunghi e valori di speech rate pi`u bassi [1]. Le persone con AD (Alzheimer’s disease) tendono ad aumentare il numero di pause per la ricerca di parole e la percentuale di segmenti senza voce [2–5].

Un significativo aumento dei voice break e di durata dei segmenti con voce `e stato riscontrato in soggetti con AD rispetto ai soggetti di controllo di et`a corrispondente [5, 6].

In [5,7] gli Autori hanno concluso che i pazienti con AD fanno un maggior numero di pause (che durano anche pi`u a lungo) e presentano uno speech

(25)

rate e un articulation rate significativamente pi`u lenti. Anche la durata delle sillabe aumenta, portando a valori pi`u alti di phonation time.

Le prove di quanto appena riportato possono essere tratte dai seguenti studi, in cui queste e altre caratteristiche sono state utilizzate per discrim-inare tra i diversi livelli di compromissione cognitiva.

In [8] `e stato possibile distinguere tra soggetti con AD (167) e sani (97) con un’accuratezza di classificazione dell’81,9%. Un gran numero di features `

e stato estratto sia da variabili linguistiche ottenute da trascrizioni che da variabili acustiche dei file audio associati.

Risultati simili sono stati trovati in [9], dove le caratteristiche acustiche estratte da registrazioni del parlato di 64 soggetti (15 sani, 23 MCI e 26 AD) hanno fornito valori di accuratezza elevati nella classificazione di sani rispetto a AD (87%), MCI rispetto AD (80%) e sani rispetto a MCI (79%). Accuratezze ancora pi`u elevate sono state trovate in un lavoro successivo dello stesso gruppo [10].

In [7] gli Autori sono stati in grado di discriminare i pazienti con AD (35) e i soggetti sani (35) con un’accurateza dell’80%, usando features estratte da un test con un’attivit`a di lettura orale. In uno studio successivo basato sullo stesso approccio ma eseguito su un totale di 127 soggetti, gli Autori hanno discriminato nuovamente i pazienti con AD e i soggetti sani con una precisione dell’87% [6].

In [5] pazienti con AD rispetto a soggetti sani sono stati classificati correttamente con un’accuratezza dell’84,8%. Le registrazioni audio sono state raccolte durante la somministrazione di un test di lettura di frasi.

In [11], utilizzando i dati raccolti da tre test (due basati sul ricordo e uno sul discorso spontaneo), gli autori sono stati in grado di separare i pazienti

(26)

MCI (48) dal gruppo di controllo (38) con un F1 score del 78,8%. Le fea-ture acustiche sono state estratte dai segnali vocali registrati, prima manual-mente e poi automaticamanual-mente, con uno strumento basato sul riconoscimento vocale automatico (ASR – Automatic Speech Recognition).

`

E importante sottolineare che, con la parziale eccezione di [11], in tutti gli studi precedenti le feature acustiche sono state estratte manualmente usando il software Praat. Oltra a ciò, questa è la prima volta che uno studio di tale portata viene eseguito su campioni italiani. Si noti inoltre l’importanza di analizzare il free speech, poiché è la migliore approssimazione delle conversazioni nella vita quotidiana.

2.3 Metodi

I dati voce sono stati raccolti da un campione di 153 persone (con et`a > 65 anni) reclutate e divise in tre gruppi sulla base del punteggio ottenuto nel Mini-Mental State Examination (MMSE):

Gruppo 1: 62 soggetti con MMSE > 26 Gruppo 2: 46 soggetti con 20 ≤ MMSE ≤ 26 Gruppo 3: 45 soggetti con 10 < MMSE < 20

I soggetti, dopo aver ricevuto un’illustrazione del motivo dell’intervista e aver firmato un consenso informato, hanno aderito a rispettare il seguente protocollo di registrazione:

(27)

tre task narrativi (positivo, negativo ed episodico), in cui `e stato chiesto loro di raccontare tre storie in modo ininterrotto per due minuti ciascuna

un task di descrizione di un’immagine, in cui `e stato chiesto di de-scrivere l’immagine della Boston Diagnostic Aphasia Examination in modo ininterrotto per due minuti

Il protocollo di registrazione è stato approvato dal comitato etico della Fon-dazione IRCCS Cà Granda, Ospedale Maggiore Policlinico di Milano, in Italia, dove si è svolto il reclutamento.

I tre task sono stati registrati per la successiva estrazione di feature acus-tiche e come precedentemente richiesto nel consenso informato, utilizzando un toolbox ad hoc sviluppato da SignalGenerix LTD. Un microfono USB esterno `e stato utilizzato per registrare la voce. La frequenza di campiona-mento `e stata impostata a 16kHz.

Per ogni task, le registrazioni sono state elaborate in quattro passaggi. Il primo passo è consistito in un pre-processing dei segnali voce. La seguente procedura è stata applicata: un taglio manuale per rimuovere i rumori im-pulsivi, il controllo della polarità e la standardizzazione.

Nella seconda fase, le feature acustiche sono state estratte utilizzando diverse tecniche di elaborazione del segnale voce. Nello specifico, sono state calcolate le seguenti feature: percentuale del segnale unvoiced ; valore medio della frequenza fondamentale; media, mediana, percentile 15% e 85% della durata dei segmenti voiced (periodici) e unvoiced (aperiodici); shimmer ; percentuale di voice breaks; deviazione standard della terza formante; speech rate e articulation rate; percentuale di fonazione; durata media di sillabe e

(28)

pause. L’analisi dei dati per estrarre le feature `e stata eseguita in Matlab. Nella terza fase, un sottoinsieme delle feature acustiche pi`u rilevanti `

e stato selezionato sulla base di test statistici che hanno valutato la loro capacità di distinguere i diversi gruppi. La linear mixed model analysis è stata eseguita su ciascuna delle feature precedentemente estratte, con età e anni di istruzione inclusi come covariate nel modello, gruppo, età e anni di istruzione aggiunti come effetti fissi e la feature come variabile dipendente. Quando il p-value complessivo di tutti i gruppi è risultato significativo (cioè < 0,05), sono stati eseguiti anche i confronti Post-hoc con la correzione di Bonferroni tra le coppie di gruppi (1 vs. 2, 2 vs. 3, 1 vs. 3).

Va notato che in questa fase e nella precedente sono state utilizzate solo le feature acustiche estratte dai task free speech (cio`e le 3 storie e l’immagine). Le feature estratte dal task di ripetizione delle frasi sono state utilizzate per testare l’algoritmo nelle prime fasi del lavoro.

L’ultima fase è consistita nell’addestramento e validazione di classifi-catori automatici per determinare la classe più probabile a cui appartiene un determinato insieme di feature rappresentati un soggetto. Sono stati considerati due dataset, il primo includendo solo le feature che sono risul-tate significativamente differenti tra i tre gruppi e un secondo con aggiunti età e anni di istruzione. Sono stati valutati diversi algoritmi di classifi-cazione sviluppati per risolvere problemi di supervised learning (AdaBoost, k-Nearest Neighbors, Logistic Regression, Random Forests, Support Vector Machines), implementandoli sia nella classificazione multiclasse che nella classificazione binaria. In entrambi i casi, i classificatori sono stati appli-cati su sia i singoli task separatamente che su tutti i task aggregati. La validazione è stata eseguita utilizzando l’approccio di Leave-One-Out Cross

(29)

Validation (LOOCV).

2.4 Risultati e Discussione

Dall’analisi delle feature dei soggetti reclutati, sono state riscontrate dif-ferenze significative in termini di età, anni di istruzione e risultato nel MMSE tra i tre gruppi, con persone con un punteggio inferiore nel MMSE caratterizzate da un minor numero di anni di scolarizzazione e un’età avan-zata, che è in linea con quanto riportato in letteratura.

Per quanto riguarda le feature, come si è visto in studi precedenti, un aumento significativo di percentuale unvoiced, media e percentile 85% della durata dei segmenti unvoiced e percentuale dei voice breaks, nonché una diminuzione significativa in speech rate e percentuale fonazione sono stati trovati in tutte le attività.

Per quanto riguarda la classificazione, le tecniche che in generale hanno ottenuto i risultati migliori sono stati la Logistic Regression e Support Vec-tor Machines. Nel caso multiclasse, la Logistic Regression ha ottenuto le migliori prestazioni utilizzando l’insieme dei task con un’accuratezza del 67%. Nel confronto tra i gruppi 1 e 2, il risultato migliore è stato ottenuto dalle Support Vector Machines nel task dell’immagine con un’accuratezza dell’81%; nel confronto tra i gruppi 2 e 3 è stato ottenuto dalla Logistic Regression nella condizione con feature di tutti i task con un’accurtezza del 68%; infine, nel confronto tra i gruppi 1 e 3 è stato ottenuto nuovamente dalla Logistic Regression nel task della storia positiva con un’accuratezza del 94%.

(30)

gruppi 1 e 2, poiché è quello che rappresenta la transizione da soggetti sani a soggetti con MCI. In questo caso, è stata rilevata un’accuratezza complessiva del 75% quando sono state utilizzate solo feature significative. L’accuratezza è aumentata fino all’81% includendo età e anni di istruzione. Questa performance è solo leggermente inferiore all’accuratezza nella classificazione tra soggetti sani e soggetti con MCI ottenute in [10], ma in questo studio gli Autori hanno sottolineato l’importanza di scegliere task che richiedano uno sforzo cognitivo sufficiente per rilevare i primi segni di declino cognitivo, che va contro uno degli obiettivi principali di questa tesi, cioè mantenere il monitoraggio completamente trasparente per il soggetto e applicarlo nella vita quotidiana.

Anche in [11] l’attenzione `e stata posta sulla distinzione tra soggetti sani e con MCI. L’81% di F1 score che `e stato raggiunto in questo studio `

e leggermente pi`u alto dei risultati presentati nel loro lavoro. L’uso di uno strumento basato su ASR `e un approccio interessante, ma non consente lo sviluppo di uno strumento indipendente dalla lingua.

La principale limitazione di questo studio è l’eterogeneità tra i gruppi, che differivano significativamente in termini di età e anni di istruzione, con soggetti con disabilità cognitive più avanzate che erano anche più anziani e con un livello di istruzione inferiore. Oltre a questo, anche la dimensione limitata del campione, che ha comportato l’utilizzo della LOOCV. Infine, una possibile limitazione potrebbe risiedere nell’uso del MMSE, che non è un test molto specifico e inoltre, come dimostrato da degli studi, fortemente influenzato dall’età e dal livello di istruzione.

(31)

2.5 Conclusioni

I risultati presentati in questa tesi hanno dimostrato che le feature acus-tiche estratte dal free speech hanno una buona capacit`a di discriminare tra soggetti sani e soggetti con lieve compromissione della funzione cognitiva.

Data la natura progressiva del declino cognitivo, la registrazione lon-gitudinale dei segnali voce dei soggetti nel tempo `e cruciale ai fini della diagnosi precoce del declino cognitivo e dello sviluppo di un’applicazione mobile, che potrebbe essere utilizzata giorno per giorno dagli anziani in modo trasparente e non intrusivo.

Questo studio, sebbene con i suoi limiti, rappresenta un primo passo verso il suo sviluppo e si presenta come una proof of concept in questa direzione.

Gli sviluppi futuri dovrebbero esplorare ulteriormente le differenze tra i gruppi (in particolare 1 e 2) su un campione pi`u ampio e indagare l’estensione di questo strumento ad altre lingue, al fine di definire uno strumento in-dipendente dalla lingua che potrebbe supportare l’individuazione dei primi sintomi del declino cognitivo.

Parte dei risultati ottenuti in questa tesi sono stati inviati alla 41° Engi-neering in Medicine and Biology Conference (EMBC) 2019 a Berlino, Ger-mania.

(32)

Chapter 3 Introduction

3.1 Background

The constant increase in life expectancy and the increase of the elderly population in Western societies (Figure 3.1 - source: OECD 2016), and particularly in Italy and Spain, pose important challenges for the healthcare system. As age increases, the need and demand for care increase due to the physiological and/or pathological decline of physical and cognitive fuctions. Older people usually desire to continue living in their homes, even when they are alone. Despite this wish, among many other psychological and con-textual factors, receiving constant care seems to be significantly connected to a low quality of life [12]. On the other hand, [13–17] illustrate that el-ders are generally open to robot assistance, although their acceptance is affected by the task (usually is higher for help with chores and information management than for tasks related to personal care and leisure time [18]). These findings suggest that cohabitation with a robotic caregiver could help

(33)

Figure 3.1: Life expectancy at birth in the Euro Area

maintain independence in the management of their daily lives, but also a good quality of life.

In this context, the so-called Information and Communication Technolo-gies (ICTs) can play a very important role. ICTs have the possibility to support elders in their everyday life, for example by helping them to be in touch with relatives and friends or giving assistance in activities related to social inclusion and thus to maintain their health and social care [19].

The ageing population phenomenon goes hand in hand with the fast growing of people with dementia worldwide (Figure 3.2 - source: World Alzheimer Report 2015). In 2018 the number of people affected by this pathology was 50 millions and it is expected to rise up to 131.5 millions (of which more than two-thirds in developing countries) by 2050: the rate in

(34)

which new cases of dementia appear in the world is of 1 every 3 seconds.

Figure 3.2: Number of people with dementia in low and middle income countries com-pared to high income countries

Dementia has also huge economic implications in terms of direct medical and social care costs. In 2015, worldwide, the cost was estimated in US$818 millions, corresponding to 1.09% of global GDP (Gross Domestic Product ). In 2018 it has reached the cost of US$1 trillion and it will double up by 2030 (Figure 3.3 - source: World Alzheimer Report 2015) [20].

Figure 3.3: Forecasted global costs of dementia 2015-2030

(35)

as Mild Cognitive Impairment (MCI). MCI is a condition characterized by cognitive impairments that are apparent on clinical exam or formal cognitive testing, but that are not yet producing a clinically significant impairment in daily functioning [21]. The prevalence of both MCI and dementia is quite high and MCI has been shown to be the strongest predictor for developing dementia. At present, neuropsychological tests performed in a clinical envi-ronment are part of the most useful tools for identifying patients with MCI and for differentiating between types of dementia.

At present, there is no effective treatment for dementia but current ther-apies (pharmacological and non-pharmacological) have been shown to be more effective in the early stages of dementia, when everyday functioning is not affected [22]. Studies have shown that cognitive training started in the pre-dementia stage extends the duration of this phase and, as a con-sequence, the duration of independent living [23]. Therefore, the need for accurate tools for the early detection of signs of cognitive decline becomes a primary need in medical research.

3.2 The MoveCare Project

MoveCare (Multiple-actOrs Virtual Empathic CARgiver for the Elder ) [24] is a project funded by EU started at the beginning of 2017. It develops and field tests an innovative multi-actor platform that supports the independent living of the elder at home by monitoring, assist and promoting activities to counteract decline and social exclusion.

MoveCare comprises three hierarchical layers:

(36)

ob-jects of everyday use with advanced processing capabilities and inte-grates them in a distributed pervasive monitoring system, both phys-ical and cognitive, to derive degradation indexes linked to decline.

2. A context-aware Virtual Caregiver, embodied into a service robot, is the core layer. It uses artificial intelligence and machine learning to propose to the elder a personalized mix of physical/cognitive/social activities as exergames. It evaluates the elder status, detects risky conditions, sends alerts and assists in critical tasks, in therapy and diet adherence.

3. The users’ community strongly promotes socialization acting as a bridge towards the elders’ ecosystem: other elders, clinicians, care-givers and family.

The final users designed for MoveCare are elderly people (age over 65 years) living alone, who are considered pre-frail, and therefore in an initial state of vulnerability with a high risk of losing their autonomy.

3.3 Aim of the work

Within the MoveCare project, the present study is focused on the early detection of the first signs of decline of the cognitive function based on the analysis of acoustic features of speech.

Based on the evidence in [25–27], automatic voice processing and ma-chine learning techniques can be exploited for early detection of cognitive impairment and monitoring of disease progression.

(37)

The majority of the studies conducted in this context performed auto-matic voice analysis on controlled spoken tasks, for example readings aloud or repetitions of semantic units [3–7,11], and few about spontaneous speech or conversations [8,11,28,29]. The MoveCare project wants to overcome this shortcoming by developing a mobile application able to perform automatic voice analysis on-the-fly during phone conversations in order to provide a valid, complementary, and ‘transparent’ method for an identification of the first signs of decline of the cognitive function.

According to these premises, the main objective of this thesis is to demonstrate that automatic voice analysis of non-linguistic content during spontaneous speech can provide useful information about the functionality of the cognitive function and can represent an additional objective assess-ment tool with the final aim to support a diagnosis of deassess-mentia.

The second and more ambitious one is to verify if it is actually possible to predict the cognitive level of a subject using machine learning algorithms based on those voice features alone.

In order to meet these objectives, the first step was the recruitment of 153 subjects divided in 3 groups depending on the level of cognitive decline. Then, after a review of the current literature, the most promising acoustic features were identified and automatically extracted thanks to the develop-ment of a Matlab-based algorithm. Of these features, the most significant ones were singled out through statistical analysis and then fed to different machine-learning techniques in order to perform classification.

Part of the results obtained in this thesis have been sent to the 41st

(38)

Chapter 4 State of the Art

4.1 Alterations in expressive prosody in

de-mentia

It has been demonstrated that various types of dementia significantly affect human speech and language.

A typical patient affected by a mild language impairment shows normal fluency (i.e. the temporal cycles during spontaneous speech production) and articulation (the capacity of utter a word clearly), but a loss in naming and comprehension capabilities. Operations at the syntactic level tends to be preserved, but they’re lost at the semantic one. About syntax, the rating of fluency takes into account such variables as sentence length and structure, besides the use of grammatical modifiers and connecting words. Overall, high fluency, low naming, and low information suit well with the clinical description of the speech of subjects affected by Alzheimer as circuitous and verbose, yet empty and lacking meaningful content. The language

(39)

impairment in Alzheimer’s disease (AD) is part of a more pervasive cognitive disorganization.

Therefore, since language impairment is one the first symptoms which are developed, speech can be considered as a source of information for early dementia assessment and diagnosis. Dementia affects speech at two levels: the linguistic level (what is said ) and paralinguistic level (how it is said ).

Anomia, i.d. the inability to name objects or to recognize the written or spoken names of objects, is a consistently found language abnormality in early dementia’s stages which affects speech fluency. Particularly, in the early phase of dementia, the associated vocal characteristics seem related to temporal parameters of speech, notably longer hesitation times and lower speech rates [1]. People with AD tend to decrease speech fluency, by in-creasing the number and dein-creasing the length of voice segments [30] as well as by increasing the number of word finding pauses and the percentage of voiceless segments [2–5]. Rhythm alterations in people with moderate Dementia of the Alzheimer’s type have been highlighted also in terms of a significant less pitch modulation with respect to age-matched controls [31]. Other changes measured are related to the control of amplitude vari-ability or sound intensity varivari-ability (i.d. the shimmer’s value), which is expected to be lower in AD patients [32].

A significant increase of the percentage of voice breaks and duration have been found in subjects with AD when compared to age-matched control subjects [5, 6].

In [32], it’s shown that the third formant oscillations present significant changes affecting people with AD, which are absent in control subjects and MCI.

(40)

Different studies have explicitly concluded that AD patients produce a higher number of pauses (that also last longer), and have a significantly slower speech rate and articulation rate in both spontaneous speech and reading tasks than asymptomatic older adults without a diagnosis of de-mentia, and matched for age. Also the duration of the syllables is increased, leading to higher values of phonation time [5, 7].

Voice features are strongly influenced also by the emotional state of the speaker [33]. For example, lower values of mean pitch are measured when the subjects express sadness [34]. Therefore, it is important to monitor the emotional state of the subject when automatic voice analysis is used to assess cognitive decline, in order to reduce confounding factors.

4.2 The vocal tract

Since this work is focused on the analysis of voice signals, a brief introduc-tion about how and where voice is generated is needed.

The vocal tract is a container of air starting from the top of the vo-cal folds and going all the way up to the edge of the lips (an anatomyvo-cal representation of all its parts can be seen in Figure 4.1 - source: Gray’s Anatomy 1918 edition). It performs the action of a resonator and can also be a filter for all the parts of the sound that the vocal folds produce. All sounds created by the vocal folds must pass through the vocal tract before they can be heard. The vocal folds’ raw sound can’t be heard on its own.

Almost every instrument has a resonator, which can be any container of air that amplifies part of the vibrations. Voice has a unique resonator since it can actively change its shape. There are many muscle groups that

(41)

can change the size and the shape of the vocal tract. Some of them make the vocal tract longer, shorter, give it a wider opening, or firm up its walls. Every time the size and shape change anywhere along the container of air, the change has a drastic effect on which parts of the original sound are amplified or filtered out changing the type of sound which is proceeded at the output of the vocal track.

In this sense, the vocal tract can be seen as a column of air, but actually it is composed by hundreds of air columns stacked on top of one another, each of them with their own individual pitch, i.e. the relative highness or lowness of a tone as perceived by the ear (the concept of pitch will be further explained later on, in Subsection 4.3.1). If the vocal tract is creating a shape so that the pitch of the air boosts the pitch which is being produced, the vocal folds can work much more efficiently. The vocal tract can boost the sound or fight it, in both cases drastically affecting the vibrations that the vocal folds are able to create.

The density of the walls of a resonator also determines the frequencies that are brought out of the sound. If the walls are soft, the energy will be spread out over many frequencies. If the walls are harder, more energy will be given to a few frequencies.

(42)

(43)

A typical speech sentence signal consists of two main parts: one carries the speech information, and the other includes silent or noise sections that are between the utterances, without any verbal information.

The verbal (informative) part of speech can be further divided into two categories, voiced speech and unvoiced speech. Voiced speech consists mainly of vowel sounds. It is produced by forcing air through the glottis, proper adjustment of the tension of the vocal cords results in opening and closing of the cords, and a production of almost periodic pulses of air. These pulses excite the vocal tract. Psychoacoustics experiments show that this part holds most of the information of the speech and thus holds the keys for characterizing a speaker. Unvoiced speech sections are generated by forcing air through a constriction formed at a point in the vocal tract (usually to-ward the mouth end), thus producing turbulence. Being able to distinguish between the three is very important for speech signal analysis.

The system presented in Figure 4.2 schematizes how human speech is produced.

(44)

4.3 Acoustic features of the speech to

sup-port the diagnosis of dementia

An extensive literature review about acoustic features able to discriminate between subjects with different level of cognitive impairment has been per-formed. The acoustic features which have been shown to have the more discriminative power can be divided into four major categories depending on the voice’s characteristics they are based on:

Voice periodicity related Glottal pulses related Formants related Syllables related

4.3.1 Voice periodicity related features

Pitch, in speech, is the relative highness or lowness of a tone as perceived by the ear, which depends on the vibration frequency of the vocal cords.

Pitch tracking refers to the task of estimating the contour of the fun-damental frequency F0, i.e. the harmonic corresponding to the lowest

fre-quency present in the voice signal. Such a system is of particular interest in several applications of speech processing, such as speech coding, anal-ysis, synthesis or recognition [35], since the detection of periodic elements throughout the speech signal is associated to the identification of its voiced parts.

(45)

A typical adult male will have a fundamental frequency in the range of from 85 to 155 Hz, and that of a typical adult female ranges from 165 to 255 Hz. Children and babies have even higher fundamental frequencies.

In Figure 4.3 is shown the pitch contour extracted using the PRAAT software (www.praat.org, [36]) superimposed on the voice signal itself (the subject was a woman of 74 years saying ‘Sul qui e sul qua l’accento non va’).

Starting from the tracking of the pitch countour throughout the speech signals, the division into voiced and unvoiced segments can be performed. From the extraction of these parts and the estimation of F0many features

can be computed. First of all, the percentage of unvoiced segments which refers to the ratio between the amount of speech signal without periodic nature and the total length of the signal. The mean pitch that is the mean value F0 assumes throughout all the voiced parts of the signal. Mean,

median, percentile 15% and 85% of the duration of voiced and unvoiced segments. Lastly, the shimmer which is a quantified measurement describing the random cycle-to-cycle temporal changes of the amplitude of the vocal fold vibration. High values of shimemr contributes to the perception of a rough or harsh voice quality. It is considered an indicator of dysphony, the smaller the value, the better is the voice quality.

(46)

Figure 4.3: Pitch contour extracted from a voice signal

4.3.2 Glottal pulses related features

Glottal pulse is a term used in the study of linguistics to describe the vari-ances in voice quality affected by the manipulation of the folds of the vocal cords when speaking. In terms of mechanics, a glottal pulse is produced by a flap of tissue in the region of the vocal cords and the gap between them, which are jointly referred to as glottis. The frequency produced in the glot-tal pulse results from the vibration of the vocal cords resonating against the larynx. This creates a buzz or hum that gives a distinctive quality to the voice of each individual.

The practical usefulness behind determining glottal pulse primarily lies in gaining an understanding of how speech is processed.

Figure 4.4 (source: [37]) shows an electroglottography (EGG), in which are highlighted the phases of opening and closure of the glottis. The time

(47)

elapsing between the openings of the glottis (Glottal Opening Instants – GOIs) and the subsequent closures (Glottal Closure Instants – GCIs) are referred as glottal periods.

Figure 4.4: EGG waveform, showing the glottal opening and closure phases

Starting from the extraction of the glottal pulses, the percentage of voice breaks is evaluated, a parameter concerning the temporal voice course. The term voice break refers to transitions between different vocal registers of the voice.

4.3.3 Formants related features

A formant is a concentration of acoustic energy around a particular fre-quency in the speech wave. There are several formants, each at a different frequency, roughly one in each 1000Hz band. They are usually referred to as F1, F2, F3, etc. Each formant corresponds to a resonance in the vocal tract. With vowels, the frequencies of the formants determine which vowel is heard and, in general, are responsible for the differences in quality among different periodic sounds. Formants can be seen clearly in a wideband spec-trogram, where they are displayed as dark bands, as in Figure 4.5 (obtained

(48)

with Praat from the same signal as in Figure 4.3). The darker a formant is reproduced in the spectrogram, the stronger it is (the more energy there is, or the more audible it is). In this picture are also displayed in red the first five formants.

Figure 4.5: Spectrogram of a speech signal with the first five formants superimposed

In this study, the standard deviation of the third formant have been estimated, as it represents a degree of tonal modulation of voice.

4.3.4 Syllables related features

A syllable is the minimal phonetic unit of organization for a sequence of speech sounds. It is typically made up of a syllable nucleus (most often a

(49)

vowel) with optional initial and final margins (typically, consonants). Sylla-bles are usually considered the phonological building blocks of words. They can influence the rhythm of a language, its prosody, its poetic metre and its stress patterns.

From the identification of the syllabic nuclei, the following features can be estimated:

Speech rate, the number of syllables per second

Percentage of phonetion time, referring to the presence of syllables throughout the speech signal

Articulation rate, the number of syllables over the phonation time Mean duration of both syllables and inter-syllabic pauses

All these parameters are connected to the speech fluency.

4.4 Classification of different levels of cognitive

function based on acoustic features of

speech

In [8] analyzing data based on linguistic features from transcripts and acous-tic features and obtained from 167 patients diagnosed with ‘possible’ or ‘probable’ AD and 97 controls (all English-speaking), it has been possible to distinguish between AD and healthy subjects with a classification accu-racy of 81.9% using the Logistic Regression algorithm. A large number of

(50)

features (370) was extracted both from linguistic variables from transcripts and from acoustic variables from the associated audio files.

Similar results were found in [9] using the Support Vector Machines clas-sifier, where acoustic features extracted from speech recordings provided high accuracy rates in classifying healthy versus AD (87%), MCI versus AD (80%), and healthy versus MCI (79%). In this study 15 subjects were recruited for the healthy control group, 23 for the MCI group and 26 for the AD one, all French-speaking. From a countdown and a picture describ-ing task, features related to different data types (voice, silence, periodic and aperiodic segments’ durations) were extracted: mean and ratio mean, median and ratio median, standard deviation and ratio standard deviation, sum of segments and ratio sum of segments, segment counts. The extraction of the features was performed manually in Praat.

Even higher classification accuracies were found in a subsequent work of the same group, in which speech analysis was performed on a mobile applica-tion, but still in a controlled environment [10]. In this case 165 participants were recruited, again French-speaking. Each participant performed a set of six spoken tasks, namely a sentence-repeating task, a denomination task, two different verbal fluency tasks, a counting backwards task and and three story-telling tasks. Features are similar to the ones used in the previous study and again are extracted in Praat.

Using a ROC curve approach based only on speech rate, in [7] the Au-thors were able to discriminate AD patients and controls with an accuracy of 80% (specificity 74.2%, sensitivity 77.1%). A test with an oral reading task was administered to 70 Spanish-speaking subjects, comprising 35 AD patients and 35 controls. The features (total duration of the recording,

(51)

number of pauses, pauses proportion, phonation time and speech rate) were obtained with Praat.

In a following study ( [6]) based on the same approach but perfomed on a total of 127 subjects, again all native Spanish speakers, the Authors discriminated again AD patients and controls with an accuracy of 87% (specificity 81.7%, sensitivity 82.2%).

In [5], performing Linear Discriminant Analysis with diagnosis (AD and control subjects) as independent variable and percentage of voice breaks, number of voice periods, number of voice breaks, shimmer and noise-to-harmonics ratio as features, AD patients were classified correctly with an accuracy of 84.8%. Audio recordings obtained during the execution of a sentence-reading task were collected from 66 English-speaking participants divided in two groups, a normal control group (36) and an AD group (30). Spectrographic analysis of temporal and acoustic characteristics was carried out using the Praat software.

In [11], using data collected from three tasks (two based on recalling and one on spontaneous speech), the Authors were able to separate MCI patients from the control group with an F1 score of 78.8% using the Random For-est classifier on acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) extracted from the recorded speech signals, first manually (using the Praat software), and then automatically, with an Automatic Speech Recognition (ASR) based tool. Participants were 38 healthy controls and 48 clinically diagnosed MCI patients, all Hungarian-speaking.

(52)

the classification was based only on spontaneous speech instead of controlled spoken tasks, such as readings aloud, repetitions of semantic units or verbal fluency tasks, counting down and picture description

the focus was set on early signs of cognitive decline

the development of a custom software in order to favor the processing porting on the mobile app instead of using Praat

(53)

Chapter 5 Methods

5.1 Participants and recruitment

A sample of 153 people were recruited at the Geriatric Unit of Fondazione IRCCS C`a Granda, Ospedale Maggiore Policlinico in Milan, Italy. Par-ticipants were divided into three groups based on the Mini-Mental State Examination (MMSE) score they achieved:

Group 1: 62 subjects with MMSE > 26 Group 2: 46 subjects with 20 ≤ MMSE ≤ 26 Group 3: 45 subjects with a 10 < MMSE < 20

The basic requirement for participation in the interviews were an age of over 65 years and a good knowledge, orally at least, of the Italian language.

Exclusions criteria consisted in:

(54)

Subjects with MMSE score ≤ 10

Subjects unwilling to participate to the study or unable to provide their consent to participate in the study

Clinically unstable subjects in the clinical judgment of the investigator or affected by a terminal illness (life expectancy below 6 months)

Subjects affected by severe hearing loss, major visual deficits, aphasia Depressed subjects (evaluated using the 30-item Geriatric Depression

Scale – GDS)

The MMSE (or Folstein test, presented in Appendix A) is a questionnaire extensively used in clinical and research settings to evaluate the cognitive function [38]. The test is composed by 30 simple questions or problems (the score ranges from 0 to 30) aimed at investigate deficiencies in one of the following cognitive areas: orientatation to time and space, registration of words, attention and calculation, recall, language, complex commands. It’s important to point out that the MMSE’s purpose is not, on its own, to provide a diagnosis for any particular disease.

The GDS (presented in Appendix B) is a self-report assessment used to identify depression in the elderly [39]. One point is assigned to each answer and the cumulative score is rated on a scoring grid. Subjects with a score over 9 are considered depressed (mildly and then severely as the score increases), thus in this study are included only the ones with score < 10. However, in contrast with the case of MMSE, the exclusion from the study was applied after the subjects’ partecipation in the interviews. Also in this

(55)

case, it must be highlighted that a diagnosis of clinical depression should not be based on GDS results alone.

5.2 Data collection

5.2.1 Experimental protocol

Data were collected through individual interviews. Each interview was recorded in Italian in agreement to the following protocol:

Welcome, explanation of the reasons for the interview, signature of the informed consent (subjects belonging to Group 1 – MMSE score > 26 – were also asked to compile the GDS)

Sentence-repeating task Three story-telling tasks Picture description task

The participants signed an informed consent approved by the European Community. In the investigation report no names and other personal in-formation that could identify the person is reported. Confidentiality has been maintained in accordance with the regulations existing in Italy. All data was stored in password-protected computers and in a secure shared folder, accessible only to the researchers responsible for data management and analysis.

5.2.1.1 Sentence-repeating task

(56)

1. So solo che oggi dobbiamo aiutare Giovanni. (I only know today we have to help Giovanni.)

2. Il gatto si nascondeva sempre sotto il divano quando c’erano cani nella stanza.

(The cat always hid under the couch when there were dogs in the room.)

3. Non c’`e se n`e ma che tenga. (There are no ifs or buts about it.)

4. Sul qui e sul qua l’accento non va. (On ‘qui’ and on ‘qua’ there is no apostrophe.)

5. C’era una volta un pezzo di legno. Non era un legno di lusso, ma un semplice pezzo da catasta, di quelli che d’inverno si mettono nelle stufe e nei caminetti per accendere il fuoco e per riscaldare le stanze. (Once upon a time there was piece of wood. Not a luxury wood, but just a piece from a woodpile, one of those that in winter are put in stoves are fireplaces to light the fire and warm the rooms.)

6. Oggi voglio preparare una torta per il compleanno della mia nipotina. (Today I want to prepare a cake for my niece’s birthday.)

7. Oggi vado dal dottore perch`e non mi sento molto bene. (Today I’ll go to the doctor because I don’t feel well.)

8. Dato che c’`e il sole, pensavo di andarmi a fare una passeggiata al parco. (Since it’s a sunny day, I was thinking about going for a walk at the park.)

(57)

9. ´E una settimana che non smette di piovere, speriamo arrivi presto la primavera. (It’s a week it doesn’t stop raining, I hope Spring is coming soon.)

10. Sei riuscito a passare al supermercato a comprare il detersivo? (Have you been able to stop by the supermarket to buy the detergent? )

This task is not based on spontaneous speech and therefore the corre-sponding results were not used for classification, but since the audio signals are relatively short and thus simpler and faster to analyze, they were useful in the first part of the work in order to test and optimize the algorithms for the automatic extraction of the features.

5.2.1.2 Story-telling tasks

As previously stated in Section 4.1 at page 27, emotions can influence voice parameters. In order to take into account this factor, subjects were asked to tell three short stories in an interrupted way for approximately two minutes each:

1. Positive story: tell something about the first pleasant event coming to your mind, with no costraint on when it happened

2. Negative story: tell something about the first unpleasant event coming to your mind, with no costraint on when it happened

3. Episodic story: tell something about an event occurred in the recent past, possibly something which does not involve strong emotions, in order to keep a neutral tone

(58)

5.2.1.3 Picture description task

Within this task, subjects were asked to describe a picture freely in an uninterrupted way for 2 minutes, trying to add as many details about the scene as they could. The picture was the ‘cookie theft’ picture of the Boston Diagnostic Aphasia Examination test, presented in Figure 5.1. It was chosen because it is considered an ecologically valid approximation to spontaneous speech.

Figure 5.1: Cookie Theft picture description task

5.2.2 Acquisition toolbox

The three spoken tasks have been recorded using an ad-hoc toolbox (whose home page is shown in Figure 5.2) developed by SignalGenerix LTD [40].

This toolbox displays the sentences for the sentence-repeating task and the picture for the picture description task, but this function was not

(59)

ex-ploited. In fact, both sentences and the picture were printed on paper sheets in a big font size and presented by the examiner, in order to make them easier to read for the subjects.

An external USB microphone has been used to record the voice, kept at a distance of approximately 20cm from the mouth of the speaker with an inclination of 45° in order to reduce breath artifacts.

Figure 5.2: Home page of the acquisition toolbox

After the choice of the language and the group the subject belongs to, the system let the examiner create the profile of a new user or load/delete an existing one. If the first option is chosen, gender and age of the subject and a 5-digit password must be inserted in order to proceed with the recording phases (User IDs are progressive and set automatically by the system). It’s

(60)

also provided for the possibility to insert eventual notes about the subject. The toolbox is powered by Matlab.

5.3 Data analysis

In Figure 5.3 is presented a simple flowchart which outlines the steps con-stituting this work.

First of all, a database were built conducting acquisitions of voice sam-ples from the subjects involved in the study. The files were recorded with a sampling frequency (Fs) of 16kHz, with a 16 bits quantization, and saved in WAVE format. The block ‘preprocessing’ shows the phase preceding the actual extraction of the features. The manual cut of the audio signals was performed using the Audacity software [41], while polarity check and standardization was performed using the MATLAB software [42]. The steps outlined in the block ‘feature extraction’ present the Matlab-based algorithms implemented to derive the acoustic features from the speech sig-nals. Then there is the stage of statistical analysis, implemented employing the IBM SPSS Statistics software [43]. Lastly, in the ‘classification’ block are presented the two main strategies carried out to classify the recruited subjects, both implemented in Python [44].

(61)

(62)

5.3.1 Pre-processing

The pre-processing phase was performed for each one of the audio signals. It was composed by three subsequent steps:

Manual cut

Polarity estimation Standardization 5.3.1.1 Manual cut

As anticipated, the manual cut was performed in Audacity. It was used to remove silences at the beginning and the end of each signal, eventual environmental noises and voices other than the one of the interviewee.

In Figure 5.4 is shown the interface of the software.

Its use is quite straightforward, to remove a part of signal it’s sufficient to select it directly on the waveform and then delete it. It’s also possible to zoom in on a specific point and change the speed of playback to make easier locate and remove noises.

5.3.1.2 Polarity estimation

Detecting the correct speech polarity is a necessary step prior to speech processing techniques. An error on its determination can have an adverse impact on the performance of this kind of techniques, even if human ear is mostly insensitive to its changes. When a microphone is used to record a speaker, inverting its electrical connections will cause an inversion of the polarity of the acquired speech signals. The origin of the polarity in the

(63)

Figure 5.4: Audacity software interface

speech signal comes from the asymmetric glottal waveform exciting the vocal tract resonances (see Subsection 4.3.2).

To detect polarity the algorithm described in [45] was implemented. The key idea behind this technique is that the excitation signals contain relevant information about the speech polarity, as their behaviour reflect the asymmetry of the glottal production. These excitation signals are:

the residual signal, obtained by estimating through Linear Prediction analysis the coefficients of an auto-regressive model of the speech sig-nal, and by removing the contribution of this spectral envelope by inverse filtering