Classification of resting state fMRI datasets: machine learning methods for the identification of patients with anxiety disorders.

(1)

CORSO DILAUREAMAGISTRALE ININGEGNERIABIOMEDICA

Classification of resting state fMRI datasets: machine

learning methods for the identification of patients with

anxiety disorders

Elaborato di Tesi di Laurea Magistrale

Studentessa Viola Fanfani Relatori

Ing. Nicola Vanello Ing. Luca Citi

Prof. Claudio Gentili Controrelatore

Prof. Luigi Landini

(2)

(3)

Tu hai mai capito quando parlano, se sono dalla tua? Si dovrebbe studiare per saper fare a meno di quelli che studiano."

Cesare Pavese "Il compagno"

(4)

Acknowledgements

I would like to thank my supervisors: Luca Citi who welcomed me at the University of Essex and then patiently mentored me; Nicola Vanello who made this thesis possible at first and then provided constant supervision to the work; Claudio Gentili for the dataset to explore and for giving useful advices on neuroscience.

I would like to thank my officemates at the University of Essex for guiding me throughout the daily challenges of this work. A particular thank to Ana, and with her to all the lunch-mates, for including me in many, serious or less serious, moments of their lives.

Eventually, I would like to thank my friend Bianca Borrani and Ilenia Mazza for the help with the graphical issues I encountered on my way.

Data were provided in part by the Human Connectome Project, WU-Minn Consor-tium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University.

(5)

Saranno ringraziamenti lunghi. Questa tesi arriva a conclusione del corso magis-trale di ingegneria biomedica, ma segna anche la fine del mio, ormai ventennale, percorso di studi. Con i libri me la sono sempre cavata bene, un po’ meno con la quo-tidianeità. Ho vissuto una vita privilegiata in ogni senso, ma in particolare riconosco la fortuna di avere sempre trovato al mio fianco persone magnifiche. Senza di loro non sarei qui o non sarei così.

Ringrazio i miei nonni, che hanno prima combattuto per il mio futuro e si sono poi presi cura del mio presente.

Un ringraziamento particolare ai miei genitori, che non ho scelto, ma che non cambierei per niente al mondo; senza di voi sarebbe stato impossibile raggiungere questo traguardo.

Ringrazio i miei zii per avermi sempre trattata come una figlia, aiutandomi in qualunque momento ne avessi bisogno.

Ringrazio le amiche di una vita, perchè siamo sopravvissute insieme all’adolescenza ed ormai sono diventate la mia famiglia. Un grazie anche tutti miei amici, ormai gli amici di sempre, che hanno reso questi anni più facili e più allegri di quanto potessi sperare. Tutti insieme, siete il mio paese in cui tornare.

Ringrazio Baba, Claudia, Costanza e Salvatore. A loro basterebbe, forse, che dicessi che sono i miei amici più belli, ma non basta. Il tempo passato con loro ha superato qualunque aspettativa avessi per i miei anni dell’università, mi hanno sconvolta e cambiata. Un grazie anche a Marianna, che ci ha sopportati nei momenti peggiori e ci ha dato in cambio tante risate. Senza di loro sarei forse lo stesso a scrivere questi ringraziamenti, ma chi sa quante esperienze mi sarei persa nel frattempo.

(6)

Ringrazio tutte le amiche e colleghe con le quali ho condiviso tanti momenti, di gioia e tristezza, da Pacinotti a Vettovaglie: grazie a Chiara, Elena A., Elena M. e Giulia per il costante e quotidiano sostegno. Un grazie a Stefano, le interminabili ore di lezione e di studio sono sembrate più corte anche grazie a lui.

Ringrazio tutte le altre persone che ho conosciuto in questi anni, tutte le mie coinquiline, tutti gli altri compagni dell’università, tutti gli altri amici e conoscenti: hanno reso questa città una seconda casa.

(7)

Machine learning is increasingly being applied to fMRI. Those methods, fit for large dimensional datasets, seem particularly suitable for the challenge of decoding brain’s intrinsic behaviour from resting state recordings (rs-fMRI). The goal of this work is to find a classification algorithm able to distinguish patients with anxiety related disorders from healthy controls, by analyzing resting state fMRI data. This work explores the possibility of using Support Vector Machines as classifying algo-rithm and presents two different approaches to feature extraction. First of all, it tests the power of features obtained from a linear and non-linear analysis of the fMRI time series. Later, it investigates the employment of functional connectivity matrices to classify the patients. The positvie semi-definite structure of covariance matrices needs a Riemannian geometry framework to be properly used with SVMs: this work tries to tackle the problem by building a reliable and handy algorithm for rs-fMRI datasets classification. To check the reliability of this novel technique, the Human Connectome Project open access data have been used in addition to the anxiety related ones. Despite the amount of different settings used, only the most interesting results are showed in this work in order to explain and discuss the methodological challenges of whole brain rs-fMRI classification.

(8)

Sommario

Sempre più frequentemente i metodi di Machine Learning vengono impiegati per l’analisi di dati di risonanza magnetica funzionale, fMRI. In particolare, poiché tali algoritmi riescono a gestire grosse moli di dati, essi sembrano particolarmente adatti ad esplorare dati di fMRI a riposo in modo da comprendere il comportamento intrinseco del cervello. Con questa tesi si è cercato di trovare un algoritmo in grado di classificare pazienti con disturbi d’ansia e controlli sani. Lavorando con le Support Vector Machines, sono stati utilizzati due tipi diversi di features: prima di tutto sono state estratte misure sintetiche dell’attività con analisi lineari e non delle serie temporali, in una seconda fase è stata esplorata la possibilità di usare le matrici di covarianza, cioè dati di connettività funzionale. Questo secondo approccio, che vede l’utilizzo di matrici semidefinite positive come features delle SVM, richiede l’impiego di alcuni strumenti legati alla geometria di Riemann. Una delle sfide di questa tesi è stata la costruzione di un algoritmo affidabile, adatto a dati rs-fMRI. Dato che questo tipo di approccio è nuovo nell’ambito della risonanza magnetica funzionale, abbiamo testato l’algoritmo anche su un dataset messo a disposizione dall’ Human Connectome Project. Nel corso di questo lavoro sono state provate varie combinazioni di dati e metodi, ma di seguito presenteremo solo quelle che hanno fornito i migliori risultati così da poter discutere dei pregi e limiti dei metodi in esame.

(9)

Contents VII

List of Figures IX

List of Tables X

1 Introduction 1

2 State of the art and our Dataset 4

2.1 Resting state fMRI and machine learning . . . 4

2.1.1 Anxiety disorders studies . . . 15

2.2 The materials . . . 16

2.2.1 The original dataset . . . 16

2.2.2 HCP 900 . . . 18

2.2.3 Software tools . . . 18

3 Machine Learning and SVM 19 3.1 Introduction . . . 19

3.2 Support Vector Machines . . . 19

3.2.1 Linear Classifier . . . 20

3.2.2 Hard Margin SVM . . . 21

3.2.3 Soft Margin SVM . . . 22

3.2.4 Nonlinear SVMs . . . 23

3.3 Assessing the performance . . . 25

3.4 Cross-validation . . . 26

3.4.1 K-fold vs Leave One Out . . . 27

(10)

Contents

3.6 Data dimension . . . 29

3.6.1 Unbalancement . . . 29

4 Classification with Synthetic Features 31 4.1 Introduction . . . 31

4.1.1 ALFF and fALFF . . . 31

4.1.2 ReHo . . . 32 4.1.3 Hurst Exponent . . . 33 4.2 Feature Extractions . . . 34 4.3 Post-hoc Analysis . . . 34 4.4 SVM of DMN . . . 35 4.4.1 Results . . . 36 4.5 Discussion . . . 37

5 Classification with SVMs and Covariance Matrices 40 5.1 Introduction . . . 40

5.2 Covariance Matrix . . . 41

5.2.1 Computation of covariance matrices . . . 42

5.2.2 Operations with covariance matrices . . . 44

5.3 Riemannian Geometry . . . 45

5.3.1 Riemannian metric . . . 45

5.3.2 Mean of Symmetric Positive Semi-definite Matrices . . . 47

5.3.3 Gradient Descent algorithm . . . 49

5.3.4 Mahalanobis distance . . . 50

5.4 Classification of HCP 900 datasets . . . 52

5.4.1 Methods . . . 52

5.4.2 Results . . . 53

5.5 Classification of rs-fMRI datasets . . . 56

5.5.1 Methods . . . 56 5.5.2 Results . . . 57 5.6 Discussion . . . 61 6 Conclusions 64 A Appendix A 67 Bibliography 71

(11)

2.1 Number of papers working on the same disorder . . . 12

3.1 Linearly separable case . . . 22

3.3 Non linear problem, RBF kernel . . . 23

3.6 ROC curve . . . 27

4.1 Default Mode Network from FINDlab map . . . 35

4.2 Weights of the classification with ReHo . . . 37

4.3 Weights of the classification with fALFF . . . 37

5.1 Manifold and tangent space . . . 46

5.2 Positive semi-definite matrices’ cone and tangent planes . . . 47

5.4 Gyruses frequence in CV . . . 54

5.5 Frequence of the highest weighted features in the HCP-900 classification . 55 5.6 Frequence of the highest weighted features in the resting state classification 59 5.7 Distance between ROIs of the Brainnettome map . . . 60

5.8 Most important features for the classification. The data in this figure are the same than in the above matrices, increasing frequency is represented with thicker and darker links . . . 61

(12)

List of Tables

2.1 Papers combining rs-fMRI data with SVMs . . . 7

3.1 Two classes confusion matrix . . . 26

5.1 EMOTION results . . . 53

5.2 rs-fMRI classification results: HC vs (SAD+GAD+Panic+dental) . . . 57

5.3 rs-fMRI classification results: HC vs anxiety . . . 57

(13)

Resting state functional magnetic resonance, rs-fMRI, is increasingly becoming one of the favourite tools to investigate brain behaviour. Even during activations, the background activity of the brain is the 95% of the whole [1]. Considering that the brain has an average consumption of around 25% of blood volume, this has led neuroscientists to think that even the resting activity of the brain would bring useful information to understand its behaviour. Since the first paper in 1995 [2], there have been many attempts to decode brain connectivity through rs-fMRI, especially in order to find characteristic traits of psychiatric disorders [3].

One of the biggest discoveries of rs-fMRI is the existence of specific close and far range connections, the so-called resting state networks, RSN [4]. These networks are formed by sets of areas that exibit common behaviours at rest and interesingly those regions are largely overlapping if compared for different subjects. In order to reach clinical reliability [5], the following step has been to assess if and where those networks are disrupted in pathological subjects. To date, the majority of studies focused on degenerative or depression disorders, still a good number of them have tried to decode brain activity in anxiety related disorders [6]. Having a dataset made of healthy controls and patients with a variety of anxiety disorders, we are looking for a suitable algorithm to discriminate different subjects and possibly to infere common charachteristics of the disorders.

The privileged approaches to study disorders traits in rs-fMRI have been seed-based correlation analysis and indipendent component analysis, ICA [7]. The former consists of building a map of correlations respect to a seed ROI, arbitrarily placed by the experimenter in a point considered relevant for the objective of the study. The resulting maps are checked to assess which voxels’ time courses are highly correlated or anticorrelated respect to the seed and how those correlations differ

(14)

from healty controls to patients. Seed-based correlations analysis needs an a priori assumption on the model, then it gives a large set of voxelwise information on the connections between the seed ROI and the rest of the brain. On the other side, with ICA, indipendent components of the signal are recognised, obtaining a map of areas with a common underlying source, which are the resting state networks. Looking for differences in those maps between patients and control subjects is the key to assess network’s disruptions originated by disorders. ICA is a model free approach which gives an inferior number of informations than the other method, but it explores the data without the bias introduced by a priori choices.

Parallel to fMRI, computer science has seen massive improvements and applica-tions of machine learning methods. Those are data driven approaches to algorithms, that learn a model from common patterns found in the datasets. The essential asset of machine learning is the intrinsic ability of managing large dimensional datasets without a priori hypotheses, on top of that they improve their behaviour with the number of examples given. Perfusing each branch of biomedical sciences, ML meth-ods seem fit for rs-fMRI, in particular to find the common or disrupted patterns in patients recordings. It is for this purpose that a good number of works applied methods such as support vector machines and k-NN to rs-fMRI datasets. We will see in chapter 3 how the datasets and goals of the analysis influence the choice of the ML’s framework. We decided to use support vector machines, SVM, since they are supervised classification methods with a large adaptability and a reasonable computational cost, on top of that, the majority of published papers used SVMs increasing our chances to compare the results.

Once the machine learning approach has been chosen, the experimenter need to extract some features from the dataset: those are numerical representations of data’s properties and their quantity and nature depend on the patterns one is looking for. The first aspect we would like to look into is the local characteristic behaviour of the brain. It is a common belief that brain intrinsic oscillations at rest carry a large load of information [8]. Following a diffuse strategy in rs-fMRI studies, from linear and non-linear analyses of the time series, we will extract synthetic parameters of the activity: its power at low frequencies, its fractality and homogeneity. Our goal in chapter 4 is to assess how those measures vary between different populations and which areas are more informative.

(15)

case of both covariance and correlation matrices, even if we are focusing on the formers to follow a path paved for task-evoked signals. Before proceeding in the actual data analysis, we reconstructed the Riemanannian geometry framework which is fundamental to manage PSD matrices with a tool working in a vectorial space. Afterward we tested this approach on our dataset, to check how it improves the discrimination ability of the SVM. This one is a novel method, seldom applied to fMRI, therefore we are looking to build a feasible algorithm able to discriminate different disorders. To build and validate the algorithm we are using a second dataset available from the Human Connectome Project.

Alongside with the classification goal, we are discussing the complementary issues raised by this type of analysis. The first question is how many instances, in this case subjects, are needed to perform a reliable classification. Moreover how much the unbalancement of the groups can affect the results. Another problem we have to address is how many features and with which resolution we can extract and manipulate. One approach could be to work with voxelwise features, without discarding local information. A second one is to use a parcellation map, built using anatomical or functional data, to extract an averaged measure of the ROIs.

Eventually, we would like to understand how many of the extracted features are actually useful for the classification. Moreover, we are interested in understanding which one of the features contains a bigger infromative power. The major goal of using SVMs for psychiatric disorders classification would be to find recurrent patterns in resting state data, patterns that are difficult to recognise with other analysis methods. Therefore, we will analyse the results of the classification to assess their adherence to previous studies on anxiety disorders.

(16)

2

State of the art and our Dataset

2.1 Resting state fMRI and machine learning

Resting state fMRI is a relatively new paradigm for functional studies: patients are asked to lie in the scanner, with their eyes closed, without thinking of anything in particular. This method has proved to be an interesting and reliable source of informations about the ongoing brain activity: during what we call rest the brain activity is still strong and it can help understanding the underlying connections, that are less visible in the signals recorded during task related experiments.

The objectives of rs-fMRI are multiple, but in disease recognition we can identify one of its major branches. Depending on the available dataset and the purpose of the experimenters, patients with a condition can be compared to healthy controls or to patients with the same disease but different medications; otherwise the goal could be to classify subjects with originally the same type of condition that, with time, developed different diseases. On the other hand, supervised machine learning supports mainly two kind of goals: classification and regression. In the first task, the algorithm is built to learn a model able to classify two or more different groups. In regression, instead, the model is pointing to find a description for the relationship between two or more variables. Both the approaches learn a model from the dataset they are using: this means that there is usually a chance of evaluate the model, a posteriori, to identify which variables played a key role in the estimation of the model. The reasons why this ML looks like a feasible solution to the request of disorders recognition can be found in the structure and dimensions of the problem.

Rs-fMRI scans are usually large datasets. With a 3D variable resolution up to 1mm, a single brain can sum up to 109voxels, each of them with a temporal recording of variable length, with usually more than 100 samples. Unlikely task induced signals,

(17)

the resting state ones can not be a priori filtered by region and period of activation or other functional markers, making it difficult to discard informations and forcing to keep the quasi-entire dataset. Moreover, to identify reliable biomarkers describing an entire population, the number of examples, which in this case are subjects, should not be limited to small groups, even if we will see in the following of the chapter that this request is not always fulfilled. Finally, the usual way to identify an ill patient is an examination of a medical doctor with a consequent binary diagnosis (ill/healthy) or a value on a predefined scale, defining the degree of the condition. This kind of problem, in its own definition, looks fit to supervised machine learning which can takle the challenge using the known information extracted from the physician to find variables better describing the pathology.

Since the nature of our dataset we decided to employ a classification SVM, whose principles will be described in Chapter 3, with different features as input. The ways to perform a classification experiment with fMRI data are multiple and can be quite different one study from the other [9] [10]. In the following we will describe the mandatory steps [11] in order to be able to perform an analysis of literature later in this chapter.

The first step, leading all the next ones, is features extraction. Once we have a collection of 3D time varying brain signals we need to extract a list of values, for each subject, which will be the input features of the SVM. This step greatily depends on the type of biomarkers one is looking for. Apart from the preprocessing steps, necessary for every kind of analysis on rs-fMRI datasets, the single volume can be considered voxelwise or can be parcellated in a set of ROIs. After that, the experimenter can decide what kind of parameters one need to extract from the dataset: it could be the simple average of the signal in a voxel or a synthetic value extracted from a graph analysis of the network. Once the features are extracted from every subject there could be the need of reducing their dimensionality through feature selection. A SVM works well with a large number of examples, which in this case are subjects, but can be not particularly robust in front of an high unbalancement of the examples/features rate. To cope with that, the classifier itself can discard some features if they are not helping in finding the optimal hyperplane. To avoid the risk of selecting "randomly" the features, one can select them using various approaches with a strategic technique. Finally, after the model is learned by the machine. it should be important to evaluate

the results. The machine trained on a dataset, which is usually a subset of the original

(18)

2.1. Resting state fMRI and machine learning

and to what degree the classifier is able to properly classify the patients. To have statistically significative evaluation of the method this training-testing procedure should be inserted in a cross-validation cycle that trains the model, repeatedly, on a different subset of examples and then tests it on the remaing instances. It is important to remember, and we will always be following this rule, that the selection must be made only using the training data, therefore, in case of a cross-validation procedure, that should be performed inside each cycle.

To have an overview on the state of the art of SVMs applied to rs-fMRI datasets we performed a meta-analysis of literature focusing on the topics investigated in this work. We performed an online search on the Scopus database [12] mixing "rs-fMRI", "resting state", "fMRI", "SVM", "support vectors", "classification" as keys. We then performed a first selection discarding the off topics results, especially those who hadn’t used SVM as classifier or the ones that weren’t trying to classify diseases from healthy controls. In table 2.1 there are the results listed, while in the next section we will discuss the state of the art moving from these works, focusing on the most common strategies. Part of the results overlap with the ones from a similar review [13] which is focusing on connectome features and is browsing within all ML approaches.

(19)

T ab le 2.1 : P a pers combin in g rs-fM RI data w ith SVMs Y ea r A ut h or D is or der N P at ients P ar ce lla tion F e at u res M L met h od F S CV 20 17 S und er man n, B .[ 14 ] M D D 72 0, 3 60 MDD 360 HC 38 peak coor di-n ates/200 GM R O Is at la s FC, P e a rso n ’s Z tr an s-fo rmed L in /rbf SVM hold ou t 20 17 H oj jat i, S. H . [1 5 ] M CI con v er tin g to A D 18 M CI con v e rter , 6 2 M CI non AD AAL 90 C on nec tiv it y measur es SVM FS ,G ini S,Kr u ska ll wal lis ,C hi squ ar e 9-fol d 20 17 G ol ba baei, S. [1 6 ] M CI, AD 46 H C, 7 6 MC I, 3 2 AD A tlas 16 0 R OI s FC, P earson ’s th en c o n -n e c tivi ty mea su re s SVM FS + se que n tial for -war d select ion 10 fo ld 20 17 A bós , A. [17 ] P ar kinson 38 HC , 70 P ar kin-son(t rain),2 7 PD test 24 6 R OI s, B rainn ett o m e P ar tial corr elation s be-tw ee n the first eigenv ar i-at e SVM R L R L OO 20 17 S u bba raju , V . [1 8 ] A u ti sm S pec-tr u m D isor der 53 0 N e ur otypica l, 50 5 ASD AAL 90 S p atial feat s-based se-lection of con nect . ma-tr ic es SVM 20 fol d s 20 16 Kh azaee , A . [1 9 ] M CI,AD 89 M CI,34 A D ,45 HC A tlas ,2 64 R OI s G rap h M easur es SVM F ish e r, F SFS H ou ldou t 20 16 Ji e , B . [2 0 ] M CI 1 2 MC I, 2 5 HC AAL 1 16 C on nect iv ity H yper net -w or ks Li n S V M, mul ti-ker n el SVM M 2 TFD L OO 20 16 W an g, S. [21 ] P sy c hosis risk syndr ome 34 PR S , 37 HC R eH o + st atystica l c lus-te ri n g R eH o R BF L OO 20 16 Ch en, Q. [2 2 ] M inima l H e -pa th ic en-cep halopat hy 16 M HE vs 19 n o MHE v o xel wi se R eH o S V M R el ieF L OO

(20)

2.1. Resting state fMRI and machine learning Y ea r A ut h or D is or der N P at ients P ar ce lla tion F e at u res M L met h od F S CV 20 16 P remi , E. [2 3 ] G ran u line disease 31 G RN T hr2 72fs carr i-ers vs 3 3HC V o xelw ise , th en SVD V ar ious measur es: VB M,R eH o ,A L FF ,f ALF F SVM t-tests+ PCA L OO 20 16 R am asu b bu , R. [2 4 ] M D D , d iff er en t sev er ities 45 M D D , 3 sev e ri ty gr ou p s v s 19 HC v o xe lwi se , th en ICA Le a rn F M RI, ICA/G L M fe a tur es li n SVM 5-fol d s 20 15 S u k, H. [25 ] M CI 1 2 MC I v s 2 5 HC AAL S uper v ised discr im in a-ti v e g rou p sp arse rep re-sen tat ion SVM R F E, t-test , m RM R L OO 20 15 Kh azaee , A . [2 6 ] Alz ehimer 20 AD vs 2 0 HC (A D NI DB) AAL 90 C orr el at ions+ gr aph measur es SVM F-sc o re ,M RM R, c hi squ ar e L OO 20 15 D u , Y . [27 ] Sch iz oph renia, bipolar d isor -der , sch iz o a ff ec-ti v e D isr o d e r 20 Sz ,20 BP , 20 SD M ,1 3 SD S, 2 0 HC v o xe lwi se GI G-ICA SVM o n 16 new sub- jec ts M ulticlass R F E N o 20 15 Ch eng , H . [2 8 ] Sch iz op hr en ia 19 Sz v s 2 9 HC 27 8 p ar cellation fr om homog eneity P earson ’s c o rr elat ion s+ B etw eeness c e n tr ali ty measur es SVM L OO 20 15 W u, X. [29 ] A cqu ir ed B rain Inju ry 99 AB I, 5 g rou p s, 34 HC v o xel wi se G rap hs then FC st reng th SVM L OO 20 15 R ehme , A. K . [3 0 ] M ot or impa ir -men t aft er st roke 40 pa tient s (2 0 MI, 20 n o M I), 20 HC v o xe lwi se seed ba sed corr elation SVM Fscor e L OO 20 15 W u, J. [31 ] G lio m a 18 L G G lioma, 1 7 H G G lioma tu mor segmen ta tion fAL FF ,REHO , S IC RB F SVM 20 15 Ch y zhyk, D . [3 2 ] Sch iz oph renia w it h A uditor y H allu cin ations 28 H C, 2 6 HC , 1 4 n o HC L A T TICE h-f u n ction s SVM

(21)

Y ea r A ut h or D is or der N P at ients P ar ce lla tion F e at u res M L met h od F S CV 20 15 S avio , A. [33 ] Sch iz op hr en ia 72 Sz, 7 4 HC v o xelw ise ALF F,fA L LF ,R E HO ,VM HC L in and RB F SVM PC C, W el c h test , B h att ac har y ya dist 10 fo lds 20 15 Zh ang , W . [3 4 ] SAD 40 SAD , 4 0 HC v o xelw ise R eH o , 27 v o xel s SVM L O O 20 15 H assan , A . [3 5 ] E pilepsy 10 0 E v s 80 H C AAL 90 C onn ect ivity m atr ic es then G CA an d iBSA SVM 50 -5 0 sp lits ,L OO ,5 -fo lds 20 14 L i, W . [3 6 ] D iab e tes 1 5 D B , 16 HC AAL9 0 C orr el at ions+ gr aph measur es SVM 50 -50 20 14 W a tan abe , T . [3 7 ] Sch iz op hr en ia 54 Sz, 6 7 HC 34 7 R OI s, g ri d ba sed p ar -cellation C orr elat ion mat ri x SVM Lasso+G raph N et 20 14 D ey , S. [38 ] A D HD 7 76 tr ai n , 1 97 test 1 90 funct io n al R OI s C orr el at ion ma-tr ix+M D S pr oj ect io n +G raph measur es poly SVM 20 14 A rb absh ir an i, M. R . [39 ] Sch iz op hr en ia 19 5 Sz , 17 5 HC P CA+I CA, 4 7 ICNs F unc tional conn ect iv-ity+ A utoc o n nec tibity SVM M RM R 10 fol d s 20 14 L i, Y . [40 ] M ild C o g nit iv e Impair men t 12 M CI vs 2 5 HC AA L 11 6 M AR eff ect iv e conn ec-ti vity SVM SVM -R F E L OO 20 14 Do s S an tos S iq uei ra, A . [4 1 ] ADHD 26 9 ADHD , 3 40 HC 4 00 R OI s P earson ’s c or relat ion , gr ap h measur es SVM L OO 20 14 G uo , S . [4 2 ] Sch iz o p hr enia 6 9 Sz , 6 2 HC 9 0 R OI s M easur es of FC eff ect s SVM 20 13 L i, Y . P. [43 ] A lz eh im e r 1 0 AD , 11 HC AAL 9 0 C orr el at ion mat rix an d to p ol ogy measur es SVM 20 13 W an g, X. [44 ] ADHD 23 A D H D , 23 HC v o xelw ise R eHO SVM -SMO P -v alu es L OO

(22)

2.1. Resting state fMRI and machine learning Y ea r A ut h or D is or der N P at ients P ar ce lla tion F e at u res M L met h od F S CV 20 13 L iu, P. [4 5 ] F unc tiona l D ys-pep si a 30 p atient s, 3 0 HC v o xel wi se R eH o S V M PCA L OO 20 13 W ei M. [46 ] M DD 2 0 MDD , 20 HC spa tial ICA, 12 ar eas H urst E xp onen t SVM L OO 20 13 T an g, Y . [4 7 ] An tisocial person al it y disor der 32 ASPD , 3 0 HC AA L 11 6 C o rr el at io n mat rix RB F -S V M LL E-bas e d L O O 20 13 Y u, Y . [48 ] Sch iz o p hr enia 24 Sz, 25 H e a lt hy sib-li n gs , 22 HC AAL 11 6 C orr e lat ion coeffi cient s SVM On e ag ai n st th e rest PCA L OO 20 13 F u, J. [49 ] A D HD 2 1 ADHD , 27 H C AAL 1 16 Lo w F req uenc y D rift L S-SVM 50 -5 0 20 13 Ji ao , Y . [5 0 ] M inima l H e -pa th ic en-cep halopat hy 32 M HE , 20 HC AA L 11 6 C o rr el at io n mat rix SVM kPCA 1 0 folds 20 13 L iu, F. [5 1 ] SAD 2 0 SAD , 20 H C AAL P earson ’s corr el a tion co-effi cient SVM F ish er L OO 20 12 Ch eng W . [5 2 ] ADHD 98 A D H D , 14 1 HC AAL 90 fAL FF ,R E HO ,sp atial and te m por a l FC G a u ssian SVM T -test , B W AS L OO 20 12 L or d, A. [53 ] M DD AAL 9 5 P earson ’s c or relat ion s an d gr ap h measur es SVM mR MR 10 00 boot -st rap 20 12 T an g, Y . [5 4 ] Sch iz o p hr enia 2 2 Sz , 2 2 HC 9 0 R OI s at las F u n ction al con nect iv ity SVM PCA L OO 20 12 Zh ang , J. [5 5 ] E pilepsy 80 H C, 1 00 E p K mat rix Asymmetr y measur es SVM 50 -5 0 20 11 W an g, X. [56 ] ADHD 21 A D H D , 25 HC 1 16 R OI s F u n ctiona l conn ectivity + K PCA SVM L OO

(23)

Y ea r A ut h or D is or der N P at ients P ar ce lla tion F e at u res M L met h od F S CV 20 11 B asset D . [5 7 ] Sch iz op hr en ia 29 Sz, 2 9 HC AAL 116 C orr el at ion ma rix an d gr ap h measur es SVM 50 -50 20 09 C radd o c k R [5 8 ] M D D 20 M DD , 20 HC 1 5 R OI s MDD related S V D , then corr elations SVM TF ,R F, R F E,RR F E L OO

(24)

Disorders and Number of Patients As it is possible to see in Fig. 2.1 the majority of the papers are focusing on schizophrenia, depressive disorders and ADHD. That follows the trend about the most studied psychiatric disorders and it makes clear that there are still not many works on classification of anxiety disorders. It is worth noticing that for ADHD, four papers [38] [41] [56] [52] used open access data from ADHD-200 project which is under the 1000 functional connectome project [59], while [26] used datasets from the ADNI DB [60]. For autism spectrum disorder another dataset, ABIDE, has been used by [18]. It is clear that making available large datasets, with each subject preprocessed with the same steps, helps the intertestability and interpretability of the results, pushing more people to test their method with the same dataset. The second important aspect about the dasets is their dimensionality.

Figure 2.1: Number of papers working on the same disorder

The vast majority of them has a number of subjects lower than 100, less than 50 patients and 50 healthy controls, and in a lot of cases the patients are around 20 subjects. Another characteristic is the unbalancement of the datasets, with a variable ratio HC/patients. The biggest datasets come from the open access resources [38] [41] or they are mainly cited in the more recent papers [14] [18], meaning that this dimensionality problem is progressively addressed.

Feature Extraction There are two main strategies in the choice of feature extrac-tion: to make an univariate analysis or a multivariate one. In the first case the focus is on the single time series, from which different parameters can be extracted. This strategy allows to extract informations about the single area, how it works, at what frequencies and with what power. The features used to date regards the power of the signal (fALFF, ALFF), [33], [23], [31] , [52], the spatial homogeneity of the time series (ReHo) [21], [22], [23], [33], [31], [34],[44], [45],[52] and other specific properties, like

(25)

fractality (Hurst Exponent) [46] and the low frequency drift [49].

On the contrary, the multivariate approach builds a map of interactions between different areas. Thus, it uses as features the correlations themselves or others param-eters of the network. By far, the most common approach is to evaluate the Pearson’s correlation matrix between each pair of time courses. Afterwards, the majority of papers thresholds or binarize the matrix to create an adjacency matrix from which the properties of the net are extracted. Those graph measures are representative of the properties of the network like small worldness, betweenness centrality, functional integration or segregation measures. All those papers reach good levels of accuracy using a limited number of features,therefore the problem of feature selection is ex-tremely reduced or removed, meaning a greater reliability of the results. The other side of the medal is the interpretability of those results: the features extracted are index of network complexity which are powerful synthetic measures, but are difficult to bring back in the voxel space in order to understand which areas are the most discriminative or which connections play the most relevant role in the network.

Seldom the same values of the matrix are taken as features. Interestingly, one of the papers who directly used the Pearson’s correlation values is the only one us-ing functional connectivity measures to classify SAD patients [45]. Although this is the general trend of these works, there is a number of papers working with spe-cific features and extraction algorithms. Unfortunately, it is not possible to make a comparison of techniques and results over those studies.

Up until now we cited the single time series, avoiding the reference to the voxels. To date, probably for computational limitations, the majority of papers, especially the one who builds a network, preferred to use a volume parcellation. After a parcellation, each brain data is reduced to a n × T volume with n number of the ROIs and T the time series length. The number and fashion of the nodes used for the parcellation is quite variable. Reducing to the minimum the a priori hypothesis , [37] used a grid-like parcellation of the volume with the nodes uniformly distributed and the ROIs chosen as a sphere around the node center. The bigger body of the works has used an anatomical based parcellation, of whose the most common is the AAL labeling with 116 ROIs, 90 cortical and 26 subcortical, otherwise there are other predefined atlases, made of a variable number of ROIs already mapped. One of the papers employed the same map we have used in this work, the Brainnettome’s one [17].The last method, very popular in rs-fMRI, which has seldom been used in these papers, is a ICA

(26)

based method, which defines different areas using the functional informations of the dataset. The advantage in considering areas made from functional informations is that the model already includes an analysis of the data which can help the following steps. On the other hand, an anatomical or functional open access atlas helps the reproducibility of the experiment and it makes the results comparable to other studies. In almost all cases when the features extracted come from a linear or non-linear analysis of the time series, the parcellation is avoided. This is probably due to the fact that the resulting features are in the order of V , number of voxels, while in a correlation analysis they would grow up to V2.

SVM pipeline Feature selection is not a mandatory step and it mostly depends on the number of features extracted. Whenever a voxelwise analysis is made, feature selection is always included, while for an inferior number of features it has been avoided. This step is performed mostly with univariate tests as F-score, t-tests, Chi square, mRMR. In all those cases, which are often the methods included in the ML software packages, each feature is tested to assess its overall variance or a supervised test between the two classes is performed.

A very interesting paper is represented by [33] which explores FS approaches to ALFF, fALFF, ReHo e VMHC (Voxel-mirrored homotropic connectivity). Unlike the majority of the other papers, which report only the best result, this one follows a more methodological path, using three different tests (Pearson’s correlation, Bhattacharya distance and Welch’s test) to check for the most discriminating features. Unlikely the majority of works they used a 150 subjects dataset which can be considered large if compared to the others and can be considered quite reliable. The variance in the accuracy between the different selection strategies is very high, for the same measure, with the same SVM kernel can vary up to 15%.

Another often used approach is the one using a cross-validated feature selection as RFE, RFS, RFECV, RRFE [58] : in RFE CV, for example, in an inner cycle the number of best performing features is evaluated, and only those are used for the next classifi-cation task. This strategy is difficult to implement when the number of examples is low since it requires a nested cycle with another cross-validation scheme. In a few papers the feature selection step is done with a more sophisticated algorithm: in [37] the whole approach is based on a feature selection method combining the LASSO regularization and the GraphNet algorithm which performs a spatially guided FS embedded in the SVM.

(27)

Not every work specified in detail the SVM’s kernel they used and we can assume that the most of them used a linear machine while some of the others have been performed with a rbf or gaussian kernel. Even in this case, the dimensionality of the datasets probably played a fundamental part in the choice of the kernel. A recommended strategy in ML is to use a nested CV cycle to evaluate the best parameters of the kernel. This algorithm would require a large number of instances because many of them are used only for the search of the best parameters. When working with 40/100 subject, this approach is more difficult to be applied.

Eventually, to test the performance of the algorithm and its reliability, a cross-validation scheme is needed, for a more detailed description refer to section 3.4. In the papers of this analysis more than the 80% of them uses a Leave One Out cross validation. This choice is probably forced by the limited number of subjects available, but none of the work explains in detail this choice.

2.1.1 Anxiety disorders studies

After reviewing the principal strategies used to tackle the clinical classification of disorders issue, we are now going a little deeper in the results obtained for anxious patients. First thing worth noticing is that the number of studies focusing on anxiety related conditions is quite low and they used different approaches.

In particular, there are only two papers focusing on SAD, one using ReHo voxel-wise analysis [34] and one building the SVM with Pearson’s correlation features [51]. This second study reports that a linear SVM reached up to 82% of accuracy using the 250 best ranked features, according to the F-score. They built a Pearson’s correlation matrix from each patient, after parcellating the brain with the AAL and then trained the classifier. We will see in chapter 5 that directly using a correlation matrix with a SVM classifier requires to be careful and to address the issue of correlated features. Afterward, they show an interesting result about the accuracy of the classifier while changing the number of features selected, reporting a peak when using 250 of them out of more than 6500 available.

This latter result introduces the dimensionality problem that we have experienced and discussed throughout this whole thesis: a 40 subjects dataset can be difficulty considered reliable to train and test a SVM with a number of features which is at least one order of magnitude greater than the number of instances. For example, a similar paper [44] is working with a dataset of about the same number of subjects taken

(28)

2.2. The materials

from the 1000 Functional Connectome Project [59] and is exploring ADHD features. Performing a FS with t-test on a voxelwise map of ReHo values, they reported an accuracy increasing up to 80% with 6000 of them. Although the pathologies are different, it is hard to think, unless further investigations on the disorders’ effects on the brain, that there could be a similar gap between the number of informative voxels. Without doubt an accuracy of 82% is a very good results, but as stated by the same authors, before being close to diagnostic reliabilty, those methods need to be feeded with a way greater number of samples.

The other paper [34] works with a 80 subjects dataset and evaluates the regional homogeneity of each voxel with the 26 neighbouring ones. Reducing the number of features with a gray matter mask, they afterward trained and tested with a LOO CV a support vector machine. The accuracy is 76% and no FS pattern is reported. Moving from this work we tried something similar in chapter 4 and we will see that even in this case the number of features seems to be too large compared to the subjects, unfortunately this work lacks a discussion about feature selection, and in general the SVM methods, therefore deeper discussion about the technique is avoided.

2.2 The materials

2.2.1 The original dataset

The dataset we worked on is made of 225 subjects, 110 of them are healthy controls, while the others are suffering of anxiety related disorders. In particular:

• 110 healthy controls, HC

• 37 subjects with Social Anxiety Disorder, SAD • 18 subjects with General Anxiety Disorder, GAD • 37 subjects with Panic Disorder, PD

• 17 subjects with Post Traumatic Stress Disorder, PTSD • 11 subjects with Dental Phobia, DP

All these disorders were included as "anxiety disorders" in the DSM-IV [61], the fourth edition of the Diagnostic and Statistical Manual of Mental Disorders, while in DSM-5 [62] the post traumatic stress disorder has been moved to the "Trauma

(29)

and Stress Related Disorders". The goal of this work is to test different solutions to classify the different conditions and to extract some information about the disorders themselves. The optimal objective would be to find a good performing classification algorithm able to distinguish all kind of disorders. We will see in the next chapters that this distribution of the subjects arises some problems for the classification algorithm. Another attempt is made to classify all the patients against the healthy controls, for which case a dataset of 110 vs 118 patiens is better suited.

Atlases To reduce the dimensionality of the volumes we needed a brain atlas. For the purpose we found two different solutions: the Brainnetome atlas [63] and the fROIs atlas from FIND Lab [64]. The human Brainnetome Atlas, provides a probabilistic map, with 210 cortical and 36 subcortical subregions, extracted from both functional and structural scans [65]. The Brainnetome atlas is the one we will rely for the whole thesis, since it is a bilateral, fine parcellated map. A template is also available for this atlas, which allowed us to check the registration with our dataset. In table A.1 there are listed the names and location of the areas. The other atlas, is a 90 ROIs functional atlas built using various task paradigms and identifies common resting state networks [66]. This atlas would have been very interesting, since it includes a functional parcellation, with the DMN already considered as a single area. From a visual analysis, this map correctly select the areas which are usually identified with the funtional resting-state connectivity.

Acquisition and preprocessing A Siemens 3 Tesla Magnetom machine was used. The scans were acquired with a repetition time of TR=2080 ms, and a voxel size of 2.3 × 2.3 × 3 mm, slice thickness= 3mm. After the scans were imported in AFNI [67] and a 4D volume was built for each subject, a preprocessing pipeline was performed. First of all each time volume was registered within the same subject, then a 4.0 mm FWHM Gaussian blurring was applied. Before registering each volume to the MNI template, a deconvolution was performed in order to remove the signal due to movement a baseline. The output we used for our experiments were the residual error time series of the deconvolution. After these steps the largest part of noise, the random effects with the blurring filter and the movement artifacts with the deconvolution, should have been removed. We will see in the following chapters how this dataset has been furtherly processed in order to perform a classification task.

(30)

2.2. The materials

2.2.2 HCP 900

The Human Connectome Project [68] is an ambitious program through which data from 1200 subjects have been collected and released, mostly as open access [69].

The HCP consortium collected a multi-modality dataset and preprocessed the data to dif-ferent levels. The available sets come from structural MRI acquisitions, fMRI paradigms,both task fMRI, tfMRI, and resting state ones, and MEG/EEG experiments. To date the 1200-subjects DB is available, but we worked on the previous 900-1200-subjects release: we downloaded a dataset, with 100 unrelated subjects, of Emotion task fMRI experiment. The downloaded dataset was already preprocessed with a standard pipeline [70] which had already removed artifacts and space motions, performed a cross-modal within subject registration and an overall registration of each subject to the standard MNI space.

The emotion processing dataset was recorded using a task-evokative paradigm firstly used in [71]. It is a blocked experiment of two different alternating task. For each trial the subject sees a 3 seconds cue and a following 18 s task: it could be a sensorimotor control task, with shapes displayed, or the proper emotion one, with fearful or angry faces. Since we needed a control dataset to test a novel approach to tfMRI signals, in order to extend it to rs-fMRI data, we decided to download an HCP subset of already preprocessed 4D volumes. This will allow an easier comparison with other results in literature, while we were able to work with a large number of subject whose preprocessing had alredy been addressed with high quality standards. We will see in section 5.4 how, in detail, we used this datasets.

2.2.3 Software tools

The whole preprocessing and parcellation tasks have been performed using the AFNI package [67]. The datasets have been subsequently imported in python [72] using the Nibabel package [73], while all the classification pipelines have been written using the Scikit-Learn Package [74]. We used MATLAB [75] for the evaluation of the Hurst exponent.

(31)

3.1 Introduction

Machine Learning is a relatively novel approach to computer science that, generically, aims to solve pattern recognition like problems. Machine learning is thought to find a solution for all those problems we can not solve with an usual algorithm, because we lack information about the datasets’ patterns, or with a loo-up table, because the data available is limited. The goal of a ML method is to adaptively build an algorithm to interpret the input dataset: the background idea is to replicate, with statistics and measures distances, the ability of the human brain to recognise patterns and distributions, generalising the results to predict the outcome of new examples.

Within ML three major groups of methods can be identified: supervised, unsupervised and reinforcement learning. Supervised learning builds the model starting from some labels given from an external "teacher", unsupervised learning is an approach where the system finds patterns only using the data without outside inputs and finally reinforcement learning where the feedback to the model is given by the subsequent evolution and response of the environment. The ML method to use is only one of the issues that need to be addressed to build a complete experiment. In the remaining of the chapter we will describe SVMs and the complementary strategies and tools that are usually needed, in particular the ones we used for our algorithms.

3.2 Support Vector Machines

Support Vector Machines (SVM) are supervised learning models [76], widely used for classification and regression tasks. Originally formulated for pattern recognition purposes [77], they are now widely used in a multitude of fields. The basic idea on which they lie is finding the hyperplane that optimally separates two classes and maximizes the margin which is defined as the distance of the hyperplane from the closest instances. Evolving from this

(32)

3.2. Support Vector Machines

binary classification model, SVM can be generalized for non linear classification problems, multiclass classifications and regression.

3.2.1 Linear Classifier

A linear classifier tries to classify making a decision based on the value of a linear com-bination of the features. Let’s now consider a binary classification encoded with labels r = {+1,−1}. The features’ vector is x and θ = [w,b] is the vector of parameters we want to find. The optimal linear classifier is thus defined as

h(x|w,b) =

P

X

k=1

wixi− b (3.1)

so that the predicted class will be si g n(h(x|w,b)) r =    +1 if h(x) > 0 −1 if h(x) < 0 (3.2)

which is equivalent to require that: r h(x) > 0.

The goal now is to define a function to be maximized (or minimized) in order to find the optimal hyperplane from a training set X = {(xt, rt)}N_{t =1}with t = 1,..., N examples. Aiming for correctly classifying as many examples as possible , we can parametrize the losses and risks associated with a wrong classification and then try to minimizes them. We can do so by defining the loss function that quantifies the loss incurred assigning a sample to the wrong class. Let’s define the 0-1 loss function:

l01(r h(x)) =    1 if h(x) < 0 0 otherwise (3.3)

That correspond to saying that misclassification lead to a 1 valued loss, while all the other decisions are costless. This description of the loss has the intrinsic limit of not measuring the degree of the error. Clearly more sofisticated loss functions can be used, to weight more carefully hits and errors. We can now formulate the empirical risk associated with the choice of a vector parameterθ = [w,b] for the classifier:

Rem(w, b|X ) = 1 N N X t =1 l01(rth(xt|w, b)) (3.4)

In practice, the empirical risk evaluates the percentage of misclassification occurred with the choice of the hyperplane defined by the parameters [w, b]. It is now straightforward that, in order to find the optimal set of parameters ˆθ that provides the best separation hyperplane according to the loss function, we have to minimize Rem:

[ ˆw, ˆb] = argmin

[w,b]

(33)

The parameters we obtain solving this optimization problem correspond to the optimal separation hyperplane, obtained from a decision function which is a linear combination of the features.

3.2.2 Hard Margin SVM

Moving from the previous description of linear classifier, we can now describe support vector machines with linearly separable cases. Those are all the cases for which an hyperplane, perfectly separating all the instances can be found.

We consider all linear classifiers such that:

rth(xt) ≥ 1 (3.6)

It is worth noticing that the instances are not only required to be on the right side of the hyperplane, but also to be some distance away: we want to maximize the margin in order to find the optimal separating hyperplane. Formally the request can be written as:

rt(wtxt+ b)

kwk ≥ ρ ∀t (3.7)

requiring the maximization ofρ.

Without other constraints the problem has infinite solutions by scaling w, for an unique solution we requireρkwk = 1. The task can be now formulated as

min1 2kwk

2 _{subject to r}t_(wt_xt

+ b) ≥ 1 (3.8)

which is equal to say that we maximize the margin _kwk2 . This is a quadratic optimization problem whose solution is given by the saddle point of the Lagrange functional. Using the Representer theorem we can write the optimal parameter as:

ˆ w = M X t =1 αt_rt_xt _(3.9)

so that w is a weighted sum of the support set x. It needs to be highlited that the summation doesn’t involve all t = 1... N , but only a subset of them whose α > 0: those points are the support vectors. It can also be seen that the support vector subset is way smaller than the original set of samples. Thus, the driving points are the ones closer to the margin, while those far away have a low impact on the hyperplane solution. For this reason, SVM are relatively easy to builde even with an high number of points since all those samples far from the boundary are not involved in the computation.

(34)

Figure 3.1: Linearly separable case

3.2.3 Soft Margin SVM

Up until now we have considered separable classes for which it is possible to define an optimal separation hyperplane dividing the two classes and keeping a margin from their instances. Clearly, that is a too restrictive hypotheses for real datasets. Formally, that is eqivalent to say that if two classes are not linearly separable, there will be no solution satisfying the constraint rt(wtxt+ b) ≥ 1. The soft margin formulation introduces a loss function for misclassified points which overcomes the problem of the 0-1 loss. Therefore, we can now define the hinge loss:some features are allowed to violate the costraint, but a price will be payed, proportional to the distance from the margin boundary.

lh(z) =    1 − z if z < 1 0 otherwise (3.10)

The new optimization problem is: [ ˆw, ˆb] = argmin [w,b] 1 2kwk 2 +C N X t =1 lh(rt(wtxt+ b)) (3.11)

and defines a soft-margin linear support vector machine.

It is now easier to understand what support vectors are, compared to the other features: using the hinge loss function, the cost of a correct classification is null, no matter wether the point is close to the margin or far from it. Therefore, the support vectors are the one misclassified or insufficiently far from the hyperplane that is considered in the summation of the second term of Eq. 3.11.

(35)

(a) Non separable case, C=10−3 (b) Non separable case, C=10100

The term C of Eq. 3.11 is the regularization parameter and represents the trade-off between the margin maximization and the error minimization. When C is small, the solution tends to maximize the margin, regardless of the penalty for missclassification; case of C = 0 the problem is equal to the one of hard margin SVM as in Eq. 3.8. On the other hand, with a larger C the cost of missclassification becomes the driving function; even a single outlier would heavily penalize the solution. Usually the value of C is unknown and must be tuned using cross-validation.

3.2.4 Nonlinear SVMs

So far, the problem considered was linear, which is a not always satisfied hypothesis. Let’s consider the case of 3.3: to the human eye it is clear that this is a separable case and it is easy to define a non linear boundary between the two classes.

(36)

One solution to this problem is to map the features to a new space linearizing the problem then to use a linear classifier over the new set of features. The new dimensions are calculated through the basis functions z=φ (x) mapping the former points, lying in a k-dimensional space, to a new p-dimensional space. Let’s consider how the w changes from Eq. 3.9

ˆ w = M X t =1 αt_rt_φ(xt₎ _(3.12)

Thus, for the optimization problem of Eq. 3.11 we have to substitute the two terms containing

w: kwk2= 〈w, w〉 = M X t =1 M X s=1 αt_αs_rt_rs 〈φ(xt)φ(xs)〉 (3.13) h(x|w,b) = 〈w,φ(x)〉 = M X t =1 αt_rt 〈φ(xt),φ(x)〉 (3.14)

The idea in kernel machines is to replace the inner product of basis functions in Eq. 3.13 and Eq. 3.14 with a kernel function between the instances of the original input space: 〈φ(x1),φ(x2)〉 → K (x1, x2). That solution, the so-called kernel trick, prevents from calculating

a whole new features set, which can be computationally expensive in high dimensional hyperspaces.Even the use of these non-linear SVM becomes easier with the decision function:

h(x) =

M

X

t =1

(37)

The most popular kernel functions reflect the common data distributions: • Polynomial

• Radial Basis Function

3.3 Assessing the performance

We have now explained how a classifier is built and tested, but we still need to formalize a metric for the performance assessment [78] and in particular we will focus on the case of two-classes classification. First of all, we can start with a confusion matrix 3.1 displaying the actual label of the examples and how they have been classified. In the table ,on diagonals, we can even identify how many samples have been correctly classified (T = T P + T N ) and how many suffered of misclassification (F = F P + F N ). The most common measures are accuracy, correct classification rate, and error rate

Accur ac y =T P +T N_P0_+N0 Er r r or Rat e =F P +F N_P0_+N0 = 1 − Accur ac y

These two metrics are particularly valuable in case of a balanced dataset, for which a random performance would score Accur ac y = Er r or Rate = 0.5, making the result immediately

(38)

3.4. Cross-validation

clear. As soon as the dataset is unbalanced 3.6.1 or we just need a further insight in the result we are going to need other metrics [79].

Other two often used scores are:

Speci f i ci t y =T NN Sensi t i vi t y = T P

P

They are also called true negative rate and true positive rate and they weight the truly classified instances with the actual existing one: they are helpful to have an idea of which class the classifier perform better with or if its performance is balanced between the two classes.

Similarly, Precision and Recall can be used: Precision is the number of correctly classified positive examples out of all the sample classified as positive:

P r eci si on =T P_P0

Recall, on the other hand, is equal to Sensitivity. The last tool using these values is the Prediction

Positive Negative Total

Actual Value Positive T P True Positive F N False Negative P Negative F P False Positive T N True Negative N Total P0 N0

Table 3.1: Two classes confusion matrix

ROC,Receiver Operating Characteristics, which is often used for cross-validation 3.4 and which displays multiple scores of the classifier for different settings.

This metrics allows to represent how the behavior of the classifier vary. The goal point is the (0, 1) on the upper left corner, but whenever the points are above the random performance line, the classifier can be considered better than a random one.

3.4 Cross-validation

In a classification task, we are supposed to train a classifier on the known samples to infere the population distinctive features and to be able to properly classify a newly collected example. In practice, since the samples are limited, one way to assess the performance of the classifier is to split the available dataset into two groups, a training and a testing set. The training is therefore done on a subsample of the dataset and so is the testing. Even though this strategy allows to test the classifier, it is clear that we are discarding all the informations we have in the testing set, increasing the risk of overfitting and finding a specific solution on the

(39)

Figure 3.6: ROC curve

training set. At the same time the performance of the classifier is only assessed on the testing set, which means the result would not been general.To overcome the above limitations and to provide a more robust assessment of the model, a resampling method [80] is commonly used.

Since it has been used throughout the experiments of this work, we will now give a brief description of cross-validation. It consists in repeatedly drawing m out of n samples for the training set, fitting a classifier on these data and , finally, testing it on the remaining validation set. How the samples are drawn depends on the used cross-validation approach and should be chosen on the base of the datasets dimensionality.

3.4.1 K-fold vs Leave One Out

In the k-fold CV, the dataset is splitted into k < n groups, or folds. A classifier is then trained on k − 1 folds and tested on the remaining one. The value of folds is usually k = 10 or k = 5, meaning that the training and testing procedure has to be performed 10 or 5 different times. In the Leave One Out CV, LOOCV, the folds are k = n so that the classifier will be trained n times and tested each time on one sample. The first big difference between the two approach is the computational load which is usually less in the case of a k-folds CV. Obvously, it depends on the number of samples: on a 10 samples dataset, a 10-folds CV is computationally identical to a LOO-CV. At the same time, if the dataset is extremely large, LOO-CV could be an unfeaseable strategy.

(40)

3.5. Feature Selection

to the accuracy of the estimation and its two main sources of error,Bias and Variance [81]. LOO-CV has a very low bias, since the classifier is always trained on n − 1 instances, it is almost equivalent to train the classifier on the whole dataset. On the other hand, k-fold is performing better than LOO-CV if we think about the variance. With a very small testing set, the error can be very high, let’s just think of a single sample which happens to be an outlier. From a practical point of view, a 10-fold CV has proved to be a reliable and feasable procedure, which has neither a high bias, nor a high variance.

3.5 Feature Selection

Feature selection is an usual step that can be included in a ML pipeline to select only a subset of relevant features, particularly when their number is extremely high. The whole point of feature selection is to decide how to assess whether a feature is relevant or not. The other metodological node is that the feature selection step needs to be included in the CV loop, to not chose the features before having removed the test instances. We will now see the most common FS strategies for SVMs, but the number of different approaches to this issue can be infinite.

Variance threshold . It is one of the easiest way to perform FS, for every feature the variance

is evaluated, then only the features above a threshold, more or less directly chosen by the experimenter, are selected. If variance is low it means that for that variable all the instances are close one to the other, and that feature ptobably isn’t a discriminating one. This step can be included after a scaling step, especially for all that cases where the features have different orders of magnitude.

Univariate selectors . The variance selector doesn’t take into account the supervised

differ-ences between the groups. The class of univariate selectors behave testing the features with an univariate test and then select a subset of them, based on their scores and p-values. The most common selectors use ANOVA’s F-scores, chi square tests, mutual information, but many others can be chosen. As we have seen in the meta-analysis, many studies used t-tests to select the features, probably to help the interpretation of the results since it is the standard method to assess the differences in rs-fMRI datasets.

Recursive Feature Selection . A model-based selection is performed in order to recursively

discard a number of features from the whole set (RFE, recursive feature elimination) or to recursively add them to an initial seed subset. Instead of perform an univariate test, the SVM is run and only a set of higher coefficients is selected: the idea is to assess which one of the features is better performing in the real SVM’s task.

(41)

3.6 Data dimension

Machine learning is a particularly effective approach to high dimensional datasets. In its easiest classification task, a SVM has to learn a model able to distinguish which of two classes a new example belongs to. Thus, the ideal configuration for training a SVM would be having a large number of samples, uniformly distributed between the two classes. As soon as the datasets are not synthetic, but they come from a real collection of parameters, this requirement of balanced large samples groups is seldom fullfilled. In particular, for our dataset of patient with anxiety realated diseases, the groups are umbalanced, small and with a number of feature way bigger than the number of samples. In section 3.4 we have already discussed the impact of the number of samples on the two main CV approaches. We will now address the umbalancement related problems and the most common ways to overcome them.

3.6.1 Unbalancement

We will now adress the behavior of a classifier when the datasets are umbalanced. What tipically happens in this case is that the classifier learns how to correctly classify the bigger class better than the smaller one. The result is a different accuracy in the prediction, with the bigger group well classified in the majority of cases and an higher error rate for the other class. The first problem this condition raises is that the results can not be easily interpreted and the accuracy score needs at least to be accompanied by other measures as precision and recall. Even with a further description of the results, the problem of a low accuracy for one of the two classes can be particularly serious, especially in the case that the smaller class is the one that we need to distinguish. In our dataset that is the case of classification of a group of patients compared to the healthy controls, classifying a patient as a HC is a more severe than the inverse kind of error, but since the HCs have more samples than patients with SAD, that is what probably is gonna happen.

To address this problem, a lot of solutions have been proposed and continue to be improved on the base of the machine learning method used [79]. In particular, for SVM, the risks are intrinsic in the formulation of the problem. The machine tries to build a boundary maximally distant from both the distributions, but from a statistical point of view, the boundary should be closer to the bigger group: drawing more points from a distribution improve the estimate of the distribution while it would be more probable to find a further point from the median of the smaller class. The other risk is associated with the C parameter in Eq. 3.11 that weights the misclassification errors: misclassifying a point of the smaller class and an outlier of the bigger one would have the same weight [82].

(42)

under-3.6. Data dimension

sampling and oversampling. Random Undersampling allows to obtain a balanced datased removing instances from the bigger group and using only real samples. This is a popular approach since it resolves the problems of umbalancement, without adding complexity to the algorithm. On the contrary it discards informations on the underlying distribution. Opposite to the undersampling one, the oversampling strategy adds synthetic points to the smaller group. These new examples can be replicas of former existing points or a combination of them. The biggest problem of this approach is that it does not necessarily add information on the distribution: it increases the risk of overfitting creating a particular solution for the dataset available which can be not representative of the underlying class. Both these methods can be improved with more or less complex solution, which can be specific to the kind of problem they are dealing with, the type of algorithm used for classification and the existing dataset.

(43)

Features

4.1 Introduction

The first thing we want to investigate is whether the brain activity is locally different between healthy subjects and patients. Endogenous brain fluctuations [8] have shown a non-random pattern, even though the physiological mechanisms underlying those signals is not yet fully understood. Moreover, many studies have been able to recognise several resting state networks [4], RSN, showing the presence at rest of a common behaviour between close and distant areas. Those networks have been reported as disfunctional in a variety of diseases [83] [84], using different approaches to the analysis of them. Since the first rs-fMRI papers had been published, there has been a big effort to find synthetic measures of the local activity. The methods used to analyse task-induced signals, like deconvolution, can no longer be used with resting state datasets. Subsequently, the attention has moved to the power and patterns of signals, trying to recognize recurrent aspects. We selected four of the most used indices, which synthesise three major properties of the signal: power, spatial coherence and fractality of the time courses. Afterward we used these measures as features for SVM classifications.

4.1.1 ALFF and fALFF

Whenever a brain analysis method is approached, the frequency analysis of the recorded time series is usually a fundamental step. The frequency ranges of brain signals has been hyerarchically divided in "oscillation classes" by [85] and most likely, due to time resolution and datas sources, each of them is of particular interest for different methodics of brain imaging. For resting state fMRI the low frequency oscillations (LFO, f < 0.1H z) have proven to bring the majority of information. To asses the amount of activity going on in the LFO spectrum, two synthetic measures have been proposed and widely used throughout a lot of different studies. The Amplitude of Low Frequency Oscillations, ALFF [86], is evaluated as the