Data preprocessing - Graph neural networks for classification: models and applications

Figure 4.3: Histogram representing frequencies of β-values in a representative sample

methylated in that position. The β-values are therefore intensity estimates, with β ∈ [0, 1]. As it can be noticed from figure 4.3, a big portion of probes tends to assume values at the extremes of the interval.

The SPECT dataset consists instead of 3D grayscale images that have been aligned together and saved in the NifTI-1 data format. Each image is repre-sented by a 91 × 109 × 91 tensor, with values normalized to the range [0, 1].

4.4 Data preprocessing

4.4.1 DNA methylation data

As a first explorative step, an unsupervised analysis over the methylation matrix has been performed through principal component analysis. The pro-jections of the data on the two principal components is shown in figure 4.4.1.

48 CHAPTER 4. MODEL APPLICATION

Figure 4.4: Principal component analysis of methylation data

From those projections, it’s possible to notice how the data seems to be di-vided in two clusters, related to the biological sex of the patients. Healthy and control patients seem to be divided equally in the two clusters. For this reason, all the probes related to DNA positions situated on sex chromosomes have been excluded from the analysis.

Moreover, [52] identified a number of probes that, because of the specific region of DNA that they refer to, should be excluded from the analyses for probe selection. We then further pre-filter those probes from the data ma-trix.

A further selection step is performed considering the final goal of embedding the features in a network of genes: only the CpGs located in a DNA position corresponding to a gene are kept in the analysis.

Subsequently, the 10000 most important probes for the goal of Parkinson’s disease classification in the dataset have been selected on the base of univari-ate ANOVA statistical tests.

As previously said, the goal is to embed the selected features on a biologically meaningful graph. There are many different biological graphs that can be considered, which differ for the quantity of genes considered and the amount

4.4. DATA PREPROCESSING 49 and type of relationship between them. For the purpose of this work we de-cided to use a the protein-protein interaction graph. In particular, we refer to the interaction data registered in the STRING database [13], which contains information of gene associations recovered from multiple sources, like biolog-ical experiments, literature information and computational prediction. We decide to focus in particular on the association that have been biologically proved, in order to avoid to add more uncertainty in the problem.

The resulting graph dataset will be composed by 450 graphs, each with 10000 nodes: each node will represent a location in the DNA (CpG). The edges be-tween nodes will instead be constructed based on the biological interactions between the proteins encoded by the genes present in the CpG location.

4.4.2 SPECT imaging data

For what concerns the volumetric SPECT data, they are first pre-processed using a specific algorithm for superpixels generation. Different algorithms have been proposed for this purpose, but one of the fastest and most ef-fective is the simple linear iterative clustering (SLIC) algorithm. SLIC is based on a procedure that generates superpixels by clustering pixels based on their color similarity and proximity in the image [1]. An example of re-sults given by SLIC superpixel is given in figure 4.4.2, where it is applied to a slice of a SPECT image. The SLIC algorithm is performed using the skimage.segmentation.slic function on the 3D images, attempting to cre-ate around 1000 superpixels for each image.

We then want to construct a graph based on this division in superpixels:

• each superpixel will be associated with a node in the graph;

• the feature vector of each node will be constituted by four features: the

50 CHAPTER 4. MODEL APPLICATION

Figure 4.5: Slice of a SPECT image on the left; resulting boundaries given by SLIC algorithm on the right

first one will be the mean between the intensity values of the elements in the superpixel, the other three are the coordinates of the center of mass z ∈ R³ of the superpixel itself;

• edges between nodes are formed based on the euclidean distance be-tween centers of mass, based on the following formula [11]:

W_ij = exp −kz_i− z_jk² σ²

with σ = 0.1π. Only for the k = 20 nearest nodes the edge will actually be maintained in the final graph.

Starting from an initial input size of 91 × 109 × 91 = 902629, the dimen-sionality of the input will then be reduced to a set of around 4 × 1000 node features, while the average amount of edges in the newly created graphs will then be of 20000 edges.

4.5. EXPERIMENTS AND RESULTS 51

4.5 Experiments and results

The code has been implemented in Python, using the libraries PyTorch [38]

and PyTorch Geometric [14] for the neural networks section.

The model constructed for classifying the methylation graphs consists of 3 convolutional layers, with 2 final fully connected layers. The size of the hid-den node representations has been set to 4 after some experiments. Skip con-nections after each convolutional layer are also included to allow the model to be more flexible. The big size of the underlying graph constitutes a limit in the construction of a deeper and more complex architecture.

The network has been trained with the so called RMSProp (Root Mean Square Propagation) algorithm, using a standard cross-entropy loss function.

The training is performed for 200 epochs, with an early stopping option if the accuracy doesn’t improve for a certain amount of epochs. The hyperpa-rameters used during training are summarized in table 4.1.

Training hyperparameters (methylation)

Learning rate 10⁻³

Batch size 16

Number of epochs 200

Early stopping (epochs) 50

L² weight decay 10⁻¹

Table 4.1: Values of hyperparameters used during training with RMSProp (methylation data classifier)

The experiments have been performed with a stratified 3-fold cross-validation, across different types of convolutional layers:

52 CHAPTER 4. MODEL APPLICATION

• ChebNet refers to the convolutional operation described in formula (3.16), with r = 3;

• GCN refers to the convolution described in (3.20);

• GraphSAGE refers to the opearation in (3.23);

• GAT is instead the attentional convolution in (3.32), with 3 heads.

The results obtained with different layer types are reported in figure 4.6. The results are given in terms of Area Under the ROC (Receiver Operating Char-acteristic) curve. The plot of a ROC curve summarizes the performance of a binary classification model on the positive class, plotting the True Positive Rate against the False Positive Rate, when varying the threshold parameter of the classifier. The Area Under the ROC Curve (AUC) is a good metric for evaluating the expressive power of a classifier and it’s superior to the accuracy in case of imbalanced datasets with binary labels.

Figure 4.6: Average value of AUC obtained using different types of convolu-tional layers in the model

4.5. EXPERIMENTS AND RESULTS 53 We can observe that none of the classifiers provides a high average ROC-AUC value. However, it seems that the one composed with the GraphSAGE convolutional layers performs a bit better than the rest, providing an average AUC of 0.74 and an average classification accuracy of 75.8%.

The attentive model proposed in 3.3 has been tested on the methylation graphs, however it does not seem able to capture the information from the methylation signal.

The reasons for the low performances observed in figure 4.6 can be various.

First of all, the amount of data available, together with the high dimension-ality of the dataset itself and the amount of noise that it contains, makes it difficult to train any model in a 3-fold cross-validation fashion. Only 66, 6%

of the data is in fact used during training, while the remaining ones equally split between validation and test sets.

In second instance, we made a strong assumption by choosing a specific bi-ological structure for the underlying graph. Different experiment should be performed to analyse if, using different biological graphs, the results might improve. Moreover, the big amount of data made the training of a deeper convolutional network difficult, while at the same time the feature selection step was also really affected by the few samples available for such noisy data.

Differently from the classifier used for the methylation data, the models con-structed for classifying the SPECT graphs exploit the framework introduced in section 3.3. They consists of 6 convolutional layers and also in this case three different types of convolution (from ChebConv, GraphSAGE and GAT ) are analyzed. The latent feature vectors are this time composed by 10 values, which is also the same amount of variables that will constitute the graph rep-resentation g, produced after the fully connected layers attached at the end

54 CHAPTER 4. MODEL APPLICATION of the convolutional block. The attentive mechanism described in equation (3.38) is applied every two layers and the attention function used is (3.39).

This time the model doesn’t include a skip-connection mechanism, as the attentive connections already serve as ways for the information contained in the first layers to be considered directly when building the final vector of features. The optimization algorithm used during training is Adam, an adaptation of the classical stochastic gradient descent procedure, and the hyperparameters used are reported in table 4.2.

Training hyperparameters (SPECT)

Learning rate 5 · 10⁻³

Batch size 16

Number of epochs 100

Early stopping (epochs) 20

Table 4.2: Values of hyperparameters used during training with Adam (SPECT data classifier)

Different experiments have been carried out in the context of the final inte-gration of the attentive vectors:

• case 1: the attentive vectors {g_a^l}_l∈L are averaged together and the resulting vector is used for the final classification;

• case 2: the vectors {g^l_a}_l∈Lare concatenated together before performing the final classification;

• case 3: the global vector g and the vectors {g_a^l}_l∈Lare averaged together and then used for classification.

4.5. EXPERIMENTS AND RESULTS 55

Figure 4.7: Average value of AUC obtained using different kind of convolu-tional layers in the model, with different integration of attentive vectors

The results of the experiments, analyzed with different convolutional layers, are shown in figure 4.7. They suggest that considering the global vector g together with {g^l_a}_l∈L is probably bringing more noise into the model, result-ing in the worst performances consistently across layer type. The model that instead seems to perform better is the one that compute the average between the features of the attentive vectors at different scales.

Even if more experiments should be performed for what regards a more sys-tematic hyperparameters selection, adapting it to each model separately, we decided to focus on the model that performed better on the previous anal-ysis: the 6-layer model with convolutional layer composed by ChebConv. A comparison is carried out to verify what happens if the attentive mechanism is applied after each of the first three layers (early attention) or after each of the last three layers (late attention).

56 CHAPTER 4. MODEL APPLICATION Attention model Accuracy (%) ROC-AUC

Early attention 80.81 ± 2.03 0.887 ± 0.014 Late attention 79.86 ± 1.05 0.844 ± 0.018

Table 4.3: Results from SPECT graph classifier

Interestingly, the results are different from what expected: the attention model seems to work slightly better, expecially in terms of AUC, when ap-plied after the first layers of the network. In the case of convolutional net-works applied on euclidean images, [22] noticed how attention models at-tached to deeper layers brought to better results. This was easily explained by the fact that in images from the natural world, as the one analyzed in the paper (CIFAR-100), the last layers are the ones that convey the most information about objects present in the image. However, in our case the setting is quite different, all the images represent the same object (a patient’s brain) and it’s more probable that important details for classification can be found at smaller scales. Moreover, the division in superpixels already caused a coarsening of the input data, so that each superpixel and its immediate neighbours already contain aggregated information ready to be used by the network.

More experiments concerning analysis of network depth and variations of attentional functions should be considered in order to further optimize the models.

Chapter 5 Conclusions and future works

In this work an extensive introduction to graph neural networks has been provided, with a description of the main methods developed in this emerging field. After describing how the convolutional operations, typical of CNNs, can be extended to graphs, a particular focus has been given to attentive mechanisms. In the context of graph classification, a new method for com-puting multi-scale self-attentive features has been introduced. Attention mechanisms have previously been proposed in various structures inside graph neural networks, however no mechanism allowed to integrate attention in a multi-scale fashion, taking into account representations at different scales si-multaneously and allowing them to attend one over the other.

The presented methods have been applied to graphs constructed starting from biological data and an analysis of how different models perform on different data has been carried out. The research of new models able to in-tegrate biological knowledge inside the standard machine learning pipelines is a field that is just starting to be explored and may bring many important advantages in the research field. From one point of view, the integration of

58 CHAPTER 5. CONCLUSIONS AND FUTURE WORKS biological biases may help to increase both accuracy and interpretability of the models. On the other hand, insights given by interpretable models may serve as an input for generating new biological hypothesis and thus open-ing new research directions. A sector in which graph neural networks may be useful in the future is represented by the research of important biologi-cal pathways correlated with specific diseases and, in this context, attentive mechanisms on graphs could play an important role.

However, when adding biological networks in the analysis, it’s important to verify the intrinsic value of the underlying graphs that are selected to be used as bias. An idea for a future work stands in the analysis of how different prior graphs can affect the final results and which ones bring the most valuable knowledge for the task at hand.

For what concerns DNA methylation data, it could be interesting to analyze if different gene interaction graphs, such as gene regulatory networks or co-expression ones, could add more value to the analysis. Moreover, it could be interesting to approach the problem with a multiplex framework, where nodes representing genes are connected by multiple edge-types in a unified multi-layer structure.

In the context of volumetric images, it’s important to notice that analysis at different scales may bring different results. Low-levels of abstraction tend to maintain information about the detail of the images, while using bigger superpixels only the global information is preserved. For this reason it can be interesting to integrate different levels of abstraction in the same graph neural network, as proposed by [25], and analyze how this feature would im-prove the results in the context of biomedical imaging.

It’s also important to notice that, in the context of Parkinson’s disease, the amount of available multimodal datasets is slowly increasing. Among these

59 data, functional magnetic resonance (fMRI) are a kind of data that is nat-urally predisposed to be represented as graphs. After the necessary prepro-cessing steps, specific region of interest in the brain can be encoded as nodes and then, by computing the correlation of signals between these regions, a correlation matrix can be constructed. The final edges will be derived from the correlation matrix, by applying a task specific thresholding operation.

The final connectivity graphs, describing functional relationship between re-gions of the brain, represent the perfect input for graph neural networks [32].

In this setting, an attentive model could potentially bring many insights on the parts of the brain that relates with specific functional problems, like the ones connected to Parkinson’s disease.

In the context of data integration in the biomedical field, at least at the current stage, one of the main limitation is given by the fact that not all the data types are available for every patients, so classical data-integration tech-niques may fail to work cause of the small amount of data in the intersection of datasets. It could therefore be interesting to explore how data encoders, trained separately on different data modalities, would work in a downstream integration task, when presented with new data that haven’t been seen be-fore.

Finally, integration with other datasets and exploration of transfer learning techniques might be one of the most promising fields to explore, with the goal of expanding the available knowledge in the context of Parkinson’s disease.

60 CHAPTER 5. CONCLUSIONS AND FUTURE WORKS

Bibliography

[1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pas-cal Fua, and Sabine Susstrunk. Slic superpixels, June 2010.

[2] Moran Artzi, Einat Even-Sapir, Hedva Shacham, Avner Thaler, Avi Urterger, Susan Bressman, Karen Marder, Talma Hendler, Nir Giladi, Dafna Ben Bashat, and Anat Mirelman. Dat-spect assessment depicts dopamine depletion among asymptomatic g2019s lrrk2 mutation carri-ers. PLOS ONE, 12, 04 2017.

[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural ma-chine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.

[4] Paul Bertin, Mohammad Hashir, Martin Weiß, Genevi`eve Boucher, Vin-cent Frappier, and Joseph Paul Cohen. Analysis of gene interaction graphs for biasing machine learning models. ArXiv, abs/1905.02295, 2019.

[5] Lucijano Berus, Simon Klancnik, Miran Brezocnik, and Mirko Ficko.

Classifying parkinson’s disease based on acoustic measures using artifi-cial neural networks. Sensors, 19:16, 12 2018.

62 BIBLIOGRAPHY [6] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: Going beyond eu-clidean data. IEEE Signal Processing Magazine, 34:18–42, 2017.

[7] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun.

Spectral networks and locally connected networks on graphs. CoRR, abs/1312.6203, 2013.

[8] Catalina Cangea, Petar Velickovic, Nikola Jovanovic, Thomas Kipf, and Pietro Li`o. Towards sparse hierarchical graph classifiers. ArXiv, abs/1811.01287, 2018.

[9] Hongyoon Choi, Seunggyun Ha, Hyung-Jun Im, Sun Paek, and Dong Lee. Refining diagnosis of parkinson’s disease with deep learning-based interpretation of dopamine transporter imaging. NeuroImage: Clinical, 16, 09 2017.

[10] Yu-Hsuan Chuang, Kimberly Paul, Jeff Bronstein, Yvette Bordelon, Steve Horvath, and Beate Ritz. Parkinson’s disease is associated with dna methylation levels in human blood and saliva. Genome Medicine, 9, 12 2017.

[11] Micha¨el Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convo-lutional neural networks on graphs with fast localized spectral filtering.

In NIPS, 2016.

[12] Francis Dutil, Joseph Paul Cohen, Martin Weiss, Georgy Derevyanko, and Yoshua Bengio. Towards gene expression convolutions using gene interaction graphs. ArXiv, abs/1806.06975, 2018.

BIBLIOGRAPHY 63 [13] Damian Szklarczyk et al. String v11: protein-protein associa-tion networks with increased coverage, supporting funcassocia-tional dis-covery in genome-wide experimental datasets. Nucleic Acids Res., 47(D1):D607–D613, 2019.

[14] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. ArXiv, abs/1903.02428, 2019.

[15] Adriana Galvan, Annaelle Devergnas, and Thomas Wichmann. Alter-ations in neuronal activity in basal ganglia-thalamocortical circuits in the parkinsonian state. Frontiers in Neuroanatomy, 9:5, 2015.

[16] Hongyang Gao and Shuiwang Ji. Graph u-nets. In ICML, 2019.

[17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.

MIT Press, 2016. http://www.deeplearningbook.org.

[18] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Confer-ence on Neural Networks, 2005., volume 2, pages 729–734 vol. 2, July 2005.

[19] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive repre-sentation learning on large graphs. In NIPS, 2017.

[20] David K. Hammond, Pierre Vandergheynst, and R´emi Gribonval.

Wavelets on graphs via spectral graph theory. ArXiv, abs/0912.3848, 2008.

[21] Steve Horvath and Ken Raj. Dna methylation-based biomarkers and the epigenetic clock theory of ageing. Nature Reviews Genetics, 19, 04 2018.

64 BIBLIOGRAPHY [22] Saumya Jetley, Nicholas A. Lord, Namhoon Lee, and Philip H. S. Torr.

Learn to pay attention. ArXiv, abs/1804.02391, 2018.

[23] Thomas Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ArXiv, abs/1609.02907, 2016.

[24] Ivan S. Klyuzhin, Nikolay Shenkov, Arman Rahmim, and Vesna Sossi.

Use of deep convolutional neural networks to predict parkinson’s disease progression from datscan spect images. 2018.

[25] Boris Knyazev, Xiao Lin, Mohammed Abdel Rahman Amer, and Gra-ham W. Taylor. Image classification with hierarchical multigraph net-works. ArXiv, abs/1907.09000, 2019.

[26] Boris Knyazev, Graham W. Taylor, and Mohammed Abdel Rahman Amer. Understanding attention and generalization in graph neural net-works. In NeurIPS, 2019.

[27] Antonina Kouli, Kelli M. Torsney, and Wei-Li Kuan. Parkinson’s dis-ease: Etiology, neuropathology, and pathogenesis. In Parkinson’s Dis-ease: Pathogenesis and Clinical Aspects. 2018.

[28] John Boaz Lee, Ryan Rossi, and Xiangnan Kong. Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining, KDD

’18, page 1666–1674, New York, NY, USA, 2018. Association for Com-puting Machinery.

[29] John Boaz Lee, Ryan A. Rossi, Sungchul Kim, Nesreen Ahmed, and Eunyee Koh. Attention models in graphs: A survey. ArXiv, abs/1807.07984, 2018.

Nel documento Graph neural networks for classification: models and applications (pagine 55-76)