• Non ci sono risultati.

Design choices in a GNN

Nel documento Graph neural networks for the MCS problem (pagine 37-43)

4.2 Machine Learning with Graphs

4.3.2 Design choices in a GNN

When designing a Graph Neural Network, we are presented with a different set of choices. A GNN layer is composed of two operations: Message transformation and aggregation. Each different GNN architecture differs in how they define these two operations. Messages can also be transformed before being aggregated, for example, by multiplying them with a matrix. The aggregation operator must be permutation invariant. Also the way GNN layers are stacked is a design option.

GNN layers can be simply chained together or skip connections could be added.

When layers are chained, each layer l receives inputs from the corresponding layer l −1. A skip connections allows information to be received from any previous layer.

Finally, the computational graph defined in the previous section can be augmented in order to learn better embeddings. Following, we will see more specifically the most famous types of GNN layers and design options.

Graph Convolutional Networks(GCN) [14]: Were one of the first proposed GNN layers, it is defined as:

hlv = σ

Ø

u∈ N (v)

Wl hl−1u

|N(v)|

(4.6)

In this case the message is computed with mlu = Wl h|N (v)|l−1u , that is at each layer l the message mlu computed by node u at layer l is the result of multiplying the embedding vector of the previous layer hl−1 by a weight matrix Wl. The result is then normalized by the size of that node neighborhood. The aggregation operation is the summatory; the messages from all neighbors are summed.

GraphSAGE [15]: . The GNN layer is defined as:

hlv = σ1Wl· CON CAT 1hl−1v , AGGhl−1u , ∀ u ∈ N(v)ï222 (4.7) In this case the aggregation function has not a single definition. GraphSAGE allows different types of aggregation: Mean (Taking a weighted average of all neighbors), pool (transforming the neighbor vector and then applying the mean or max function) or more advanced techniques that we will not cover (Such as Long

Graph Neural Networks

Short Term Memory). Moreover, the message aggregation is done in two phases.

First considering only neighbors of node v, that is what happens inside the AGG function. Then the result is concatenated with the embedding of node v itself.

Graph Attention Networks [16]: We would also like to cite this powerful kind of GNN. In both GCN and GraphSAGE all the neighbors of the target node are equally important. The idea of GAT is to add a learnable attention matrix avu which allows the network to understand which neighbors are more important.

GNNs allows many of the classical learning modules to be inserted between its layers. We will cover the Batch Normalization and the Dropout modules.

Batch Normalization [17]: When we are training a machine learning model, we map some input data to an appropriate output. Usually, is assumed that the input distribution remains the same for all the duration of the training. Since in Neural Networks each layer depends on the outputs coming from the preceding layer, this assumption may not hold. Batch Normalization helps stabilizing the networks by re-centering the node embeddings into zero mean and rescaling the variance into unit variance. The batch normalization is done in two steps: first we compute mean and variance over the embeddings then we normalize using the computed mean and variance. However, this normalization can reduce the expressive power of the neural network. To solve this problem, two learnable parameter are added into the formula so that the network is allowed to choose an arbitrary value for the mean and variance.

Dropout and overfitting [18]: What every machine learning model does is to train and tune its parameters using available data (the training dataset) but then apply the same rules on unseen data, i.e., they are useful only if they are able to generalize. When a model becames increasingly good in understanding the training dataset it might happen that the generalization power decreases. This is because often the training dataset has a different distribution than the original data. The more data we have the more we are able to get a good approximation of the real distribution. However, most practical cases have quite less data than needed. If trained too much, the model will learn to fit exactly the training data. This usually

causes bigger mistakes when evaluating the model with unseen data. A common example of overfitting can be seen when trying to fit a polynomial Dropout is a technique used to reduce the effect of overfitting in neural networks. The main concept is to remove a certain percentage of neurons when computing the output of a neural network layer. This avoids that during training, certain neurons become more important than others. The technique is divided in three phases:

• Assigning a dropout rate. This will control the percentage of neurons that will be removed.

• Remove random neurons until we reach the desired amount of neurons.

• The output is computed using the remaining neurons only.

This way, all neurons are given a chance to contribute to the final output. In GNNs the dropout is applied in the message transformation phase.

(a) Underfitting (b) Balanced (c) Overfitting

Activation Functions: The most used non-linearities for GNNS are:

• Rectified linear unit (ReLU): defined as ReLU(xi) = max(xi,0). It is the most commonly used.

• Sigmoid: σ (xi) = 1+e1−xi. Usually used when the range of our embeddings needs to be restricted.

• Parametric ReLU: adds a trainable parameter to the Relu: P ReLU(xi) = max(xi,0) + aimin(xi,0). Usually, it performs better than classical ReLu.

Graph Neural Networks

(a) ReLU (b) Sigmoid (c) Parametric ReLU

Graph Augmentation: As we said in the previous section, each node defines a computational graph. A computational graph is a graph that represent how the message passing happens for that target node. We explained how these compu-tational graph can be built by simply unrolling a node neighborhood. However, this is not the only way to do it. It is unlikely that the vanilla computational graph is the optimal choice for computing embeddings. There are 4 problems with vanilla computational graphs. If the graph is too sparse, the message passing will be inefficient. This is solved by adding virtual nodes or edges to allow more message passing in the graph. A possible way to do it is to connect 2-hop neighbors using virtual edges. This is especially useful when working with bipartite graphs.

It is also a good idea to add a virtual node that connects to all other nodes in the graph, such that all nodes will have a distance of two. This makes the graph much less sparse and it is beneficial for message passing. If the graph is too dense it might be beneficial to not consider the whole node neighborhood. Usually, at each layer we sample a different portion of a node neighborhood. This is done to reduce the computational cost of message passing when each node has a big neighborhood. Also, if a graph is too large there might be problems in fitting it on GPUs. Node features are also important when computing messages, as the first message at layer 0 is composed by the node feature vector. If nodes lacks features it might be useful to add them. A common approach is to simply assign a constant value like "1" to each node. This is especially useful in an inductive setting (when we want to generalize on unseen graphs). All nodes are identical but the GNN will learn from the graph structure. For transductive settings, a possibility is to assign to each node an ID and use it as a feature. This procedure is more expressive as node-specific information is stored but, since a GNN cannot know an unseen node

ID, it does not generalize well. If node features are not present, it is much harder for a GNN to differentiate between a nodes. As an example, the red nodes in Fig 4.9a and Fig 4.9b will generate the same computational graph (Fig 4.9c). To solve this problem, a possible augmented feature vector could be the cycle count. Each node knows the length of the cycle it resides in. This way, the red node in Fig 4.9a will have a value of 3 as the cycle it resides in is of length 3. The red node in Fig 4.9b will have a value of 4. Now the two nodes can be distinguished by a GNN since their feature are different (the structure of the computational graph will remain the same).

(a) (b) (c)

Figure 4.9: Red nodes in Fig 4.9a and Fig 4.9b will both generate the computa-tional graph in Fig 4.9c.

Prediction setting: Using GNNs we can model different kinds of problems. It is useful to reduce a problem to known patterns or prediction tasks. The prediction task can usually be expressed in one of the following ways:

• Node-level: predict a discrete (classification) or continue (regression) value for nodes.

• Link-level: Classification or regression on node pairs or predicting new links.

• Graph-level: Classification or regression on the whole graph.

A GNN will produce embeddings at its last layer. For node-level tasks, we simply use the resulting embeddings to make predictions. For link-level tasks, we can generate new embeddings by concatenating the original embeddings in pairs. Alternatively,

Graph Neural Networks

we can simply use the dot product between two embeddings. If done this way we can simply predict a binary value that usually represents the existence or not of an edge.

For graph-level tasks, embeddings must be aggregated in other ways. Common techniques consists in taking the mean , the max or the sum of all embeddings.

Of course, by naively using these techniques, much expressive power is lost. Let’s suppose we have G with a node embedding vector zg = [−1, −1, −2, 1, 3] and graph H with a node embedding vector zh = [−10, −10, −20, 10, 30] and we use the sum operation to aggregate these embeddings. Both G and H will have a resulting embedding of 30. Therefore, our GNN will not be able to differentiate between these two very different graphs. This can be solved by hierarchically aggregating the embeddings. For each graph we first aggregate the first two nodes, the result is fed to an activation function. The same is done with the last three nodes and the two final results are aggregated once again.

Setting up the dataset: Using a graph dataset is more tricky than the average dataset, e.g., an image dataset. In an image dataset, each image is an independent data point. The prediction on an image only depends on that specific image characteristics. When we want to classify nodes, the setting is different. Each node is connected with other nodes by one or more edges (Fig. 4.10). In this case, other nodes will affect the prediction on our target node. We can use our dataset in different ways depending on if our task is transductive or inductive. When we are dealing with a transductive setting we have only one input graph that is available in all the three phases (training, validation, test) of a neural network training pipeline. However, nodes in the graph are split in training, validation and test groups. When we are training, the embeddings is computed using the entire graph, but only the training nodes labels will be used. The same goes in validation, the evaluation is only done on the validation nodes. In the inductive setting the dataset does not consist on only one graph. The objective of the inductive setting is to generalize to unseen graph, not to predict missing node labels. In this case the dataset is split in training, validation and test just as we would do in an image dataset. There exists node-level or link-level tasks for both the inductive and transductive setting, however only the inductive setting is well defined for graph-level tasks.

(a) (b) (c)

Figure 4.10: We divided 7 data points in training (red), validation (yellow) and test set (light blue). Figure (a) shows an image dataset where all points are independent. Figure (b) shows a graph dataset, where nodes are actually connected with each other. Figure (c) shows a graph dataset split in an inductive setting.

The original graph is split into three independent graphs.

Nel documento Graph neural networks for the MCS problem (pagine 37-43)

Documenti correlati