2.2 Regression algorithms review
2.2.2 Gaussian Process Regression
Figure 18: Geometry of Elastic-Net, Ridge and Lasso Image taken from [4]
predictive distribution, which will estimate the prediction at the new points of interest x∗.
p(f∗|x∗, y, X) =
Z
ω
p(f∗|x∗, ω)p(ω|y, X)dω (2.8) The first step of the GPR is to define the prior mean m(x) and the covariance function k(x, x0):
m(x) = E[f (x)] (2.9)
k(x, x0) = E[(f (x) − m(x))(f (x0) − m(x0))] (2.10) where E represents the expectation. These functions allow to identify a Gaussian Process as:
f (x) ∼ GP (m(x), k(x, x0)) (2.11) The dataset is divided into test and training subsets where X is the observation matrix and y are the variables.
Assuming now that y is the set of training outputs and f∗ a set of test outputs, the prior is defined as:
"
y f∗
#
∼ N
"
µ µ∗
#
,
"
K(X, X) + σ2nI K(X, X∗) K(X∗, X) K(X∗, X∗)
#!
(2.12) Where K(X, X) is a matrix whose i, jth element is equal to k(xi, xj), K(X, X∗) is a column vector whose ith element is equal to k(xi; x∗).
The covariance function can also be called kernel that is used to compute the GPR’s covariance between pairs of datapoints. The implemented kernel functions can be controlled through some hyperparameters in order to better adapt to different functions.
One of the most used kernels for GPR is Radial-basis function (RBF) kernel which is a stationary kernel. It is also known as the "square exponential" kernel. It is parameterized by a length scale parameter l, which can be a scalar or vector with the same number of dimensions as the inputs.
This was chosen given the low complexity of the dataset used and can be represented mathematically with the formula:
ky(xp, xq) = σ2fexp − 1
2l2(xp− xq)2
!
+ σ2nδpq (2.13)
Results and critical issues
This chapter proposes the results following an in-depth study on the analyzed data. In the first section some variables will be analyzed in relation to the onset of the COVID or to the antibody values developed by volunteers. This operation can help to get a better idea of how certain variables affect the development of these antibodies.
Finally the results of the supervised learning algorithms described in chapter 2 are reported.
To carry out all these operations, Python was used, a high-level programming language that allows, by importing numerous libraries, to work better on the data available.
Table 4 shows the list of libraries used:
Python library Use
Pandas Data analysis and manipulation Datetime Dates and times manipulation
Numpy Numerical computing
Seaborn Data visualization Matplotlib Data visualization
Sklearn Predictive data analysis Table 4: Python libraries imported
The algorithms used were those proposed by the open source scikit-learn plat-form [23]. Before applying the chosen models, the datasets have been cleaned up, deleting all invalid entries such as missing or invalid data and shuffled.
Subsequently, the datasets were divided into training for 70% and test sets for 30%.After this procedure the data were preprocessed, categorical variables were
transformed to 0 or 1 and Numerical variables were normalized using "Min-Max normalization" [24]. It is to be preferred over standardization in cases where the distribution of the data is not known or when it is known that the distribution is not Gaussian as in this case.
The formula is as follows:
z = x − min(x)
max(x) − min(x) (3.1)
This was necessary to have all the data on a scale from 0 to 1.
Results are presented in the same way for both the first and the second dataset.
For each algorithm a scatterplot is proposed which represents the comparison between predicted and real data. The same graph also shows two lines, one dashed which represents the line of perfect prediction and the second which is the regression line based on the points shown. In addition, for each model there is a bar graph showing the weight ”ω” associated with each variable in the regression equation, representing the importance of each of them.
Finally, to univocally compare the performance of each algorithm, the following evaluation parameters will be used: Root Mean Square Error (RMSE), R-squared and variance, defined as follows.
• root mean square error (RMSE): As the name implies is the square root of the mean of the square of all errors. It is a great metric for measuring the accuracy of a model. It is scale dependent, which means that it measures the prediction error for a particular variable and not between variables.
The formula for calculating RMSE is:
RM SE =
s PN
i=1( ¯yi− yi)2
N (3.2)
where:
¯
yi = Predicted value for the ith observation;
yi = Observed value for the ith observation;
N = Total number of observations.
• R-squared (R2): It is a statistical measure that represents how well a regression model can adapt. The maximum value is 1, the closer the R-square value is to 1, the better the model fits.
R2 is calculated with the formula:
R2 = 1 − SSres
SStot (3.3)
where:
SSres = Mean square error;
SS = Total sum of squares.
• Variance (σ2): It is defined as the sum of the squared distances of each term in the distribution from the mean µ, divided by the number of terms in the distribution N [25]. These measures are useful for making comparisons between data sets that go beyond simple visual impressions.
The variance is calculated with the formula:
σ2 =
PN
i=1( ¯yi− µ)2
N (3.4)
where:
¯
yi = Predicted value for the ith observation;
µ = Mean of predicted values.
µ = 1 N
N
X
i=1
¯
yi (3.5)
3.1 Data analysis
In the period between May and December 2020, 1273 out of a total of 1340 volunteers carried out at least one nasopharyngeal swab for SARS-CoV-2, of these only 216 were positive.
An interesting fact is that 177 people experienced COVID-like symptoms, but only 118 were truly positive, i.e. 67%, looking at the results of the swabs.
Figure 19 shows the different percentages based on the age group.
Figure 19: Histogram of subjects with COVID who got a positive swab test In addition to declaring the hospital ward they work in, each of the participants also had to report what kind of care they have been taking part in since May 2020. There were 4 choices available, with the possibility of inserting multiple answers:
1. High intensity COVID departments 2. Low intensity COVID departments 3. Services for COVID patients
4. ’Non-COVID’ departments and services
In the first sub-figure of figure 20 the positivity of the first 3 types of services have been considered, in the second sub-figure only personnel who have never worked in COVID departments are taken into consideration.
It is evident that those who have worked in environments with infected patients have contracted the virus more easily.
Figure 20: Percentage of positives in COVID and non-COVID departments