4 – Machine Learning: Main Supervised Characteristics and Algorithms

are other types of regression algorithms that can be used in certain par-ticular cases when the most common, mentioned here are not suitable for the problem to be solved. In this section we list the most important.

### 4.3.1 Linear Regression

Linear regression is the first and basic concept of machine learning for re-gression problems. As already mentioned above, it has such a hypothesis function:

h_{ϑ}(x) = ϑ_{0}+ ϑ_{1}∗ x_{1}+ ϑ_{2}∗ x_{2}+ · · · + ϑ_{n}∗ x_{n}

As you can see is a linear prediction formula in the coefficients and in the inputs. The coefficients ϑ have to be estimate and the features x are the input informations taken from the database. The cost function to optimize to find the best ϑ to have minimum error is instead:

J (ϑ) = 1 2m

m

X

i=1

(h_{ϑ}(x^{(i)}) − y^{(i)})^{2}

As you can see the cost function tries to minimize the error between the prediction made by the hypothesis function with certain parameters and the real output desired. This means that the parameters must be changed several times before reaching a satisfactory result. The selection of the parameters therefore takes place in an iterative way according to the following formula:

ϑ_{j} = ϑ_{j} − α
m

m

X

i=1

[(h_{ϑ}(x^{(i)}) − y^{(i)})x^{(i)}_{j} ]

Here there is the introduction of a new parameter, the so-called learn-ing rate α. Tuning this parameter change the velocity of the conversion and in its accuracy. A big value of α may create problems of oscillations in the convergence or even divergence, whie a too little value imply great precision but very slow convergence so the number of iterations required to converge at the minimum of the cost function will be huge. Deter-mination of the learning rate parameter is essential for a good choice of learning parameters.

In the figure 4.14 it is visible a fitting of the data with linear regres-sion, the figure shows a two-dimensional representation. It is possible to imagine a straight line in more dimensions to the increase of the features that makes the solution no longer graphable.

4.3 – Regression Learning Algorithms

Figure 4.14: A typical example of linear regression fitting in 2D.

### 4.3.2 Polynomial Regression

The polynomial regression has the same cost function as the linear re-gression from which it starts. The objective is always to find the best coefficients that minimize the error between prediction and true output.

What changes is the form of the hypothesis function. The input features are now interpolated obtaining a polynomial between degree greater than one among the features while the linearity of the coefficients remains. In fact it would be better to define this kind of algorithms as linear regression with polynomial features.

It is possible to have any degree of interpolated features, the more the degree grows, the more the number of total features will be. For example, think of a machine learning problem that involves only 2 features. The hypothesis function of a normal linear regression is:

h_{ϑ}(x) = ϑ_{0}+ ϑ_{1} ∗ x_{1} + ϑ_{2}∗ x_{2}
while for a polynomial regression grade 2 we will have:

h_{ϑ}(x) = ϑ_{0}+ ϑ_{1}∗ x_{1}+ ϑ_{2}∗ x_{2}+ ϑ_{3}∗ x^{2}_{1}+ ϑ_{4}∗ x^{2}_{2}+ ϑ_{5}∗ x_{1}∗ x_{2}
and grade 3 becomes:

h_{ϑ}(x) = ϑ_{0}+ϑ_{1}∗x_{1}+ϑ_{2}∗x_{2}+ϑ_{3}∗x^{2}_{1}+ϑ_{4}∗x^{2}_{2}+ϑ_{5}∗x_{1}∗x_{2}+ϑ_{6}∗x^{3}_{1}+ϑ_{7}∗x^{3}_{2}+ϑ_{8}∗x^{2}_{1}∗x_{2}+ϑ_{9}∗x_{1}∗x^{2}_{2}
As you can see the complexity of the hypothesis function grows a lot as

soon as the degree of polynomial interpolation rises. For what has been

4 – Machine Learning: Main Supervised Characteristics and Algorithms

said so far, if the selected degree is high, the fitting will be more and more precise with a great risk of overfitting. Moreover, the number of features increases exponentially, so the computational calculation is a determining factor in the polynomial regression.

If you have only one feature you can also represent this hypothesis function in a 2D plane as shown in figure fig:regpol where the interpo-lation no longer generates a straight line but a curve that approximates the real values.

Figure 4.15: Polynomial regression, the hypothesis function is now a curve.

### 4.3.3 Ridge and Lasso Regression

These regression algorithms are based on linear or polynomial regression with the addition of a term in the regularization cost function. This device has been thought to avoid the problem of overfitting, therefore the term has the function to give less importance to the coefficients in the hypothesis function of certain terms, so that the fitting of the data does not lead to the above mentioned problem.

In detail, the Ridge regression has the following cost function:

J (ϑ) = 1 2m

m

X

i=1

(hϑ(x^{(i)}) − y^{(i)})^{2}+ λ

m

X

i=1

ϑ^{2}_{i}

λ (lambda) is defined as the tuning parameter, which multiplied by the sum of the squared coefficients (excluding the intercept) defines the penalty term. It is evident that having a λ = 0 means not having a penalty in the model, that is we would produce the same estimates that with the minimum squares. In another way having a λ very large means having a

4.3 – Regression Learning Algorithms

high penalty effect, which will bring many coefficients to be close to zero, but will not imply their exclusion from the model. The ridge regression method never allows the exclusion of estimated coefficients similar to 0 from the model. This lack, from the point of view of the accuracy of the estimate, may not be a problem. A problem connected to this limit is on the side of the interpretability of the coefficients, given the high number of predictors.

The lasso method (least absolute shrinkage and selection operator) fills the disadvantage of the ridge regression. It allows the coefficients to be excluded from the model when they are equal to zero. It can be noticed that the formula of the ridge regression is very similar to that of the lasso, the only difference consists in the structure of the penalty, in how much it is necessary to calculate the summation of the absolute value of the coefficients:

J (ϑ) = 1 2m

m

X

i=1

(h_{ϑ}(x^{(i)}) − y^{(i)})^{2}+ λ

m

X

i=1

|ϑ_{i}|

As in the ridge regression the lasso method forces the estimated coeffi-cients towards zero but, the absolute value present, forces some of them to be exactly equal to zero.

In general, it can be expected that ridge regression will do a better job when the number of predictors is high, and at the same time that the time frame will do a better job when the number of predictors is small.

### 4.3.4 Support Vector Regression

The support vector machine (SVM) method already seen in the regression algorithms part can also be used for part and regression while keeping the characteristic part of the massive margins intact. If used for regression its name becomes Support Vector Regression (SVR) with some differences compared toSVM: it is impossible to estimate a number with respect to an infinite range of possibilities so we use a range within which the solution is defined as ideal. The main idea is always the same, however, to minimize the error by finding the hyperplane that maximizes the margins, taking into account a certain rollerance in the error as just said.

Even the kernel function has the same function of transforming the data into a higher dimensional feature space to make it possible to perform the linear separation.

4 – Machine Learning: Main Supervised Characteristics and Algorithms

Figure 4.16: Support vector regression with the usage of tolerance range for finding the maximum margin.