• Non ci sono risultati.

4.4 Supervised Algorithm Choice

4.4.1 Choose feasible algorithms

4 – Machine Learning: Main Supervised Characteristics and Algorithms

Figure 4.16: Support vector regression with the usage of tolerance range for finding the maximum margin.

4.4 – Supervised Algorithm Choice

Linearity of the database

Some algorithms of machine learning use hypothesis of linearity, it is meant that the classes or the features of the database are separable from a linear function like a straight line or from a space o hgher dimensions equivalent and linear. This hypothesis is difficult to ascertain and often not applicable to real cases. It happens that the features are dependent and not separable, for this reason the linear algorithms obtain not good results. More realistically, the data needs polynomial interpolations or very complex decision boundaries. However, a high degree of polynomial interpolation is very expensive at a computational level and may therefore make a probabilistic method preferable to an exact one because of the high complexity required.

Linear algorithms are however an excellent starting point to under-stand how the dataset is structured and to direct the choice to more complex models, avoiding trying to find the best model with a brute force technique.

Missing values in the dataset

A very common feature in databases is the presence of missing values between the data. These are due to different phenomena such as the diversification of the examples in the dataset, data sampling errors, er-rors during the pre-processing phase,. . . It is very important to take into account the missing values because some algorithms work badly or do not work at all in the presence of missing values between the data. This problem can be solved in two ways: The pre-processing phase is extended by deleting examples or modifying data; the choice of an algorithm that is less sensitive to missing values.

In both cases the solution is not optimal because you lose time and you generally get worse solutions than you would get with a homogeneous dataset without missing values. The techniques of treatment of these values are many and if necessary they will be discussed when they occur.

Data repeatability

Any algorithm to work well requires that the outputs to be predicted depend on features very similar to those with which the algorithm was trained. However, some models are more sensitive to changes in the input data to be predicted. These make them adaptable to the training set but with a poor behavior on the test set and future predictions. This is due to the more or less important change in input data over time. It is therefore

4 – Machine Learning: Main Supervised Characteristics and Algorithms

better in these cases to select an algorithm with worse performance on the training but with greater versatility and repeatability of accuracy, this increases the reliability of the output.

Another solution if you want to keep the same high-performance algo-rithm in your training is to increase the training time and data to cover all possible cases. Finally, the best performance is obtained by making the algorithm self-learning over time, i.e. using the prediction data as subse-quent data to perform a new training by increasing the total dataset and adding information from the new examples. With this care the algorithm adapts to the changes that can occur in time and the output remains reliable over time. The cons is that this continuous training requires high computational costs to keep the model up to date, this is often not acceptable during a normal application of the model.

Examples and features number

The size of the database is relevant not only for the training time but also for the intrinsic ability of an algorithm to properly manage a large amount of information. For some algorithms it becomes difficult to extract infor-mation on what are the relevant features causing overfitting problems that are difficult to solve and identify. In the same way there are oppo-site problems related to data underfitting if the size of the database is not adequate. In any case, it is necessary to act on the modification of the features, using techniques of regularization or extension of the features.

Some algorithms are better then other to extract these informations with-out modifying the database and this can imply less time to dedicate at the preprocessing phase.

Number of algorithm parameters

The parameters of the algorithms are those coefficients that characterize the setup of the model. An algorithm with many parameters is more difficult to optimize than one with few, and requires a lot of experience to be used. In general, an algorithm with several parameters is more flexible and adapts well to many cases, but it is difficult to find the right combination of these parameters to achieve the best possible performance.

Class imbalance

For classification, an unbalanced dataset guides the user to the algorithms that will perform best. Usually having a great imbalance between the classes we tend to use more robust algorithms specifically designed for

4.4 – Supervised Algorithm Choice

this function. Algorithms based on research trees are very powerful in these conditions.