• Non ci sono risultati.

Some algorithms implementation in the real dataset

5.6 Feature selection

5.6.3 Some algorithms implementation in the real dataset

To reflect what has been explained in the previous pages, both automatic and manual methods of fetures selection have been implemented. The first one that has been implemented is the process of manual features selection that requires the use of correlational and statistical analysis explained in the previous paragraphs. We will then show some results of the process of automatic features selection, not customizable and the output a bit cryptic and not always performing.

It should be added that in our case the feature selection process is not done so much to decrease the number of features and therefore the dimensionality of the dataset because the number of variables is relatively small. The feature selection aims to increase the quality of the data provided in input to the machine learning algorithm, so you should not expect a big increase in performance as much as an improvement in time and understanding of the algorithm.

Manual features selection

If only one customer is selected through the relative feature, the various ProcessorIDs, deriving from the separation due to the categorical features, can be eliminated as they are always constant for a certain customer, so that the columns shrink a little bit. If, on the other hand, only one table is considered for all customers, there are 3 more features, one identifying each customer. In each of the two cases the process of manual features selection followed is inclusive of 4 steps:

1. Missing values:With Pandas it is easy to see how many missing values there are in the dataset, just enter the code df.isnull().sum() and the list of features with its number of null values will be printed on the screen. In our case the null values are always zero: the tables used are without null values.

5 – Error Prediction: Data Preparation

2. Pairwise correlation:As seen in the paragraph 5.4 you can get correlation values between features with the command .corr(). By analyzing the coefficients obtained and sorting them in descending order by degree of correlation, it is easy to identify the redundant features. The code written permitt to show the correlation between features in a decreasing order. A value of 1indicate they are perfectly redundant. It was immediately decided to eliminate this redundancy by deleting the features Grind_MC, NGlassInSess, Spess1_sup, Is-Diagonal_MC that have perfect correlation with other features (as shown also in paraghraph 5.4.2). Other pairs have correlation value close to 1 but being our dataset unbalanced it is better to avoid drawing hasty conclusions and it is better to wait to evaluate the correlation with the target.

3. Correlation with the target : This type of analysis is also based on the correlation, not between features this time, but between the target (BadTeminated_MC) and the various features. Obviously you have to eliminate the features not very related to the target as they give little information on the desired output. You have to choose therefore a minimum threshold that has been fixed to 0.005 under which the features are eliminated. The threshold has been chosen very small because we have an unbalanced dataset and therefore a small correlation can still be important to find the when the target assumes value 1. Another reason for this choice is due to the fact that we have few features so in the choice of features is not necessary to remove much information as the database remains small. So you try to delete only those features completely useless, keeping those with little information.

4. Amount of variation :To be able to make a study on variance you need to do a data scaling. After that we use on the normalized dataset the method .describe() of Pandas that calculates statistics divided by features and returns them in a small dataframe object.

It is now possible to sort the object obtained in descending order of standard deviation. As you can see in figure 5.18 the trend of the decrease of standard deviation as a function of the features is linear and has zero value only for two features. this means that they always have the same values. They are StepManual_C (the manual steps in fact are no longer considered in this treatment as described in the paragraph 5.1) and the ProcessorID is a certain customer if you consider only that machine in the table.

5.6 – Feature selection

Figure 5.18: Normalized standard deviation reduction plot in function of related feature.

Automatic features selection

Of the various algorithms presented in the theoretical paragraph, 5 have been chosen for implementation, namely:

1. Univariate Statistical Test (Chi-squared for classification) 2. Recursive Features Elimination

3. Principal Component Analysis 4. Features Importance

5. Select from model

Their implementation is very simple and the code for the univariate statistical test is given as an example (figure 5.19). For the others the code is similar, just a few changes on the chosen feature selection algo-rithm are enough. For the PCA it is not possible to identify the chosen features because it creates new ones and you lose the physical meaning of the numbers so that the various columns will no longer have an eloquent index.

For these automatic feature selection methods, the number of maxi-mum features to be selected can be inserted as parameter, for this thesis purpose a number of 30 can be optimal, thanks also to the considera-tions made during the automatic features selection. The 5 methods of automatic feature selection create five tables for each client and their ef-fectiveness will be tested in the chapter on the evaluation of the best

5 – Error Prediction: Data Preparation

algorithm.

Figure 5.19: Code for implement the univariate statistical test features selection.

Chapter 6

Error Prediction:

Algorithm evaluation

In this chapter we continue with the error prediction, in detail after hav-ing prepared the data durhav-ing the preprocesshav-ing phase in this chapter we evaluate the performance of the algorithms. A good result is desirable given the importance of the agrande problem to be solved, so as to avoid abnormal stops of the machine or pieces terminated badly. For further in-dication on the problem see chapter 1.2. The methods used for machine learning are described in the machine learning theoric chapter, section 4.2. The algorithms used are both classic and probabilistic.

We will describe, starting from the tables created specifically thanks to the procedure described in the chapter 5 concerning the preprocessing phase, how to divide the examples for the test and train phases, how to evaluate the algorithms concretely thanks to the evaluation metrics described in the paragraph 4.1.5 and finally we will discuss about the selection of the best solution, trying to understand, thanks to the final result, why the dataset at our disposal behaves better with one solution rather than another.

6.1 Training and Test sets and