• Non ci sono risultati.

5 – Error Prediction: Data Preparation

Feature Type

CutLength int64

Grind_MC int64

Cut_MC int64

TSup_MC int64

TInf_MC int64

Heating_MC int64

Detach_MC int64

IsDiagonal_MC int64 BadTerminated_MC int64 GlassNumberInSession int64 StepNumberInGlass int64 GrindEnabled int64

StepType object

Spess1_sup float64 Spess2_inf float64

SpessPVB float64

Pmax_sup float64

Pmax_inf float64

parWheelBreakoutInf float64 parBreakoutInfTrim float64 parBreakoutSupTrim float64 NGlassInOpt float64 NGlassInSess float64 GlassLength float64 GlassHeight float64 GlassTotCut(mm) float64 PiecesInGlass float64 WastePiecesInGlass float64 TimeFromLastSession float64 TimeFromLastGlass float64 TimeFromLastType float64

Table 5.2: Features after data merging, the first phase of the preprocessing phase.

these categories cannot be provided as they are to the machine learning algorithm because the algorithm would tend to look for mathematical correlations between the numbers 1,2,3 (e.g. 3 is triple 1) that have no meaning between the categories that are only groups, which in fact could be identified by three letters A, B, C without changing the meaning. The problem of managing categorical features arises. The main techniques for

5.3 – Handle Categorical Features

converting and extrapolating information from them in a mathematical form are now analysed.

5.3.1 Common methods

There are two main classes of categorical data: nominal and ordinal. In the nominal categorical data there is no concept of sorting between the various categories, i.e. no category comes before the other and it is not possible to compare them because they are substantially disconnected from each other (e.g. weather, musical genres,...).

Instead, in the ordinal categorical data there is a concept related to the sorting between the classes. An example are the sizes of clothing (S,M,L,XL). These are related to the size of the garment but it is not clear whether linearly, polynomally or otherwise.

It is therefore necessary to transform these categories into numbers in order to be able to process the data. This method of transforming cate-gorical features is called encodings. Now we will analyze the methods of encoding most useful in machine learning applications, the so-called clas-sic encoding. Remember that there are various types of feature encoding that will not be analyzed here.1

One-hot encoding

This is the most used encoder for machine learning because it is very re-liable and performing. You get a total scrolling of the categories without loss of information but it is not a good solution if the number of cate-gories per features is high. It works by transforming the m catecate-gories into m binary features, so that for each example of the dataset only one of these newly created m features will have value one, the one related to the categoryof the selected entry. The others instead will all have zero value. In this way there is a total separation of the categories with a nu-merically comprehensible representation of them for the machine learning algorithm.

Dummy coding scheme

Very similar to the previous one with the difference that the m categories are transformed into m−1 features where the m-th category is represented

1The encoding techniques not presented here are the so-called contrast encoders and the bayesian encoders. A good Python library in which to find implementations of these encoding types is called category_encoders.

5 – Error Prediction: Data Preparation

when all the m − 1 features have null value, all the others are instead identified by a 1 in the relative feature (as for one-hot encoding).

Effect coding scheme

Practically equal to the dummy coding scheme with the only difference that the category previously indicated by the zeros in all m − 1 features is now represented with all values equal to -1 in the m − 1 features.

Bit-counting scheme

The previous encoding methodologies work very well when the number of categories is small but begin to become problematic when the number of categories rises because too many new binary features are created. The problems that can emerge if the number of categories is high and one of the previous methods is used are related to storage, computational training time and dimensionality of the dataset. In fact, if the number of examples becomes comparable to the number of features, overfitting problems are created in the machine learning model (this possibility is called in jargon curse of dimensionality problem).

If you have many categories the bin-counting scheme represents a good encoding to avoid the problems mentioned above. It is based on the as-signment of a probability to each category based on historical occurrence.

It is clear that for this type of encoding we need datasets containing all the categories in order to calculate true probabilities.

Binary encoding

Immediately each element of the categories is transformed into a number if it was not yet, then this decimal number is transformed into binary and the various bits that make it up are separated one by one going to create each a new feature, so as to create a series of unique columns for a certain bit. This encoding creates a limited number of new columns and is therefore also indicated if the number of categories is high. It is more efficient than one-hot encoding both at the computational and storage levels, but the performance obtained is lower as there is no complete separation between categories. In fact, equal bits do not correspond to the same number but there is the risk that the algorithm of machine learning seeks erroneous associations between these elements. It therefore represents a compromise between efficiency and performance.

5.3 – Handle Categorical Features

Hashing encoders

This last encoding algorithm is similar to one-hot but with fewer columns created and some information lost. It is based on the hashing trick 2 concept. The new number of features created is established a priori and therefore you can check the growth of the dataset before applying the encoding algorithm.

5.3.2 Application on the case study

After creating the extended table of cuts as explained in the section 5.2 it is needed to manage the categorical features. The only column rep-resented by categories is StepType. The StepType represents the type of cut made that, remember, can be StepVsxTaglio5X8 or StepVsxTaglio-Diagonale5X8, these two strings represent the categories of the feature StepType.

There are only 3 machine that send data in the format described in the chapter on the database (chapter 3), the ProcessorID that represent the machine can seem a categorical feature but, as we will see, studios will be kept separate for each machine because of their diversity. This means that in every algorithm training the ProcessorID will be unique for each machine and will be eliminated from the treatment, so it is not necessary to treat it as a categorical feature.

Since the categories of StepType are few we will use the classic one-hot method that works well when you do not have too many categories to divide. It can be implemented directly with the SKLearn machine learn-ing library but it is easier and more intuitive to use a library dedicated to encoding called category_encoder. The use is simple, you create the object capable of encoding thanks to the class OneHotEncoder passing the names of features to be managed. Then the fit_trasform method is applied to the dataframe to be modified. The method returns another dataframe with the addition of the newly created columns.

In table 5.3 you can see how the features are now increased after handling categorical features. Moreover the type of the categorical fea-tures was object and could represent a problem for many machine lerning algorithm. Now this is no more a problem thanks to the transformation in numerical features.

2For a clear and satisfactory explanation of the hashing trick see "Don’t be tricked by the hashing trick" by Lucas Bernardi.

5 – Error Prediction: Data Preparation

Feature Type

CutLength int64

Grind_MC int64

Cut_MC int64

TSup_MC int64

TInf_MC int64

Heating_MC int64

Detach_MC int64

IsDiagonal_MC int64

BadTerminated_MC int64

GlassNumberInSession int64

StepNumberInGlass int64

GrindEnabled int64

StepType_StepVsxTaglio5X8 int64 StepType_StepVsxTaglioDiagonale5X8 int64

Spess1_sup float64

Spess2_inf float64

SpessPVB float64

Pmax_sup float64

Pmax_inf float64

parWheelBreakoutInf float64 parBreakoutInfTrim float64 parBreakoutSupTrim float64

NGlassInOpt float64

NGlassInSess float64

GlassLength float64

GlassHeight float64

GlassTotCut(mm) float64

PiecesInGlass float64

WastePiecesInGlass float64 TimeFromLastSession float64 TimeFromLastGlass float64

TimeFromLastType float64

Table 5.3: Features after handling categorical features, now all features are of int or float type, only the datetime is of object type.