• Non ci sono risultati.

happen that due to the fact of having a small and unbalanced database, some train or test datasets do not contain examples of machine errors inside them. If this happens in the train set the algorithm is not able to find correlations and will never predict 1. If, on the other hand, there are no negative examples in the test set, the algorithm will never predict true positive, obtaining null scores in this case too.

6.6 Algorithm Choice

From the results shown above it is clear that the algorithm that best pre-dicts the errors of the 548 Lam machine is currently the Random Forest Classifier. As already mentioned, however, in general the probabilistic algorithms are the ones that perform best. This is typical of unbalanced datasets.

Another feature of our database is that it has few obvious correlations with the target, so a deterministic mathematical model cannot find linear links between the data to approximate the correct result, so this is the second reason why probabilistic algorithms work better.

You can do further studies now that you have the algorithm selected as best available. For example, you can search for the causes of the poor performance in the training set score, or you can better analyze the prediction by going back to the two evaluetion metrics that make up the F1 score that are recall and precision. These, unlike the F1 score, have a more concrete meaning and it is therefore easier to understand which are physically and practically the results obtained.

6.6.1 Random forest learning curve

You can study in more detail the behaviour of the Random Forest Classi-fier algorithm through the learning curves (see paragraph 4.1.4). As you can see, the cross-validation score curve is increasing as the number of examples in the dataset increases, while the training score curve is almost constant around the maximum value. This last fact probably indicates an overfitting as the value of the learning curve should decrease slightly.

Another possible cause of the gap between the two curves is the lack of data. As you can see the validation curve is constantly increasing with the increase in the number of examples, you can therefore expect a further growth as the amount of data increases.

In figures 6.9 and 6.10 you can see the learning curves for customers 2 and 3. The comments are similar to those made for customer 1 with

6 – Error Prediction: Algorithm evaluation

Figure 6.8: Random forest learning curve for client 1. In x axis the size of the dataset, in y axis the F1 score.

the difference that the scores are lower. Here too there may be a problem of overfitting or lack of data.

Figure 6.9: Learning curve for client 2

A further possibility is due to the uniqueness of the error situation, so you can train an algorithm to predict only the errors for which it was trained, but it is difficult to find correlations with the test set. Surely in reality there are causes that lead to an error in the machine. When the available data (features) are not representative of the problem to be solved we speak of Dataset Mismatch already discussed in the paragraph 4.1.4. If the problem that causes a scarcity of results is due to a dataset

6.6 – Algorithm Choice

Figure 6.10: Learning curve for client 3

mismatch the algorithms of machine learning and their tuning can not overcome this problem. The best solution to get good results for the pre-dictioin error would be in this case a new project, collecting new features and new data.

Chapter 7

Time Prevision Solution

This chapter discusses the second problem cited in the introduction, chap-ter 1.3, i.e. the prediction of plate processing times. The forecasting of processing times is a problem of great importance for Bottero, as cus-tomers often expressly request a time for processing. Since the machine has many possibilities of operation, indicating a time is not at all easy.

Before this work they tried to create for some applications a deterministic approach that often turned out to be incorrect and that above all referred only to certain processes. In this chapter we will try to explain how it is possible to generalize the process obtaining satisfactory results.

This second part of the thesis has therefore a more practical approach and focuses not only on optimizing the results, but on the contrary its purpose is to create an algorithm that can be recreated directly on the machine so that before the processing of the glass plate begins customers know what time will be spent.

The time prediction is based on regression machine learning algo-rithms. The idea of using machine learning in this second work also stems from the need to generalize the solution quickly and reliably. Differently from the error prediction part we will not discuss so much the process that leads to the result (i.e. mathematical formulation, data merging, ...) but this second part is more focused on the structure of the practical work that must be performed for transport and ricreate directly the prediction platform to the 548 Lam machine.