Tuning the Random Forest model - Random Forest regressor

4 Simulations, graphs and comparisons

4.10 Random Forest regressor

4.10.4 Tuning the Random Forest model

The initial setup of the hyperparameters, with max_depth set to 2, and random_state set to 0, is described in section 4.9.1.5. However, as stated in the preceding justification, the Random Forest's other properties are crucial for having strong performance. Therefore, the first step of GridSearch is to make a search space, which is the set of values of the hyperparameters to search to find the best set of values. In order to implement it, a dictionary (in the code below the grid) must be defined.

grid = {'n_estimators': [200, 300, 400], 'max features': ['sqrt', 'log2'], 'max_depth': [6, 9, 18, 32], }

The names of the hyperparameters serve as the dictionary's keys. Additionally, a grid search object must be created, and parameters must be handed in for every piece of information related to the grid search. The model, a Random Forest Regressor model in this instance, is the first parameter. The search space dictionary and several scoring techniques are also present: the negative mean squared error, negative mean absolute error, and the r2 approaches. Finally, the refit is a crucial parameter that should be set to R2 so that the GS returns a model optimised for the r-squared measure.

Additionally, there is the 5-fold cross-validation that has been configured. A resampling strategy is used in the latter method to evaluate the effectiveness of machine learning models with a small sample of data. Data samples are divided into groups based on a single parameter (k). In light of this, k-fold cross-validation is usually referred to as this technique.

Machine learning models are usually tested using cross-validation to determine whether they perform well on untrained data. The model will be evaluated using a small sample to predict data not used during training. In general, what happens is as follows:

1. Randomly shuffle the dataset.

2. Create k groups from the dataset.

3. For every distinct group:

a. Use the group as a test or a holdout data set.

b. As a training data set, use the remaining groupings.

c. The training data fit a model evaluated using the test data.

d. In contrast to the model, the evaluation outcome is kept.

4. Finally, summarise the model's proficiency by employing a sample of model evaluation ratings.

The last setting, verbose, controls how much information will be printed.

GS = GridSearchCV(estimator=RandomForestRegressor(), param_grid=grid,

scoring=['neg_mean_squared_error',

4.10. Random Forest regressor

'neg_mean_absolute_error', 'r2'], refit='r2', cv=5, verbose=4)

The following are some of the most crucial hyperparameters that were optimised for the Random Forest model:

 Trees in the forest (the number): By generating forests with a considerable number of trees (high number of estimators), a more robust aggregate model with less variation may be developed at the expense of a more significant training time. Most times, the answer here is to analyse the data: how much data is accessible and how many attributes each observation includes. Because of the unpredictability of Random Forest, specific characteristics with high predictive power might be left out of the forest and not be utilised entirely or used very little in the case of little trees or extensive data. The same applies to the data: if many observations are present and each tree is not trained with the complete dataset, if there is a limited number of trees, then some observations might be left out. When all other hyper-parameters are constant, increasing the number of trees decreases model error at the expense of a more extended training period. Because Random Forests seldom overfit, it may be necessary to use many trees to avoid these difficulties and achieve decent results. The central point is that as the number of trees increases, the model variance lowers, and the model error will be near the optimal value. Building a forest with 10,000 trees is a futile strategy.

 The standards by which to divide at each tree node: By calculating which feature and which value of that feature best separates the observations up to that point, decision trees create locally optimum judgments at each node. To do this, they employ a particular metric—MAE or MSE for regression. The general guideline is to use MSE if the data does not contain many outliers because it severely penalises observations that deviate from the mean.

 The Deepest Individual Trees Can Get: The number of possible feature/value pairings rises as the depth of individual trees increases. The more splits a tree has, the more information it contains about the data it uses. Overfitting results when this occurs in a single tree. However, overfitting is more challenging because of how the ensemble is

built in Random Forest, even if it is still possible at substantial depth values. Setting this variable to a reasonable value and making minor adjustments are recommended.

Given the number of elements in the tree, this variable should be adjusted to a reasonable value. Neither creates stumps (very shallow trees) nor stupidly enormous trees.

 How many random characteristics are taken into account at each split: This one is one of the most crucial hyperparameters to adjust in the Random Forest ensemble.

The most straightforward technique to determine the optimal value of this hyperparameter is by conducting a Grid Search with Cross Validation while taking into consideration the following:

o A small number will result in fewer characteristics being taken into account when splitting at each node, reducing ensemble variance at the expense of an enormous individual tree (and likely aggregate) bias.

o This number should be chosen considering noisy characteristics with numerous outliers and how many informative or quality features are present. For example, the value of the number of random characteristics on each split might be pretty minimal if the data collection includes elegant, polished, and high-quality features since all of the aspects that are taken into consideration will be attractive. On the other hand, this value should likely be more significant if the data is noisy, as doing so will enhance the likelihood that the contest will accept a good feature.

o As there is a more significant likelihood that favourable features will be included, increasing the maximum number of random features examined in a split helps to lower the model's bias; however, this might come at the expense of increased volatility. Furthermore, the training speed also decreases when additional characteristics to test at each node are added.

Cross-validating the potential alternatives and keeping the model that produces the best results while accounting for the earlier factors is the most realistic course of action in this case. The best result obtained after looping over the Logic Table's length, after doing 80 trials, is:

4.10. Random Forest regressor

The best hyperparameter values are: {'criterion': 'squared_error', 'max_depth': 18, 'max_features': 'sqrt', 'n_estimators': 200}

Best score according to the metric passed in refit: R² = -0.2471368089372133

The grid search technique was initially run while tweaking only a few hyperparameters and considering the best R² score measure. The latter, though, could be deceptive.

R² is rarely the appropriate statistic to assess how well it can be predicted a new output, y, from new input, x, for the following two relatively straightforward reasons:

1. The R² value is the same if the historical data are reversed, with y becoming x and x becoming y. The input and output of a model must be taken into account when measuring the model's predictive power.

2. Before determining the slope and intercept of the model, it is feasible to predict the R² value.

The second statement suggests that R² can be determined without fitting a regression model.

Therefore, using it to evaluate prediction skills is illogical. The formula is independent of any models and relies on the raw data. Due to the manner that things cancel out or simplify, this is not some kind of mathematical gimmick. It is a simple truth that R² measures the degree of connection among two-time series, x and y, as intended. R² should only be used to measure the correlation between two sequences.

The new dictionary has been set as follows:

grid = {'max_depth': [3, 9, 12, 18, 35, None], 'max_features': ['auto', 'sqrt', 'log2'], 'min_samples_leaf': [1, 2, 4, 6],

'min_samples_split': [2, 5, 10, 15], 'n_estimators': [100, 200, 300],

'criterion': ['squared_error', 'absolute error', 'poisson']

}

The only parameter altered within the GridSearch function is refit. The latter has been equal to a negative mean squared error. In order to have a more precise tuning of the hyperparameters, the study of the Random Forest variable significance has been carried out.

The latter is described correctly in the following section. After that, specific inputs were deleted from the model, and the GridSearch technique was used again.

After removing useless parameters and setting refit to equal negative mean squared error, the final result is:

The best hyperparameter values are: {'criterion': 'poisson', 'max_depth': 35, 'max_features':

'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 200}

Best score according to the metric passed in refit:

Negative mean squared error = -0303991474696633

Nel documento Machine Learning model for tribological data extraction from experimental tests (pagine 114-119)