• Non ci sono risultati.

Random Forest

N/A
N/A
Protected

Academic year: 2021

Condividi "Random Forest"

Copied!
7
0
0

Testo completo

(1)

Data Base and Data Mining Group of Politecnico di Torino

DB M G

Random Forest

Elena Baralis

Politecnico di Torino

(2)

DB M G

2

Random Forest

Ensemble learning technique

multiple base models are combined

to improve accuracy and stability

to avoid overfitting

Random forest = set of decision trees

a number of decision trees are built at training time

the class is assigned by majority voting

Bibliography: Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, Springer, 2009

(3)

DB M G

3

Random Forest

Dataset

Original Training data

D1 Dj DB

Random subsets

Majority voting

Class

Aggregating classifiers

Multiple decision trees

For each subset, a tree is learned on a random set of features

(4)

DB M G

4

Bootstrap aggregation

Given a training set D of n instances, it selects B times a random sample with replacement from D and trains trees on these dataset samples

For b = 1, ..., B

Sample with replacement n’ training examples, n’≤n

A dataset subset Db is generated

Train a classification tree on Db

(5)

DB M G

5

Feature Bagging

Selects, for each candidate split in the learning process, a random subset of the features

being p the number of features, √𝑝 features are typically selected

Trees are decorrelated

Feature subsets are sampled randomly, hence different features could be selected as best attribute for the split

(6)

DB M G

6

Random Forest – Algorithm Recap

Given a training set D of n instances with p features

For b = 1, ..., B

Sample randomly with replacement n’ training examples. A subset D

b

is generated

Train a classification tree on D

b

During the tree construction, for each candidate split

𝑚 ≪ 𝑝 random features are selected (typically m ≈ √𝑝)

the best split is computed among these 𝑚 features

Class is assigned by majority voting among the B

predictions

(7)

DB M G

7

Random Forest

Strong points

higher accuracy than decision trees

fast training phase

robust to noise and outliers

provides global feature importance, i.e. an estimate of which features are important in the classification

Weak points

results can be difficult to interpret

A prediction is given by hundreds of trees

but at least we have an indication through feature importance

Riferimenti

Documenti correlati

Validation visits are essential to optimise data quality at source, ensure centres are aware of the importance of correct informed consent and encourage dialogue to

Let G n be the union of all closed line segments joining any two elements of [n] × [n] along a vertical or horizontal line, or along a line with

Autosomal recessive spastic ataxia of Charlevoix-Saguenay (ARSACS): high-resolution physical and tran- script map of the candidate region in chromosome region 13q11. Engert JC, Dore

The effect of the addition of Re/C to commercial Ru/C catalyst has been investigated, on the basis of the promising results obtained in the literature in the presence of the

The key result of the experiment is that (1) autogen test cases affect positively debugging, by improving the accuracy and efficiency of developers conducting fault localization and

In this work we carried out a comparison of the size, the continuum intensity, the Doppler velocity and the mean magnetic field strength characterizing UDs formed in an umbra

A BSTRACT : In terms of efficiency, managing the effects of overpublising (the sheer volume of new papers published each week) has become seriously challenging for science

In addition, the questions in previous surveys tended to be based on the interests of researchers (often experts), and were thus unlikely to address the