• Non ci sono risultati.

An introduction to Machine Learning

N/A
N/A
Protected

Academic year: 2023

Condividi "An introduction to Machine Learning"

Copied!
72
0
0

Testo completo

(1)

An introduction to Machine Learning

Gianfranco Lombardo, Ph.D [email protected]

DIPARTIMENTO DI INGEGNERIA E ARCHITETTURA

(2)

Currently I am a Postdoctoral Researcher in the Social web, intelligent and distributed systems engineering (SoWIDE) Group at the Department of Engineering and Architecture (DIA) of the University of Parma.

I received the Ph.D in Information Technologies (2021) and the Master Degree in Computer Engineering (2017) both at the University of Parma.

In 2019 I have been Visiting Researcher at the Center for Applied Optimization (CAO) at the Herbert Wertheim College of Engineering of the University of Florida (United States)

My research activity is focused on:

Machine Learning and Deep Learning

Natural Language Processing

Network Science and Graph Data Mining

Fintech technologies

Social Networks

Predictive Maintenance

For questions reach me at [email protected] Materials: http://sowide.unipr.it/en/teaching/ai

@me

(3)

Gianfranco Lombardo, Ph.D ([email protected])

Machine learning fundamentals

Learning by examples

Training and test set

Loss function

Overfitting and Underfitting

Regression tasks

Linear Regression

Root Mean Square Error (RMSE)

Gradient descent

Classification tasks

Why Linear regression is not suitable?

Logistic regression

Confusion matrix and metrics

Regularization and polynomial regression

Lecture summary

(4)

Stanford CS229: Machine Learning — Andrew Ng, Stanford University

Free lectures on youtube:

Hands-On Machine Learning with Scikit-learn & Tensorflow - Aurèlien Gèron

Book (O’Reilly editor)

Pattern Recognition and Machine Learning - Christopher Bishop

References

(5)

Gianfranco Lombardo, Ph.D ([email protected])

● ML is the field of study that gives computers the ability to learn from data without being explicitly programmed

A machine-learning system is trained rather than explicitly programmed

● It is a subset of the Artificial Intelligence

● Also known as statistical learning

What is Machine learning (ML) ?

(6)

● We are never certain something will happen, but we usually know (or can estimate rather well) how likely it is to happen or, at least, what is most likely to happen, based on the experience we have acquired throughout our life.

● Experience in Machine learning means: Data in the form of examples

● Explore data to find patterns to understand the hidden laws that regulates the domain

Learning by experience

(7)

Gianfranco Lombardo, Ph.D ([email protected])

Training Set: the dataset from which we learn

Validation set: A dataset we use to tune model’s parameters

Test Set: A dataset which must include patterns describing the same problem which do not belong to the training set

The test set is used to verify whether what has been learnt in the training phase can be correctly generalized on new data

We want to avoid overfitting (the learning algorithm memorizes the training set instead of learning general rules, the model performs well on the training data, but it does not generalize well). We use the validation set to ensure that the model is not overfitting

How to use data

(8)

How to really use data

Full Dataset

Training data (~ 70/80%) Test set (~ 30/20%)

Training set Validation

set

ML

Algorithm ML Model Evaluation

Select parameters for the model

(9)

Gianfranco Lombardo, Ph.D ([email protected])

Supervised learning: Each example includes an input pattern and the desired output (teaching input) the model is expected to produce when the pattern is input. Learning modifies the model such that actual model outputs become more and more similar to the teaching inputs.

Unsupervised learning: Each example is represented only by the input pattern. Learning modifies the model such that it reflects some regularities and similarities that characterize the data (e.g., by grouping similar data or identifying regions in the pattern space where they are most likely to be located)

Two ways of learning from examples

(10)

A model is mathematical tool that maps our examples (observations) X to the desired output y in the form y ~ F(X) where F() is our model

F() is a parametric function and we call its parameters W

● So, a model is y=F(X,W) and the goal of learning is to estimate these parameters W

● Machine learning algorithms provides different ways to get these parameters

○ Most of them are based on statistics and differential calculus

A model in supervised learning

(11)

Gianfranco Lombardo, Ph.D ([email protected])

Example: House price prediction

Mq Price

50 80,000 $

75 100,000 $

125 130,000 $

35 50,000 $

200 220,000 $

60 90,000 $

100 110,000 $

● We want to build an intelligent system that helps to predict prices of houses in a city

● We only have two information:

square meters and the price

● Square meters is what we call a feature

● Price represents our target output

(12)

Example: House price prediction

● X in this case is represented by square meters column

● y in this case is the price column

● Our goal is to identify a model that on the basis of these examples can understand the underlying rule of this domain

(13)

Gianfranco Lombardo, Ph.D ([email protected])

● A split of data in training and test set to evaluate if the model is able to generalize

● A ML algorithm to build the model

● A metric for this evaluation

What we need ?

(14)

Splitting data

Mq Price

50 80,000 $

75 100,000 $

125 130,000 $

35 50,000 $

200 220,000 $

60 90,000 $

100 110,000 $

● A suggested division is to use the 70 % of data in training and 30 % for the test

Training setTest set

(15)

Gianfranco Lombardo, Ph.D ([email protected])

Dataset VS training set and test set

TRAINING SET DATASET

(16)

● We want to find the most easy linear model that fits our data to get predictions on new data

● In our case let’s consider the equation of a line

○ y = mx + q

■ Where m is the slope

■ q is the intercept

● If we write that equation as

y = w

1

X + w

0

○ y is our target output

○ X is our feature

○ W are the two parameters that we have to learn

● If X0=1 we can write the equation as Y = WT X

Linear model

(17)

Gianfranco Lombardo, Ph.D ([email protected])

(18)

We want w0 and w1 to have a line that has the minimum distance from our point

Let’s call the output of our linear model y* = F(xi , W)

● We want the best W that minimize the following cost function:

○ Loss function L

○ Where m is the number of the training examples

● To minimize this function we should compute the derivative (gradient) of L with respect to W and find where this derivative is equal to 0

○ ∇ w L = 0

Learning: An optimization problem

(19)

Gianfranco Lombardo, Ph.D ([email protected])

Minimization of J(w)

(20)

● We have a J(w0, w1) cost function and we want to find the parameters w0, w1 that better minimizes J

○ We start with a random choice of W

○ Keep changing W to reduce J(W) until we hopefully end up at a minimum

○ Weights are updated simultaneously

Gradient descent to minimize J(w)

Repeat until convergence {

}

(21)

Gianfranco Lombardo, Ph.D ([email protected])

Gradient descent to minimize J(w)

(22)

Gradient descent to minimize J(w)

(23)

Gianfranco Lombardo, Ph.D ([email protected])

Gradient descent to minimize J(w)

● Actually, in order to simplify the computation of the gradient, the cost function is usually defined with an addition

(24)

Gradient descent uses the whole training set to compute gradients at every step.

For this reason it is very slow when the training set is large

Stochastic-Gradient Descent picks a random instance in the training set at every step and computes the gradients based only on that single instance. It is faster and it makes it possible to train on huge training sets (less memory required)

Due its random nature SGD is less regular than Gradient Descent. Instead of gently decreasing until the minimum, the cost function will bounce up and down, decreasing only on average

Final parameter values are good but no guarantees about optimality

When the function is very irregular, SGD can help to jump from a local minima

Stochastic-Gradient Descent (SGD)

(25)

Gianfranco Lombardo, Ph.D ([email protected])

● A traditional statistics tool that fits a straight line as a model

Y = WT X

○ Where X is the feature matrix (examples) and Y is the column vector with the target values

● It is considered the first ML algorithm with the idea of learning something from data

● It computes the gradient of the loss and estimates the most performing line

● It is available in the python module scikit-learn (sklearn)

Linear Regression

(26)

Example

from sklearn import linear_model

from sklearn.metrics import mean_squared_error import matplotlib.pyplot as plt

import numpy as np

X_train = [50,75,125,35,200]

X_test = [60,100]

y_train=[80000,100000,130000,50000,220000]

y_test=[90000,110000]

(27)

Gianfranco Lombardo, Ph.D ([email protected])

Example

# Create linear regression object

regr = linear_model.LinearRegression()

# Train the model using the training sets

regr.fit(np.array(X_train).reshape(-1, 1), y_train)

# Make predictions using the testing set

y_pred = regr.predict(np.array(X_test).reshape(-1, 1))

# The coefficients

print('Coefficients: \n', regr.coef_) print('Coefficients: \n', regr.intercept_)

# The mean squared error

print('Mean squared error: %.2f'% mean_squared_error(y_test, y_pred))

(28)

Plot

plt.plot(X_train,y_train,"ro") plt.plot(X_test,y_test,"bo")

X =np.concatenate([X_train,X_test])

y = [ x_i*regr.coef_ + regr.intercept_ for x_i in X]

plt.plot(X,y, color='orange', linewidth=3) plt.show()

(29)

Gianfranco Lombardo, Ph.D ([email protected])

● We are computing the mean square error on the test set trying to understand if our model is good or not

● What is the answer ?

○ MSE = 85318750.59

○ However MSE is not in the same unit of our samples

○ An idea: Root mean square error : 9236.814958911717

■ Is it good or not ?

How good is our model ?

(30)

● Data are purely invented so we cannot expect great results

○ Data are important, AI is not magic

● Maybe a polynomial or a non linear model could be better choices

● Dataset is too small

○ Data are important and we need the bigger dataset of observation that is possible

○ Overfitting and underfitting can be relevant with a small dataset

Some problems

(31)

Gianfranco Lombardo, Ph.D ([email protected])

Overfitting VS underfitting

(32)

● A model ‘learnt from data’ is generally as good as the data from which it has been derived

● A good dataset should be

○ Large

○ Correct (affected by noise as little as possible)

○ Consistent (examples must not be contradictory and a pattern should exist)

○ Well balanced (all classes adequately represented)

Good datasets for good models

(33)

Gianfranco Lombardo, Ph.D ([email protected])

● Scikit learn provides several public datasets to develop intelligent system but in particular for teaching reasons

from sklearn import datasets

○ Complete list:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

○ https://scikit-learn.org/stable/datasets/

Scikit-learn datasets

(34)

Examples with several features

(35)

Gianfranco Lombardo, Ph.D ([email protected])

Diabete

from sklearn import datasets, linear_model

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split

# Load the diabetes dataset

X, y = datasets.load_diabetes(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

(36)

Diabete

# Create linear regression object

regr = linear_model.LinearRegression()

# Train the model using the training sets regr.fit(X_train, y_train)

# Make predictions using the testing set y_pred = regr.predict(X_test)

# The coefficients

print('Coefficients: \n', regr.coef_)

# The mean squared error

print('Mean squared error: %.2f'

% mean_squared_error(y_test, y_pred)) from math import sqrt

rmse = sqrt(mean_squared_error(y_test, y_pred)) print(rmse)

(37)

Gianfranco Lombardo, Ph.D ([email protected])

BREAK

(38)

● When the target value we aim to predict is a numerical real value, we are performing a regression task

● When the target value to be predicted is a class among two or more choices but limited, we are performing a classification task

○ Examples:

■ Price prediction is regression

■ Detect a cat or a dog in a image is classification

■ Detect a cancer (yes or not) is a binary classification

● In machine learning there algorithms only for regression tasks (e.g linear regression) and some only for classification. Finally some others for both

Regression and classification

(39)

Gianfranco Lombardo, Ph.D ([email protected])

● We want to estimate a parametric function F(X,W) that maps our input features X to one or more of our target labels in y, where y is a limited set and is called class.

● The goal is to find the decision boundaries in the features space

Model for classification task

(40)

● We want to train a model to predict if a tumor is benign or malignant using a single feature (tumor size)

● y ∊ {0,1} : 0 is malignant , 1 is benign

● X has only one feature

Can we adapt Linear Regression for a classification task?

(Yes) 1

(No) 0

Tumor size

Malignant ?

X X X X

X X X X

(41)

Gianfranco Lombardo, Ph.D ([email protected])

We can find the best Y = WT X from our dataset and then select a threshold over y to classify when the tumor is malignant or not

Can we adapt Linear Regression for a classification task?

(Yes) 1

(No) 0

Tumor size

0.5

X X X X

X X X X

Y = W

T

X

t

(42)

● Does it really work? Or it was just lucky? What happens if i add another training example ?

Can we adapt Linear Regression for a classification task?

(Yes) 1

(No) 0

Tumor size

0.5

X X X X X

X X X X

t

These are now misclassified

(43)

Gianfranco Lombardo, Ph.D ([email protected])

● Moreover, linear regression returns a y that is not bounded between 0 and 1

● We want a hW(X) bounded: 0<= hW(X) <= 1, the simplicity of a linear regression and a good fit to our binary classification task

○ y = h

W

(X) = 𝞼(W

T

X)

● Logistic regression (sigmoid) can be the perfect solution

Logistic regression

(44)

Logistic function

(45)

Gianfranco Lombardo, Ph.D ([email protected])

● Logistic regression (despite the name) is a classification algorithm

● The name is due to the past when the logit function was invented and used as a regression algorithm

● As a classification algorithm it estimates the probability that an instance belongs to a class or not (Binary classification).

○ If the estimated probability is greater than 50% then the model predicts that class

● The logistic function is a sigmoid function ( S-shape) that outputs a number between 0 and 1

Logistic Regression

(46)

Logistic function

● Just like a linear regression model, a logistic regression computes a weighted sum of the input features plus a bias term

● But instead of outputting the result directly, it outputs the logistic of this result

○ So when the probability that an instance belongs to the positive class is computed, the

y

* can be computed easily:

■ Class 0 (or negative) if p < 0.5

■ Class 1 (or positive) if p>= 0.5

● When possible labels are more than two, several binary classifier are built to identify the correct class

(47)

Gianfranco Lombardo, Ph.D ([email protected])

● In our case, hW(X) with Logistic regression estimated the probability that y=1 on input X

● For example, If hW(X) = 0.7 for a sample x , It means that patient has the probability of 70% of having a malignant tumor.

● If you are familiar with probability, we can see hW(X) in the following way:

○ hW(X) = P(y=1 | X; W)

○ The probability that y=1 given X parametrized by W

Logistic function

(48)

Decision boundaries

● We predict y=1 when hW(X) >= 0.5 that means when WTX >=0 and vice-versa for y=0

○ Remember y = hW(X) = 𝞼(WTX) with

(49)

Gianfranco Lombardo, Ph.D ([email protected])

Decision boundaries

● Suppose the case we have the dataset in figure, with 2 features X1 and X2

○ We trained a Logistic regression and we found that the best parameters W are respectively W0= -3 , W1=1 , W2=1

h

W

(X) = g(w

0

+ w

1

x

1

+ w

2

x

2

)

(50)

Decision boundaries

● We predict y=1 when -3 + x

1

+ x

2

>= 0

○ So when x

1

+ x

2

>= 3

The green line is an example of a decision boundary

(51)

Gianfranco Lombardo, Ph.D ([email protected])

Non-linear decision boundaries

h

W

(X) = g(w

0

+ w

1

x

1

+ w

2

x

2

+

w

3

x

2 1

+ w

4

x

2 2

)

(52)

Learning: An optimization problem (again)

How to choose parameters W ?

(53)

Gianfranco Lombardo, Ph.D ([email protected])

Learning: An optimization problem (again)

How to choose parameters W ?

(54)

Learning: An optimization problem (again)

This cost function definition introduces a non linearity that makes the function we want to minimize non-convex

With a Non-convex function, there are not guarantees to converge in a global optimum

(55)

Gianfranco Lombardo, Ph.D ([email protected])

Learning: An optimization problem

Let’s define the cost function in the following way to have a convex function

y

i*

Cost = 0 if y=1 and y*=1 Cost -> ∞ if y=1 and y*= 0

We will penalize learning algorithm by a very large cost

yi*

(56)

Learning: An optimization problem

Merge Cost in a single function and apply Gradient Descent

Repeat until convergence {

}

(57)

Gianfranco Lombardo, Ph.D ([email protected])

To evaluate a classification model we can use the accuracy

● The accuracy is defined as the ratio among the number of correct predictions over the total number of predictions

● The accuracy is often expressed as a percentage

Pay attention: Accuracy has a meaning only when the test-set is balanced

● If the dataset is unbalanced we have to check other metrics as precision, recall, F1-score etc..

How to evaluate a classification task?

(58)

Number of Instances

150 (50 in each of three classes)

Number of Attributes

4 numeric, predictive features and the class

Features Information

sepal length in cm

sepal width in cm

petal length in cm

petal width in cm

class:

Iris-Setosa

Iris-Versicolour

Iris-Virginica

Example: Iris plants dataset

(59)

Gianfranco Lombardo, Ph.D ([email protected])

Confusion matrix

● The diagonal elements represent the number of points for which the predicted label is equal to the true label. The off-diagonal elements are those that are mislabeled by the classifier.

● Perfect classification: diagonal matrix

Total Predicted

Iris-setosa Iris-versicolor Iris-virginica Real

class

Iris setosa 14 1 1

Iris-versicolor 0 16 0

Iris-virginica 0 1 15

Accuracy = (14+16+15)/48

= 0,9375

(60)

Other metrics from the Confusion Matrix

● In a binary classification we can divide predictions in True Positive (TP), True Negative (TN), False positive (FP), False Negative (FN)

Predicted Negative Positive Real

class

Negative TN FP

Positive FN TP

from sklearn.metrics import confusion_matrix confusion_matrix(y_true, y_pred)

For the other metrics: see sklearn documentation:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

(61)

Gianfranco Lombardo, Ph.D ([email protected])

Other metrics from the Confusion Matrix

Precision can be thought as the accuracy of the positive predictions

Recall (also known as Sensitivity or True positive rate) is the ratio of positive instances that are correctly

detected by the classifier

F1 is the harmonic mean of precision and recall.

Whereas the regular mean treats all values equally, the harmonic one gives much weight to low values. As a consequence we can get a high F1 score only if both precision and recall are high

(62)

import numpy as np

from sklearn.linear_model import LogisticRegression from sklearn import datasets

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

# import some data to play with iris = datasets.load_iris()

X = iris.data # we only take the first two features.

y = iris.target

logreg = LogisticRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y)

Iris plants

(63)

Gianfranco Lombardo, Ph.D ([email protected])

# Create an instance of Logistic Regression Classifier and fit the data.

logreg.fit(X_train, y_train)

# Make predictions using the testing set y_pred = logreg.predict(X_test)

acc= accuracy_score(y_test, y_pred) print(acc)

Iris plants

(64)

● Example (From Bishop): We have a dataset of points (blue circle) generated from the function sin(2πx) + N (a random noise) . Our goal is to predict the value of t for some new value of x, without the knowledge of the green curve (sin(2πx)) and how this dataset has been generated

Model selection and Regularization

Value of x is our single feature

Value t is our target value to learn and predict

Question: It’s a regression task, but how can we select the order M of the polynomial ?

N= 10

(65)

Gianfranco Lombardo, Ph.D ([email protected])

Model selection and Regularization

(66)

Model selection and Regularization

(67)

Gianfranco Lombardo, Ph.D ([email protected])

Model selection and Regularization

For M=9, the training set error goes to zero but the test set error has become very large

This may seem paradoxical because a polynomial of given order contains all lower order polynomials as special cases

Furthermore, we might suppose that the best predictor of new data should be something similar to the function sin(2πx)

Remember that power series expansion of that function contains terms of all orders ( see mathematical analysis 1’s notes for more details), so we might expect that results should improve monotonically as we increase M

As M increases, the magnitude of coefficients typically gets larger

With M=9 the model is becoming increasingly tuned to the random noise on the target values

(68)

Model selection and Regularization

If we use a bigger dataset with N=100 samples the overfitting becomes less severe

Rough heuristic: Number of data points should be no less than some multiple (say 5 or 10) of the number of parameters in the model.

(69)

Gianfranco Lombardo, Ph.D ([email protected])

L

2

Regularization

● To control the over-fitting phenomenon a common technique involves adding a penalty term to the cost function in order to discourage the coefficients from reaching large values

○ The simplest penalty terms is the sum of squares of all of the coefficients:

This particular case of a quadratic regularizer is called Ridge regression and in the context of neural networks is known as Weight decay

● If we set the regularization parameter λ to a large value, when we minimize we force the weights to be very small

(70)

Regularization

In this case the over-fitting has been suppressed if λ is too much large we get again a poor representation

(71)

Gianfranco Lombardo, Ph.D ([email protected])

● If we want to fit a paraboloid to the data instead of a plane, we can combine the features in second-order polynomials, so that the model looks like this:

Polynomial regression

● It is still linear:

● We need to transform the input data matrix into a new data matrix of a given degree. From [x1,x2] to [ 1, x1 , x2 ,x12 , x1x2 , x22 ]

from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2)

X= poly.fit_transform(X)

(72)

from sklearn import linear_model

regr = linear_model.Ridge(alpha=5.)

L2 regularization

Riferimenti

Documenti correlati

In a later contribution, Kanoulas et al. propose a large-scale study on the effect of the distribution of labels across the different grades of relevance in the training set on

During the Nazi occupation of Rome, in 1943, the Germans requi- sitioned and dispersed the original collection, but in 1949 the Cineteca Nazionale was instituted by the State as

may very well be the case that current traffic rules do not fit a scenario with self-driving

– While standard NN nodes take input from all nodes in the previous layer, CNNs enforce that a node receives only a small set of features which are spatially or temporally close

I risultati ottenuti per queste registrazioni sono da imputarsi alle condizioni near fault che con- traddistinguono le registrazioni del transetto di valle Aterno

Al di là di una più diretta riflessione storica, stimolata dagli ambienti francesi, le implicazioni radicali della ricezione del- la cultura politica francese di matrice

2959, GC, 1993, 1, 2229, avevano da tempo riconosciuto che «fonte legale del dovere di corrispondere il premio non è il solo fatto materiale del rinvenimento del bene; occorrendo,

Il giudizio complessivo sul testo costituzionale focalizza una grossa ed insuperabile contraddizione, che secondo Paladin influenzerà fortemente lo snodarsi della storia