An introduction to Machine Learning

(1)

An introduction to Machine Learning

Gianfranco Lombardo, Ph.D [email protected]

DIPARTIMENTO DI INGEGNERIA E ARCHITETTURA

(2)

Currently I am a Postdoctoral Researcher in the Social web, intelligent and distributed systems engineering (SoWIDE) Group at the Department of Engineering and Architecture (DIA) of the University of Parma.

I received the Ph.D in Information Technologies (2021) and the Master Degree in Computer Engineering (2017) both at the University of Parma.

In 2019 I have been Visiting Researcher at the Center for Applied Optimization (CAO) at the Herbert Wertheim College of Engineering of the University of Florida (United States)

My research activity is focused on:

● Machine Learning and Deep Learning

● Natural Language Processing

● Network Science and Graph Data Mining

● Fintech technologies

● Social Networks

● Predictive Maintenance

For questions reach me at [email protected] Materials: http://sowide.unipr.it/en/teaching/ai

@me

(3)

Gianfranco Lombardo, Ph.D ([email protected])

● Machine learning fundamentals

○ Learning by examples

○ Training and test set

○ Loss function

○ Overfitting and Underfitting

● Regression tasks

○ Linear Regression

○ Root Mean Square Error (RMSE)

○ Gradient descent

● Classification tasks

○ Why Linear regression is not suitable?

○ Logistic regression

○ Confusion matrix and metrics

○

● Regularization and polynomial regression

Lecture summary

(4)

● Stanford CS229: Machine Learning — Andrew Ng, Stanford University

○ Free lectures on youtube:

● Hands-On Machine Learning with Scikit-learn & Tensorflow - Aurèlien Gèron

○ Book (O’Reilly editor)

● Pattern Recognition and Machine Learning - Christopher Bishop

References

(5)

● ML is the field of study that gives computers the ability to learn from data without being explicitly programmed

● A machine-learning system is trained rather than explicitly programmed

● It is a subset of the Artificial Intelligence

● Also known as statistical learning

What is Machine learning (ML) ?

(6)

● We are never certain something will happen, but we usually know (or can estimate rather well) how likely it is to happen or, at least, what is most likely to happen, based on the experience we have acquired throughout our life.

● Experience in Machine learning means: Data in the form of examples

● Explore data to find patterns to understand the hidden laws that regulates the domain

Learning by experience

(7)

● Training Set: the dataset from which we learn

● Validation set: A dataset we use to tune model’s parameters

● Test Set: A dataset which must include patterns describing the same problem which do not belong to the training set

○ The test set is used to verify whether what has been learnt in the training phase can be correctly generalized on new data

● We want to avoid overﬁtting (the learning algorithm memorizes the training set instead of learning general rules, the model performs well on the training data, but it does not generalize well). We use the validation set to ensure that the model is not overﬁtting

How to use data

(8)

How to really use data

Full Dataset

Training data (~ 70/80%) Test set (~ 30/20%)

Training set Validation

set

ML

Algorithm ML Model Evaluation

Select parameters for the model

(9)

● Supervised learning: Each example includes an input pattern and the desired output (teaching input) the model is expected to produce when the pattern is input. Learning modifies the model such that actual model outputs become more and more similar to the teaching inputs.

● Unsupervised learning: Each example is represented only by the input pattern. Learning modifies the model such that it reflects some regularities and similarities that characterize the data (e.g., by grouping similar data or identifying regions in the pattern space where they are most likely to be located)

Two ways of learning from examples

(10)

● A model is mathematical tool that maps our examples (observations) X to the desired output y in the form y ~ F(X) where F() is our model

● F() is a parametric function and we call its parameters W

● So, a model is y=F(X,W) and the goal of learning is to estimate these parameters W

● Machine learning algorithms provides different ways to get these parameters

○ Most of them are based on statistics and differential calculus

A model in supervised learning

(11)

Example: House price prediction

Mq Price

50 80,000 $

75 100,000 $

125 130,000 $

35 50,000 $

200 220,000 $

60 90,000 $

100 110,000 $

● We want to build an intelligent system that helps to predict prices of houses in a city

● We only have two information:

square meters and the price

● Square meters is what we call a feature

● Price represents our target output

(12)

Example: House price prediction

● X in this case is represented by square meters column

● y in this case is the price column

● Our goal is to identify a model that on the basis of these examples can understand the underlying rule of this domain

(13)

● A split of data in training and test set to evaluate if the model is able to generalize

● A ML algorithm to build the model

● A metric for this evaluation

What we need ?

(14)

Splitting data

Mq Price

50 80,000 $

75 100,000 $

125 130,000 $

35 50,000 $

200 220,000 $

60 90,000 $

100 110,000 $

● A suggested division is to use the 70 % of data in training and 30 % for the test

Training setTest set

(15)

Dataset VS training set and test set

TRAINING SET DATASET

(16)

● We want to find the most easy linear model that fits our data to get predictions on new data

● In our case let’s consider the equation of a line

○ y = mx + q

■ Where m is the slope

■ q is the intercept

● If we write that equation as

y = w

₁

X + w

₀

○ y is our target output

○ X is our feature

○ W are the two parameters that we have to learn

● If X₀=1 we can write the equation as Y = W^T X

Linear model

(17)

(18)

● We want w₀ and w₁ to have a line that has the minimum distance from our point

● Let’s call the output of our linear model y^* = F(x_i, W)

● We want the best W that minimize the following cost function:

○ Loss function L

○ Where m is the number of the training examples

● To minimize this function we should compute the derivative (gradient) of L with respect to W and find where this derivative is equal to 0

○ ∇ _w ^{L = 0}

Learning: An optimization problem

(19)

Minimization of J(w)

(20)

● We have a J(w₀, w₁) cost function and we want to find the parameters w₀, w₁ that better minimizes J

○ We start with a random choice of W

○ Keep changing W to reduce J(W) until we hopefully end up at a minimum

○ Weights are updated simultaneously

Gradient descent to minimize J(w)

Repeat until convergence {

}

(21)

Gradient descent to minimize J(w)

(22)

Gradient descent to minimize J(w)

(23)

Gradient descent to minimize J(w)

● Actually, in order to simplify the computation of the gradient, the cost function is usually defined with an addition

(24)

● Gradient descent uses the whole training set to compute gradients at every step.

For this reason it is very slow when the training set is large

● Stochastic-Gradient Descent picks a random instance in the training set at every step and computes the gradients based only on that single instance. It is faster and it makes it possible to train on huge training sets (less memory required)

● Due its random nature SGD is less regular than Gradient Descent. Instead of gently decreasing until the minimum, the cost function will bounce up and down, decreasing only on average

● Final parameter values are good but no guarantees about optimality

● When the function is very irregular, SGD can help to jump from a local minima

Stochastic-Gradient Descent (SGD)

(25)

● A traditional statistics tool that fits a straight line as a model

● Y = W^T X

○ Where X is the feature matrix (examples) and Y is the column vector with the target values

● It is considered the first ML algorithm with the idea of learning something from data

● It computes the gradient of the loss and estimates the most performing line

● It is available in the python module scikit-learn (sklearn)

Linear Regression

(26)

Example

from sklearn import linear_model

from sklearn.metrics import mean_squared_error import matplotlib.pyplot as plt

import numpy as np

X_train = [50,75,125,35,200]

X_test = [60,100]

y_train=[80000,100000,130000,50000,220000]

y_test=[90000,110000]

(27)

Example

# Create linear regression object

regr = linear_model.LinearRegression()

# Train the model using the training sets

regr.fit(np.array(X_train).reshape(-1, 1), y_train)

# Make predictions using the testing set

y_pred = regr.predict(np.array(X_test).reshape(-1, 1))

# The coefficients

print('Coefficients: \n', regr.coef_) print('Coefficients: \n', regr.intercept_)

# The mean squared error

print('Mean squared error: %.2f'% mean_squared_error(y_test, y_pred))

(28)

Plot

plt.plot(X_train,y_train,"ro") plt.plot(X_test,y_test,"bo")

X =np.concatenate([X_train,X_test])

y = [ x_i*regr.coef_ + regr.intercept_ for x_i in X]

plt.plot(X,y, color='orange', linewidth=3) plt.show()

(29)

● We are computing the mean square error on the test set trying to understand if our model is good or not

● What is the answer ?

○ MSE = 85318750.59

○ However MSE is not in the same unit of our samples

○ An idea: Root mean square error : 9236.814958911717

■ Is it good or not ?

How good is our model ?

(30)

● Data are purely invented so we cannot expect great results

○ Data are important, AI is not magic

● Maybe a polynomial or a non linear model could be better choices

● Dataset is too small

○ Data are important and we need the bigger dataset of observation that is possible

○ Overfitting and underfitting can be relevant with a small dataset

Some problems

(31)

Overfitting VS underfitting

(32)

● A model ‘learnt from data’ is generally as good as the data from which it has been derived

● A good dataset should be

○ Large

○ Correct (affected by noise as little as possible)

○ Consistent (examples must not be contradictory and a pattern should exist)

○ Well balanced (all classes adequately represented)

Good datasets for good models

(33)

● Scikit learn provides several public datasets to develop intelligent system but in particular for teaching reasons

○ from sklearn import datasets

○ Complete list:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

○ https://scikit-learn.org/stable/datasets/

Scikit-learn datasets

(34)

Examples with several features

(35)

Diabete

from sklearn import datasets, linear_model

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split

# Load the diabetes dataset

X, y = datasets.load_diabetes(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

(36)

Diabete

# Create linear regression object

regr = linear_model.LinearRegression()

# Train the model using the training sets regr.fit(X_train, y_train)

# Make predictions using the testing set y_pred = regr.predict(X_test)

# The coefficients

print('Coefficients: \n', regr.coef_)

# The mean squared error

print('Mean squared error: %.2f'

% mean_squared_error(y_test, y_pred)) from math import sqrt

rmse = sqrt(mean_squared_error(y_test, y_pred)) print(rmse)

(37)

BREAK

(38)

● When the target value we aim to predict is a numerical real value, we are performing a regression task

● When the target value to be predicted is a class among two or more choices but limited, we are performing a classification task

○ Examples:

■ Price prediction is regression

■ Detect a cat or a dog in a image is classification

■ Detect a cancer (yes or not) is a binary classification

● In machine learning there algorithms only for regression tasks (e.g linear regression) and some only for classification. Finally some others for both

Regression and classification

(39)

● We want to estimate a parametric function F(X,W) that maps our input features X to one or more of our target labels in y, where y is a limited set and is called class.

● The goal is to find the decision boundaries in the features space

Model for classification task

(40)

● We want to train a model to predict if a tumor is benign or malignant using a single feature (tumor size)

● y ∊ {0,1} : 0 is malignant , 1 is benign

● X has only one feature

Can we adapt Linear Regression for a classification task?

(Yes) 1

(No) 0

Tumor size

Malignant ?

X X X X

(41)

● We can find the best Y = W^T X from our dataset and then select a threshold over y to classify when the tumor is malignant or not

Can we adapt Linear Regression for a classification task?

(Yes) 1

(No) 0

Tumor size

0.5

X X X X

Y = W

^T

X

t

(42)

● Does it really work? Or it was just lucky? What happens if i add another training example ?

Can we adapt Linear Regression for a classification task?

(Yes) 1

(No) 0

Tumor size

0.5

X X X X X

X X X X

t

These are now misclassified

(43)

● Moreover, linear regression returns a y that is not bounded between 0 and 1

● We want a h_W(X) bounded: 0<= h_W(X) <= 1, the simplicity of a linear regression and a good fit to our binary classification task

○ y = h

_W

(X) = 𝞼(W

^T

X)

● Logistic regression (sigmoid) can be the perfect solution

Logistic regression

(44)

Logistic function

(45)

● Logistic regression (despite the name) is a classification algorithm

● The name is due to the past when the logit function was invented and used as a regression algorithm

● As a classification algorithm it estimates the probability that an instance belongs to a class or not (Binary classification).

○ If the estimated probability is greater than 50% then the model predicts that class

● The logistic function is a sigmoid function ( S-shape) that outputs a number between 0 and 1

Logistic Regression

(46)

Logistic function

● Just like a linear regression model, a logistic regression computes a weighted sum of the input features plus a bias term

● But instead of outputting the result directly, it outputs the logistic of this result

○ So when the probability that an instance belongs to the positive class is computed, the

y

^*can be computed easily:

■ Class 0 (or negative) if p < 0.5

■ Class 1 (or positive) if p>= 0.5

● When possible labels are more than two, several binary classifier are built to identify the correct class

(47)

● In our case, h_W(X) with Logistic regression estimated the probability that y=1 on input X

● For example, If h_W(X) = 0.7 for a sample x , It means that patient has the probability of 70% of having a malignant tumor.

● If you are familiar with probability, we can see h_W(X) in the following way:

○ h_W(X) = P(y=1 | X; W)

○ The probability that y=1 given X parametrized by W

Logistic function

(48)

Decision boundaries

● We predict y=1 when h_W(X) >= 0.5 that means when W^TX >=0 and vice-versa for y=0

○ Remember y = h_W(X) = 𝞼(W^TX) with

(49)

Decision boundaries

● Suppose the case we have the dataset in figure, with 2 features X₁ and X₂

○ We trained a Logistic regression and we found that the best parameters W are respectively W₀= -3 , W₁=1 , W₂=1

h

_W

(X) = g(w

₀

+ w

₁

x

₁

+ w

₂

x

₂

)

(50)

Decision boundaries

● We predict y=1 when -3 + x

₁

+ x

₂

>= 0

○ So when x

₁

+ x

₂

>= 3

The green line is an example of a decision boundary

(51)

Non-linear decision boundaries

h

_W

(X) = g(w

₀

+ w

₁

x

₁

+ w

₂

x

₂

+

w

₃

x

²₁

+ w

₄

x

²₂

)

(52)

Learning: An optimization problem (again)

How to choose parameters W ?

(53)

Learning: An optimization problem (again)

How to choose parameters W ?

(54)

Learning: An optimization problem (again)

This cost function definition introduces a non linearity that makes the function we want to minimize non-convex

With a Non-convex function, there are not guarantees to converge in a global optimum

(55)

Learning: An optimization problem

Let’s define the cost function in the following way to have a convex function

y

_i^*

Cost = 0 if y=1 and y^*=1 Cost -> ∞ if y=1 and y*= 0

We will penalize learning algorithm by a very large cost

y_i^*

(56)

Learning: An optimization problem

Merge Cost in a single function and apply Gradient Descent

Repeat until convergence {

}

(57)

● To evaluate a classification model we can use the accuracy

● The accuracy is defined as the ratio among the number of correct predictions over the total number of predictions

● The accuracy is often expressed as a percentage

● Pay attention: Accuracy has a meaning only when the test-set is balanced

● If the dataset is unbalanced we have to check other metrics as precision, recall, F1-score etc..

How to evaluate a classification task?

(58)

Number of Instances

150 (50 in each of three classes)

Number of Attributes

4 numeric, predictive features and the class

Features Information

● sepal length in cm

● sepal width in cm

● petal length in cm

● petal width in cm

● class:

○ Iris-Setosa

○ Iris-Versicolour

○ Iris-Virginica

Example: Iris plants dataset

(59)

Confusion matrix

● The diagonal elements represent the number of points for which the predicted label is equal to the true label. The off-diagonal elements are those that are mislabeled by the classifier.

● Perfect classification: diagonal matrix

Total Predicted

Iris-setosa Iris-versicolor Iris-virginica Real

class

Iris setosa 14 1 1

Iris-versicolor 0 16 0

Iris-virginica 0 1 15

Accuracy = (14+16+15)/48

= 0,9375

(60)

Other metrics from the Confusion Matrix

● In a binary classification we can divide predictions in True Positive (TP), True Negative (TN), False positive (FP), False Negative (FN)

Predicted Negative Positive Real

class

Negative TN FP

Positive FN TP

from sklearn.metrics import confusion_matrix confusion_matrix(y_true, y_pred)

For the other metrics: see sklearn documentation:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

(61)

Other metrics from the Confusion Matrix

● Precision can be thought as the accuracy of the positive predictions

● Recall (also known as Sensitivity or True positive rate) is the ratio of positive instances that are correctly

detected by the classifier

● F1 is the harmonic mean of precision and recall.

Whereas the regular mean treats all values equally, the harmonic one gives much weight to low values. As a consequence we can get a high F1 score only if both precision and recall are high

(62)

import numpy as np

from sklearn.linear_model import LogisticRegression from sklearn import datasets

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

# import some data to play with iris = datasets.load_iris()

X = iris.data # we only take the first two features.

y = iris.target

logreg = LogisticRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y)

Iris plants

(63)

# Create an instance of Logistic Regression Classifier and fit the data.

logreg.fit(X_train, y_train)

# Make predictions using the testing set y_pred = logreg.predict(X_test)

acc= accuracy_score(y_test, y_pred) print(acc)

Iris plants

(64)

● Example (From Bishop): We have a dataset of points (blue circle) generated from the function sin(2πx) + N (a random noise) . Our goal is to predict the value of t for some new value of x, without the knowledge of the green curve (sin(2πx)) and how this dataset has been generated

Model selection and Regularization

● Value of x is our single feature

● Value t is our target value to learn and predict

Question: It’s a regression task, but how can we select the order M of the polynomial ?

N= 10

(65)

Model selection and Regularization

(66)

Model selection and Regularization

(67)

Model selection and Regularization

● For M=9, the training set error goes to zero but the test set error has become very large

○ This may seem paradoxical because a polynomial of given order contains all lower order polynomials as special cases

○ Furthermore, we might suppose that the best predictor of new data should be something similar to the function sin(2πx)

■ Remember that power series expansion of that function contains terms of all orders ( see mathematical analysis 1’s notes for more details), so we might expect that results should improve monotonically as we increase M

■ As M increases, the magnitude of coefficients typically gets larger

■ With M=9 the model is becoming increasingly tuned to the random noise on the target values

(68)

Model selection and Regularization

● If we use a bigger dataset with N=100 samples the overfitting becomes less severe

● Rough heuristic: Number of data points should be no less than some multiple (say 5 or 10) of the number of parameters in the model.

(69)

L

₂

Regularization

● To control the over-fitting phenomenon a common technique involves adding a penalty term to the cost function in order to discourage the coefficients from reaching large values

○ The simplest penalty terms is the sum of squares of all of the coefficients:

● This particular case of a quadratic regularizer is called Ridge regression and in the context of neural networks is known as Weight decay

● If we set the regularization parameter λ to a large value, when we minimize we force the weights to be very small

(70)

Regularization

In this case the over-fitting has been suppressed if λ is too much large we get again a poor representation

(71)

● If we want to fit a paraboloid to the data instead of a plane, we can combine the features in second-order polynomials, so that the model looks like this:

Polynomial regression

● It is still linear:

● We need to transform the input data matrix into a new data matrix of a given degree. From [x₁,x₂] to [ 1, x₁, x₂,x₁², x₁x₂, x₂² ]

from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2)

An introduction to Machine Learning