2. Model parameters θ: Assigning specific values to the model parameters θ identifies a hypothesis within the given hypothesis class. This can be
1.4 Optimizers and model accuracy
We have previously dealt with the loss function, which is a mathematical way of measuring how wrong your predictions are. During the training process, we tweak and change the parameters (weights) of our model to try and minimize that loss function, and make our predictions as correct and optimized as possible. However, how exactly do you do that? How do you change the parameters of the model, by how much, and when?
This is where optimizers come in. They tie together the loss function and model parameters by updating the model in response to the output of the loss function.
In simpler terms, optimizers shape and mold your model into its most accurate possible form by futzing with the weights. The loss function is the guide to the terrain, telling the optimizer when it’s moving in the right or wrong direction.
For a useful mental model, we can think of a hiker trying to get down a mountain with a blindfold on. It’s impossible to know which direction to go in, but there’s one thing she can know: If she’s going down (making progress) or going up (losing progress). Eventually, if she keeps taking steps that lead her downwards, she’ll reach the base. Similarly, it’s impossible to know what your model’s weight should be right from the start. However, with some trial and error based on the loss function (whether the hiker is descending), you can end up getting there eventually.
1.4.1 Gradient Descent
This algorithm is used across all types of machine learning (and other math problems) to optimize. Gradient descent is a first-order optimization algorithm which depends on the first-order derivative of a loss function. It calculates that which way the weights should be altered so that the function can reach a minima.
Through backpropagation, the loss is transferred from one layer to another and the model’s parameters, also known as weights, are modified depending on the losses so that the loss can be minimized. The algorithm of gradient descent can be implemented following the following instruction:
1. Calculate what a small change in each individual weight would do to the loss function (i.e. which direction should the hiker walk in)
2. Adjust each individual weight based on its gradient (i.e. take a small step in the determined direction)
3. Keep doing steps #1 and #2 until the loss function gets as low as possible In mathematical form, this can be translated into
Θ = Θ − α · ∇J (Θ) (1.16)
The tricky part of this algorithm (and optimizers in general) is understanding the gradients, which represent what a small change in a weight or parameter would do to the loss function. Gradients are partial derivatives and are a measure of change.
They connect the loss function and the weights; they tell us what specific operation we should do on our weights—add 5, subtract .07, or anything else—to lower the output of the loss function and thereby make our model more accurate.
The cost function can be mathematically defined as follows:
J (θ) = 1/2m
m
Ø
i=1
(h(θ)(i)− y(i))2 (1.17) Note that the cost function is for linear regression, for other algorithms the cost function will be different and the gradients would have to be derived from the cost function. Others important parameters of Gradient Descendent are:
• Learning Rate: is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward the minimum of a loss function. Since it influences to what extent the newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the adaptive control literature, the learning rate is commonly referred to as gain.
• Epochs: one epoch is when an entire dataset is passed forward and backward through the neural network only once. Since one epoch is too big to feed to the computer at once, we divide it in several smaller batches. Passing the entire dataset through a neural network is not enough. What is typically done is to pass the full dataset multiple times to the same neural network.
• Batch Size: total number of training examples present in a single batch. As was previously mentioned, you can’t pass the entire dataset into the neural net at once. Therefore, what is done is to divide the dataset into a number of batches or sets or parts.
• Iterations: iteration is the number of batches needed to complete one epoch.
1.4.2 Gradient Descent Algorithm
Basically the Gradient Descent can be implemented in this way.
First of all,we define the cost function:
1 d e f c a l _ c o s t ( t h e t a , X, y ) :
2 ’ ’ ’
3 C a l c u l a t e s t h e c o s t f o r g i v e n X and Y. The f o l l o w i n g shows and example o f a s i n g l e d i m e n s i o n a l X
4 t h e t a = V e c t o r o f t h e t a s
5 X = Row o f X ’ s np . z e r o s ( ( 2 , j ) )
6 y = A c t u a l y ’ s np . z e r o s ( ( 2 , 1 ) )
7 where :
8 j i s t h e no o f f e a t u r e s
9 ’ ’ ’
10
11 m = l e n( y )
12 p r e d i c t i o n s = X. d o t ( t h e t a )
13 c o s t = ( 1 / 2 ∗m) ∗ np .sum( np . s q u a r e ( p r e d i c t i o n s −y ) )
14 r e t u r n c o s t
15
Then we can define the Gradient Descent algorithm:
1 d e f g r a d i e n t _ d e s c e n t (X, y , t h e t a , l e a r n i n g _ r a t e = 0 . 0 1 , i t e r a t i o n s =100) :
2 ’ ’ ’
3 X = Matrix o f X w i t h added b i a s u n i t s
4 y = V e c t o r o f Y
5 t h e t a=V e c t o r o f t h e t a s np . random . randn ( j , 1 )
6 l e a r n i n g _ r a t e
7 i t e r a t i o n s = no o f i t e r a t i o n s
8 R e t u r n s t h e f i n a l t h e t a v e c t o r and a r r a y o f c o s t h i s t o r y o v e r no o f i t e r a t i o n s
9 ’ ’ ’
10 m = l e n( y )
11 c o s t _ h i s t o r y = np . z e r o s ( i t e r a t i o n s )
12 t h e t a _ h i s t o r y = np . z e r o s ( ( i t e r a t i o n s , 2 ) )
13 f o r i t i n r a n g e( i t e r a t i o n s ) :
14
15 p r e d i c t i o n = np . d ot (X, t h e t a )
16
17 t h e t a = t h e t a −(1/m) ∗ l e a r n i n g _ r a t e ∗ ( X. T . d o t ( ( p r e d i c t i o n − y ) ) )
18 t h e t a _ h i s t o r y [ i t , : ] =t h e t a . T
19 c o s t _ h i s t o r y [ i t ] = c a l _ c o s t ( t h e t a , X, y )
20
21 r e t u r n t h e t a , c o s t _ h i s t o r y , t h e t a _ h i s t o r y
22
1.4.3 Adaptive Moment Estimation (ADAM)
Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.Adam is different from classical stochastic gradient descent: Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training. A learning rate is maintained for each network weight (parameter) and separately adapted as learning unfolds. Adam combines the advantages of two other extensions of stochastic gradient descent. Specifically:
• AdaGrad (Adaptive Gradient Algorithm): that maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g., natural language and computer vision problems).
• RMSProp (Root Mean Square Propagation): that also maintains per-parameter learning rates that are adapted based on the average of the recent magnitudes of the gradients of the weight (e.g., how quickly it is changing).
This means the algorithm does well on online and nonstationary problems (e.g. noisy).
Instead of adapting the parameter learning rates based on the average first moment (mean) as in RMSProp, Adam also makes use of the average of the second moments
of the gradients (uncentered variance).
Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.
The initial value of the moving averages and beta1 and beta2 values close to 1.0
(recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.[1]
Figure 1.5: Comparison of Adam to Other Optimization Algorithms Training a Multilayer Perceptron [1]
1.4.4 Adam Configuration Parameters
• alpha: also referred to as the learning rate or step size. The proportion that weights are updated (e.g., 0.001). Larger values (e.g., 0.3) results in faster initial learning before the rate is updated. Smaller values (e.g., 1.0E-5) slow learning right down during training
• beta1: the exponential decay rate for the first moment estimates (e.g., 0.9).
• beta2: the exponential decay rate for the second moment estimates (e.g., 0.999). This value should be set close to 1.0 for problems with a sparse gradient (e.g., NLP and computer vision problems).
• epsilon: is a very small number to prevent any division by zero in the implementation (e.g., 10E-8).
Algorithm 1 Adam optimizer algorithm. All operations are element-wise even powers. Good values for the constants are α = 0.001, β1 = 0.9, β2 = 0.999, Ô = 10−8. Ô is needed to guarantee numerical stability.
1: procedure Adam(α, β1, β2, f, θ0)
2: ó α is the stepsize
3: ó β1, β2 ∈ [0, 1) are the exponential decay rates for the moment estimates
4: ó f (θ) is the objective function to optimize
5: ó θ0 is the initial vector of parameters which will be optimized
6: ó Initialization
7: m0 ← 0 ó First moment estimate vector set to 0
8: v0 ← 0 ó Second moment estimate vector set to 0
9: t ← 0 ó Timestep set to 0
10: ó Execution
11: while θt not converged do
12: t ← t + 1 ó Update timestep
13: ó Gradients are computed w.r.t the parameters to optimize
14: ó using the value of the objective function
15: ó at the previous timestep
16: gt← ∇θf (θt−1)
17: ó Update of first-moment and second-moment estimates using
18: ó previous value and new gradients, biased
19: mt← β1· mt−1+ (1 − β1) · gt
20: vt← β2· vt−1+ (1 − β2) · gt2
21: ó Bias-correction of estimates
22: mˆt← mt
1 − β1t
23: vˆt← vt
1 − β2t
24: θt ← θt−1− α · mˆt
√vˆt+ Ô ó Update parameters
25: end while
26: return θt ó Optimized parameters are returned
27: end procedure