Gradient Descent Algorithm
An optimization algorithm is at the heart of almost every machine learning method. We'll be covering one such algorithm in this post today.
Gradient descent is by far the most prominent optimization approach in machine learning and deep learning right now. Nonetheless, it perplexes many newcomers. If you're new to gradient boosting, the math might seem a little challenging. In this post, we'll explore gradient descent in detail and help you get a better grasp of this concept.
Gradient descent: What is it?
When training a machine learning model, gradient descent is a type of optimization algorithm. It is based on a convex function that iteratively changes its parameters to reduce a given function to its local minimum. Gradient descent begins by establishing the starting parameter values and then iteratively adjusts the values using calculus to minimize the specified cost function. It's necessary to understand gradients in order to grasp this notion fully.
A gradient is a measure of change in all weights in relation to the change in error. For better understanding, a gradient can be analogous to the slope of a function. A higher gradient can be considered as a steeper slope. This results in the model learning faster. Similarly, if the slope is zero, the model ceases to learn. In math terminology, a gradient is nothing but the partial derivative of its inputs.
Let's go through an example to understand the idea of gradient descent better. Take a look at the three-dimensional graph below in the perspective of a cost function.
Our aim is to go from the mountain in the upper right corner (at a high cost) to the sea in the lower-left corner (low cost). Beginning at the peak, we take our first step downhill in the direction indicated by the negative gradient. For each subsequent step, we recalculate the gradient with the new coordinates as the input. We repeat this procedure until we reach the bottom of our graph, or reach a point where we can no longer proceed downhill (local minimum).
Let’s understand a few more technical jargon before we proceed further.
Referencing the example above, the learning rate refers to the magnitude of the steps we took. We would cover more space with each step if we have a high learning rate, but since the slope of the hill changes at every point, we could potentially risk overshooting our local minima. On the other hand, with a low learning rate, we can guarantee more accuracy but it will take a long time to find the local minima.
A cost function helps us analyze how good or bad our model is at making predictions for a given set of inputs. The slope of this curve indicates how we should adjust our parameters to improve the model's accuracy.
Gradient Descent Algorithm
It is an iterative process that calculates the coordinates of the next point using gradient at the current point, multiples it(by a factor of learning rate) and subtracts the resultant from the current point. Thereby, completing one step. We are subtracting here as our goal is to minimize the function. Mathematically, the process of step calculation would look like this:
Here, pn+1 is the value of the new weight to be calculated pn is the value of the old weight. The parameter η is the learning rate here and has quite some influence on the model’s behavior(as seen earlier). η is then multiplied by the gradient at that particular point(calculated using derivative)
To learn more about how this equation is derived, feel free to check this post out.
Let’s try and implement the above algorithm into code.
# gradient function computes the derivative of the cost function
Procedure gradient_descent(start, gradient, learning_rate, max_iter, tol):
1. steps ← start, x ← start # variables initialization
2. for iterations = 1 to max_iter do
3. diff ← learning_rate*gradient(x) # computing the learning_rate*gradient
4. if abs(diff) < tol do # if the score is not going to improve
6. end if
7. x ← x - diff # x are the weights to be updated
8. return steps, x
9. end procedure
The function gradient_descent takes five parameters:
- start: It defines the starting point of the algorithm(usually initialized randomly).
- gradient: This function has to be pre-specified.
- learning_rate: This is the learning rate as we discussed above.
- max: This limits the maximum number of iterations that can take place.
- tolerance: To conditionally stop the function, we can assign a default value. In this case, it is 0.001.
Types of Gradient Descent
In current machine learning and deep learning algorithms, there are three primary forms of gradient descent. The primary cause for these variances is computing efficiency. It boils down to the gradient's accuracy vs. the time it takes to compute the next step. With that said, let's have an overview of the three types of gradient descent, understand each type and go over their pros and cons.
Batch Gradient Descent
The most basic kind is Batch Gradient Descent. Within the training set, it estimates the error for each example. It changes the model parameters only when it has evaluated all of the training instances.
- Using a fixed learning rate during training eliminates the risk of learning rate degradation.
- It calculates gradients in an impartial manner. The standard error is reduced as the number of examples increases.
- It will take a long time to run through all of the examples, especially when dealing with massive datasets.
- Each learning step occurs after reviewing all examples, some of which may be repetitive and offer nothing to the update.
Stochastic Gradient Descent
Stochastic Gradient Descent adjusts the parameters based on the error's gradient in relation to a single training sample. Depending on the task, this can make Stochastic Gradient Descent quicker than Batch Gradient Descent.
- Regular updates have the advantage of providing a comprehensive rate of improvement.
- As the network processes a single training sample, it is easy to fit into memory.
- The steps taken towards the minima are highly noisy as a result of the frequent updates. This frequently causes the gradient to descend in unexpected directions.
- Furthermore, due to noisy steps, convergence to the loss function minima may take longer.
Mini Batch Gradient Descent
Mini Batch Gradient Descent is a popular approach because it combines Stochastic Gradient Descent with Batch Gradient Descent. It simply divides the training set into tiny chunks and updates each of these batches. It does this by striking a compromise between the effectiveness of Batch Gradient Descent and the resilience of Stochastic Gradient Descent.
- It is faster than the Batch version since it runs through fewer examples.
- Choosing instances at random can help you avoid redundant examples that don't offer much to the learning.
- It will not converge. Because of the noise, the learning step may go back and forth on each iteration. As a result, it circles the minimal region but never converges.
- Because of the noise, the learning steps contain more oscillations, necessitating the addition of a learning decay to reduce the learning rate as we go closer to the minimum.
Frequently Asked Questions
- Does gradient descent work for all functions?
The gradient descent algorithm does not work for all functions. There are two specific requirements. A function has to be differentiable & convex.
- What is the difference between local minima and global minima?
Global minima are the points at which the function has the least value. Local minima are locations where the function has the lowest value in the vicinity of a given point.
- How does the learning rate affect the Gradient descent algorithm?
A small learning rate is more accurate at the cost of being more computationally expensive, on the other hand, a large learning rate is faster a the cost of the accuracy of the algorithm.
- What are some drawbacks of Gradient descent?
Convergence to a local minimum might be extremely sluggish.
There is no guarantee that the algorithm will locate the global minimum if there are many local minima.
This article should provide you with a general understanding of the gradient descent method in machine learning. In this post, we looked at how a Gradient Descent method works, when it may be utilized, and what typical issues arise while utilizing it.
If you’re interested in learning more about Machine Learning, you should definitely check this course out!