Mastering the Basics: How Linear Regression Unlocks the Secrets of Complex Models
Full explanation on Linear Regression and how it learns
Just like Mr. Miyagi taught young Daniel LaRusso karate through repetitive simple chores, which ultimately transformed him into the Karate Kid, mastering foundational algorithms like linear regression lays the groundwork for understanding the most complex of AI architectures such as Deep Neural Networks and LLMs.
Through this deep dive into the simple yet powerful linear regression, you will learn many of the fundamental parts that make up the most advanced models built today by billion-dollar companies.
What is Linear Regression?
Linear regression is a simple mathematical method used to understand the relationship between two variables and make predictions. Given some data points, such as the one below, linear regression attempts to draw the line of best fit through these points. It’s the “wax on, wax off” of data science.
Once this line is drawn, we have a model that we can use to predict new values. In the above example, given a new house size, we could attempt to predict its price with the linear regression model.
The Linear Regression Formula
Y is the dependent variable, that which you want to calculate — the house price in the previous example. Its value depends on other variables, hence its name.
X are the independent variables. These are the factors that influence the value of Y. When modelling, the independent variables are the input to the model, and what the model spits out is the prediction or Ŷ.
β are parameters. We give the name parameter to those values that the model adjusts (or learns) to capture the relationship between the independent variables X and the dependent variable Y. So, as the model is trained, the input of the model will remain the same, but the parameters will be adjusted to better predict the desired output.
Parameter Learning
We require a few things to be able to adjust the parameters and achieve accurate predictions.
- Training Data — this data consists of input and output pairs. The inputs will be fed into the model and during training, the parameters will be adjusted in an attempt to output the target value.
- Cost function — also known as the loss function, is a mathematical function that measures how well a model’s prediction matches the target value.
- Training Algorithm — is a method used to adjust the parameters of the model to minimise the error as measured by the cost function.
Let’s go over a cost function and training algorithm that can be used in linear regression.
Cost Function: Mean Squared Error (MSE)
MSE is a commonly used cost function in regression problems, where the goal is to predict a continuous value. This is different from classification tasks, such as predicting the next token in a vocabulary, as in Large Language Models. MSE focuses on numerical differences and is used in a variety of regression and neural network problems, this is how you calculate it:
- Calculate the difference between the predicted value, Ŷ, and the target value, Y.
- Square this difference — ensuring all errors are positive and also penalising large errors more heavily.
- Sum the squared differences for all data samples
- Divide the sum by the number of samples, n, to get the average squared error
You will notice that as our prediction gets closer to the target value the MSE gets lower, and the further away they are the larger it grows. Both ways progress quadratically because the difference is squared.
Training Algorithm: Gradient Descent
The concept of gradient descent is that we can travel through the “cost space” in small steps, with the objective of arriving at the global minimum — the lowest value in the space. The cost function evaluates how well the current model parameters predict the target by giving us the loss value. Randomly modifying the parameters does not guarantee any improvements. But, if we examine the gradient of the loss function with respect to each parameter, i.e. the direction of the loss after an update of the parameter, we can adjust the parameters to move towards a lower loss, indicating that our predictions are getting closer to the target values.
The steps in gradient descent must be carefully sized to balance progress and precision. If the steps are too large, we risk overshooting the global minimum and missing it entirely. On the other hand, if the steps are too small, the updates will become inefficient and time-consuming, increasing the likelihood of getting stuck in a local minimum instead of reaching the desired global minimum.
Gradient Descent Formula
In the context of linear regression, θ could be β0 or β1. The gradient is the partial derivative of the cost function with respect to θ, or in simpler terms, it is a measure of how much the cost function changes when the parameter θ is slightly adjusted.
A large gradient indicates that the parameter has a significant effect on the cost function, while a small gradient suggests a minor effect. The sign of the gradient indicates the direction of change for the cost function. A negative gradient means the cost function will decrease as the parameter increases, while a positive gradient means it will increase.
So, in the case of a large negative gradient, what happens to the parameter? Well, the negative sign in front of the learning rate will cancel with the negative sign of the gradient, resulting in an addition to the parameter. And since the gradient is large we will be adding a large number to it. So, the parameter is adjusted substantially reflecting its greater influence on reducing the cost function.
Practical Example
Let’s take a look at the prices of the sponges Karate Kid used to wash Mr. Miyagi’s car. If we wanted to predict their price (dependent variable) based on their height and width (independent variables), we could model it using linear regression.
We can start with these three training data samples.
Now, let’s use the Mean Square Error (MSE) as our cost function J, and linear regression as our model.
The linear regression formula uses X1 and X2 for width and height respectively, notice there are no more independent variables since our training data doesn’t include more. That is the assumption we take in this example, that the width and height of the sponge are enough to predict its price.
Now, the first step is to initialise the parameters, in this case to 0. We can then feed the independent variables into the model to get our predictions, Ŷ, and check how far these are from our target Y.
Right now, as you can imagine, the parameters are not very helpful. But we are now prepared to use the Gradient Descent algorithm to update the parameters into more useful ones. First, we need to calculate the partial derivatives of each parameter, which will require some calculus, but luckily we only need to this once in the whole process.
With the partial derivatives, we can substitute in the values from our errors to calculate the gradient of each parameter.
Notice there wasn’t any need to calculate the MSE, as it’s not directly used in the process of updating parameters, only its derivative is. It’s also immediately apparent that all gradients are negative, meaning that all can be increased to reduce the cost function. The next step is to update the parameters with a learning rate, which is a hyper-parameter, i.e. a configuration setting in a machine learning model that is specified before the training process begins. Unlike model parameters, which are learned during training, hyper-parameters are set manually and control aspects of the learning process. Here we arbitrarily use 0.01.
This has been the final step of our first iteration in the process of gradient descent. We can use these new parameter values to make new predictions and recalculate the MSE of our model.
The new parameters are getting closer to the true sponge prices, and have yielded a much lower MSE, but there is a lot more training left to do. If we iterate through the gradient descent algorithm 50 times, this time using Python instead of doing it by hand — since Mr. Miyagi never said anything about coding — we will reach the following values.
Eventually we arrived to a pretty good model. The true values I used to generate those numbers were [1, 2, 3] and after only 50 iterations, the model’s parameters came impressively close. Extending the training to 200 steps, which is another hyper-parameter, with the same learning rate allowed the linear regression model to converge almost perfectly to the true parameters, demonstrating the power of gradient descent.
Conclusions
Many of the fundamental concepts that make up the complicated martial art of artificial intelligence, like cost functions and gradient descent, can be thoroughly understood just by studying the simple “wax on, wax off” tool that linear regression is.
Artificial intelligence is a vast and complex field, built upon many ideas and methods. While there’s much more to explore, mastering these fundamentals is a significant first step. Hopefully, this article has brought you closer to that goal, one “wax on, wax off” at a time.
Mastering the Basics: How Linear Regression Unlocks the Secrets of Complex Models was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Datascience in Towards Data Science on Medium https://ift.tt/FMwvoTz
via IFTTT