Machine Learning Algorithms Part 11: Ridge Regression, Lasso Regression And Elastic-Net Regression
Supervised learning problems can be further grouped into Classification and Regression problems. As opposed to classification problems, regression has the task of predicting a continuous quantity (i.e. weight, income).
Ridge Regression
As the name implies, ridge regression falls under the latter category. According to the sklearn cheat-sheet, ridge regression is useful in solving problems where you have less than one hundred thousand samples or when you have more parameters than samples..
Before we can begin to describe Ridge and Lasso Regression, it’s important that you understand the meaning of variance and bias in the context of machine learning.
Bias
The term bias is not the y-intercept but the extent to which the model fails to come up with a plot that’s inline with the samples.
Variance
In contrast to the statistical definition, variance does not mean the spread of the data but how the accuracy of a model changes with respect to different datasets.
The squiggly line from the preceding image performs radically different on other datasets. Therefore, we say it has a high variance. On the other hand, the straight line has relatively low variance because the sum of squares is similar for different datasets.
Ridge regression is almost identical to linear regression (sum of squares) except we introduce a small amount of bias. In return, we get a significant drop in variance. In other words, by starting with a slightly worse fit, Ridge Regression can provide better long term predictions.
The bias added to the model is also known as the Ridge Regression penalty. We compute it by multiplying lambda by the squared weight of each individual feature.
For example, we can plot the salary as a function of years of experience.
y = wx + b
In simple linear regression, we determine the best fitting line by minimizing the sum of the squared residuals.
The equation for the Ridge Regression penalty is:
In the case of multiple linear regression, the output is a function of multiple features. Therefore, when calculating the ridge regression penalty, we’d incorporate all those parameters squared.
To come up with a value for lambda we just try a bunch of values and use cross validation to determine which one results in the lowest variance.
Lasso Regression
Lasso Regression is almost identical to Ridge Regression, the only difference being that we take the absolute value as opposed to the squaring the weights when computing the ridge regression penalty.
As a result of taking the absolute value, Lasso Regression can shrink the slope all the way down to 0. whereas Ridge Regression can only shrink the slope asymptotically close to 0.
Since Lasso Regression can exclude useless variables from equations by setting the slope to 0, it is a little better than Ridge Regression at reducing variance in models that contain a lot of irrelevant features.
Elastic-Net Regression
By combining lasso and ridge regression we get Elastic-Net Regression. Elastic-Net Regression groups and shrinks the parameters associated with correlated variables and leaves them in the equation or removes them all at once.
Note: The Lasso Regression penalty and Ridge Regression penalty each get their own lambda.
Cory Maklin
_Sign in now to see your channels and recommendations!_www.youtube.com