Ridge Regression Python Example
Overfitting, the process by which a model performs well for training samples but fails to generalize, is one of the main challenges in machine learning. In the proceeding article, we’ll cover how we can use regularization to help prevent overfitting. To be specific, we’ll talk about Ridge Regression, a distant cousin of Linear Regression, and how it can be used to determine the best fitting line.
Before we can begin to describe Ridge Regression, it’s important that you understand variance and bias in the context of machine learning.
Bias
The term bias is not the y-intercept but the extent to which the model fails to come up with a plot that approximates the samples. For example, the proceeding line has a high bias since it fails to capture the underlying trend in the data.
On the other hand, the proceeding line has a relatively low bias. If we were to measure the mean square error, it would be much lower compared to the previous example.
Variance
In contrast to the statistical definition, variance does not refer the spread of data relative to the mean. Rather, it characterizes the difference in fits between datasets. In other words, it measures how the accuracy of a model changes when presented with a different dataset. For example, the squiggly line in the proceeding image performs radically different on other datasets. Therefore, we say it has a high variance.
On the other hand, the straight line has relatively low variance because the mean square error is similar for different datasets.
Ridge Regression is almost identical to Linear Regression except that we introduce a small amount of bias. In return for said bias, we get a significant drop in variance. In other words, by starting out with a slightly worse fit, Ridge Regression performs better against data that doesn’t exactly follow the same pattern as the data the model was trained on.
Adding bias, is often referred to as regularization. As the name implies, regularization is used to develop a model that excels at predicting targets for data that follows a regular pattern rather than specific. Said another way, the purpose of regularization is to prevent overfitting. Overfitting tends to occur when we use a higher degree polynomial than what is needed to model the data.
To get around this problem, we introduce a regularization term to the loss function. In Ridge Regression, the loss function is the linear least squares function and the regularization is given by the l2-norm.
Since we are trying to minimize the loss function and w is included in the residual sum of squares, the model will be forced into finding a balance between minimizing the residual sum of squares and minimizing the coefficients.
For a high degree polynomial, the coefficients of the higher order variables will tend towards 0 if the underlying data can be approximated just as well with a low degree polynomial.
If we set the hyperparameter alpha to some large number, in trying to find the minimum value for the cost function, the model will set the coefficients to 0. In other words, the regression line will have a slope of 0.
Algorithm
Finding the coefficients given the added regularization term isn’t all that difficult. We take the cost function, perform a bit of algebra, take the partial derivative with respect to w (the vector of coefficients), make it equal to 0 and then solve for w.
Python Code
Let’s see how we can go about implementing Ridge Regression from scratch using Python. To begin, we import the following libraries.
from sklearn.datasets import make_regression
from matplotlib import pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge
We can use the scikit-learn
library to generate sample data which is well suited for regression.
X, y, coefficients = make_regression(
n_samples=50,
n_features=1,
n_informative=1,
n_targets=1,
noise=5,
coef=True,
random_state=1
)
Next, we define the hyperparameter alpha. Alpha determines the regularization strength. The larger value for alpha, the stronger the regularization. In other words, when alpha is a very larger number, the bias of the model will be high. An alpha of 1, will result in a model that acts identical to Linear Regression.
alpha = 1
We create the identity matrix. In order for the equation we saw previously to respect the rules of matrix operations, the identity matrix has to be the same size as the matrix X transpose dot X.
n, m = X.shape
I = np.identity(m)
Finally, we solve for w using the equation discuss above.
w = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X) + alpha * I), X.T), y)
In comparing w to the actual coefficient(s) used in generating the data, we can see that they’re not exactly equal to one another but close.
w
Let’s take a look at how the regression line fits the data.
plt.scatter(X, y)
plt.plot(X, w*X, c='red')
Let’s do the same thing using the scikit-learn
implementation of Ridge Regression. First, we create and train an instance of the Ridge
class.
rr = Ridge(alpha=1)
rr.fit(X, y)
w = rr.coef_
We get the same value for w where we solved for it using linear algebra.
w
The regression line is identical to the one above.
plt.scatter(X, y)
plt.plot(X, w*X, c='red')
Next, let’s visualize the effect of the regularization parameter alpha. To start, we set it to 10.
rr = Ridge(alpha=10)
rr.fit(X, y)
w = rr.coef_[0]
plt.scatter(X, y)
plt.plot(X, w*X, c='red')
As we can see, the regression line is no longer a perfect fit. In other words, the model has a higher bias compared to the one with an alpha of 1. For emphasis, let’s try an alpha of 100.
rr = Ridge(alpha=100)
rr.fit(X, y)
w = rr.coef_[0]
plt.scatter(X, y)
plt.plot(X, w*X, c='red')
When alpha tends towards positive infinity, the regression line will tend towards a mean of 0 since that would minimize the variance across different datasets.