Statistics For Machine Learning: R-Squared Explained
Machine learning involves a lot of statistics. In the proceeding article, we’ll take a look at the concept of R-Squared which is useful in feature selection.
Correlation (otherwise known as “R”) is a number between 1 and -1 where a value of +1 implies that an increase in x results in some increase in y, -1 implies that an increase in x results in a decrease in y, and 0 means that there isn’t any relationship between x and y. Like correlation, R² tells you how related two things are. However, we tend to use R² because it’s easier to interpret. R² is the percentage of variation (i.e. varies from 0 to 1) explained by the relationship between two variables.
The latter sounds rather convoluted so let’s take a look at an example. Suppose we decided to plot the relationship between salary and years of experience. In the proceeding graph, every data point represents an individual.
We can calculate the mean or average by taking the sum of all the individuals in the sample and dividing it by the total number of individuals in the sample.
The variance of the entire dataset is equal to the sum of the distance between every data point and the mean squared. The difference is squared such that points below the mean don’t cancel out with points above the mean.
var(mean) = sum(pi - mean)²
Now say, we took the same people but this time, we decided to plot the relationship between their salary and height.
Notice how the average salary remains the same irrespective of what we take to be the independent variable. In other words, we can use other aspects of the people’s lives as x but the salary will remain the same.
Suppose that we used linear regression to find the best fitting line.
The value of R² can then be expressed as:
**R² = (var(mean) - var(line)) / var(mean)**
where var(mean) is the variance with respect to the mean and var(line) is the variance with respect to line.
Like we mentioned previously, the variance can be calculated by taking the sum of the differences between individual salaries and the mean squared.
Using the same logic, we can determine the variation around the orange line.
Assuming that we obtained the following values for the variance of the line and mean.
We can calculate R² using the formula described previously.
The R² value implies that there is 96% less variation around the line than the mean. In other words, the relationship between salary and years of experience accounts for 96% of the variation. Said yet another way, years of experience is a good predictor of salary because when the years of experience go up so does the salary and vice versa.
Code
Let’s take a look at how we could go about using R² to evaluate a linear regression model. To start, import the following libraries.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
sns.set()
We’ll be using the following dataset. If you would like to follow along, copy its contents into a csv file.
YearsExperience,Salary
1.1,39343.00
1.3,46205.00
1.5,37731.00
2.0,43525.00
2.2,39891.00
2.9,56642.00
3.0,60150.00
3.2,54445.00
3.2,64445.00
3.7,57189.00
3.9,63218.00
4.0,55794.00
4.0,56957.00
4.1,57081.00
4.5,61111.00
4.9,67938.00
5.1,66029.00
5.3,83088.00
5.9,81363.00
6.0,93940.00
6.8,91738.00
7.1,98273.00
7.9,101302.00
8.2,113812.00
8.7,109431.00
9.0,105582.00
9.5,116969.00
9.6,112635.00
10.3,122391.00
10.5,121872.00
We load the data into our program using pandas and plot it using matplotlib.
df = pd.read_csv('data.csv')
plt.scatter(df['YearsExperience'], df['Salary'])
Next, we train a linear regression model on our salary data.
X = np.array(df['YearsExperience']).reshape(-1, 1)
y = df['Salary']
rf = LinearRegression()
rf.fit(X, y)
y_pred = rf.predict(X)
We can view the best fitting line produced by our model by running the following lines.
plt.scatter(df['YearsExperience'], df['Salary'])
plt.plot(X, y_pred, color='red')
Then, we compute R² using the formula discussed in the preceding section.
def r2_score_from_scratch(ys_orig, ys_line):
y_mean_line = [ys_orig.mean() for y in ys_orig]
squared_error_regr = squared_error(ys_orig, ys_line)
squared_error_y_mean = squared_error(ys_orig, y_mean_line)
return 1 - (squared_error_regr / squared_error_y_mean)
def squared_error(ys_orig, ys_line):
return sum((ys_line - ys_orig) * (ys_line - ys_orig))
r_squared = r2_score_from_scratch(y, y_pred)
print(r_squared)
Rather than implement it from scratch every time, we can leverage the sklearn r2_score function.
r2_score(y, y_pred)