Memory Based Collaborative Filtering — User Based
In the early 90s, recommendation systems, particularly automated collaborative filtering, started seeing more widespread use. Fast forward to today, recommendation systems are at the core of the value proposition offered by numerous companies such as Amazon, Netflix and Spotify.
There are four types of recommendation systems:
- Content-based Filtering — Uses item metadata as the basis for the recommendation. For example, if a user liked The Matrix and The Dark Knight, both of which are categorized as action movies, the recommendation system would suggest another action movie like Star Wars.
- Collaborative Filtering — Uses similarity between users or items as the basis for the recommendation.
- Knowledge-based filtering —Uses information supplied explicitly as the basis for the recommendation. For example, if you’re searching for a house online, you would specify how many bathrooms and bedrooms you require. The recommendation system would return a list of properties based on those constraints.
- Hybrid — A combination of the other types.
In this article, we will be focusing on collaborative filtering. There are two types of memory-based collaborative filtering:
- User-based — User-based collaborative filtering makes recommendations based on the user’s preferences that are similar to other users. For example, if a user gives a similar rating to movies as the user in question. We could assume that they have similar interests. Thus, if the other user has seen and liked a movie that the user hasn’t seen, we would recommend it.
- Item-based — Item-based collaborative filtering suggests items similar to other items the active user liked. For example, if a user liked a Lord of the Rings book, then we would recommend another Lord of the Rings book.
Naively, if we were contemplating buying something, we might use social proof. That is, we would ask the people we know if they would recommend it. Some people would say no, others would say yes. We would go ahead with the purchase if the average rating was positive.
If we wanted to express the latter mathematically, we’d use the following formula:
In plain language, the rating for the item i for the user u is equal to the sum of the ratings for that item divided by the total number of users in the dataset.
The preceding equation doesn’t take into account how similar one user is to another. For instance, suppose that users u and v_{1} have similar tastes while u and v_{2} have very different tastes. If the user v_{2} rates a movie very highly (i.e. 5 stars), it doesn’t mean that the user u will like it. On the other hand, if v_{1} rates a movie very highly then it’s likely that user u will do so as well.
In order to account for similarity, we multiply each rating by a weight corresponding to how much the other user v’s tastes are similar to that of the user u.
It’s important to note that we divide by the sum of the absolute value of the weights as opposed to the number of users in the dataset.
There are multiple ways of measuring the similarity between users preferences. The cosine similarity can be calculating as follows:
The current equation still doesn’t consider the fact that people tend to rate on different scales. For example, if a person is optimistic by nature, they might give a movie a 3 star rating even if they didn’t really like it whereas a more hard-nosed person with similar tastes would give the movie a 1 star rating.
It’s for this reason that we need to normalize the ratings. We do this by subtracting the rating for the item i by the average rating the user v gave to all items. We must add the average rating user u gave to all items to return the result to the 5 star rating scale.
Neighbourhoods
Given m=|U| users and n=|I| items, the big O notation for computing pairwise correlations is O(m²n). This becomes a problem when the dataset is large. It’s for this reason that we introduce the notion of a neighbourhood. A neighbourhood is a subset of the users in the dataset used in predicting the rating.
The following methods can be used to select what users should be contained within a neighbourhood:
- Select users with a similarity score above a certain threshold
- Select at random
- Select the top-N users ranked by similarity score
- Select users within the same cluster
In theory, the more data the better. In practice, noise from dissimilar neighbours can decrease the accuracy of the predictions. In industry, a neighbourhood of between 25 and 100 users is typically used.
Let the neighbourhood V be a collection of users similar to the one whose rating we are trying to predict. The formula for calculating the rating is:
Python
We begin by importing the following libraries:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
We download the MovieLens dataset and load the training and test sets into memory.
train_df = pd.read_csv("ml-100k/u1.base", sep="\t", header=None, names=["user_id", "item_id", "rating", "timestamp"])
test_df = pd.read_csv("ml-100k/u1.test", sep="\t", header=None, names=["user_id", "item_id", "rating", "timestamp"])
Every row in our DataFrame contains a user’s rating for a given item.
train_df.head()
We construct the ratings matrix.
ratings_matrix = pd.pivot_table(train_df, values='rating', index='user_id', columns='item_id')
We normalize the ratings matrix by subtracting every user’s rating by the user’s mean rating.
normalized_ratings_matrix = ratings_matrix.subtract(ratings_matrix.mean(axis=1), axis=0)
In this example, we’re using the Pearson correlation. However, if we were using the cosine similarity, we’d need to impute the missing data. The most common technique is to replace the missing value with either the user’s average rating across all items or the item’s average rating across all users. You might see others replacing the missing values with 0. This is fine so long as we normalized the data first.
similarity_matrix = ratings_matrix.T.corr()
We define a function to calculate the score (rating) according to the formula we saw in the previous section. It’s worth noting that not all the items in the test set are found in the training set. In this case, we 2.5 since it’s neutral (could use the average rating of the dataset).
def calculate_score(u, i):
# Check whether the item is in the training dataset
if i not in ratings_matrix.columns:
return 2.5
similarity_scores = similarity_matrix[u].drop(labels=u)
normalized_ratings = normalized_ratings_matrix[i].drop(index=u)
# Drop users that haven't rated the item
similarity_scores.drop(index=normalized_ratings[normalized_ratings.isnull()].index, inplace=True)
normalized_ratings.dropna(inplace=True)
# If none of the other users have rated items in common with the user in question return the baseline value
if similarity_scores.isna().all():
return 2.5
total_score = 0
total_weight = 0
for v in normalized_ratings.index:
# It's possible that another user rated the item but that
# they have not rated any items in common with the user in question
if not pd.isna(similarity_scores[v]):
total_score += normalized_ratings[v] * similarity_scores[v]
total_weight += abs(similarity_scores[v])
avg_user_rating = ratings_matrix.T.mean()[u]
return avg_user_rating + total_score / total_weight
We iterate over all the user/item pairs in the test set and calculate the prediction using the function defined previously.
test_ratings = np.array(test_df["rating"])
user_item_pairs = zip(test_df["user_id"], test_df["item_id"])
pred_ratings = np.array([calculate_score(user_id, item_id) for (user_id, item_id) in user_item_pairs])
print(np.sqrt(mean_squared_error(test_ratings, pred_ratings)))
0.9725765287253909
As we can see, our recommendation system beat the baseline.
baseline_rating = train_df["rating"].mean()
baseline_ratings = np.array([baseline_rating for _ in range(test_df.shape[0])])
print(np.sqrt(mean_squared_error(test_ratings, baseline_ratings)))
1.1536759477860323
Conclusion
User-based collaborative filtering is an effective way to come up for recommendations. That being said, it suffers from issues of sparsity. In other words, you tend to encounter a large number of items and a relatively small number of ratings which results in a lot of wasted memory space. Not to mention, when we start dealing with millions of users, computing all pairwise correlations becomes very expensive. To get around this issue, we select a subset of users, called a neighbourhood, and use that when computing the rating.