Spectral Clustering Algorithm Implemented From Scratch
Spectral clustering is a popular unsupervised machine learning algorithm which often outperforms other approaches. In addition, spectral clustering is very simple to implement and can be solved efficiently by standard linear algebra methods. In spectral clustering, the affinity, and not the absolute location (i.e. k-means), determines what points fall under which cluster. The latter is particularly useful in tackling problems where the data forms complicated shapes.
Algorithm
The algorithm can be broken down into 4 basic steps.
- Construct a similarity graph
- Determine the Adjacency matrix W, Degree matrix D and the Laplacian matrix L
- Compute the eigenvectors of the matrix L
- Using the second smallest eigenvector as input, train a k-means model and use it to classify the data
Code
In the proceeding section, we’ll implement spectral clustering from scratch. We’re going to need the following libraries.
import numpy as np
float_formatter = lambda x: "%.3f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
from sklearn.datasets.samples_generator import make_circles
from sklearn.cluster import SpectralClustering, KMeans
from sklearn.metrics import pairwise_distances
from matplotlib import pyplot as plt
import networkx as nx
import seaborn as sns
sns.set()
Typically, you will have a dataset composed of samples (rows) and their features (columns). However, the spectral clustering algorithm can only be applied to a graph of connected nodes.
Therefore, we must apply transformations to our data in order to go from a table of rows and columns to a graph. Assume that we had the following dataset. We can clearly see that the data can be segregated into three clusters.
X = np.array([
[1, 3], [2, 1], [1, 1],
[3, 2], [7, 8], [9, 8],
[9, 9], [8, 7], [13, 14],
[14, 14], [15, 16], [14, 15]
])
plt.scatter(X[:,0], X[:,1], alpha=0.7, edgecolors='b')
plt.xlabel('Weight')
plt.ylabel('Height')
First, we construct the similarity matrix, a NxN matrix where N is the number of samples. We fill the cells with the euclidean distance between each pair of points.
Then we create the adjacency matrix by copying the contents of the similarity matrix and only this time, we set a threshold such that if the distance is greater than the predefined limit, we set the value to 0 and 1 otherwise.
The adjacency matrix can then be used to build a graph. If there’s a 1 in the cell of the adjacency matrix then we draw an edge between the nodes of the column and row.
W = pairwise_distances(X, metric="euclidean")
vectorizer = np.vectorize(lambda x: 1 if x < 5 else 0)
W = np.vectorize(vectorizer)(W)
print(W)
For the remainder of this tutorial, we’ll be using the networkx
library to visualize graphs.
def draw_graph(G):
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos)
nx.draw_networkx_labels(G, pos)
nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.5)
To start, we randomly generate a graph and print its adjacency matrix.
G = nx.random_graphs.erdos_renyi_graph(10, 0.5)
draw_graph(G)
W = nx.adjacency_matrix(G)
print(W.todense())
Notice how the nodes form a single component (i.e. all other nodes can be reached from a given node).
Once we’ve built the adjacency matrix, we construct the degree matrix. For each row of the degree matrix we fill the cell along the diagonal by summing all the elements of the corresponding row in the adjacency matrix.
Then, we compute the laplacian matrix by subtracting the adjacency matrix from the degree matrix.
# degree matrix
D = np.diag(np.sum(np.array(W.todense()), axis=1))
print('degree matrix:')
print(D)
# laplacian matrix
L = D - W
print('laplacian matrix:')
print(L)
Once we have the laplacian matrix, we can take advantage of one of its special properties to classify our data.
- If the graph (W) has K connected components, then L has K eigenvectors with an eigenvalue of 0.
Therefore, since in our current example we only have one component, one eigenvalue will be equal to 0.
e, v = np.linalg.eig(L)
# eigenvalues
print('eigenvalues:')
print(e)
# eigenvectors
print('eigenvectors:')
print(v)
fig = plt.figure()
ax1 = plt.subplot(121)
plt.plot(e)
ax1.title.set_text('eigenvalues')
i = np.where(e < 10e-6)[0]
ax2 = plt.subplot(122)
plt.plot(v[:, i[0]])
fig.tight_layout()
plt.show()
As we can see, of the 10 eigenvalues one is equal to 0.
Let’s take a look at another example. The proceeding graph is composed of two components. Thus, 2 eigenvalues be equal to 0.
G = nx.Graph()
G.add_edges_from([
[1, 2],
[1, 3],
[1, 4],
[2, 3],
[2, 7],
[3, 4],
[4, 7],
[1, 7],
[6, 5],
[5, 8],
[6, 8],
[9, 8],
[9, 6]
])
draw_graph(G)
W = nx.adjacency_matrix(G)
print(W.todense())
# degree matrix
D = np.diag(np.sum(np.array(W.todense()), axis=1))
print('degree matrix:')
print(D)
# laplacian matrix
L = D - W
print('laplacian matrix:')
print(L)
e, v = np.linalg.eig(L)
# eigenvalues
print('eigenvalues:')
print(e)
# eigenvectors
print('eigenvectors:')
print(v)
fig = plt.figure(figsize=[18, 6])
ax1 = plt.subplot(131)
plt.plot(e)
ax1.title.set_text('eigenvalues')
i = np.where(e < 10e-6)[0]
ax2 = plt.subplot(132)
plt.plot(v[:, i[0]])
ax2.title.set_text('first eigenvector with eigenvalue of 0')
ax3 = plt.subplot(133)
plt.plot(v[:, i[1]])
ax3.title.set_text('second eigenvector with eigenvalue of 0')
If we take a closer look at the plot of each eigenvector, we can clearly see that the first 5 nodes are mapped to the same value and the other 5 nodes are mapped to another value. We can use this fact to place the nodes into one of two categories.
Let’s take a look at a slightly more complicated example. The proceeding graph is made up of a single component. However, it looks like we have two classes.
G = nx.Graph()
G.add_edges_from([
[1, 2],
[1, 3],
[1, 4],
[2, 3],
[3, 4],
[4, 5],
[1, 5],
[6, 7],
[7, 8],
[6, 8],
[6, 9],
[9, 6],
[7, 10],
[7, 2]
])
draw_graph(G)
W = nx.adjacency_matrix(G)
print(W.todense())
# degree matrix
D = np.diag(np.sum(np.array(W.todense()), axis=1))
print('degree matrix:')
print(D)
# laplacian matrix
L = D - W
print('laplacian matrix:')
print(L)
e, v = np.linalg.eig(L)
# eigenvalues
print('eigenvalues:')
print(e)
# eigenvectors
print('eigenvectors:')
print(v)
fig = plt.figure(figsize=[18, 6])
ax1 = plt.subplot(131)
plt.plot(e)
ax1.title.set_text('eigenvalues')
i = np.where(e < 0.5)[0]
ax2 = plt.subplot(132)
plt.plot(v[:, i[0]])
ax3 = plt.subplot(133)
plt.plot(v[:, i[1]])
ax3.title.set_text('second eigenvector with eigenvalue close to 0')
Because we have a single component, only 1 eigenvalue will be equal to 0. However, if we look at the second smallest eigenvalue, we can still observe a distinction between the two classes. If we drew a horizontal line across, we’d correctly classify the nodes.
Let’s take a look at another example. Again, the graph will be composed of a single component, but this time, it looks as if the nodes should be placed in one of three bins.
G = nx.Graph()
G.add_edges_from([
[1, 2],
[1, 3],
[1, 4],
[2, 3],
[3, 4],
[4, 5],
[1, 5],
[6, 7],
[7, 8],
[6, 8],
[6, 9],
[9, 6],
[7, 10],
[7, 2],
[11, 12],
[12, 13],
[7, 12],
[11, 13]
])
draw_graph(G)
W = nx.adjacency_matrix(G)
print(W.todense())
# degree matrix
D = np.diag(np.sum(np.array(W.todense()), axis=1))
print('degree matrix:')
print(D)
# laplacian matrix
L = D - W
print('laplacian matrix:')
print(L)
e, v = np.linalg.eig(L)
# eigenvalues
print('eigenvalues:')
print(e)
# eigenvectors
print('eigenvectors:')
print(v)
fig = plt.figure(figsize=[18, 6])
ax1 = plt.subplot(221)
plt.plot(e)
ax1.title.set_text('eigenvalues')
i = np.where(e < 0.5)[0]
ax2 = plt.subplot(222)
plt.plot(v[:, i[0]])
ax3 = plt.subplot(223)
plt.plot(v[:, i[1]])
ax3.title.set_text('second eigenvector with eigenvalue close to 0')
ax4 = plt.subplot(224)
plt.plot(v[:, i[2]])
ax4.title.set_text('third eigenvector with eigenvalue close to 0')
fig.tight_layout()
Since we only have 1 component, 1 eigenvalue will be equal to 0. However, we can again use the the second smallest eigenvalue to figure out which node should be placed in which category.
In practice, we use k-means to classify the nodes based off their corresponding values in the eigenvector.
U = np.array(v[:, i[1]])
km = KMeans(init='k-means++', n_clusters=3)
km.fit(U)
km.labels_
Next, let’s compare k-means to spectral clustering using scitkit-learn’s implementation. Suppose our data took the following shape when graphed.
X, clusters = make_circles(n_samples=1000, noise=.05, factor=.5, random_state=0)
plt.scatter(X[:,0], X[:,1])
When using k-means, we get the following.
km = KMeans(init='k-means++', n_clusters=2)
km_clustering = km.fit(X)
plt.scatter(X[:,0], X[:,1], c=km_clustering.labels_, cmap='rainbow', alpha=0.7, edgecolors='b')
In contrast, when using spectral clustering we place each circle in its own cluster.
sc = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', random_state=0)
sc_clustering = sc.fit(X)
plt.scatter(X[:,0], X[:,1], c=sc_clustering.labels_, cmap='rainbow', alpha=0.7, edgecolors='b')
Final Thoughts
In contrast to k-means, spectral clustering takes the relative position of data points into account.