t-SNE Python Example

August 10, 2019

t-SNE Python Example

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique used to represent high-dimensional dataset in a low-dimensional space of two or three dimensions so that we can visualize it. In contrast to other dimensionality reduction algorithms like PCA which simply maximizes the variance, t-SNE creates a reduced feature space where similar samples are modeled by nearby points and dissimilar samples are modeled by distant points with high probability.

At a high level, t-SNE constructs a probability distribution for the high-dimensional samples in such a way that similar samples have a high likelihood of being picked while dissimilar points have an extremely small likelihood of being picked. Then, t-SNE defines a similar distribution for the points in the low-dimensional embedding. Finally, t-SNE minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the embedding.

Algorithm

As mentioned previously, t-SNE takes a high dimensional dataset and reduces it to a low dimensional graph that retains a lot of the original information.

Suppose we had a dataset composed of 3 distinct classes.

We want to reduce the 2D plot into a 1D plot while maintaining clear boundaries between the clusters.

Recall that simply projecting the data on to an axis is a poor approach to dimensionality reduction because we lose a substantial amount of information.

Instead, we can use a dimensionality reduction technique (hint: t-SNE) to achieve what we want. The first step in the t-SNE algorithm involves measuring the distance from one point with respect to every other point. Instead of working with the distances directly, we map them to a probability distribution.

In the distribution, the points with the smallest distance with respect to the current point have a high likelihood, whereas the points far away from the current point have very low likelihoods.

Taking another look at the 2D plot, notice how the blue cluster is more spread out than the green one. If we don’t address this difference in scale, the likelihood of the green points will be greater than that of the blue ones. To account for this fact, we divide by the sum of the likelihoods.

Thus, although the absolute distance between the points are different, they’re considered just as similar.

Let’s try and tie these concepts back to the underlying theory. Mathematically, we write the equation for a normal distribution as follows.

If we drop everything before the exponent and use another point instead of the mean, all the while addressing the problem of scale discussed earlier, we get the equation from the paper.

Next, let’s address how we come up with the reduced feature space. To begin, we create a n_samples x n_components matrix (in this case: 9x1) and fill it with random values (i.e. positions).

If we take a similar approach to what’s above (measure the distances between points and map them to a probability distribution), we get the following equation.

Notice how like before, we took the equation for a normal distribution , dropped everything in front, used another point instead of the mean and accounted for scale by dividing by the sum of the likelihood for all other points (don’t ask me why we got rid of the standard deviation).

If we could some how make it so the probability distribution for the points in the reduced feature space approximate those in the original feature space, we’d get nicely defined clusters.

To accomplish this, we make use of something called the Kullback-Leiber divergence. The KL divergence is a measure of how different one probability distribution from a second.

The lower the value of the KL divergence, the closer two distributions are to one another. A KL divergence of 0 implies that the two distributions in question are identical.

This should hopefully bring about a flush of ideas. Recall how in the case of linear regression, we were able to determine the best fitting line by using gradient descent to minimize the cost function (i.e. mean square error). Well, in t-SNE, we use gradient descent to minimize the sum of the Kullback-Leiber divergences over data all the data points.

We take the partial derivative of our cost function with respect to every point in order to give us the direction of each update.

Python Code

Often times we make use of some library without really understanding what goes on under the hood. In the proceeding section, I will attempt (all be it unsuccessfully) to implement the algorithm and associated mathematical equations as Python code. To help with the process, I took bits and pieces from the source code of the TSNE class in the scikit-learn library.

To begin, we’ll import the following libraries and set some properties which will come in to play when we go to plot our data.

import numpy as np  
from sklearn.datasets import load_digits  
from scipy.spatial.distance import pdist  
from sklearn.manifold.t_sne import _joint_probabilities  
from scipy import linalg  
from sklearn.metrics import pairwise_distances  
from scipy.spatial.distance import squareform  
from sklearn.manifold import TSNE  
from matplotlib import pyplot as plt  
import seaborn as sns  
sns.set(rc={'figure.figsize':(11.7,8.27)})  
palette = sns.color_palette("bright", 10)

For this example, we’ll be working with hand drawn digits. The scikit-learn library provides a method for importing them into our program.

X, y = load_digits(return_X_y=True)

We’re going to want to select either 2 or 3 for the number of components given that t-SNE is strictly used for visualization and we can only see things in up to 3 dimensions. On the other hand, perplexity is related to the number of nearest neighbors used in the algorithm. A different perplexity can cause drastic changes in the end results. In our case, we set it to the default value of the scitkit-learn implementation of t-SNE (30). According to the numpy documentation, the machine epsilon is the smallest representable positive number such that 1.0 + eps != 1.0. In other words, any number below the machine epsilon can’t be manipulated by the computer since it lacks the necessary bits. As we’ll see, the contributors use np.maximum to check whether the values in a matrix are smaller than the machine epsilon and replaces them in the event they are. I don’t understand the reasoning behind this, so if someone could leave a comment explaining why, it would be greatly appreciated.

MACHINE_EPSILON = np.finfo(np.double).eps  
n_components = 2  
perplexity = 30

Next, we define the fit function. We’ll make a call to the fit function when we go to transform our data.

def fit(X):  
    n_samples = X.shape[0]  
      
    # Compute euclidean distance  
    distances = pairwise_distances(X, metric='euclidean', squared=True)  
      
    # Compute joint probabilities p_ij from distances.  
    P = _joint_probabilities(distances=distances, desired_perplexity=perplexity, verbose=False)  
      
    # The embedding is initialized with iid samples from Gaussians with standard deviation 1e-4.  
    X_embedded = 1e-4 * np.random.mtrand._rand.randn(n_samples, n_components).astype(np.float32)  
      
    # degrees_of_freedom = n_components - 1 comes from  
    # "Learning a Parametric Embedding by Preserving Local Structure"  
    # Laurens van der Maaten, 2009.  
    degrees_of_freedom = max(n_components - 1, 1)  
      
    return _tsne(P, degrees_of_freedom, n_samples, X_embedded=X_embedded)

There’s quite a bit going on in this function so let’s break it up step by step.

1. We store the number of samples in a variable for future reference.

2. We compute the euclidean distance between each data point. this corresponds to ||xi — xj||^2 in the proceeding equation.

3. We pass the euclidean distance computed in the previous step as an argument to the _join_probabilities function which then calculates and returns a matrix of p_ji values (using the same equation).

4. We create the reduced feature space using randomly selecting values from Gaussian distributions with standard deviation 1e-4.

5. We define the degrees_of_freedom. There’s a comment in the source code which tells you to check out this paper explaining their reasoning. Basically, it’s been empirically shown that we get a better results (in bold) when we use the number of components minus one.

Trustworthiness T(12) of low-dimensional representations of the MNIST dataset, the characters dataset, and the 20 newsgroups dataset.

6. Finally we call the tsne function which is implemented as follows.

def _tsne(P, degrees_of_freedom, n_samples, X_embedded):

params = X_embedded.ravel()  
      
    obj_func = _kl_divergence  
      
    params = _gradient_descent(obj_func, params, [P, degrees_of_freedom, n_samples, n_components])  
          
    X_embedded = params.reshape(n_samples, n_components)

return X_embedded

There isn’t really much going in in this function. First, we use np.ravel to flatten our vector into a 1-D array.

>>> x = np.array([[1, 2, 3], [4, 5, 6]])  
>>> np.ravel(x)

array([1, 2, 3, 4, 5, 6])

Then we use gradient descent to minimize the kl divergence. Once it’s done, we change the embedding back to a 2D array and return it.

Next, let’s take a look at the something with a little more meat to it. The following block of code is responsible for computing the error in the form of the kl divergence and the gradient.

def _kl_divergence(params, P, degrees_of_freedom, n_samples, n_components):  
    X_embedded = params.reshape(n_samples, n_components)  
      
    dist = pdist(X_embedded, "sqeuclidean")  
    dist /= degrees_of_freedom  
    dist += 1.  
    dist **= (degrees_of_freedom + 1.0) / -2.0  
    Q = np.maximum(dist / (2.0 * np.sum(dist)), MACHINE_EPSILON)  
      
    # Kullback-Leibler divergence of P and Q  
    kl_divergence = 2.0 * np.dot(P, np.log(np.maximum(P, MACHINE_EPSILON) / Q))  
      
    # Gradient: dC/dY  
    grad = np.ndarray((n_samples, n_components), dtype=params.dtype)  
    PQd = squareform((P - Q) * dist)  
    for i in range(n_samples):  
        grad[i] = np.dot(np.ravel(PQd[i], order='K'),  
                         X_embedded[i] - X_embedded)  
    grad = grad.ravel()  
    c = 2.0 * (degrees_of_freedom + 1.0) / degrees_of_freedom  
    grad *= c

return kl_divergence, grad

Again, let’s walk through the code step by step.

1. The first part calculates the probability distribution over the points in the low-dimensional map.

The authors actually use a variation of the equation above which includes the degrees of freedom.

where α represents the number of degrees of freedom of the Student-t distribution

2. We calculate the KL divergence (hint: whenever you see np.dot think sum).

3. We calculate the gradient (partial derivatives). dist is actually yi — yj in:

Again, they use a variation of the equation above with the degrees of freedom.

where α represents the number of degrees of freedom of the Student-t distribution

The gradient descent function updates the values in the embedding by minimizing the KL divergence. We stop prematurely when either the gradient norm is below the threshold or when we reach the maximum number of iterations without making any progress.

def _gradient_descent(obj_func, p0, args, it=0, n_iter=1000,  
                      n_iter_check=1, n_iter_without_progress=300,  
                      momentum=0.8, learning_rate=200.0, min_gain=0.01,  
                      min_grad_norm=1e-7):  
      
    p = p0.copy().ravel()  
    update = np.zeros_like(p)  
    gains = np.ones_like(p)  
    error = np.finfo(np.float).max  
    best_error = np.finfo(np.float).max  
    best_iter = i = it  
      
    for i in range(it, n_iter):

error, grad = obj_func(p, *args)

grad_norm = linalg.norm(grad)

inc = update * grad < 0.0  
        dec = np.invert(inc)  
        gains[inc] += 0.2  
        gains[dec] *= 0.8  
        np.clip(gains, min_gain, np.inf, out=gains)  
        grad *= gains  
        update = momentum * update - learning_rate * grad  
        p += update

print("[t-SNE] Iteration %d: error = %.7f,"  
                      " gradient norm = %.7f"  
                      % (i + 1, error, grad_norm))  
          
        if error < best_error:  
                best_error = error  
                best_iter = i  
        elif i - best_iter > n_iter_without_progress:  
            break  
          
        if grad_norm <= min_grad_norm:  
            break

return p

If you’ve made it this far, give yourself a pat on the back. We’re ready to call the fit function with our data.

X_embedded = fit(X)

As we can see, the model did a fairly decent job of separating the different digits based off of the position of their pixels.

sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y, legend='full', palette=palette)

Let’s do the same thing using the scikit-learn implementation of t-SNE.

tsne = TSNE()

X_embedded = tsne.fit_transform(X)

As we can see, the model managed to take a 64-dimensional dataset and project it on to a 2-dimensional space in such a way that similar samples cluster together.

sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y, legend='full', palette=palette)