Latent Dirichlet Allocation
Latent Dirichlet Allocation, or LDA for short, is an unsupervised machine learning algorithm. Similar to the clustering algorithm K-means, LDA will attempt to group words and documents into a predefined number of clusters (i.e. topics). These topics can then be used to organize and search through documents.
Caveats
- LDA works off the premise that documents with the same topic will have a lot of words in common.
- LDA is a bag of words model meaning that it only considers individual tokens and not their relationships in a sentence.
Math
The most popular methods for estimating the LDA model is Gibbs sampling. The probability that a word in a document is associated with topic j can be expressed as follows:
where
- W is the length of vocabulary (i.e. number of unique words).
- T is the number of topic.
- The matrix C_dj^WT contains the number of times topic z is assigned to some word w in document d.
- The matrix C_wj^WT contains the number of times word w is assigned to topic z.
- α is a hyperparameter. A low alpha value places more weight on having each document composed of only a few dominant topics whereas a high value will return many more dominant topics.
- η is a hyperparameter. A low value for the η (i.e. eta) places more weight on having each topic composed of only a few dominant words.
Algorithm
Let’s walk through one iteration of the algorithm.
Suppose we had the following documents:
- The president declared war against Russia
- The student slept in instead of studying for his exam
We start off by splitting the sentences into tokens and then removing any English stopwords.
We assign a unique id to every word.
We randomly assign each word in each document to one of the K topics.
We calculate the matrix C_wj^WT.
We compute the matrix C_dj^WT.
Let’s assume that we set the hyperparameters α and η to 1 and 0.001, respectively. Now, we have all we need in order to calculate the probability.
We start off by looking at the number of times a word “president” appeared in topic 0.
In the denominator, we compute the sum of all the words assigned to topic 0 (5 in this case).
Using the C_dj^WT matrix, we look at the number of times document d = 0 appeared in topic 0 (3 in this case).
The total number of times document d = 0 appears in topic 0 and topic 1 is 3 and 1 respectively. We add 3 and 1 together to get 4.
We replace the values in the formula we saw previously.
As we can see, the probability of obtaining topic 0 given the word w = “president” and document d = 0 is roughly 13%.
We repeat the process for all the remaining words & documents. Then, we repeat the entire process n times where n is the number of iterations (the probabilities should converge).
Python Example
To begin, we install the Natural Language Toolkit library.
pip install --user nltk
Then, we import the following libraries.
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
import itertools
from collections import Counter
We download the Brown Corpus (named after Brown University), the first million English word electronic corpus, as well as a set of English stopwords.
For those who don’t already know, a stopword is a word that has no semantic meaning and can be removed from the corpus.
nltk.download('brown')
nltk.download('stopwords')
stopwords = stopwords.words('english')
We set the hyperparameters.
n_iters = 200
n_documents = 15
n_topics = 5
eta = 1.5
alpha = 1.5
The sentences in the Brown Corpus are already split by whitespace. We manually remove the stop words from the sentences as follows:
np.random.seed(42)
brown = nltk.corpus.brown
documents = np.random.choice(brown.fileids(), n_documents, replace=False)
processed_documents = {}
for document in documents:
processed_sentences = []
sentences = brown.sents(document)
for sentence in sentences:
processed_sentences.append([word.lower() for word in sentence if word.isalnum() and word not in stopwords])
processed_documents[document] = processed_sentences
We will be working with the following documents:
processed_documents.keys()
dict_keys(['cj68', 'cc03', 'ck01', 'cf15', 'cd17', 'ck21', 'ck04', 'ce20', 'cb25', 'cn18', 'ca10', 'cg06', 'cl04', 'cc14', 'cj78'])
If we take a look at one of the documents, we see the following:
processed_documents["cj68"]
[['recent',
'criticism',
'great',
'expectations',
'tended',
...
]]
We store a list of all the unique tokens.
processed_sentences = list(itertools.chain(*list(processed_documents.values())))
unique_tokens = list(set(itertools.chain(*processed_sentences)))
n_tokens = len(unique_tokens)
As we can see, there are roughly six thousand distinct words in the corpus.
n_tokens
6078
The algorithm requires that the tokens be mapped to integers.
token2int = dict(zip(unique_tokens, range(n_tokens)))
As we can see, every word is associated with a given number.
token2int
{'reiterating': 0,
'his': 1,
'unhappiness': 2,
'topic': 3,
'courage': 4,
'felicity': 5,
...
}
The same goes for the documents.
document2int = dict(zip(documents, range(n_documents)))
As we can see, every document is associated with a given number.
document2int
{'cj68': 0, 'cc03': 1, 'ck01': 2, 'cf15': 3, 'cd17': 4, 'ck21': 5, 'ck04': 6, 'ce20': 7, 'cb25': 8, 'cn18': 9, 'ca10': 10, 'cg06': 11, 'cl04': 12, 'cc14': 13, 'cj78': 14}
We define a function to randomly assign a topic to a given word in a document.
def get_topic_assignment(processed_documents, n_topics):
np.random.seed(42)
topic_assignment = {}
for document, sentences in processed_documents.items():
words = list(itertools.chain(*sentences))
topics_assigned = np.random.choice(range(n_topics), len(words), replace=True)
topic_assignment[document] = list(zip(words, topics_assigned))
return topic_assignment
We call the function and save the result to a variable.
initial_topic_assignment = get_topic_assignment(processed_documents, n_topics)
As we can see, every word in the document was assigned one of the five topics.
initial_topic_assignment
{'cj68': [('recent', 3),
('criticism', 4),
('great', 2),
('expectations', 4),
('tended', 4),
...
}
We define the matrix containing the number of times word w is assigned to topic j.
def get_CWT_matrix(topic_assignment):
word_topics = topic_assignment.values()
count_word_topics_dict = Counter(list(itertools.chain(*word_topics)))
CWT_words = np.array(list(count_word_topics_dict.keys()))[:, 0]
CWT_words = list(map(lambda x: token2int[x], CWT_words))
CWT_topics = list(map(int, np.array(list(count_word_topics_dict.keys()))[:, 1]))
counts = list(count_word_topics_dict.values())
CWT = np.zeros((n_tokens, n_topics), dtype='int')
CWT[CWT_words, CWT_topics] = counts
return CWT
We compute the matrix given our initial random topic assignments.
CWT = get_CWT_matrix(initial_topic_assignment)
As we can see, there are 5 columns (one for each topic) and 6078 rows (one for each word).
CWT
array([[0, 1, 0, 0, 0],
[7, 8, 5, 4, 6],
[0, 1, 0, 0, 0],
...,
[0, 0, 0, 1, 1],
[0, 0, 0, 1, 0],
[0, 1, 1, 0, 0]])
We define a function to calculate the matrix that contains the number of times topic j is assigned to some word token in document d.
def get_CDT_matrix(topic_assignment):
CDT = np.zeros((n_documents, n_topics), dtype='int')
for document, word_topics in topic_assignment.items():
count_topic_dict = Counter(list(map(int, np.array(word_topics)[:, 1])))
topic_indices = list(count_topic_dict.keys())
topic_counts = list(count_topic_dict.values())
CDT[document2int[document], topic_indices] = topic_counts
return CDT
Just like we did before, we compute the matrix given our initial random topic assignments.
CDT = get_CDT_matrix(initial_topic_assignment)
As we can see, there are 5 columns (one for each topic) and 15 rows (one for each document).
CDT
array([[218, 194, 193, 213, 211],
[229, 245, 215, 224, 231],
[220, 197, 223, 239, 233],
[236, 213, 240, 214, 230],
[208, 219, 223, 245, 225],
[239, 269, 210, 226, 248],
[201, 251, 216, 203, 216],
[248, 230, 206, 223, 206],
[248, 197, 215, 187, 219],
[179, 203, 208, 210, 222],
[213, 226, 214, 229, 256],
[247, 233, 216, 264, 221],
[200, 240, 208, 218, 219],
[260, 259, 229, 251, 255],
[253, 278, 245, 246, 256]])
We define a function to compute the probability using the formula we saw earlier.
def get_sampled_topic(word_index, document_index, CWT, CDT):
distribution = []
for topic_index in range(n_topics):
pwt = (CWT[word_index, topic_index] -1 + eta) / (np.sum(CWT[:, topic_index]) + n_tokens * eta)
pdt = (CDT[document_index, topic_index] -1 + alpha) / (np.sum(CDT[document_index, :]) + n_topics * alpha)
p = pwt * pdt
distribution.append(p)
# normalize probabilities such that they sum to 1
distribution = distribution / sum(distribution)
outcome = np.random.multinomial(1, distribution)
sampled_topic_index = np.where(outcome==1)[0][0]
return sampled_topic_index
We select the first pair as arguments.
dict_pairs = initial_topic_assignment.items()
pairs_iterator = iter(dict_pairs)
document, word_topics = next(pairs_iterator)
document
'cj68'
word_topics
[('recent', 3),
('criticism', 4),
('great', 2),
('expectations', 4),
('tended', 4),
...
]
As we can see, the function return the topic = 4.
document_index = document2int[document]
word, assigned_topic = word_topics[0]
word_index = token2int[word]
get_sampled_topic(word_index, document_index, CWT, CDT)
4
We define the function that performs Gibbs sampling.
def get_gibbs_sample(input_topic_assignment):
CWT = get_CWT_matrix(input_topic_assignment)
CDT = get_CDT_matrix(input_topic_assignment)
new_topic_assignment = {}
for document, word_topics in input_topic_assignment.items():
document_index = document2int[document]
new_topic_assignment[document] = []
for word, _ in word_topics:
word_index = token2int[word]
sampled_topic = get_sampled_topic(word_index, document_index, CWT, CDT)
new_topic_assignment[document].append((word, sampled_topic))
return new_topic_assignment
get_gibbs_sample(initial_topic_assignment)
{'cj68': [('recent', 1),
('criticism', 2),
('great', 3),
('expectations', 4),
('tended', 0),
...
}
Finally, we define a function to execute the LDA algorithm.
def run_lda(processed_documents, n_topics, n_iters):
topic_assignment = get_topic_assignment(processed_documents, n_topics)
for n_iter in range(n_iters):
new_topic_assignment = get_gibbs_sample(topic_assignment)
topic_assignment = new_topic_assignment
if (n_iter + 1) % 10 == 0:
print(f'Iteration {n_iter + 1}')
return topic_assignment
We pass the preprocessed documents, the number of topics and the number of iterations and store the result in a variable.
topic_assignment = run_lda(processed_documents, n_topics, n_iters)
As we can see, in the final result, the word “recent” in document “cj68” was assigned to topic 0.
topic_assignment
{'cj68': [('recent', 0),
('criticism', 0),
('great', 0),
('expectations', 0),
('tended', 0),
('emphasize', 0),
...
}
Obviously, we wouldn’t want to implement the algorithm from scratch every time. Fortunately for us, the scikit-learn library provides optimized version of the LDA algorithm. We can import the required libraries as follows:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
This time around, we will be using articles from the NPR (National Public Radio), which can be found here.
df = pd.read_csv("npr.csv")
As we can see, the dataset has a single column named “Article” of type string.
df.head()
We will initialize an instance of the CountVectorizer class.
We indicate that we would like to exclude any English stopwords, words that show up in less than 2 documents and words that are common across 90% of the documents in the corpus since these words would not help with of distinguishing the documents.
cv = CountVectorizer(max_df=0.9, min_df=2, stop_words="english")
We represent the corpus as a Document Term Matrix, or DTM for short. The latter assigns a unique id to every word.
dtm = cv.fit_transform(df["Article"])
We initialize an instance of the LatentDirichletAllocation class. It’s important to note that the class has two parameters doc_topic_prior and topic_word_prior which represent alpha and eta, respectively.
lda = LatentDirichletAllocation(n_components=7, random_state=42)
We train the model.
lda.fit(dtm)
We can obtain the number of tokens in our corpus as follows:
len(cv.get_feature_names())
54777
The components property is a list of length k where k is the number of topics. Every element in the list is another list which contains the probability that a word belongs to the topic.
len(lda.components_[0])
54777
We print the top 15 words in each of the topics.
n = 15
for index, topic in enumerate(lda.components_):
print(f'The top {n} words for topic #{index}')
print([cv.get_feature_names()[i] for i in topic.argsort()[-n:]])
As we can see, the topic 1 appears to be related to war, the topic 4 appears to be about an election and the topic 6 appears to be about education.
If we want to determine what topics every document belongs to. We can call the transform function and provide the Document Term Matrix.
topic_results = lda.transform(dtm)
As we can see, the result is now, a 2 dimensional array where the number of rows are the number of documents and the number of columns are the number of topics.
topic_results.shape
(11992, 7)
If we look at the value of the first element, we can see that it’s an array where every element is the probability that the document belongs to the topic at that index.
topic_results[0].round(2)
array([0.02, 0.68, 0. , 0. , 0.3 , 0. , 0. ])
We will select the index with the highest probability and assign that to a column named “Topic”.
df["Topic"] = topic_results.argmax(axis=1)
As we can see, the first 4 documents belong to the first topic.
df.head()