Word Embeddings Python Example - Sentiment Analysis
One of the primary applications of machine learning is sentiment analysis. Sentiment analysis is about judging the tone of a document. The output of a sentiment analysis is typically a score between zero and one, where one means the tone is very positive and zero means it is very negative. Sentiment analysis is frequently used for trading. For example, sentiment analysis is applied to the tweets of traders in order to estimate an overall market mood.
As one might expect, sentiment analysis is a Natural language Processing (NLP) problem. NLP is a field of artificial intelligence concerned with understanding and processing language. The goal of this article will be to construct a model to derive the semantic meaning of words from documents in the corpus. At a high level, one can imagine us classifying the documents with the word good in them as positive and the word bad as negative. Unfortunately, the problem isn’t that simple since the words can be preceded by not as in not good.
Code
That’s enough ranting let’s dive into some code.
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('dark_background')
from keras.datasets import imdb
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, GlobalAveragePooling1D, Dense
The argument num_words=10000
ensures we only keep the top 10,000 most frequently occurring words in the training set. The rare words are discarded to keep the size of the data manageable.
num_words = 10000
We’ll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. The latter are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.
old = np.load
np.load = lambda *a,**k: old(*a,**k, allow_pickle = True)
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_words)
np.load = old
del(old)
print("Training entries: {}, labels: {}".format(len(X_train), len(y_train)))
When we use keras.datasets.imdb
to import the dataset into our program, it comes already preprocessed. In other words, every example is a list of integers where each integer represents a specific word in a dictionary and each label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review. Let’s take a peak at the first review.
print(X_train[0])
To gain a greater grasp of what we’re working with, we’ll create a helper function to map the integers in each training example to the words in the index.
word_index = imdb.get_word_index()
# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2 # unknown
word_index["<UNUSED>"] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
Now, we can use the decode_review
function to display the text of the first review.
decode_review(X_train[0])
Given that every word in a document will be interpreted as a feature, we must ensure the movie reviews are the same length before attempting to feed them into a neural network.
len(X_train[0]), len(X_train[1])
We will use the pad_sequence
function to standardize the lengths.
X_train = pad_sequences(
X_train,
value=word_index["<PAD>"],
padding='post',
maxlen=256
)
X_test = pad_sequences(
X_test,
value=word_index["<PAD>"],
padding='post',
maxlen=256
)
Let’s look at the length of the first couple of samples.
len(X_train[0]), len(X_train[1])
As the attuned reader might have already guessed, words are categorical features. As such, we can not directly feed them into the neural network. Although they’re already encoded as integers, if we left them the way they are, the model would interpret the integers with a higher values as having a higher priority than the ones with lower values. Normally, you’d get around this problem by converting the arrays into vectors of 0s and 1s indicating word occurrence, similar to one hot encoding, but for words, this is memory intensive. Given a vocabulary of 10,000 words, we’d need to store num_words * num_reviews
size matrix in RAM.
Embeddings
This is where embeddings come in to play. Embeddings solve the core problems of sparse input data (very large vectors with relatively few non-zero values) by mapping our high-dimensional data into a lower-dimensional space (similar to PCA).
For example, suppose we had a corpus composed of the following two sentences.
- Hope to see you soon
- Nice to see you again
Just like the IMDB dataset, we can assign each word a unique integer.
[0, 1, 2, 3, 4]
[5, 1, 2, 3, 6]
Next, we can define an embedding layer.
Embedding(`input_dim=`7,` output_dim=`2, input_length=5)
- input_dim: The size of the vocabulary (i.e. number of distinct words) in the training set
- output_dim: The size of the embedding vectors
- input_length: The number of features in a sample (i.e. number of words in each document). For example, if all of our documents are comprised of 1000 words, the input length would be 1000.
Embeddings work like a look up table. Every token (i.e. word) acts as an index which stores a vector. When a token is given to the embedding layer, it returns the vector associated to that token and passes it through the neural network. As the network trains, the embeddings are optimized as well.
+------------+------------+
| index | Embedding |
+------------+------------+
| 0 | [1.2, 3.1] |
| 1 | [0.1, 4.2] |
| 2 | [1.0, 3.1] |
| 3 | [0.3, 2.1] |
| 4 | [2.2, 1.4] |
| 5 | [0.7, 1.7] |
| 6 | [4.1, 2.0] |
+------------+------------+
Say, we had the following two dimensional embedding vector for the word Teacher.
We could imagine a two dimensional space where similar words (i.e. school, tutor) are clustered together.
In our example, we use embedding vectors with 16 dimensions. Thus, we might find that the words enjoyed, liked and fantastic are in close proximity to one another. Our model can then learn to classify the reviews whose words map to embedding vectors which are close to each other in the 16 dimensional space as positive.
model = Sequential()
model.add(Embedding(input_dim==num_words, output_dim=16, `input_length=256`))
model.add(GlobalAveragePooling1D())
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
We use adam as our optimizer and binary crossentropy as our loss function since we’re trying to choose between two classes.
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
We set ten percent of our data aside for validation. Every epoch, 512 reviews pass through the neural network before updating the weights.
history = model.fit(
X_train,
y_train,
epochs=20,
batch_size=512,
validation_split=0.1,
shuffle=True
)
We can plot the training and validation accuracy and loss at each epoch by using the history variable returned by the fit function.
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'y', label='Training loss')
plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Finally, let’s see how well our model performs on the testing set.
test_loss, test_acc = model.evaluate(X_test, y_test)
print(test_acc)
Final Thoughts
When we’re working with categorical features with a lot of categories (i.e. words), we want to avoid using one hot encoding as it requires us to store a large matrix in memory and train a lot of parameters. Instead we can map each category to a n dimension embedding vector and train our machine learning model using the embedding vectors as input.