LSTM Recurrent Neural Network Keras Example
Recurrent neural networks have a wide array of applications. These include time series analysis, document classification, speech and voice recognition. In contrast to feedforward artificial neural networks, the predictions made by recurrent neural networks are dependent on previous predictions.
To elaborate, imagine we decided to follow an exercise routine where, every day, we alternate between lifting weights, swimming and yoga. We could then build a recurrent neural network to predict today’s workout given what we did yesterday. For example, if we lifted weights yesterday then we’d go swimming today.
More often than not, the problems you’ll be tackling in the real world will be a function of the current state as well as other inputs. For instance, suppose we signed up for hockey once a week. If we’re playing hockey on the same day that we’re supposed to lift weights then we might decide to skip the gym. Thus, our model now has to differentiate between the case when we attended a yoga class yesterday and we’re not playing hockey as well as the case when we attended a yoga class yesterday and we’re playing hockey today in which case we’d jump directly to swimming.
Long Short Term Memory (LSTM)
In practice, we rarely see regular recurrent neural networks being used. Recurrent neural networks have a few shortcomings which render them impractical. For instance, say we added in a rest day. The rest day should only be taken after two days of exercise. In the event we use a recurrent neural network to try and predict what activity we’ll do tomorrow, it’s possible that it gets trapped in a loop.
Suppose we had the following scenario.
- Day 1: Lift Weights
- Day 2: Swimming
- Day 3: At this point, our model must decide whether we should take a rest day or yoga. Unfortunately, it only has access to the previous day. In other words, it knows we swam yesterday but it doesn’t know whether had taken a break the day before.Therefore, it can end up predicting yoga.
LSTMs were invented, to get around this problem. As the name implies LSTMs have memory. Just like how humans can store roughly 7 bits of information in short term memory, LSTMs can in theory remember information going back several states. However, this raises the question as to how far back they should remember. At what point does information become irrelevant? For instance, in our exercise example, we shouldn’t need to go back more than two days to figure out whether we should take a break.
Without delving into too much detail, LSTMs combat this problem by using dedicated neural networks for forgetting and selecting information. A single LTSM layer is composed of four neural network layers interacting in a special way.
If you’re interested in finding out more about the internals of LSTM networks, I highly recommend you checkout the proceeding link.
Understanding LSTM Networks — colah’s blog
_These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that…_colah.github.io
One of the tricky things about natural language processing is that the meaning of words can change depending on their context. In the case of sentiment analysis, we can’t just go off the occurrences of some words like good because its meaning changes completely if it’s preceded by the word not as in not good. It’s also extremely difficult for computers to recognize things like sarcasm since they require reading in between the lines. LSTM networks turn out to be particularly well suited for solving these kinds of problems since they can remember all the words that led up to the one in question.
Code
In the proceeding section, we go over my solution to a Kaggle competition whose goal it is to perform sentiment analysis on a corpus of movie reviews. We’re asked to label each phrase on a scale of zero to four. The sentiment corresponding to each of the labels are:
- 0: negative
- 1: somewhat negative
- 2: neutral
- 3: somewhat positive
- 4: positive
If you’d like to follow along, you can obtain the dataset from the following link.
Movie Review Sentiment Analysis (Kernels Only)
_Classify the sentiment of sentences from the Rotten Tomatoes dataset_www.kaggle.com
We’re going to be using the following libraries.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.style.use('dark_background')
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, GlobalMaxPooling1D, SpatialDropout1D
The corpus contains over 150,000 training samples. As you can see some phrases are incomplete and some repeat.
df_train = pd.read_csv('train.tsv', sep='\t')
print('train set: {0}'.format(df_train.shape))
df_train.head(10)
The testing set includes over 60,000 samples. Each row is given a PhraseId
and a SentenceId
which will be used by Kaggle in evaluating the submission file with the predictions made by the model.
df_test = pd.read_csv('test.tsv', sep='\t')
print('test set: {0}'.format(df_test.shape))
df_test.head(10)
ASCII characters are ultimately interpreted by the computer as hexadecimal. In consequence, to a computer, ‘A’ is not the same as ‘a’. Therefore, we’ll want to change all characters to lowercase. Since we’re going to be splitting the sentences up into individual words based off of white spaces, a word with a period right after it is not equivalent to one without a period following it (happy. != happy). In addition, contractions are going to be interpreted differently than the original which will have repercussions for the model (I’m != I am). Thus, we replace all occurrences using the proceeding function.
replace_list = {r"i'm": 'i am',
r"'re": ' are',
r"let’s": 'let us',
r"'s": ' is',
r"'ve": ' have',
r"can't": 'can not',
r"cannot": 'can not',
r"shan’t": 'shall not',
r"n't": ' not',
r"'d": ' would',
r"'ll": ' will',
r"'scuse": 'excuse',
',': ' ,',
'.': ' .',
'!': ' !',
'?': ' ?',
'\s+': ' '}
def clean_text(text):
text = text.lower()
for s in replace_list:
text = text.replace(s, replace_list[s])
text = ' '.join(text.split())
return text
We can use apply
to apply the function to every row in the series.
X_train = df_train['Phrase'].apply(lambda p: clean_text(p))
Let’s look at the individual length of each phrase in the corpus.
phrase_len = X_train.apply(lambda p: len(p.split(' ')))
max_phrase_len = phrase_len.max()
print('max phrase len: {0}'.format(max_phrase_len))
plt.figure(figsize = (10, 8))
plt.hist(phrase_len, alpha = 0.2, density = True)
plt.xlabel('phrase len')
plt.ylabel('probability')
plt.grid(alpha = 0.25)
All the inputs to a neural network must be the same length. Therefore, we store the longest length as a variable which we’ll use later to define the input to our model.
Next, we create a separate dataframe for the target labels.
y_train = df_train['Sentiment']
Computers don’t understand words, let alone sentences, therefore, we use the tokenizer to parse the phrases. In specifying num_words
, only the most common num_words-1
words will be kept. We use a filter to remove special characters. By default, all punctuation is removed, turning the text into a space separated sequence of words. The tokens are then vectorized. By vectorized we mean that they are mapped to integers. 0
is a reserved index that won’t be assigned to any word.
pad_sequence
is used to ensure that all the phrase are the same length. Sequences that are shorter than maxlen
are padded with value
(0
by default) at the end.
Whenever we’re working with categorical data, we don’t want to leave it as integers because the model will interpreted the samples with a higher number as having more significance. to_categorical
is quick and dirty way of encoding the data.
max_words = 8192
tokenizer = Tokenizer(
num_words = max_words,
filters = '"#$%&()*+-/:;<=>@[\]^_`{|}~'
)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(X_train, maxlen = max_phrase_len)
y_train = to_categorical(y_train)
We define variables for the hyperparameters.
batch_size = 512
epochs = 8
Then, we build our model using a LSTM layer.
model_lstm = Sequential()
model_lstm.add(Embedding(input_dim = max_words, output_dim = 256, input_length = max_phrase_len))
model_lstm.add(SpatialDropout1D(0.3))
model_lstm.add(LSTM(256, dropout = 0.3, recurrent_dropout = 0.3))
model_lstm.add(Dense(256, activation = 'relu'))
model_lstm.add(Dropout(0.3))
model_lstm.add(Dense(5, activation = 'softmax'))
model_lstm.compile(
loss='categorical_crossentropy',
optimizer='Adam',
metrics=['accuracy']
)
If you don’t understand what the Embedding layer is doing, I suggest you checkout an article I wrote on the subject.
Machine Learning Sentiment Analysis And Word Embeddings Python Keras Example
_One of the primary applications of machine learning is sentiment analysis. Sentiment analysis is about judging the tone…_towardsdatascience.com
We use dropout to prevent overfitting.
We set ten percent of our data aside for validation. Every epoch, 512 reviews pass through the network before we take a step towards the solution.
history = model_lstm.fit(
X_train,
y_train,
validation_split = 0.1,
epochs = 8,
batch_size = 512
)
We can plot the training and validation accuracy and loss at each epoch by using the history variable returned by the fit function.
plt.clf()
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'g', label='Training loss')
plt.plot(epochs, val_loss, 'y', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf()
acc = history.history['acc']
val_acc = history.history['val_acc']
plt.plot(epochs, acc, 'g', label='Training acc')
plt.plot(epochs, val_acc, 'y', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
Since the dataset was obtain as part of a Kaggle competition, we aren’t given the sentiments corresponding to the phrases in the testing set.
Final Thoughts
Recurrent neural networks can be used to model any phenomenon that is dependent on its preceding state. The example, we covered in this article is that of semantics. In other words, the meaning of a sentence changes as it progresses. Rather than attempting to classify documents based off the occurrence of some word (i.e. good), we can use a more sophisticated approach to capture the interplay between words (i.e. movie was not good).