Transformers Explained
Since their introduction in 2017, transformers have revolutionized the world of natural language processing. Prior to Transformers, LSTMs and RNNs were the state of the art. The reason Transformers consistently outperform LSTMs and RNNs is that the latter can only interpret sentences from left to right. For example, suppose we had the following sentences:
- On the river bank
- On the bank of the river
An LSTM or RNN wouldn’t realize that in the context of the second sentence, the word bank is referring to a location by a stream of water and not a financial institution. In contrast, a transformer is able to handle this scenario because it doesn’t read the words one after the other. Rather, it accepts the entire sentence at once.
The architecture described in the paper Attention Is All You Need consists of an encoder and decoder.
Input Embeddings
Transformers do not accept raw text as input. Thus, like we do for other models, we generate the word embeddings for the input sequence.
Positional Encoding
Embeddings represent a token in a d-dimensional space where tokens with similar meaning are closer to one another. However, the embeddings do not encode the relative position of the tokens in a sentence.
As the name implies, positional encoding encodes the position of the words in the sequence.
The formula for calculating the positional encoding is:
Positional encoding works because absolute position is less important than relative position. For instance, we don’t need to know that the word “good” is at index 6 and the word “looks” is at index 5. It’s sufficient to remember that the word “good” tends to follows the word “looks”.
Here’s a plot generated using a sequence length of 100 and embedding space of 512 dimensions:
For the first dimension, if the value is 1, it’s an odd word, if the value is 0, it’s an even word. For the d/2th dimension, if the value is 1, we know the word is in the second half of the sentence and if the value is 0, then it’s in the first half of the sentence. The model can use this information to determine the relative position of the tokens.
Encoder Input
After adding the positional encoding to the embedding vector, tokens will be closer to each other based on the similarity of their meaning and their position in the sentence.
Encoder
The Encoder’s job is to map all input sequences into an abstract continuous representation that holds the learned information (i.e. how words relate to one another).
Scaled Dot-Product Attention
After feeding the query, key, and value vectors through a linear layer, we calculate the dot product of the query and key vectors. The values in the resulting matrix determine how much attention should be payed to the other words in the sequence given the current word. In other words, each word (row) will have an attention score for every other word (column) in the sequence.
The dot product is scaled by a factor of square root of the depth. This is done because for large values of depth, the dot product grows large in magnitude pushing the softmax function where it has small gradients which make it difficult to learn.
Once the values have been scaled, we apply a softmax function to obtain values between 0 and 1.
Finally, we multiply the resulting matrix by the value vector.
Multi-Headed Attention
Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information from different representation subspaces at different positions.
For example, given the word “the”, the first head will give more attention to the word “bank” whereas the second head will give more attention to the word “river”.
It’s important to note that after the split each head has a reduced dimensionality. Thus, the total computation cost is the same as a single head attention with full dimensionality.
The attention output for each head is concatenated and put through a Dense layer.
The Residual Connections, Layer Normalization, and Feed Forward Network
The original positional input embedding is added to the multi-headed attention output vector. This is known as a residual connection. Each hidden layer has a residual connection around it followed by a layer normalization. Residual connections help in avoiding the vanishing gradient problem in deep networks.
The output finishes by passing through a point wise feed forward network.
Decoder
The decoder’s job is to generate text. The decoder has similar hidden layers to the encoder. However, unlike the encoder, the decoder’s output is sent to a softmax layer in order to compute the probability of the next word in the sequence.
Decoder Input Embeddings & Positional Encoding
The decoder is autoregressive meaning that it predicts future values based on previous values. To be exact, the decoder predicts the next token in the sequence by looking at the encoder’s output and self-attending to its own previous output. Just like we did with the encoder, we add the positional encodings to the word embedding to capture the position of the tokens in the sentence.
Masking
Since the decoder is trying to generate the sequence word by word, a look-ahead mask is used to indicate which entries should not be used. For example, when predicting the third token in the sentence, only the previous tokens, that is, the first and second tokens, should be used.
Output
Like we mentioned previously, the output of the hidden layers goes through a final softmax layer. If we have a vocabulary of 10,000 words, then the output of the classifier will be a vector of length 10,000 where the value at each index is the probability that the word associated with that index is the next word in the sequence.
We take the word with the highest probability and append it to the sequence used in the next training iteration.
Python
The Google Transformer model for language understanding tutorial already does an excellent job of demonstrating how to code a Transformer from scratch using TensorFlow Keras. Thus, we will instead see how we can download and make use of one of the pre-trained models.
To begin, we install and import the required libraries.
! pip install -q -U "tensorflow-text==2.8.*" tf-models-official==2.7.0
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization # to create AdamW optimizer
import matplotlib.pyplot as plt
import os
import shutil
We download the IMDB dataset using the Keras utility function.
dataset = tf.keras.utils.get_file('aclImdb_v1.tar.gz', '[https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz%27),
untar=True, cache_dir='.',
cache_subdir='')
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
train_dir = os.path.join(dataset_dir, 'train')
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)
We create training, validation and testing datasets from the input data.
AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 42
raw_train_ds = tf.keras.utils.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = tf.keras.utils.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='validation',
seed=seed)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = tf.keras.utils.text_dataset_from_directory(
'aclImdb/test',
batch_size=batch_size)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
We print a few records to get a better sense of what we’re working with.
for text_batch, label_batch in train_ds.take(1):
for i in range(3):
print(f'Review: {text_batch.numpy()[i]}')
label = label_batch.numpy()[i]
print(f'Label : {label} ({class_names[label]})')
Review: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs....'
Label : 0 (neg)
Review: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into complicated situations, and so does the perspective of the viewer...."
Label : 0 (neg)
Review: b'Great documentary about the lives of NY firefighters during the worst terrorist attack of all time....'
Label : 1 (pos)
We will download and use the pre-trained BERT models from TensorFlow Hub.
tfhub_handle_encoder = '[https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1'](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1%27)
tfhub_handle_preprocess = '[https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'](https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3%27)
The pre-processing model takes a sentence and tokenizes it. Notice how it also adds padding to ensure the sequence is of length 128 (required by the BERT model).
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
text_test = ['what a great movie!']
text_preprocessed = bert_preprocess_model(text_test)
print(f'Keys : {list(text_preprocessed.keys())}')
print(f'Shape : {text_preprocessed["input_word_ids"].shape}')
print(f'Word Ids : {text_preprocessed["input_word_ids"][0, :12]}')
print(f'Input Mask : {text_preprocessed["input_mask"][0, :12]}')
print(f'Type Ids : {text_preprocessed["input_type_ids"][0, :12]}')
Keys : ['input_mask', 'input_type_ids', 'input_word_ids']
Shape : (1, 128)
Word Ids : [ 101 2054 1037 2307 3185 999 102 0 0 0 0 0]
Input Mask : [1 1 1 1 1 1 1 0 0 0 0 0]
Type Ids : [0 0 0 0 0 0 0 0 0 0 0 0]
We define a function to build our classifier model. We add a dense layer to the in order to return a value ranging from 0 to 1 where a value of 1 implies that the review is positive and a value of 0 implies that the review is negative.
def build_classifier_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
encoder_inputs = preprocessing_layer(text_input)
encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
outputs = encoder(encoder_inputs)
net = outputs['pooled_output']
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
return tf.keras.Model(text_input, net)
We call the function and examine the model layers in closer detail. We ensure the parameters are trainable since we want to fine tune the model.
classifier_model = build_classifier_model()
classifier_model.summary()
We define a number of hyperparameters such as the number of epochs, steps and the learning rate.
epochs = 5
steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)
init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr, num_train_steps=num_train_steps, num_warmup_steps=num_warmup_steps, optimizer_type='adamw')
We compile the model using binary crossentropy for the loss function and AdamW for the optimizer.
classifier_model.compile(optimizer=optimizer,
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=tf.metrics.BinaryAccuracy())
We train the model.
history = classifier_model.fit(x=train_ds,
validation_data=val_ds,
epochs=epochs)
Epoch 1/5
625/625 [==============================] - 137s 203ms/step - loss: 0.5083 - binary_accuracy: 0.7452 - val_loss: 0.3831 - val_binary_accuracy: 0.8364
Epoch 2/5
625/625 [==============================] - 122s 195ms/step - loss: 0.3284 - binary_accuracy: 0.8520 - val_loss: 0.3700 - val_binary_accuracy: 0.8450
Epoch 3/5
625/625 [==============================] - 121s 194ms/step - loss: 0.2530 - binary_accuracy: 0.8949 - val_loss: 0.3833 - val_binary_accuracy: 0.8522
Epoch 4/5
625/625 [==============================] - 121s 194ms/step - loss: 0.1967 - binary_accuracy: 0.9232 - val_loss: 0.4424 - val_binary_accuracy: 0.8534
Epoch 5/5
625/625 [==============================] - 121s 193ms/step - loss: 0.1612 - binary_accuracy: 0.9385 - val_loss: 0.4716 - val_binary_accuracy: 0.8504
We evaluate the accuracy of our model on the testing dataset.
loss, accuracy = classifier_model.evaluate(test_ds)
print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')
Loss: 0.4483765959739685
Accuracy: 0.8543199896812439
We plot the loss and accuracy of our model over time.
history_dict = history.history
acc = history_dict['binary_accuracy']
val_acc = history_dict['val_binary_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']
epochs = range(1, len(acc) + 1)
fig = plt.figure(figsize=(10, 6))
fig.tight_layout()
plt.subplot(2, 1, 1)
plt.plot(epochs, loss, 'r', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.ylabel('Loss')
plt.legend()
plt.subplot(2, 1, 2)
plt.plot(epochs, acc, 'r', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
For the sake of understanding, we perform inference on a few examples.
examples = [
'this is such an amazing movie!',
'The movie was great!',
'The movie was meh.',
'The movie was okish.',
'The movie was terrible...'
]
results = tf.sigmoid(classifier_model(tf.constant(examples)))
result_for_printing = \
[f'input: {examples[i]:<30} : score: {results[i][0]:.6f}'
for i in range(len(examples))]
print(*result_for_printing, sep='\n')
input: this is such an amazing movie! : score: 0.999392
input: The movie was great! : score: 0.991764
input: The movie was meh. : score: 0.515988
input: The movie was okish. : score: 0.009715
input: The movie was terrible... : score: 0.001295
As we can see, the model does a pretty good job of classifying the sentences as either positive or negative. That being said, “okish” probably should have been closer to “meh” than “terrible”.