Word2Vec — Skip-Gram
With a few exceptions, machine learning models do not accept raw text as input. The sequences of words must first be encoded in some fashion. We could represent each sentence as a Bag of Words (BOW). First, we find all the unique words in the text corpus. Then, we map every sentence to a vector whose length is equal to the length of the vocabulary (i.e. number of unique words) such that the values at the indices corresponding to words present in the sentence are set to 1, and the values at all other indices are left as 0.
There are two problems with this approach:
- The vector is very sparse (i.e. most values are 0)
- We lose information relating the context (i.e. order of the words in the sentence)
Alternatively, we could represent the text using word embeddings. A word embedding is a learned representation for text where related words will be closer to one another in feature space.
Word embeddings can be computed by training a machine learning model named Word2Vec. There are two variants of Word2Vec — skip-gram and CBOW. The skip-gram variant takes a target word and tries to predict the surrounding context words, whereas the CBOW (continuous bag of words) variant takes a set of context words and tries to predict a target word. In this post, we will cover the skip-gram variant.
Suppose we had the following sentence:
The wide road shimmered in the hot sun.
The window size determines the span of words on either side of a target_word
that can be considered a context word
, as opposed to the number of context words.
Algorithm
For each word t = 1 … T, we predict the surrounding words in a window of “radius” m. We train a machine learning model to maximize the probability of any context word given the current centre word.
Like we do for other probabilistic models, we try to minimize the negative log likelihood.
where P(w_{t+j}|w_{t}) can be formulated as a Softmax function.
where
The latter reads as the probability of output word o given the centre word c. Recall that the denominator, in a Softmax function, is used to normalize the result to give a probability (i.e. number that ranges from 0 to 1).
Architecture
The skip-gram neural network is composed of a single hidden layer. The input is a Bag of Words (BOW) with a value of 1 at the position of the centre word. The output is the probability of finding a specific word at each position in the context window. In the following example, we assume we’re using a window size of 1. As we can see, initially, the model predicts that the word shimmered
follows the word wide
in a sentence and that the word road
precedes the word wide
in a sentence.
It’s important to note that we didn’t lowercase the words. Thus, we have The
and the
. However, in practice, you would.
Using back propagation, the model adjusts the weights until the error is minimized. As we can see, after a few iterations, it correctly predicts that the subsequent word after wide
is road
.
Python
We will analyze individual sections of the code from Google’s detailed word2vec tutorial.
Suppose we had the same sentence as before.
sentence = "The wide road shimmered in the hot sun"
We start off by splitting the sentence into individual tokens (i.e. words).
tokens = list(sentence.lower().split())
print(len(tokens))
8
Next, we map the words to numbers.
vocab, index = {}, 1 # start indexing from 1
vocab['<pad>'] = 0 # add a padding token
for token in tokens:
if token not in vocab:
vocab[token] = index
index += 1
vocab_size = len(vocab)
inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)
{0: '<pad>', 1: 'the', 2: 'wide', 3: 'road', 4: 'shimmered', 5: 'in', 6: 'hot', 7: 'sun'}
Then, we create a one dimensional vector.
example_sequence = [vocab[word] for word in tokens]
print(example_sequence)
[1, 2, 3, 4, 5, 1, 6, 7]
Using a window size of 2
, we generate the list of all possible positive training samples given the example sentence.
window_size = 2
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
example_sequence,
vocabulary_size=vocab_size,
window_size=window_size,
negative_samples=0)
for target, context in positive_skip_grams[:5]:
print(f"({target}, {context}): ({inverse_vocab[target]}, {inverse_vocab[context]})")
(5, 4): (in, shimmered)
(1, 6): (the, hot)
(1, 5): (the, in)
(1, 3): (the, road)
(6, 5): (hot, in)
Using the first pair of target and context words, we generate num_ns = 4
negative training samples. A sample is negative (i.e. assigned a label of 0) when the context word isn’t found inside of the context window.
# Get target and context words for one positive skip-gram.
target_word, context_word = positive_skip_grams[0]
# Set the number of negative samples per positive context.
num_ns = 4
context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
true_classes=context_class, # class that should be sampled as 'positive'
num_true=1, # each positive skip-gram has 1 positive context class
num_sampled=num_ns, # number of negative context words to sample
unique=True, # all the negative samples should be unique
range_max=vocab_size, # pick index of the samples from [0, vocab_size]
seed=SEED, # seed for reproducibility
name="negative_sampling" # name of this operation
)
print(negative_sampling_candidates)
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])
`# Add a dimension so you can use concatenation (in the next step).
negative_sampling_candidates = tf.expand_dims(negative_sampling_candidates, 1)
# Concatenate a positive context word with negative sampled words.
context = tf.concat([context_class, negative_sampling_candidates], 0)
# Label the first context word as `1` (positive) followed by `num_ns` `0`s (negative).
label = tf.constant([1] + [0]*num_ns, dtype="int64")
# Reshape the target to shape `(1,)` and context and label to `(num_ns+1,)`.
target = tf.squeeze(target_word)
context = tf.squeeze(context)
label = tf.squeeze(label)`
print(f"target_index : {target}")
print(f"target_word : {inverse_vocab[target_word]}")
print(f"context_indices : {context}")
print(f"context_words : {[inverse_vocab[c.numpy()] for c in context]}")
print(f"label : {label}")
target_index : 6
target_word : hot
context_indices : [7 2 1 4 3]
context_words : ['sun', 'wide', 'the', 'shimmered', 'road']
label : [1 0 0 0 0]
The following diagram summarizes the procedure of generating a training example:
We define a function that will dynamically generate training examples given a list of sentences, the window size, the number of negative samples, the vocabulary size and a random seed.
`def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
# Elements of each training example are appended to these lists.
targets, contexts, labels = [], [], []
# Build the sampling table for `vocab_size` tokens.
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)
# Iterate over all sequences (sentences) in the dataset.
for sequence in tqdm.tqdm(sequences):
# Generate positive skip-gram pairs for a sequence (sentence).
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
sequence,
vocabulary_size=vocab_size,
sampling_table=sampling_table,
window_size=window_size,
negative_samples=0)
# Iterate over each positive skip-gram pair to produce training examples
# with a positive context word and negative samples.
for target_word, context_word in positive_skip_grams:
context_class = tf.expand_dims(
tf.constant([context_word], dtype="int64"), 1)
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
true_classes=context_class,
num_true=1,
num_sampled=num_ns,
unique=True,
range_max=vocab_size,
seed=seed,
name="negative_sampling")
# Build context and label vectors (for one target word)
negative_sampling_candidates = tf.expand_dims(
negative_sampling_candidates, 1)
context = tf.concat([context_class, negative_sampling_candidates], 0)
label = tf.constant([1] + [0]*num_ns, dtype="int64")
# Append each element from the training example to global lists.
targets.append(target_word)
contexts.append(context)
labels.append(label)
return targets, contexts, labels`
Now, we can proceed to generate training examples from a larger list of sentences. That is, a text file of Shakespeare’s writing.
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
return tf.strings.regex_replace(lowercase, '[%s]' % re.escape(string.punctuation), '')
vocab_size = 4096
sequence_length = 10
vectorize_layer = layers.TextVectorization(
standardize=custom_standardization,
max_tokens=vocab_size,
output_mode='int',
output_sequence_length=sequence_length)
vectorize_layer.adapt(text_ds.batch(1024))
inverse_vocab = vectorize_layer.get_vocabulary()
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()
sequences = list(text_vector_ds.as_numpy_iterator())
for seq in sequences[:5]:
print(f"{seq} => {[inverse_vocab[i] for i in seq]}")
[ 89 270 0 0 0 0 0 0 0 0] => ['first', 'citizen', '', '', '', '', '', '', '', '']
[138 36 982 144 673 125 16 106 0 0] => ['before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', '', '']
[34 0 0 0 0 0 0 0 0 0] => ['all', '', '', '', '', '', '', '', '', '']
[106 106 0 0 0 0 0 0 0 0] => ['speak', 'speak', '', '', '', '', '', '', '', '']
[ 89 270 0 0 0 0 0 0 0 0] => ['first', 'citizen', '', '', '', '', '', '', '', '']
We configure the dataset that will be used to train the model.
targets, contexts, labels = generate_training_data(sequences=sequences, window_size=2, num_ns=4, vocab_size=vocab_size, seed=SEED)
targets = np.array(targets)
contexts = np.array(contexts)[:,:,0]
labels = np.array(labels)
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
We define a class for the Word2Vec model.
class Word2Vec(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim):
super(Word2Vec, self).__init__()
self.target_embedding = layers.Embedding(vocab_size,
embedding_dim,
input_length=1,
name="w2v_embedding")
self.context_embedding = layers.Embedding(vocab_size,
embedding_dim,
input_length=num_ns+1)
def call(self, pair):
target, context = pair
# target: (batch, dummy?) # The dummy axis doesn't exist in TF2.7+
# context: (batch, context)
if len(target.shape) == 2:
target = tf.squeeze(target, axis=1)
# target: (batch,)
word_emb = self.target_embedding(target)
# word_emb: (batch, embed)
context_emb = self.context_embedding(context)
# context_emb: (batch, context, embed)
dots = tf.einsum('be,bce->bc', word_emb, context_emb)
# dots: (batch, context)
return dots
We will represent every word in the vocabulary using 128 dimensions. We instantiate an instance of our Word2Vec class. We compile the model using categorical crossentropy for our loss function.
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam', loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
Finally, we train the model.
word2vec.fit(dataset, epochs=20)
We obtain the embeddings (i.e. weights) between the input layer and the hidden layer.
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
As we can see, we represent each of the 4096 words in the vocabulary using an embedding vector of length 128.
weights.shape
(4096, 128)
We can examine the embedding vector corresponding to the first word in the vocabulary.
weights[0]
array([ 0.02083263, 0.00343355, 0.03133059, 0.04064811, 0.02139286, 0.01668987, -0.01700681, 0.03104338, 0.00513292, 0.01149722, 0.00156037, 0.04110433, -0.02908002, -0.02072917, -0.04493903, -0.03360658, 0.02354895, 0.02986685, 0.01450031, -0.00434611, 0.02604233, 0.00688297, -0.00568321, 0.02448267, -0.04282743, 0.01752845, 0.02333864, -0.03737045, -0.03860588, 0.03164918, -0.03887875, 0.03344462, -0.04599243, -0.00912831, -0.03298129, -0.02165511, 0.00222781, -0.01334076, 0.03560077, -0.01657902, -0.04948949, 0.00923187, 0.03645227, 0.00624547, -0.00375736, 0.03080207, -0.03460135, 0.00123183, 0.0317348 , -0.03172968, -0.01598473, 0.03343581, 0.03939797, 0.01271281, 0.01737561, -0.04787338, 0.03081578, 0.02194339, 0.00668417, 0.0198779 , -0.03545182, 0.03608498, 0.03983852, 0.01381046, 0.02620314, -0.01378284, 0.04695277, -0.0301432 , -0.01917797, 0.03523597, 0.03922388, 0.02773141, 0.00329931, 0.02588192, 0.03493189, -0.02089679, 0.04374716, -0.03882134, -0.02024856, 0.04483554, -0.03621026, -0.04145117, -0.03030737, -0.02996567, -0.00220994, 0.0392569 , 0.03163559, -0.02619413, 0.04448912, -0.01938783, 0.02185104, 0.01294803, -0.01223926, -0.02752018, 0.02359452, 0.01469387, 0.01765844, -0.00813044, -0.04376047, -0.01028157, 0.00078993, -0.01525372, -0.0381612 , -0.00429031, 0.01438124, 0.03173996, 0.02320362, -0.03639726, -0.01158337, 0.04985858, 0.03488507, 0.0025389 , 0.03290978, 0.02607682, 0.04781124, 0.00342916, -0.03108559, 0.0361053 , 0.02612146, -0.00554097, -0.03796817, 0.03855484, -0.03623279, 0.0217861 , -0.01969334, -0.03057173, 0.03088465, -0.02974273], dtype=float32)
Conclusion
Word embeddings are a widely used representation of words that captures the similarity between a given word and the other words in the corpus. In contrast to Bag of Words (BOW), word embeddings have the added advantage that they are dense. Word embeddings can be obtained by training a Word2Vec model and looking at the weights. There are two variations of the Word2Vec model: skip-gram and CBOW.