[TensorFlow 2.0]Word Embeddings in TensorFlow

4 min readJan 8, 2020

Hello, everyone! Hope that Australia recover from the natural catastrophe. 🙏 Now that we are in a new decade, it is time we considered our impact on nature and its inhabitants other than ourselves.

If you cannot read this, visit our Friend’s Link!!

Just for now, let’s come back to word-embeddings.
In this posting, we will see how the word-embeddings work in TensorFlow 2.0 and we will see a bit more delicate model.

As usual, I hereby acknowledge that the following post is based on TensorFlow tutorial provided in:

Word embeddings | TensorFlow Core

This tutorial introduces word embeddings. It contains complete code to train word embeddings from scratch on a small…

www.tensorflow.org

I. Technical Setup

from __future__ import absolute_import, division, print_function, unicode_literalstry:# %tensorflow_version only exists in Colab.
  !pip install tf-nightly
except Exception:
  passimport tensorflow as tffrom tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds
tfds.disable_progress_bar()

II. Load IMDB data set

As in the previous posting we will be using preprocessed dataset.

(train_data, test_data), info = tfds.load(
    'imdb_reviews/subwords8k', 
    split = (tfds.Split.TRAIN, tfds.Split.TEST), 
    with_info=True, as_supervised=True)

III. Standardize the length of the reviews

Previously we just cut off the reviews after 20 words, however in TensorFlow we will be using padded_batch method which will add 0’s to match the length of the review to that of the longest one.

padded_shapes = ([None],())
train_batches = train_data.shuffle(1000).padded_batch(10, padded_shapes = padded_shapes)
test_batches = test_data.shuffle(1000).padded_batch(10, padded_shapes = padded_shapes)

Alert! For those of whom using colab to practice this tutorial note that this part in colab has some errors, copy the codes from the tutorial or from my posting.

Now, let’s see what we did with padded_batch.

train_batch, train_labels = next(iter(train_batches))
train_batch.numpy()

array([[6739, 7961, 499, …, 0, 0, 0],
[ 62, 27, 180, …, 0, 0, 0],
[3502, 112, 48, …, 0, 0, 0],
…,
[ 62, 9, 4, …, 0, 0, 0],
[ 19, 1807, 7, …, 0, 0, 0],
[3742, 2128, 3268, …, 0, 0, 0]])

We can see the zeros!

Some python basics; _iter_ and _next_ ; for loop is a more elegant method. 😎

my_birthday = [12,30,'cecil' ,'kim' ]
my_birthday = iter(my_list)# iterate through it using next()
print(next(my_birthday))          #12
print(next(my_birthday))          #30
print(my_birthday.__next__())     #cecil 
print(my_birthday.__next__())     #kim#next(object) is same as object.__next__()next(my_birthday) 
#<-- since there is no left object, this will cause an error!

As mentioned for loop is simpler.

my_birthday = [12,30,'cecil' ,'kim' ]for i in my_birthday:
    print(i)

which will output
12
30
cecil
kim

IV. Modelling

encoder.

Text encoder can convert any string into integers.

encoder = info.features['text'].encoder
encoder.subwords[5:20]
encoder.vocab_size #8185

This will return [‘of_’, ‘to_’, ‘s_’, ‘is_’, ‘br’, ‘in_’, ‘I_’, ‘that_’, ‘this_’, ‘it_’, ‘ /><’, ‘ />’, ‘was_’, ‘The_’, ‘as_’] where _ represents a space.

Before we apply it to our imdb dataset, let’s practice this on a simple example.

sample_string = ''love your neighbor as yourself.'encoded_string = encoder.encode(sample_string)
print ('Encoded string is {}'.format(encoded_string))

This will return;
Encoded string is [174, 155, 2955, 7961, 20, 4381, 7975]

Then we want to see the original text from the numbers, so we will decode it.

original_string = encoder.decode(encoded_string)
print ('The original string: "{}"'.format(original_string))

This will return;
The original string: “love your neighbor as yourself.”

However, why do we have 7 integers while our sample_string consists of only 5 words? Or how the words are assigned to numbers

To answer that let’s try the following;

for ts in encoded_string:
print ('{} ----> {}'.format(ts, encoder.decode([ts])))

174 — → love
155 — → your
2955 — → neighbor
7961 — →
20 — → as
4381 — → yourself
7975 — → .

encoder.subwords[173] #'love_'
encoder.subwords[154] #'your_'
encoder.subwords[19]  #'as_'

and so on.

2. Model Definition.

embedding_dim=16

model = keras.Sequential([
  layers.Embedding(encoder.vocab_size, embedding_dim),
  layers.GlobalAveragePooling1D(),
  layers.Dense(16, activation='relu'),
  layers.Dense(1, activation='sigmoid')
])

model.summary()

encoder.vocab_size was 8185 and we are working with 16-dimensional word embeddings; 8185*16=13960 parameters
GlobalAveragePooling1D requires an input of 3D tensor with shape (batch_size, features, steps) in our case (samples, input_length, embedding_dim). Then, outputs a 2D tensor with shape (batch_size, features) or (samples, embedding_dim).
The first dense layer will take the output from the previous pooling layer.
Each node will have 16 weights and 1 bias so, (16+1)*16=272 parameters.
Similarly, the second dense layer will have (16+1)*1=17 parameters.

In total 131,249 parameters.

3. Compile and train the model.

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(
    train_batches,
    epochs=10,
    validation_data=test_batches, validation_steps=20)

4. Plotting

import matplotlib.pyplot as plt

history_dict = history.history

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

plt.figure(figsize=(4,3))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

plt.figure(figsize=(4,3))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim((0.5,1))
plt.show()

It is still overfitting however, padding was a big progress in some sense and adding dense layers and their nodes increases validation accuracy.

So, this has been all for word embeddings.
What is missing in these postings were the parts of tokenization. In real life we have to import raw txt file and start the task of indexing each words. If time permits, I will also cover that part.

Thank you all for reading through this post and hope you enjoy the winter. Sadly, we are missing the snow this winter in Seoul, Korea.

I will be back soon! See you then.