[TensorFlow 2.0]Word Embeddings in Keras— Part 2

A Ydobon
5 min readJan 7, 2020

--

Hello, everyone! How was your holiday? What are your New Year’s resolutions? One of mine is to write postings regularly. :)

If you cannot read it, click my Friend Link HERE!!!

I. Introduction

On our last posting we have practiced one of the strategies of vectorization; one-hot encodings. Although one-hot encoding is very intuitive approach to express words by numbers/integers, it is destined to be inefficient.

Previously, we have talked about the classic example of ‘The cat sat on the mat.’ and ‘The dog ate my homework.’ The result was shown as a sparse matrix which has mostly 0's and a few 1's as its element which requires a very high dimension (equivalent to the number of words)

As a solution to it, today we will be looking through Word-embeddings.
It has its powers is a sense that
1. it uses way less dimension, so called dense matrix/vector
2. it can express relations between words.

Similar to dense layers that trains and learns itself the weights, embeddings are self-learned. The only thing a human has to do is to specify the dimension of word embeddings, usually 8 for small datasets and up to 2¹⁰-dimensions. Intuitively, higher dimensional embedding will show more accurate relationships b/w words and requires more data.

For example, if we specify 4-dimensional word-embedding layer, each word, i.e., cat will be assigned a vector of size 4 with floating point values i.e.,[1.2,-0.1,4.3,3.2]. Here I said ‘cat’ however, technically speaking, embedding layer maps indices that represent each words to the dense vectors.

II. Embedding layer

From this section I will introduce a very simple example from Deep Learning with Python by François Chollet pp.186–187. He mainly uses keras so later on I will also cover how things work in TensorFlow tutorial.
As a beginner to this area, I always feel the small changes very challenging, very ironic but it is what it is 😅

  1. Embedding layer dimension
from keras.layers import Embedding
embedding_layer = Embedding(1000, 64)
  • Embedding layer takes tokenized word indices as inputs and 1000 is the number of possible tokens. (indices start from 0, so technically indices are from 0 to 999).
  • 64 is for the dimension of embedding layer.
  • It turns positive integers (indices) into dense vectors of fixed size ( the dimension of embedding layer). e.g. `[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]` in this case there were 2 words and the dimension was 2.
  • The Embedding layer takes as an input 2D tensor of integers, of shape (samples, sequence_length) where each entry is a sequence of integers with the same length.
  • It returns a 3D floating-point tensor of shape (samples, sequence_length, embedding_dimensionality)
  • This layer can only be used as the first layer in a model.

2. Some ideas with practice

As other layers, the weights for embedding layer are randomly initialized. With some practice from TensorFlow it might be easier to understand the dimensions and mechanisms of embedding layer.

Caution that I am citing TensorFlow tutorials for word embeddings which I will elaborate in the following posting. Anyway, if you want to practice these lines of code remember to import what’s necessary i.e., tensorflow etc.

embedding_layer = layers.Embedding(1000, 5)

For simplicity we will have 5-dimensional embedding layer.

result = embedding_layer(tf.constant([1,2,3]))
result.numpy()

array([[ 0.03213355, -0.0173107 , -0.03180752, 0.04842926, -0.04584675], [ 0.02202057, -0.01880321, 0.03329623, -0.04096957, 0.00134622], [ 0.01395286, 0.04063762, 0.0090096 , 0.02181033, 0.00713902]], dtype=float32)

result.shape

TensorShape([3, 5])

result = embedding_layer(tf.constant([[0,1,2],[3,4,5]]))
result.numpy()

array([[[-0.0318125 , -0.04601121, 0.03894596, -0.04053012, 0.02418424],←0
[ 0.03213355, -0.0173107 , -0.03180752, 0.04842926, -0.04584675], ←1
[ 0.02202057, -0.01880321, 0.03329623, -0.04096957, 0.00134622]],←2

[[ 0.01395286, 0.04063762, 0.0090096 , 0.02181033, 0.00713902], ←3
[ 0.01617518, -0.02878059, -0.04374051, -0.04474043, 0.03428983], ←4
[ 0.03450178, -0.04077119, -0.03103538, 0.00089379, -0.02062824]]], ←5
dtype=float32)

result.shape

TensorShape([2, 3, 5])

III. Practice with IMDB data (the movie review data)

i) Load IMDB data to feed our embedding layer

from keras.datasets import imdb
from keras import preprocessing
max_features = 10000
maxlen = 20
  • max_features will restrict the movie reviews to the top 10,000 most common words.
  • maxlen specifies the number of words that we are going to take in each movie review comments.
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features) 
  • imdb has a format of numpy (npz file) and we will be loading the data as lists of integers (at this moment we are dealing with a file that is already tokenized).
  • x_train.shape and y_train.shape both output (25000,)

We might want to look into how x_train is constructed, for example.

array([list([1, 14, 22, 16, 43, 530,…,32]),…,list([1, 17, 6,… ,9])], dtype=object)
It has 25000 number of lists and for example the 1st list has the length of 218 where as 5th list has length of 147.

len(x_train[0]) #218
len(x_train[4]) #147

Next, we have to convert the shape of our inputs; (25000,) is not appropriate for embedding layer.

x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
  • We first had lists of integers and preprocessing.sequence.pad_sequences will turn the lists into a 2D integer tensor with shape of (samples, maxlen), which is what embedding layer requires.
x_train.shape #(25000,20)
x_test.shape #(25000,20)

Now let’s check we were able to cut off the text after 20 number of words, maxlen!

len(x_train[0]) #20
len(x_train[4]) #20

ii) Model Definition

Now that we have tuned our data appropriately we are ready for the real fun!

from keras.models import Sequential
from keras.layers import Flatten, Dense

Our plan is like following.

  1. The network will learn 8-dimensional embeddings for each of the 10,000 words.
  2. Turn the input integer sequences (2D integer tensor) into embedded sequences (3D float tensor)
  3. Flatten the 3D tensor(dense vector) to 2D to train a single Dense layer on top for classification
model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.summary()

Let’s look through how the number of parameters are counted.

  1. Embedding layer will take 10,000 number of words and will map the 10,000 indices into a 8-dimensional dense vector. So, 10,000*8 = 80,000.
  2. We are looking at the first 20 words in every review and each words will be assigned a 8-dimensional word-embedding. So, Flatten layer will return output of shape (,160).
  3. In a single dense layer the training process will return weights for every input and a bias. So, in total, (8*20+1)*1=161 parameters. The 1’s are for 1 bias and 1 node, respectively.
  4. None is left blank for the batch dimension

iii) model compiling and fitting

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

iv) plotting

import matplotlib.pyplot as pltacc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

We can note the existence of overfitting and we all know that this is too simple and there were too little data.

In the next posting I will introduce TensorFlow way of word-embedding with more complex model.

--

--