[TensorFlow 2.0] Word Embeddings — Part 1

Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing we must do is to come up with a strategy to convert strings to numbers (or to “vectorize” the text) before feeding it to the model.

3 min readJan 1, 2020

Like all other neural networks, deep-learning models don’t take as input raw text: they only work with numeric vectors. Vectorizing text is the process of transforming text into numeric vectors. This can be done in multiple ways:

Segment text into words, and transform each word into a vector.
Segment text into characters, and transform each character into a vector.
Extract n-grams of words or characters, and transform each n-gram into a vector. N-grams are overlapping groups of multiple consecutive words or characters.

The different units into which you can break down text (words, characters, or n-grams) are called tokens, and breaking text into such tokens is called tokenization. All text-vectorization processes consist of applying some tokenization scheme and then associating numeric vectors with the generated tokens.

One-hot encodings

As a first idea, we might “one-hot” encode each word in our vocabulary. Consider the sentence “The cat sat on the mat”. The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). To represent each word, we create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word.

To understand this idea, consider the following toy example in detail. This toy example code do not distinguish “The” from “the”, and do not take care of punctuation marks. That’s why we call it a toy.

The first thing to notice in this code is that we build an empty dictionary (token_index) at #5. In this dictionary we will record our tokens. Let’s dive into the code line by line.

Since we have two sentences in the data, the split() method produces the following. Note that we do not strip the punctuation marks. In real life example, we should also strip punctuation and special characters from the samples.

['The', 'cat', 'sat', 'on', 'the', 'mat.']
['The', 'dog', 'ate', 'my', 'homework.']

The second for loop prints out all the words in the sentences. Note that there are 11 words in two sentences.

The 
cat 
sat 
on 
the 
mat. 
The 
dog 
ate 
my 
homework.

Now the final pieces.

We have the following dictionary.

Note that both ‘The’ and ‘the’ exit because we did not control capitalization. Anyway this is the end of the tokenization process.

Now we have finished the tokenization, we need to “vectorize” the tokens. Preparing the space for vectors is needed. We make the results variable for this, the size of which is (2, 10, 11). 2 is just the number of sentences in the data.

10 is the maximum number of words in a sentence that are to be processed. We can change this number as we wish, so we play with this number later.

11 is the number of tokens + 1. The reason to add 1 is that we just do not want to use the index 0.

Now practice!

produces

Practice

Output

Practice

Output

Practice

Output

Practice!!!

Output

References

[1] TensorFlow.

[2] François Chollet, Deep Learning with Python, Manning Publications, 2017.

[3] Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition, O’Reilly Media Inc., 2019.

[TensorFlow 2.0] Word Embeddings — Part 1

Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing we must do is to come up with a strategy to convert strings to numbers (or to “vectorize” the text) before feeding it to the model.

One-hot encodings

References

Written by A Ydobon

No responses yet