[TensorFlow 2.0] Word Embeddings — Part 1
Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing we must do is to come up with a strategy to convert strings to numbers (or to “vectorize” the text) before feeding it to the model.
Like all other neural networks, deep-learning models don’t take as input raw text: they only work with numeric vectors. Vectorizing text is the process of transforming text into numeric vectors. This can be done in multiple ways:
- Segment text into words, and transform each word into a vector.
- Segment text into characters, and transform each character into a vector.
- Extract n-grams of words or characters, and transform each n-gram into a vector. N-grams are overlapping groups of multiple consecutive words or characters.
The different units into which you can break down text (words, characters, or n-grams) are called tokens, and breaking text into such tokens is called tokenization. All text-vectorization processes consist of applying some tokenization scheme and then associating numeric vectors with the generated tokens.
As a first idea, we might “one-hot” encode each word in our vocabulary. Consider the sentence “The cat sat on the mat”. The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). To represent each word, we create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word.
To understand this idea, consider the following toy example in detail. This toy example code do not distinguish “The” from “the”, and do not take care of punctuation marks. That’s why we call it a toy.
The first thing to notice in this code is that we build an empty dictionary (
token_index) at #5. In this dictionary we will record our tokens. Let’s dive into the code line by line.
Since we have two sentences in the data, the
split() method produces the following. Note that we do not strip the punctuation marks. In real life example, we should also strip punctuation and special characters from the samples.
['The', 'cat', 'sat', 'on', 'the', 'mat.']
['The', 'dog', 'ate', 'my', 'homework.']
for loop prints out all the words in the sentences. Note that there are 11 words in two sentences.
Now the final pieces.
We have the following dictionary.
Note that both ‘The’ and ‘the’ exit because we did not control capitalization. Anyway this is the end of the tokenization process.
Now we have finished the tokenization, we need to “vectorize” the tokens. Preparing the space for vectors is needed. We make the
results variable for this, the size of which is (2, 10, 11). 2 is just the number of sentences in the data.
10 is the maximum number of words in a sentence that are to be processed. We can change this number as we wish, so we play with this number later.
11 is the number of tokens + 1. The reason to add 1 is that we just do not want to use the index 0.
 François Chollet, Deep Learning with Python, Manning Publications, 2017.
 Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition, O’Reilly Media Inc., 2019.