[EasyPeasyPyTorch] 01. Word Embedding

6 min readJan 8, 2020

January 2020, the best time to learn new things to boost our skillset. So, I decided to study PyTorch together.

For people who are somewhat unfamiliar with PyTorch, consult the link below, the official website for the library.

PyTorch

An open source deep learning platform that provides a seamless path from research prototyping to production deployment.

pytorch.org

PyTorch is an open-source deep learning platform primarily developed by Facebook Artificial Intelligence Research Lab, FAIR.

Based on personal experience, I find PyTorch provides somewhat easier and more intuitive code lines to implement machine learning models.

In this posting, we will take a quick look over what Recurrent Neural Networks are, and study the primary and widely used data processing technique, word embedding.

Content
1. Brief introduction: Recurrent Neural Networks
2. Text preprocessing: One-hot encoding & Word Embedding
— 2.1 One-hot encoding
— 2.2 Word Embedding with PyTorch

1. Brief introduction: Recurrent Neural Networks

What is the Recurrent Neural Network (RNN) ?

I think the easiest way to learn a new thing is to step up from the thing that we already know. So, Let’s think of the regular, simple neural network that classifies images into certain labels.

Let’s say we are watching a documentary film about the sea world. And we want our neural network to classify each object in the film into certain categories.

[Figure-01. Regular simple neural networks for image classification]

As we can see from the above drawing (I drew it by myself!), there is no connection between each different fed image into the neural network model. So, even though there exists a clue or additional information from the sequence of data itself, the neural network has no resource to utilize that to boost the performance.

However, in the recurrent neural network, shortened as RNN, getting a hint from the previously fed data and use the additional information in the current task is possible.

[Figure-02. Recurrent neural networks for image classification]

Basically, RNN differs from the regular neural network for having a memory of the previous network’s outcome. In other words, RNN incorporates dependency of the input data into its model.

2. Text preprocessing:

One-hot encoding & Word Embedding

Perhaps the text data, such as articles in a newspaper, movie review or a medium posting like this, would be the most adequate form of data to test incorporating dependency of each unit of data really boosts the performance of the model.

However, as computers can only understand numeric information such as vectors, scalars or tensors, we need to transform the text data of humans into the tensor format so that computers can follow our orders or commands.

Let’s say we have a sentence to transform. “The cat sat on the mat.”

There are two different starting points; character level, or word level. Here, I will start with the word. The above sentence can be split into

a = “The cat sat on the mat.”
list(a.split())#   ['The', 'cat', 'sat', 'on', 'the', 'mat.']

[‘The’, ‘cat’, ‘sat’, ‘on’, ‘the’, ‘mat.’].

We just broke down a sentence into different words. In general, this process is called ‘tokenization’, and each consisting word is called ‘token’.

Let’s give each word an index so that we can indicate each word simply with numbers.

import numpy as npsamples = ['The cat sat on the mat.', 'The dog ate my homework.']token_index = {}for sample in samples:    for word in sample.split():        if word not in token_index:        token_index[word] = len(token_index) + 1

And the result dictionary variable under the name “token_index” has the following result.

{'The': 1,  'ate': 8,  'cat': 2,  'dog': 7,  'homework.': 10,  'mat.': 6,  'my': 9,  'on': 4,  'sat': 3,  'the': 5}

2.1 One-hot encoding

Thanks to the tokenization, we can replace each word with its own index number. The integer index number can be also represented in a vector format, especially the one-hot vector. This form of vector consists of only two integers, 0 and 1. For example, the one-hot vector format of the word ‘cat’ whose index number is 2, would be the following.

[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]

In our toy example here, we will only consider the first 10 words which would be fixed with the variable name “maximum length”.

max_length = 10results = np.zeros(shape=(len(samples),                          max_length,                          max(token_index.values()) + 1))for i, sample in enumerate(samples):    for j, word in list(enumerate(sample.split()))[:max_length]:        index = token_index.get(word)        results[i, j, index] = 1.

And the “results” is the following.

# 'The cat sat on the mat.'
array([[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],         
[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],         
[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],         
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],         
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],         
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],         
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],         
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],         
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],         
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],#'The dog ate my homework.'       
       [[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],         
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],         
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],         
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],         
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],         
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],         
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],         
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],         
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],         
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])

2.2 Word Embedding with PyTorch

In the above 2.1 one-hot encoding part, the toy dataset we had only consists of 10 different words. However, what if we want to tokenize the work of William Shakespeare? The number of distinct words in his work would be obviously more than 10 words. Maybe 8,000? or 12,000? I have no idea.

In this circumstance, the dimension size of each one-hot vector would be really long. To address this issue, the most widely used, and efficient method ‘word-embedding’ comes out.

As the name of the method suggests, word-embedding embeds a word into a vector with a fixed size. The size is usually the powers of 2, such as 32, 64 or 256.

Using the tokenized variable under the variable name ‘token_index’ that we used in the 2.1 one-hot encodings, we will see how word-embedding works in PyTorch. The codes are following.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optimtorch.manual_seed(1)  
          # random seedword_to_ix = token_index
embeds = nn.Embedding(10, 8)  
          # 10 different words are fed into the Embedding layer 
          # and 8-dimension embeddings come out.lookup_tensor = torch.tensor([word_to_ix["cat"]], dtype=torch.long)cat_embed = embeds(lookup_tensor)
print(cat_embed)

And the word-embedded result of the word ‘cat’ is the following.

tensor([[-0.6970, -1.1608,  0.6995,  0.1991,  0.8657,  0.2444, -0.6629,  0.8073]],        grad_fn=<EmbeddingBackward>)

In the above example, we set the embedding dimension as 5, but any number can be fed into it.

So, in this posting, we have studied two different text preprocessing measures, one-hot encoding, and word embedding. Hope this was helpful to you and worth your time reading. Thank you! 🚀 😺 🐶 ⛄️

Oh, I almost forgot, happy new year! 🙌