[EasyPeasyPyTorch] 02. Recurrent Neural Networks (Concepts and deployment)

A Ydobon
6 min readJan 16, 2020

--

As we have studied about transforming words into computable tensors, let’s dive deeper into deploying what we learned in real data analysis.

Previously we have checked how to transform words into tensors. (The previous post [EasyPeasyPyTorch] 01. Word Embedding)

The method above should be practiced soon with the IMDB dataset by conducting sentiment analysis using PyTorch. Especially in this posting, we will use one-hot encoding, so if that does not ring a bell, it would be a good idea to take a few minutes to click the link above and check the idea of it.

Ready? Then, let’s check briefly what Recurrent Neural Network is and its progress or short history.

According to the Google dictionary, the word ‘recurrent’ means ‘occurring often or repeatedly’.

<Figure 01 — Search result of ‘recurrent meaning’>

Unlike Feed Forward Neural Network(FFNN), RNN performs the same task repeatedly on each of the sequence data.

<Figure 02 — FFNN and RNN>

As a result, RNN performs better analyzing sequence signal datasets such as text, or speech, because repeating the same task helps the algorithm to capture the temporal dependency within the dataset. Temporal dependency, in other words, time-varying properties of the data, has been missed out in computing values in regular Feed Forward Neural Network.

Before checking out the details of capturing temporal dependency, let’s briefly look over the history of RNN. It won’t take long, and it would be refreshing!

When RNN did not exist yet, people must have tried to incorporate time-varying features in their models. The most straightforward way to do so is directly putting the time-lagged values in the feature set. This model is called Time Delay Neural Networks, in short, TDNN.

Let’s say in normal Feed Forward Neural Networks, x(t) is used to generate y(t) as an outcome. However, in Time Delay Neural Networks, not only x(t) but x(t-1), x(t-2), and so forth, depending on the pre-determined size of the window, several time-lag feature values work together to pull out better-fitted outcome y(t). This algorithm was published in 1989.

<Figure 03 — Time Delay Neural Network, TDNN>

This model has clear advantages and disadvantages.

  • Advantages ) Succeeded in incorporating previous time features to capture parts of time-varying properties of the data
  • Disadvantages ) Only able to capture a limited amount of time dependency depending on the size of the time-lag window

And a new algorithm of our interest RNN came out in 1990’s. This algorithm successfully dealt with the disadvantages of the TDNN. However, there came another obstacle ‘vanishing gradient’ which indicating the decaying contribution of the gradients as the sequence gets longer. Due to this, the performance of RNN fell shorter than we expected as it could not properly process a long sequence of data.

Then, a new algorithm came out. Long Short Term Memory, LSTM in short, is an extension of RNN as it equipped with extra gates in each of the cells to process further, a longer sequence of data.

(You can search ‘LSTM’ on google, and there would be tons of pictures about this unique computation cells.)

Thanks to the gates, some signals can be kept inside as a state vector and introduced again in the later computation time to boost the performance of the algorithm.

By far, we have checked a brief history from Feed Forward Neural Networks to Long Short Term Memory, and a detailed explanation of each algorithm will be introduced later with a separate posting.

So, enough appetizer and now time to have the main dish, deploying RNN to analyze text data, IMDB for sentiment analysis.

Here, we focus on building an RNN itself. As a result, the performance would be somewhat mediocre in this part. However, higher performance would be accomplished in the next posting with slight changes. So, let’s get started!

  1. Loading the libraries, dataset
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext import data
from torchtext import dataset
import random
SEED = 1 #any number you favor can be assigned heretorch.manual_seed(SEED)TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

Imported dependent libraries; torch as the main tool, and torchtext to load text dataset. Also, we set the ‘SEED’ value to be able to replicate the result and tune the model for better performance with parameter modifications. And ‘TEXT’ and ‘LABEL’ are prepared to store text and label of the dataset.

print(len(train_data), len(test_data))

And we find out each train and test dataset contain 25,000 units of examples.

To prevent overfitting of our training data, it is always a good idea to split out a validation set to sporadically check the performance of our model. Out of the whole training dataset, we will pull out 20% of them as a validation set, and only use the rest 80% to train our model.

train_data, valid_data = train_data.split(random_state = random.seed(SEED), split_ratio = 0.8)

So, we prepared three different datasets: train, validation, and test. However, as you already know, our smart machines cannot operate with text data. Therefore, we will make a table that machines can look up when they face a word. In the table, each word will be assigned with unique values.

MAX_VOCAB_SIZE = 25000
TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

We will only use the most frequently used 25,000 words. It can be assumed that as we limited the number of unique words we deal with, the dimension of a one-hot vector would be also 25,000.

BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(train_data, valid_data, test_data), batch_size = BATCH_SIZE, sort_within_batch = True, device = device)

The benefit of using ‘BuckeIterator’ is that this iterator tries to group text data based on the length of it. Therefore, by using this, we can expect that we minimize zero paddings.

Now, let’s build our model!

class RNN(nn.Module):
def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
super().__init__()
self.embedding = nn.Embedding(input_dim, embedding_dim)
self.rnn = nn.RNN(embedding_dim, hidden_dim)
self.fc = nn.Linear(hidden_dim, output_dim)

def forward(self, tex):
embedded = self.embedding(text)
output, hidden = self.rnn(embedded)
return self.fc(hidden.squeeze(0))

And now, let’s specify parameters.

INPUT_DIM = len(TEXT.vocab) 
EMBEDDING_DIM = 128
HIDDEN_DIM = 256
OUTPUT_DIM = 1

We will plug in each of the parameters in our model.

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

Time to select an optimizer!

optimizer = optim.Adam(model.parameters(), lr = 1e-3)

As a loss function (criterion), Binary Cross Entropy With Logits will be used. Logits squash our model outcome into the [0,1] bound.

criterion = nn.BCEWithLogitsLoss()

To train the model, we will use the following function. This will be used to run training data.

def train(model, iterator, optimizer, criterion):
epoch_loss = 0
model.train()
for batch in iterator:
optimizer.zero_grad()
predictions = model(batch.text).squeeze(1)
loss = criterion(predictions, batch.label)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
return epoch_loss/len(iterator)

To evaluate the model, we will use the following function. However, we do not want to update our parameters while evaluating the performance, there will be a few changed compared with the former train function. This function will be used to run on validation data and test data.

def evaluate(model, iterator, criterion):
epoch_loss = 0
model.eval()
with torch.no_grad():
for batch in iterator:
predictions = model(batch.text).squeeze(1)
loss = criterion(predictions, batch.label)
epoch_loss += loss.item()
return epoch_loss/len(iterator)

Let’s run the model!

N_EPOCHS = 5 
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
train_loss = train(model, train_iterator, optimizer, criterion)
valid_loss = evaluate(model, valid_iterator, criterion)

if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'best_param.pt')
print("Epoch: ", epoch+1)
print("Train loss: ", train_loss)
print("Validatoin loss: ", valid_loss)

We will constantly check the result of evaluation on the validation dataset, and store the best performing combination of parameters under the name ‘best_param.pt’. This saved model will be used to make final predictions about the test dataset.

Now, the final step, make predictions!

model.load_state_dict(torch.load('best_param.pt'))
test_loss = evaluate(model, test_iterator, criterion)
print('Test loss: ', test_loss)

My test loss was “0.6939689666413895”, around 70%. Perhaps not as much as you expected. Therefore, we will try to update this model in the next posting. See you soon!!

--

--

A Ydobon
A Ydobon

No responses yet