[TensorFlow 2.0] Text Classification with an RNN in TensorFlow
For those of you who cannot see this post, use our Friend’s Link!!
Continued from the last post which was basically on how RNN works and its implementation on keras environment, in this one I will focus on TensorFlow with some advancements.
Then, as promised I think it is time for us to go back and see how to preprocess raw text data. When that’s all done, my plan is to dive into advanced RNN before lunar new year holiday starts. Or maybe after 🌝.
I’m heading for Rome this holiday and there is full of excitement and joy inside me!!!!!Can’t wait!
This post is based
[1] Deep Learning with Python by François Chollet pp.207–224.
[2] the TensorFlow tutorial, so if you need more information consult this link!
I. Baseline
- Technical Setup
from __future__ import absolute_import, division, print_function, unicode_literalsimport tensorflow_datasets as tfds
import tensorflow as tf
2. Load IMDB data and preprocess
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
as_supervised=True)train_dataset, test_dataset = dataset['train'], dataset['test']
As we have dealt with in word-embeddings part3 in TensorFlow we do not truncate the text. Rather we pad zero’s according to the maximum length of a text input data.
3. Modeling
encoder = info.features['text'].encodermodel = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
4. Compiling and Fitting
model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(1e-4),metrics=['accuracy'])history = model.fit(train_dataset, epochs=10, validation_data=test_dataset, validation_steps=30)
5. Plotting
BUFFER_SIZE = 10000
BATCH_SIZE = 64train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)import matplotlib.pyplot as plthistory_dict = history.history
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']epochs = range(1, len(acc) + 1)
plt.figure(figsize=(4,3))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')plt.legend()
plt.show()plt.figure(figsize=(4,3))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim((0.5,1))
plt.show()
From now on, Steps 4 and 5 (training and plotting) will be continuously used but will not be specified in this post for spacial efficiency.
It is a real mass 😫.
Then what can we do to improve this network?
II. Advancements Proposed
We have been through quite a progress starting from Word Embedding with Dense layers to LSTM layers. However, problems unsolved are overfitting and low validation accuracy/high validation loss. Also, although LSTM did a good job in keeping track of state information throughout iterations, let’s not assume everything’s settled.
Some useful advancement techniques are followings;
- Bidirectional recurrent layer — it presents the same information to a recurrent network in different ways (2 directions) so it increases accuracy and soothes the short-term memory loss :).
- Recurrent Dropout — a classic way to fight overfitting can also be adopted in RNN.
- Stacking Recurrent layers — typical way to increase network capacity is to increase units in the layers and adding the layers (once you think your network is not overfitting).
III. Bidirectional Recurrent layer
Bidirectional Recurrent layer or BRNN is very convenient strategy when it comes to natural language analysis.
First, a bidirectional RNN deals with 2 regular RNN’s, it will have double the number of parameters compared to the normal RNN layers. Each layer in bidirectional RNN works separately according to its mechanism. By doing so, bidirectional RNN catches more details that unidirectional RNN can do.
Also, RNN is very order-sensitive and bidirectional RNN exploits the feature. It looks at its input sequence in both in a normal/chronological sequence and in a reversed sequence.
A brief mechanism of bidirectional RNN is as following.
3. Modeling
encoder = info.features['text'].encodermodel = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])model.summary()
We have practiced on encoding in the word-embeddings part3. Remember “Love your neighbor as yourself” example? There were 8185 vocabs in this encoder.
- 8185*64=523,840
- Remember how the parameters of LSTM layer was calculated?
(64+64+1+129*3)*64=33,024. See the comments for the details.
So, in bidirectional RNN double the numbers=33,024*2=66,048 and note that output shape is (,128) - (128+1)*64=8256
- (64+1)*1=65
Compared to the simple LSTM example, there has been a tremendous improvement. However, we can observe a severe overfitting. As mentioned earlier, dropout can soothe the overfitting. Let’s see, then.
IV. Dropout
Drop out is a classic solution to overfitting in machine learning, which randomly drops out input units of a layer to prevent coincidental correlation within the training data. According to some studies, it can also be implemented in recurrent layers if the same dropout mask (the pattern of dropped units) are applied in every time step.
To see some achievements that dropout can make, let’s go back to the previous example of LSTM with IMDB data from the last post.
- Baseline model
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid')
2. With dropout
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32,dropout=0.2,recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
- dropout specifies the dropout rate for input of the layer.
- recurrent_dropout specifies the dropout rate of the recurrent units.
Can you note the difference?
Caution that dropout is done a bit differently in TensorFlow.
We’ll see it below.
IV. Stack two more LSTM layers — all together
Now that we have mildly solved the overfitting problem with dropout, we have our model under control. Then, we want to increase the capacity of our model. Even in feedforward network or with dense layers, we add layers and units to improve accuracy. Also, one thing that we have take care is that intermediate RNN layers should return full sequence of outputs; 3D tensor by specifying return_sequences=True.
model = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()
- First, the intermediate LSTM layer has output of 3D shape.
- 8185*64=523,840
- {(64+64+1+129*3)*64}*2=66,048
- {(128+32+1+161*3)*32}*2=41,216
Here, the output from the previous LSTM layer becomes the input of this layer which is 128-dimensional. Also, this second layer has 32 units so the state will be 32-dimensional. - (64+1)*65=4,160
- (64+1)*1=65
Then, is this better that the first example that one with single bidirectional LSTM layer without dropout? Let’s see the plots.
- bidirectional layer w/o dropout
2. two bidirectional LSTM layers w/ dropout
Do you see the difference? Depending on the perspective, and considering the computation time for the additional bidirectional layer, it seems like a mere improvement.
- The validation accuracy increased from around 86% to 88%.
- It seems like we have mitigated the overfitting problem, but not enough.
This is all for this post and the following would be on how to preprocess raw text data. See you then!