For those of you who has not subscribed medium, use our Friend’s Link!!
How was your weekend? My cousin got married this Sunday!!! 👰 🎉 🎉 I am still stuck in that emotional moments 😿 and under some side effects of too much of wine 🍷.
Anyway we have to learn RNN this week. My plan is to start with RNN basics with some explanations in this post and then some advancements in the following posts.
I hereby acknowledge that this post is based on
Deep Learning with Python by François Chollet pp.196–206
I. What is RNN?
In the first place what does ‘recurrent’ mean? The word reminds me of a musical note; the repeat sign. As in the music, the neural networks are repeated during certain loops in RNN. The network that we have dealt with is called feedforward networks, it was a song without any repeat sign, so all you have to do is to go forward.
On the other hand, RNN has an internal loop while maintaining state which consists of information regarding what the sequences have been through during the loops. There are only 2 concepts that we have to clarify to understand the RNN; loop and state. It is more or less very simple if we put RNN this way; a for loop that reuses quantities computed during the previous iteration of the loop in other words, state.
Now, we have to mention how the output is computed. Input and state will be parameterized by two matrices, W and U respectively and a bias vector. Then, it will go through an activation function whatever it is.
output_t=activation(W*input_t+U*state_t+bias)
- t is for time
- initial state can be any random vector, even zero vector.
- So, the RNN layers that we will be looking at very soon, i.e., SimpleRNN, LSTM and GRU layers follow a very similar mechanism in a sense that these RNN layers will find most adequate W’s and U’s; weights.
Enough of brief information, let’s go deeper with more details.
II. SimpleRNN in Keras
Let’s start with the most simple RNN. In this section we will see some basics of RNN. The concept is very simple, the output of the previous time step is used as state information, then it is repeated for certain amount of iterations.
- SimpleRNN has 2 modes of output
- It takes inputs of 3D tensor of shape (batch_size, time_steps, input_features)
- Then, it can return a 2D tensor of shape (batch_size, output_features) which is the last output for each input sequence.
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNNmodel = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32))
model.summary()
- Or it can return a 3D tensor of shape (batch_size, time_steps, output_features) which is the full sequences of successive outputs for each time steps by adding return_sequences=True.
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32,return_sequences=True))
model.summary()
Can you see that difference in the output shape?
2. As usual, let’s check param#, while doing so we will get used to the mechanism of the SimpleRNN
- 10,000*32=320,000 ←we have done this in word-embeddings
- (32+32+1)*32=2080.
- The first 32 is from the 32-dimensional word embedding layer which will be an input for the RNN layer in each iteration. (W)
- The second 32 is the dimension of output shape in the previous time step which is defined in SimpleRNN(32). (U)
- The 1 is for the bias.
- Lastly, 32 is from the number of units which is same as the number of time steps.
Then, can you see how the number of parameters in the following model is computed?
model = Sequential()
model.add(Embedding(10000, 32)) #32*10,000
model.add(SimpleRNN(64,return_sequences=True)) #(32+64+1)*64=6208
model.add(SimpleRNN(32,return_sequences=True)) #(64+32+1)*32=3104
model.add(SimpleRNN(32,return_sequences=True)) #(32+32+1)*32=2080
model.summary()
Pop quiz! Experiment yourself then check the comments 🤞
model = Sequential()
model.add(Embedding(10000, 1))
model.add(SimpleRNN(64,return_sequences=True))
model.add(SimpleRNN(3,return_sequences=True))
model.add(SimpleRNN(2))
model.summary()
3. Implement on IMDB data
from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.layers import Densemax_features = 10000
maxlen = 500
batch_size = 32
- Import data
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data( num_words=max_features)print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
print('Pad sequences (samples x time)')input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
Loading data…
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
input_train shape: (25000, 500)
input_test shape: (25000, 500)
- Modeling
model = Sequential()
model.add(Embedding(max_features, 32)) #max_feature=10,000 so, 320,000
model.add(SimpleRNN(32)) #(32+32+1)*32=2080
model.add(Dense(1, activation='sigmoid'))#(32+1)*1=33
model.summary()
- Compiling and fitting
model.compile(optimizer='rmsprop', loss='binary_crossentropy',metrics=['acc'])
history = model.fit(input_train, y_train,epochs=10, batch_size=128, validation_split=0.2)#25,000*0.8=20,000 (train on 20000samples) 5000 left for validation
- Plotting
import matplotlib.pyplot as pltacc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
The validation accuracy goes up to 86%, and this is not very outstanding. The drawback was that we considered the first 500 words; maxlen = 500. Also, the SimpleRNN isn’t appropriate when it comes to processing long sequences; texts are!
III.LSTM in Keras
LSTM is an abbreviation of Long Short-Term Memory. It is one of the solutions to the vanishing gradient problem of our SimpleRNN. It does sound very vague.
Let me put it this way.
Have you seen the movie series Finding Dory? It is a lovely movie 🐠
In that movie Dory suffers from a short-term memory loss and she suddenly recalls her parents and that is how the journey begins.
Our poor Dory is SimpleRNN which fails to retain at time-t information about the inputs seen many time steps before (although it should be able to do so, theoretically ). This problem is called vanishing gradient problem or between you and me short-term memory loss.
And, as you know, Dory is accompanied by Nemo and his father. Dory alone would have never been able to find her parents (spoil alert!). Whenever Dory is lost and forgets why she is away from home her friends are there for her to remind her.
So is SimpleRNN. It needs another data flow that carries information across time-steps (or a journey). So, in LSTM we add a carry track.
Then, how the carry track is computed?
output_t = activation(Uo*state_t + Wo*input_t + Vo*C_t + bo)
i_t = activation(Ui*state_t + Wi*input_t + bi)
f_t = activation(Uf*state_t + Wf*input_t + bf)
k_t = activation(Uk*state_t + Wk*input_t + bk)
c_t+1=i_t + k_t + c_t + k_t
Now, let’s have some work done.
- Modeling
We have already imported dataset, if you are starting from this line go up and copy the parts
from keras.layers import LSTMmodel = Sequential()
model.add(Embedding(max_features, 32)) #max_features=10,000 so 320,000
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.summary()
Where did 8320 come from?
output_t = activation(Uo*state_t + Wo*input_t + Vo*C_t + bo).
Here, Uo and Wo have 32 parameters and 1 bias for bo. Then, what about C_t?
i_t-1 = activation(Ui*state_t-1 + Wi*input_t -1+ bi)
f_t-1= activation(Uf*state_t-1 + Wf*input_t-1+ bf)
k_t-1= activation(Uk*state_t-1 + Wk*input_t-1 + bk)
These 3 will each have 32+32+1=65 parameters and, so C_t which is a transformation of i_t-1 and f_t-1 and k_t-1 will have 65*3=195 parameters.
So, (32+32+195+1)*32=8320.
- Compiling and fitting
model.compile(optimizer='rmsprop',loss='binary_crossentropy',
metrics=['acc'])history = model.fit(input_train, y_train,
epochs=10,
batch_size=128, validation_split=0.2)
- Plotting
This time, validation accuracy goes up to 88%, better than SimpleRNN but still not perfect. This is mainly because LSTM is good at global, long-term structure of the words/texts rather than sentiment-analysis. So, LSTM has its power when it comes to translation.
So this is it for this post and I will be soon back with RNN in TensorFlow2.0. See you then!