Breaking Down Recurrent Neural Networks – 101

Reading Time: 8 Minutes

Join me in this post to decode the concept of RNNs (Recurrent Neural Networks) – a neural network that is specifically built for prediction and classification on data that changes over time.

Recurrent Neural Networks are quite dated concepts that is not being currently used in transformer architecture based modern applications. But learning them will help you to build a strong foundation in AI. The various concepts discussed in this post are still relevant in the modern architecture, such as processing input data, choosing activation and loss functions, back-propagation, etc.

I started my research to learn RNNs and found some important information from this paper on using RNNs for sequence learning, it discusses basics and popular types of and advancement of RNNs. However, I couldn’t understand the internal working and back-propagation completely, so after a significant internet surfing and research I found a simple way to learn RNNs. This post exactly presents my condensed understanding of RNNs with easy to understand visuals, math and python code.

If you’re already a reader of sapiencespace, you might have guessed the structure of this post. Yes, we will uncover the math behind the nets, visualize them and implement them in python. This concept is very close to me now, because I always wanted to learn it. I also couldn’t understand certain aspects and took me lot of time and resources to learn it properly. But here it is, enjoy yourselves.

I highly recommend you to check out my previous posts on neural networks, to get a firm grasp of basic concepts. Here are some helpful links:

Basics of Neural Networks	sapiencespace.com/neural-networks/
Forward Propagation	sapiencespace.com/mastering-neural-networks-1-a-comprehensive-guide-to-forward-propagation/
Backward Propagation	sapiencespace.com/mastering-neural-network-2-back-propagation/
Gradient Descent	sapiencespace.com/breaking-down-gradient-descent-in-4-minutes/
Tips to Train High Performance Neural Networks	sapiencespace.com/art-of-training-neural-networks-18-secrets/

However, if you are comfortable with the fundamentals, lets get started.

Recurrent Neural Networks, as the name suggest uses the concept of recurring units over a period of time to understand and make predictions on sequential data. It is used in tasks where context is important, such as Natural Language Processing, Time Series Forecasting, Speech Recognition, etc.

Working Principle of RNNs

Input Sequence Processing: RNNs take sequential input one time step at a time.
Hidden State: The network maintains a hidden state, which captures information about previous time steps.
Shared Weights: The same set of weights is used across all time steps, making the model efficient for sequences of varying lengths.

By understanding the information from the previous time steps, the network will have the ability to store contextual information. For this post, I will be taking an example of predicting the next character(letter) in a word, we will explore more of the details soon after understanding the architecture’s underlying logic.

The functioning of Recurrent Neural Networks are very simple, we initialize a bunch of parameters, pass them forward by doing some calculations and activations, we finally use a function like softmax to normalize the predicted outcomes. Then we deploy back-propagation to tune the parameters to solve our intended task.

In this given diagram, let’s say we have 3 sequential units (left to right). Each units are connected through the hidden layer where previous parameters are used for computing preceding output values.

I have chosen a simple task of predicting the next letter based on the training data – ‘CAT’, it is not an abbreviation for a complex data, they are just three letters ‘C’, ‘A’, ‘T’ and our task is to predict the next word from the given a letter from the data. I know this sounds very childish and has no practicality, but trust me, starting with a toy example for a complicated concept is the best way to understand it easily. We’ll go through a real-world application in the next post.

But before staring with the logic, we need to know our end goal. Of course we want to predict ‘a’ when ‘c’ in given as input and ‘t’ when ‘a’ is given, but from a technical standpoint we should know the loss we want to achieve. Having this insight helps to train over model efficiently without under/over fitting the model. Here we will use the negative log likelihood to know our destination. Since our vocabulary size is 3 in this example, the loss at the end of training should be close to -np.log10(1/3) = 0.477(approx).

The Forward Propagation Phase

As usual in neural networks, we initialize the weights and biases for each layer and pass them through activation function.

Here the weights at input layer are set to W_xh with dimensions hidden_size X vocab_size. Each of the column possibly represents the character we would like to process at a given point. I say possibly because of the varying nature of inputs. Since we will be one-hot encoding the input, we will end up extracting only certain elements from the column.

To make things easily understandable, I have labelled the columns with the input character as the diagonal values will be extracted.

This image has an empty alt attribute; its file name is IMG_0056-993x1024.webp

Similarly, we initialize weights and biases for the hidden and output layer.

# Vocabulary (e.g., a -> 0, b -> 1, ..., z -> 25)
vocab = {'c': 0, 'a': 1, 't': 2}
vocab_size = len(vocab)

# Sequence to predict ("c", "a", "t")
sequence = ['c', 'a', 't']
hidden_size = 3  # Number of hidden units


W_xh = np.random.randn(hidden_size, vocab_size) * 0.01  # Input to hidden
W_hh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden
W_ho = np.random.randn(vocab_size, hidden_size) * 0.01  # Hidden to output
b_h = np.zeros((hidden_size, 1))  # Hidden bias
b_o = np.zeros((vocab_size, 1))  # Output bias

Some of you might argue with me to change the ordering of dimensions, as it is easier to process vocab_size -> hidden_size -> vocab_size. But the further calculations carried out will be easier when we set the arrangement of weights for individual element in time is in a columnar fashion. This is especially helpful in forward propagation which focuses on utilizing the parameter matrices to weigh the inputs and computed outcomes.

The first propagation of weights is through the hidden layer which uses the tanh activation function. Since this is an RNN, we are performing two matrix multiplications:

W_xh X x_t : computing the weighed outcome of weights and the input values
W_hh X h_prev: accounting for the previous influence of hidden state on the weight parameters in the current hidden unit.

Further, the bias term(b_h)(bias for the hidden layer) is added to the compute and summed up inside the tanh function.

Later the output value is computed by multiplying the computed h_t with the initialized W_ho weight parameters and adding b_o(bias for the output). We will be using a softmax function to normalize the raw scores during back-propagation.

Each column is the weight values for a entire vocabulary. At forward propagation, we have three rows of data at input and each row represents a time step, and at each increasing time, we are computing the consecutive characters.

h_prev = np.zeros((hidden_size, 1))  # Initial hidden state (t = -1)

for char in sequence:
    x_t = one_hot_encode(char, vocab)  # One-hot encoded input

    h_t = np.tanh(np.dot(W_xh, x_t) + np.dot(W_hh, h_prev) + b_h)  # Hidden state
    y_t = np.dot(W_ho, h_t) + b_o  # Output (raw scores)
    
    h_prev = h_t  # Update hidden state for next timestep

In this example, we will be using cross-entropy as the loss function. Now the outputs list which has three 3 X 3 matrices are prepared for back-propagation. Since, this is RNN, we will be using a concept called BPTT (Back Propagation Through Time). This accounts for the influence of previous hidden state on present loss optimization.

The rest of the process is same as any other back-propagation concept, we fix a loss function and partially differentiate other parameters with respect to that loss function. For example, we took the derivative of ReLU and multiplied with the delta of error in the explanation of back-propagation in neural networks.

Similarly, we will take the derivative of cross-entropy function and use it to calculate the rest of the gradient values. To frame it better, I would say that we use this base derivative as a basis to flow back through the network meanwhile accounting the previous hidden_state impact.

The One-Hot Encoded y_pred,target has passed through the softmax function by now and the derivative of loss is taken with respect to the output. Since the output is predicted for a single class at a time, we will take y_pred as dy.

We will subtract 1 from the target labelled list to identify the correct target in further processing. We will also shift the elements in the targets list ahead by one step.

The back-propagation until the output layer will look like this->

Once the output layer’s parameters are updated, now its is time to back-propagate through the hidden state, here it is crucial to take the preceding time step’s hidden values to calculate the gradients.

The tanh activation does play a role even during back-propagation, the derivative of tanh is used to pass to earlier parameters instead of directly passing dh.

Complete Code for RNN

import numpy as np

# Vocabulary and data setup
vocab = {'c': 0, 'a': 1, 't': 2}
vocab_size = len(vocab)

sequence = ['c', 'a', 't']
targets = [vocab['a'], vocab['t'], vocab['c']]  # Shifted targets

Note: W_xh: Columns correspond to input characters (vocab_size) ; rows correspond to hidden units. W_ho: Columns correspond to hidden units; rows correspond to output classes.

# Weight initialization
W_xh = np.random.randn(hidden_size, vocab_size) * 0.01
W_hh = np.random.randn(hidden_size, hidden_size) * 0.01
W_ho = np.random.randn(vocab_size, hidden_size) * 0.01
b_h = np.zeros((hidden_size, 1))
b_o = np.zeros((vocab_size, 1))

def one_hot_encode(char, vocab):
    vec = np.zeros((len(vocab), 1))
    vec[vocab[char]] = 1
    return vec
    
def softmax(x):
  exp_x = np.exp(x - np.max(x)) 
  return exp_x / np.sum(exp_x)

# Hyperparameters
hidden_size = 3
learning_rate = 0.03

# Training loop
# Number of epochs - you can easily find this number by experimenting, however in practical real life models, an approximate estimate is taken or early stopping is implemented

for epoch in range(220):  
    loss = 0  # Reset loss for this epoch -- so that they are not accumulated during calculation
    
    # Initialize gradients
    dW_xh = np.zeros_like(W_xh)
    dW_hh = np.zeros_like(W_hh)
    dW_ho = np.zeros_like(W_ho)
    db_h = np.zeros_like(b_h)
    db_o = np.zeros_like(b_o)
    dh_next = np.zeros((hidden_size, 1))  
    
    # Reset hidden state
    h_prev = np.zeros((hidden_size, 1))  

    # Forward and backward pass for sequence
    for t in range(len(sequence)):
        # Forward pass
        x_t = one_hot_encode(sequence[t], vocab)
        h_t = np.tanh(np.dot(W_xh, x_t) + np.dot(W_hh, h_prev) + b_h)
        y_t = np.dot(W_ho, h_t) + b_o
        y_pred = softmax(y_t)

        # Loss calculation
        loss += -np.log(y_pred[targets[t]])

        # Gradients for output layer
        dy = y_pred
        dy[targets[t]] -= 1

        dW_ho += np.dot(dy, h_t.T)
        db_o += dy

        # Backpropagation to hidden layer
        dh = np.dot(W_ho.T, dy) + dh_next
        dh_raw = (1 - h_t ** 2) * dh  # Derivative of tanh
        dW_xh += np.dot(dh_raw, x_t.T)
        dW_hh += np.dot(dh_raw, h_prev.T)
        db_h += dh_raw
        dh_next = np.dot(W_hh.T, dh_raw)

        h_prev = h_t  # Update hidden state

    # Update weights and biases
    W_xh -= learning_rate * dW_xh
    W_hh -= learning_rate * dW_hh
    W_ho -= learning_rate * dW_ho
    b_h -= learning_rate * db_h
    b_o -= learning_rate * db_o

The loss can be stored in a list and later visualized to observe the pattern of decrease in loss.

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')

plt.plot(ll, label='Training Loss')
plt.axhline(-np.log10(1/3), color='red', label='Goal')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Analysis')
plt.legend()

Thank you for reading all along, in the next part of this post we’ll be building a Recurrent Neural Network that solves a real world task. I highly encourage you to try other activation functions and tweet the hyper-parameters to experiment with the results. Do post your comments and feedback in the comment section below. Subscribe to sapiencespace and enable notifications to get regular insights.

_{Title image and cover picture credits – unsplash content creators}

Click here to view similar insights.

What’s your Reaction?

Insightful

Helpful

Amazing

Clap

Hi-fi

Recently Posted

Data Science & Programming

Breaking Down Recurrent Neural Networks – 101

Join me in this post to decode the concept of RNNs (Recurrent Neural Networks) – a neural network that is specifically built for prediction and classification on data that changes over time.

Working Principle of RNNs

The Forward Propagation Phase

Leave a Reply Cancel reply

Recently Posted

Can page-based indexing save Compute, Memory and Time in RAG(Retrieval Augmented Generation)? A comparative study in medical field

Share

Subscribe To Newsletter

Breaking Down Recurrent Neural Networks – 101

Join me in this post to decode the concept of RNNs (Recurrent Neural Networks) – a neural network that is specifically built for prediction and classification on data that changes over time.

Working Principle of RNNs

The Forward Propagation Phase

Share this post!

Leave a Reply Cancel reply

Recently Posted

Can page-based indexing save Compute, Memory and Time in RAG(Retrieval Augmented Generation)? A comparative study in medical field

Share

Subscribe To Newsletter

Home

Data Science & Programming

Book Summaries & Review

Personal Development