Understanding RNN: Sequence Learning and Memory

What is RNN?

RNN = Recurrent Neural Network

A neural network designed to process sequential data — text, time series, speech, any data where order matters.

Key difference from CNN/DNN: RNN has memory — it processes data one step at a time and remembers what it saw before.

Main Things I Learned

1. Sequential Data Has Order

Unlike images (CNNs), sequences have temporal order.

RNN processes left-to-right, understanding context builds as it goes. Order matters. Context matters.

2. What is Hidden State (Memory)?

RNN’s secret sauce is hidden state.

At each time step, the network takes:

Current input (current character)
Previous hidden state (memory from before)
Produces new hidden state (updated memory)

The hidden state is computed as:

$$h_t = \tanh(W_x \cdot x_t + W_h \cdot h_{t-1} + b)$$

This is memory: The network carries information from previous steps forward. Each step captures more context as it processes the sequence.

3. What Are Time Steps?

Time step = processing one element in sequence

Each step processes one character/word and passes hidden state to next step.

This is why it’s called “recurrent” — same computation happens at each step, with recurrent connection to previous hidden state.

4. Input Sequences and Targets

Model learns patterns from sequences and their targets.

Training: For each sequence, the model learns “if you see this sequence, predict that next element”

5. One-Hot Encoding

RNN doesn’t understand characters — only numbers.

Characters are mapped to integers, then converted to one-hot vectors for processing.

Why? Neural networks work with numerical representations. This encoding makes the input clear and uniform.

6. The Vanishing Gradient Problem (CRITICAL)

Problem: When RNN has long sequences, gradients get multiplied many times and become very small.

When computing gradients backwards through time:

$$\frac{\partial L}{\partial h_0} = \frac{\partial L}{\partial h_T} \prod_{t=1}^{T} \frac{\partial h_t}{\partial h_{t-1}}$$

If each $$\frac{\partial h_t}{\partial h_{t-1}} < 1$$, the product becomes exponentially small.

Result: Network forgets early elements. Long-range dependencies break.

Symptoms:

Can’t remember context from many steps ago
Fails on long sequences
Only learns local patterns

7. Why LSTM and GRU Exist

Solution: Add gates to control information flow.

LSTM (Long Short-Term Memory) uses multiple gates to control what gets forgotten, remembered, and output.

GRU (Gated Recurrent Unit):

Simpler than LSTM
Fewer gates, faster training
Similar performance

Result: Gradients flow better. Network remembers longer context.

8. How Embedding Layer Works

Instead of one-hot encoding (sparse, inefficient), use Embedding layers.

Character indices are converted to dense vector representations learned during training.

Why better?

Compact representation
Learned during training
Captures semantic meaning

9. Sequence Padding

Different sequences have different lengths.

Solution: Pad shorter sequences to same length so all sequences are uniform. This enables batch processing.

10. Training on Sequences

Model learns from many examples: “Given this context, predict that next element”

Over many examples, it learns language patterns and builds understanding of the data.

11. Text Generation

After training, you can generate new sequences by:

Starting with seed text
Predicting next element
Feeding prediction back as input
Repeating to generate long sequences

12. Why RNN is Limited Today

RNNs have problems:

Vanishing gradient (even with LSTM)
Can’t parallelize (must process sequentially)
Slow training on large datasets
Long-range dependencies still hard

Solution: Transformers (newer, better architecture)

No recurrence, processes entire sequence at once
Attention mechanism > RNN memory
Parallelizable = much faster

Key Takeaways

✅ RNN has memory (hidden state) — carries context forward

✅ Time steps = processing one element at a time

✅ Sequential data has order — order matters

✅ Hidden state is updated at each time step with recurrence

✅ Vanishing gradient limits long-range memory

✅ LSTM/GRU solve vanishing gradient with gates

✅ Embedding layers compress one-hot encoding

✅ Padding makes sequences uniform length

✅ RNNs learn language patterns through training

✅ Text generation works by predicting one step at a time

✅ Transformers replaced RNNs for most NLP tasks today

Full Implementation

🔗 GitHub: RNN-Project-Next-Character-Prediction

See the repository for implementation details.

RNNs taught me that neural networks can have memory. That’s powerful. 🚀