What is RNN?
RNN = Recurrent Neural Network
A neural network designed to process sequential data — text, time series, speech, any data where order matters.
Key difference from CNN/DNN: RNN has memory — it processes data one step at a time and remembers what it saw before.
Main Things I Learned
1. Sequential Data Has Order
Unlike images (CNNs), sequences have temporal order.
RNN processes left-to-right, understanding context builds as it goes. Order matters. Context matters.
2. What is Hidden State (Memory)?
RNN’s secret sauce is hidden state.
At each time step, the network takes:
- Current input (current character)
- Previous hidden state (memory from before)
- Produces new hidden state (updated memory)
The hidden state is computed as:
$$h_t = \tanh(W_x \cdot x_t + W_h \cdot h_{t-1} + b)$$
This is memory: The network carries information from previous steps forward. Each step captures more context as it processes the sequence.
3. What Are Time Steps?
Time step = processing one element in sequence
Each step processes one character/word and passes hidden state to next step.
This is why it’s called “recurrent” — same computation happens at each step, with recurrent connection to previous hidden state.
4. Input Sequences and Targets
Model learns patterns from sequences and their targets.
Training: For each sequence, the model learns “if you see this sequence, predict that next element”
5. One-Hot Encoding
RNN doesn’t understand characters — only numbers.
Characters are mapped to integers, then converted to one-hot vectors for processing.
Why? Neural networks work with numerical representations. This encoding makes the input clear and uniform.
6. The Vanishing Gradient Problem (CRITICAL)
Problem: When RNN has long sequences, gradients get multiplied many times and become very small.
When computing gradients backwards through time:
$$\frac{\partial L}{\partial h_0} = \frac{\partial L}{\partial h_T} \prod_{t=1}^{T} \frac{\partial h_t}{\partial h_{t-1}}$$
If each $$\frac{\partial h_t}{\partial h_{t-1}} < 1$$, the product becomes exponentially small.
Result: Network forgets early elements. Long-range dependencies break.
Symptoms:
- Can’t remember context from many steps ago
- Fails on long sequences
- Only learns local patterns
7. Why LSTM and GRU Exist
Solution: Add gates to control information flow.
LSTM (Long Short-Term Memory) uses multiple gates to control what gets forgotten, remembered, and output.
GRU (Gated Recurrent Unit):
- Simpler than LSTM
- Fewer gates, faster training
- Similar performance
Result: Gradients flow better. Network remembers longer context.
8. How Embedding Layer Works
Instead of one-hot encoding (sparse, inefficient), use Embedding layers.
Character indices are converted to dense vector representations learned during training.
Why better?
- Compact representation
- Learned during training
- Captures semantic meaning
9. Sequence Padding
Different sequences have different lengths.
Solution: Pad shorter sequences to same length so all sequences are uniform. This enables batch processing.
10. Training on Sequences
Model learns from many examples: “Given this context, predict that next element”
Over many examples, it learns language patterns and builds understanding of the data.
11. Text Generation
After training, you can generate new sequences by:
- Starting with seed text
- Predicting next element
- Feeding prediction back as input
- Repeating to generate long sequences
12. Why RNN is Limited Today
RNNs have problems:
- Vanishing gradient (even with LSTM)
- Can’t parallelize (must process sequentially)
- Slow training on large datasets
- Long-range dependencies still hard
Solution: Transformers (newer, better architecture)
- No recurrence, processes entire sequence at once
- Attention mechanism > RNN memory
- Parallelizable = much faster
Key Takeaways
✅ RNN has memory (hidden state) — carries context forward
✅ Time steps = processing one element at a time
✅ Sequential data has order — order matters
✅ Hidden state is updated at each time step with recurrence
✅ Vanishing gradient limits long-range memory
✅ LSTM/GRU solve vanishing gradient with gates
✅ Embedding layers compress one-hot encoding
✅ Padding makes sequences uniform length
✅ RNNs learn language patterns through training
✅ Text generation works by predicting one step at a time
✅ Transformers replaced RNNs for most NLP tasks today
Full Implementation
🔗 GitHub: RNN-Project-Next-Character-Prediction
See the repository for implementation details.
RNNs taught me that neural networks can have memory. That’s powerful. 🚀