From ANN to DNN: Classifying Handwritten Digits with Deep Neural Networks

After building my first ANN to predict exam scores, I was ready for the next challenge: Deep Neural Networks. This time, instead of predicting one number, the network would recognize handwritten digits (0-9) from images!

🎯 What Makes This Different from ANN?

ANN (Previous Project)	DNN (This Project)
Input: 1 number (hours)	Input: 784 numbers (28×28 image)
Output: 1 number (score)	Output: 10 classes (digits 0-9)
1 layer	Multiple hidden layers
Linear problem	Non-linear, complex patterns

Key insight: When problems get complex, we need depth — multiple layers that learn hierarchical features.

🖼 What is MNIST?

MNIST is the “Hello World” of machine learning — a dataset of 70,000 handwritten digit images:

Image size: 28 × 28 pixels
Color: Grayscale (0-255)
Labels: Digits 0 through 9
Training set: 60,000 images
Test set: 10,000 images

🧠 The DNN Architecture

Here’s the model that achieves 97%+ accuracy:

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

Let me break down each layer:

🔹 Layer 1: Flatten()

What it does: Converts the 2D image into a 1D vector.

Before: 28 × 28 image (matrix)
After:  784 numbers → [x1, x2, x3, ... x784]

Why needed? Dense layers expect a flat vector, not a 2D grid.

🔹 Layer 2: Dense(128, activation=‘relu’)

What it does:

128 neurons, each connected to ALL 784 input pixels
Learns simple patterns (edges, curves)

ReLU activation:

output = max(0, x)

Why ReLU? Adds non-linearity. Without it, stacking layers would just be linear math — no real “depth.”

🔹 Layer 3: Dense(64, activation=‘relu’)

What it does:

64 neurons receiving input from the 128 neurons above
Combines simple features into complex patterns
Learns digit-specific shapes

This is hierarchical learning — building complex understanding from simple parts!

🔹 Layer 4: Dense(10, activation=‘softmax’)

What it does:

10 neurons — one for each digit (0-9)
Outputs probabilities that sum to 1.0

Example output:

[0.01, 0.02, 0.90, 0.01, 0.01, 0.02, 0.01, 0.01, 0.01, 0.00]
         ↑
    Digit 2 has 90% probability → Prediction: 2

📊 Complete Training Code

import tensorflow as tf

# 1. Load MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# 2. Normalize pixel values (VERY IMPORTANT!)
x_train = x_train / 255.0
x_test = x_test / 255.0

# 3. Build the DNN model
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

# 4. Compile model
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# 5. Train
model.fit(x_train, y_train, epochs=5, validation_split=0.1)

# 6. Evaluate
model.evaluate(x_test, y_test)

🤔 Questions I Asked (And Finally Understood)

Q: Why normalize pixels by dividing by 255?

Answer: Raw pixel values (0-255) are too large and vary too much. Normalizing to 0-1:

Makes training more stable
Helps gradients flow better
Speeds up convergence

x_train = x_train / 255.0  # Now values are 0.0 to 1.0

Q: Why use Adam instead of SGD?

Answer: Adam is “smarter” than basic SGD:

Adaptive Moment Estimation
Automatically adjusts learning rate per parameter
Works well out-of-the-box

For most deep learning, Adam is the default choice!

Q: What is `sparse_categorical_crossentropy`?

Answer: It’s the loss function for multi-class classification with integer labels.

Sparse = labels are integers (0, 1, 2, … 9)
Categorical = multiple classes
Crossentropy = measures how wrong the probability distribution is

If labels were one-hot encoded, we’d use categorical_crossentropy instead.

Q: What’s the difference between ANN and DNN?

Answer: DNN = ANN with multiple hidden layers.

Feature	ANN	DNN
Hidden layers	0-1	2+ ✅
Neurons	Few	Many ✅
Learning	Simple patterns	Complex, hierarchical ✅
Use case	Linear problems	Images, text, complex data

Key insight: All DNNs are ANNs, but not all ANNs are “deep.”

🧠 What Each Layer Learns (Hierarchical Features)

Layer	What It Learns
Flatten	Raw pixels
Dense 128	Edges, simple curves
Dense 64	Digit parts (loops, lines)
Dense 10	Final digit classification

This is why deep learning works — it builds understanding layer by layer!

⚠️ Overfitting: The DNN Trap

Problem: If you add too many layers/neurons:

Training accuracy goes UP ↑
Test accuracy goes DOWN ↓

The model memorizes training data instead of learning patterns!

Solutions:

Dropout: Randomly disable neurons during training
Regularization: Penalize large weights
Early stopping: Stop training when validation accuracy plateaus
Less depth: Sometimes simpler is better

📈 Results

After just 5 epochs:

Training accuracy: ~98%
Validation accuracy: ~97%
Test accuracy: ~97-98%

The DNN correctly classifies handwritten digits with near-human accuracy! 🎉

🚀 What I Learned (Key Takeaways)

✅ Depth matters — multiple layers learn complex patterns
✅ Flatten converts images to vectors for Dense layers
✅ ReLU adds non-linearity (essential for learning)
✅ Softmax outputs probabilities for multi-class problems
✅ Normalization is crucial for stable training
✅ Adam optimizer works better than basic SGD
✅ Overfitting is the enemy — regularize!

📁 Try It Yourself

I’ve open-sourced this project with full code and explanations:

🔗 GitHub: DNN Handwritten Digit Classification

🎯 What’s Next?

This DNN works great for simple images, but for complex images (cats, dogs, faces), we need something better: Convolutional Neural Networks (CNNs).

Stay tuned for my next project where I build a CNN to classify cats vs dogs! 🐱🐶

Questions about DNNs? Feel free to reach out — explaining concepts helps me learn too! 🧠✨

From ANN to DNN: Classifying Handwritten Digits with Deep Neural Networks

🎯 What Makes This Different from ANN?

🖼 What is MNIST?

🧠 The DNN Architecture

🔹 Layer 1: Flatten()

🔹 Layer 2: Dense(128, activation=‘relu’)

🔹 Layer 3: Dense(64, activation=‘relu’)

🔹 Layer 4: Dense(10, activation=‘softmax’)

📊 Complete Training Code

🤔 Questions I Asked (And Finally Understood)

Q: Why normalize pixels by dividing by 255?

Q: Why use Adam instead of SGD?

Q: What is sparse_categorical_crossentropy?

Q: What’s the difference between ANN and DNN?

🧠 What Each Layer Learns (Hierarchical Features)

⚠️ Overfitting: The DNN Trap

📈 Results

🚀 What I Learned (Key Takeaways)

📁 Try It Yourself

🎯 What’s Next?

Q: What is `sparse_categorical_crossentropy`?