MNIST and Fashion-MNIST Classification #
Introduction #
In this lab session, we explore handwritten digit recognition using the MNIST dataset, progressing from classical machine learning approaches to modern deep learning techniques. We then apply transfer learning to the more challenging Fashion-MNIST dataset. This progression mirrors the historical development of the field while providing hands-on experience with key optimization concepts.
The session is divided into three parts:
- Part I: Classical approaches using Multi-Layer Perceptrons (MLP) and Support Vector Machines (SVM)
- Part II: Deep learning with Convolutional Neural Networks (CNN)
- Part III: Transfer learning applied to Fashion-MNIST
Note: For this lab you will need a running Python with packages:
numpy
,matplotlib
,scikit-learn
,pandase
,torch
,torchvision
.
Learning objectives #
By the end of this session, you should be able to:
- Implement backpropagation and stochastic gradient descent from scratch
- Design proper validation strategies to avoid overfitting
- Build and train CNNs using PyTorch
- Apply transfer learning to improve model performance
- Understand the trade-offs between different optimization strategies
You need to implement a backpropagation procedure for MLP. Since we didn’t have time to cover this, we refer to this ressource.
A solution to this lab is available here but don’t just look at it! Try to implement the code yourself first, and then compare your solution with the provided one.
Part I: Classical Machine Learning Approaches #
1. Introduction and Data Exploration #
The MNIST dataset consists of 70,000 grayscale images of handwritten digits (0-9), each of size 28×28 pixels. We begin by loading and exploring this dataset. This initial setup is provided for you.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
# Load MNIST data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False, parser='auto')
# Convert to appropriate types
X = X.astype(np.float32)
y = y.astype(np.int64)
print(f"Data shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Number of classes: {len(np.unique(y))}")
print(f"Pixel value range: [{X.min()}, {X.max()}]")
# Visualize some examples
fig, axes = plt.subplots(2, 5, figsize=(10, 4))
for i, ax in enumerate(axes.flat):
ax.imshow(X[i].reshape(28, 28), cmap='gray')
ax.set_title(f'Label: {y[i]}')
ax.axis('off')
plt.tight_layout()
plt.show()
For classical machine learning approaches, we work with flattened vectors rather than 2D images. Each image becomes a 784-dimensional vector where spatial structure is implicit in the feature ordering.
2. Data Preprocessing and Visualization #
Preprocessing is crucial for optimization convergence. We’ll normalize the data and use dimensionality reduction for visualization.
Normalization #
Normalization ensures that all features contribute equally to the optimization process. For pixel values in range [0, 255], we have two main approaches:
- Min-Max scaling: $x_{\text{scaled}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}} = \frac{x}{255}$
- Standardization: $x_{\text{std}} = \frac{x - \mu}{\sigma}$
where $\mu$ is the mean and $\sigma$ is the standard deviation across the training set. We will use Min-Max scaling. The three-way data split (train, validation, test) is also performed here.
from sklearn.preprocessing import MinMaxScaler
# We'll use MinMaxScaler for pixel data as it preserves the 0 boundary
scaler = MinMaxScaler()
# Create a three-way split for robust evaluation
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)
# Fit scaler on training data only to avoid data leakage
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
print(f"Training set: {X_train_scaled.shape}")
print(f"Validation set: {X_val_scaled.shape}")
print(f"Test set: {X_test_scaled.shape}")
Dimensionality Reduction for Visualization #
To understand the data structure, we apply PCA and t-SNE. PCA finds the directions of maximum variance through eigenvalue decomposition of the covariance matrix: $\mathbf{C} = \frac{1}{n-1}\mathbf{X}^T\mathbf{X}$. The following code will help you visualize the high-dimensional data in 2D.
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Apply PCA for visualization
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train_scaled)
# t-SNE for 2D visualization (on a subset for speed)
subset_size = 5000
indices = np.random.choice(len(X_train_scaled), subset_size, replace=False)
X_subset = X_train_scaled[indices]
y_subset = y_train[indices]
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_subset)
# Visualize t-SNE embedding
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_subset, cmap='tab10', alpha=0.6)
plt.colorbar(scatter)
plt.title('t-SNE visualization of MNIST digits')
plt.xlabel('t-SNE component 1')
plt.ylabel('t-SNE component 2')
plt.show()
3. Experimental Methodology #
A rigorous experimental protocol is key to reliable results.
Three-way Data Splitting #
We use a three-way split:
- Training set (49,000 samples): For model parameter learning.
- Validation set (10,500 samples): For hyperparameter tuning and early stopping.
- Test set (10,500 samples): For final, unbiased model evaluation.
K-Fold Cross-Validation #
For hyperparameter optimization, we employ K-fold cross-validation on the training set. The cross-validation error is: $$\text{CV}(k) = \frac{1}{k} \sum_{i=1}^{k} L(\mathbf{w}_{-i}, \mathcal{D}_i)$$ where $\mathbf{w}_{-i}$ is the model trained on all folds except fold $i$.
4. Multi-Layer Perceptron from Scratch #
Now for the main challenge: implementing a two-layer neural network from scratch to understand backpropagation and stochastic optimization.
Network Architecture #
- Input layer: 784 neurons
- Hidden layer: 128 neurons with ReLU activation
- Output layer: 10 neurons with softmax activation
The forward propagation equations are: \begin{align} \mathbf{z}^{(1)} &= \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)} \\ \mathbf{a}^{(1)} &= \text{ReLU}(\mathbf{z}^{(1)}) \\ \mathbf{z}^{(2)} &= \mathbf{W}^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)} \\ \hat{\mathbf{y}} &= \text{softmax}(\mathbf{z}^{(2)}) \end{align}
Loss Function #
We use the cross-entropy loss: $$L(\mathbf{W}, \mathbf{b}) = -\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{10} y_{ij}\log(\hat{y}_{ij})$$
Backpropagation Derivation #
Here are the crucial gradients you’ll need for implementation:
- Output layer gradients: $\frac{\partial L}{\partial \mathbf{z}^{(2)}} = \hat{\mathbf{y}} - \mathbf{y}$
- Hidden layer gradients: $\frac{\partial L}{\partial \mathbf{z}^{(1)}} = (\mathbf{W}^{(2)})^T \frac{\partial L}{\partial \mathbf{z}^{(2)}} \odot \mathbf{1}[\mathbf{z}^{(1)} > 0]$
- Weight gradients: $\frac{\partial L}{\partial \mathbf{W}^{(2)}} = \frac{1}{n}\sum_{i=1}^{n} \frac{\partial L}{\partial \mathbf{z}^{(2)}_i} (\mathbf{a}^{(1)}_i)^T$ and $\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \frac{1}{n}\sum_{i=1}^{n} \frac{\partial L}{\partial \mathbf{z}^{(1)}_i} \mathbf{x}_i^T$
Implementation #
Task: Implement the MLPFromScratch
Class
Your task is to create a Python class MLPFromScratch
that builds and trains our two-layer neural network. Use the information to guide you through implementing each method.
- init & Helpers
__init__(self, ...)
- Initialize weights and biases for both layers. Use Xavier initialization for weights (e.g.,
np.random.randn(...) * np.sqrt(2.0 / n_input)
) to aid convergence. Initialize biases to zero. - Initialize velocity terms for momentum-based gradient descent (e.g.,
self.vW1
,self.vb1
) as zero arrays with the same shape as the corresponding parameters.
Helper Functions
relu(self, x)
: Implement the ReLU activation function, $\max(0, x)$.relu_derivative(self, x)
: Return 1 for positive inputs, 0 otherwise.softmax(self, x)
: Implement the softmax function. Remember to subtract the max value fromx
before exponentiating for numerical stability: $\text{softmax}(\mathbf{z})_i = \frac{\exp(z_i - \max(\mathbf{z}))}{\sum_j \exp(z_j - \max(\mathbf{z}))}$.
- Forward Pass
forward(self, X)
- Implement the forward pass using the equations above.
- The input
X
will have shape(n_samples, n_features)
. It’s often easier to work with column vectors, so you might need to transpose it. - Store intermediate values like
self.z1
,self.a1
,self.z2
, andself.a2
as they are needed for backpropagation. - The method should return the final predictions,
a2
, transposed back to shape(n_samples, n_classes)
.
compute_loss(self, y_pred, y_true)
- Implement the cross-entropy loss.
- You will need to convert the true labels
y_true
(e.g.,[5, 0, 4, ...]
) into one-hot encoded vectors. - Add a small epsilon (e.g.,
1e-8
) toy_pred
before taking the logarithm to avoidlog(0)
.
- Backward Pass
backward(self, X, y_true, momentum=0.9)
- This is the core of the learning process. Implement the backpropagation algorithm using the gradient equations provided.
- Calculate the gradients for
W2
,b2
,W1
, andb1
. - Update the velocity terms for each parameter using the momentum formula: $v_t = \text{momentum} \cdot v_{t-1} - \text{lr} \cdot \nabla L$.
- Update the weights and biases using their corresponding velocity terms: $W \leftarrow W + v_W$.
- Train & Predict
train(self, X_train, y_train, ...)
- Implement the main training loop.
- Iterate for a given number of
epochs
. - In each epoch, shuffle the training data to ensure batches are random.
- Implement mini-batch gradient descent: loop through the training data in batches of a specified
batch_size
. - For each batch, perform a
forward
pass,compute_loss
, and abackward
pass. - After each epoch, calculate and store the training loss and the accuracy on the validation set. This allows you to monitor for overfitting.
predict(self, X)
- Perform a forward pass and return the index of the highest-scoring class for each input sample using
np.argmax
.
Training the MLP #
Once your class is implemented, use the following code to train it and visualize the results. Experiment with different learning rates.
# NOTE: This code assumes you have created the MLPFromScratch class.
# mlp = MLPFromScratch(learning_rate=0.01)
# train_losses, val_accuracies = mlp.train(
# X_train_scaled, y_train,
# X_val_scaled, y_val,
# epochs=50, batch_size=128
# )
# Plot training curves
# fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# ax1.plot(train_losses)
# ax1.set_xlabel('Epoch'); ax1.set_ylabel('Training Loss'); ax1.set_title('Training Loss over Epochs')
# ax2.plot(val_accuracies)
# ax2.set_xlabel('Epoch'); ax2.set_ylabel('Validation Accuracy'); ax2.set_title('Validation Accuracy over Epochs')
# plt.tight_layout(); plt.show()
Now, experiment with different batch sizes. How does batch size affect:
- Convergence speed?
- Final accuracy?
- Computational efficiency?
Hint
5. Support Vector Machines #
Now we apply SVMs to the same problem using scikit-learn’s optimized implementation. This serves as a powerful baseline to compare against our MLP. We use GridSearchCV
to find the best hyperparameters.
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import time
svm_model = SVC(kernel='rbf', random_state=42)
param_grid = {'C': [0.1, 1, 10], 'gamma': ['scale', 'auto', 0.01]}
# Use a subset for faster grid search
X_train_subset = X_train_scaled[:5000]
y_train_subset = y_train[:5000]
print("Starting Grid Search for SVM...")
start_time = time.time()
grid_search = GridSearchCV(svm_model, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train_subset, y_train_subset)
print(f"Grid search completed in {time.time() - start_time:.2f} seconds")
best_svm = grid_search.best_estimator_
best_svm.fit(X_train_scaled, y_train)
val_acc_svm = best_svm.score(X_val_scaled, y_val)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Validation accuracy: {val_acc_svm:.4f}")
6. Comparative Analysis #
Let’s compare the performance of your custom MLP with the tuned SVM on the final test set.
Task: Create an sklearn-compatible Estimator
To easily compare your MLP with sklearn models, create a wrapper class MLPClassifier
that inherits from sklearn.base.BaseEstimator
and sklearn.base.ClassifierMixin
.
- The
__init__
method should store hyperparameters likelearning_rate
,epochs
, etc. - The
fit(self, X, y)
method should initialize and train yourMLPFromScratch
instance. - The
predict(self, X)
method should call thepredict
method of your trained MLP. - The
score(self, X, y)
method should predict onX
and return the accuracy againsty
.
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
# NOTE: This assumes you have created the MLPClassifier wrapper.
# mlp_sklearn = MLPClassifier(hidden_size=128, learning_rate=0.01, epochs=30)
# mlp_sklearn.fit(X_train_scaled, y_train)
# Predictions on the test set
# mlp_pred = mlp_sklearn.predict(X_test_scaled)
# svm_pred = best_svm.predict(X_test_scaled)
# ...
Analyze the computational costs.Hint
Part II: Deep Learning with CNNs #
1. Transition to PyTorch #
We now move to PyTorch for implementing Convolutional Neural Networks (CNNs). PyTorch provides automatic differentiation and GPU acceleration, which are essential for deep learning. The following boilerplate code prepares the data for PyTorch.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import torch.nn.functional as F
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# Convert numpy arrays to PyTorch tensors and reshape for CNNs (N, C, H, W)
X_train_tensor = torch.FloatTensor(X_train_scaled.reshape(-1, 1, 28, 28))
y_train_tensor = torch.LongTensor(y_train)
X_val_tensor = torch.FloatTensor(X_val_scaled.reshape(-1, 1, 28, 28))
y_val_tensor = torch.LongTensor(y_val)
X_test_tensor = torch.FloatTensor(X_test_scaled.reshape(-1, 1, 28, 28))
y_test_tensor = torch.LongTensor(y_test)
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
2. Convolutional Neural Network Design #
CNNs exploit the spatial structure of images through local connectivity and parameter sharing.
CNN Architecture Implementation #
Task: Implement a Simple CNN in PyTorch
Create a SimpleCNN
class that inherits from nn.Module
.
__init__(self)
:- Define the layers of the network. Use the following architecture:
nn.Conv2d
: 1 input channel, 32 output channels, kernel size of 3, padding of 1.nn.MaxPool2d
: kernel size of 2, stride of 2. (This will reduce 28x28 to 14x14).nn.Conv2d
: 32 input channels, 64 output channels, kernel size of 3, padding of 1.nn.MaxPool2d
: kernel size of 2, stride of 2. (This will reduce 14x14 to 7x7).nn.Linear
: Input size will be the flattened output of the previous layer (64 * 7 * 7
). Output size of 128.nn.Dropout
: with a probability of 0.5 for regularization.nn.Linear
: 128 inputs, 10 outputs (for the 10 digit classes).
- Define the layers of the network. Use the following architecture:
forward(self, x)
:- Define the data flow through the network.
- Pass the input
x
throughconv1
, then apply aF.relu
activation, then pass through the firstpool
layer. - Repeat for the second convolutional block (
conv2
,relu
,pool
). - Before the fully connected layers, you must flatten the tensor. Use
x = x.view(-1, 64 * 7 * 7)
. - Pass the flattened tensor through
fc1
,relu
,dropout
, and finallyfc2
. - Return the raw output scores (logits) from
fc2
. The loss function will handle the softmax.
3. Training Methodology #
Training Loop Implementation #
Task: Implement a PyTorch Training Loop
Write a function train_model(model, train_loader, val_loader, ...)
that trains your SimpleCNN
.
Setup:
- Move the
model
to the selecteddevice
. - Define the
criterion
(loss function):nn.CrossEntropyLoss()
. - Define the
optimizer
:optim.Adam(model.parameters(), lr=...)
.
- Move the
Outer Loop: Iterate through
epochs
.Training Phase (Inner Loop):
- Set the model to training mode:
model.train()
. - Iterate through the
train_loader
to get batches ofdata
andtarget
. - Move
data
andtarget
to thedevice
. - Crucially, zero the gradients:
optimizer.zero_grad()
. - Perform a forward pass:
output = model(data)
. - Calculate the loss:
loss = criterion(output, target)
. - Perform backpropagation:
loss.backward()
. - Update the model weights:
optimizer.step()
. - Keep track of running loss and accuracy.
- Set the model to training mode:
Validation Phase (Inner Loop):
- Set the model to evaluation mode:
model.eval()
. - Disable gradient calculation with
with torch.no_grad():
. - Iterate through the
val_loader
. - Calculate the validation loss and accuracy for the epoch.
- Set the model to evaluation mode:
Logging & Return:
- Print the training and validation statistics at the end of each epoch.
- Return a history dictionary containing lists of training losses, validation losses, etc., for later plotting.
4. Advanced Optimization Strategies #
Comparing Optimizers #
Task: Compare Optimizers
Adapt your training loop into a new function, compare_optimizers
. This function should:
- Take the model class as an argument.
- Have a dictionary of optimizers to test (e.g.,
'SGD': optim.SGD
,'Adam': optim.Adam
). - Loop through this dictionary. In each iteration:
- Instantiate a new model and the corresponding optimizer.
- Run the training for a fixed number of epochs.
- Store the validation accuracy history for each optimizer.
- Return a dictionary of results. Finally, plot the validation accuracy curves for all optimizers on a single graph to compare their convergence behavior.
Learning Rate Scheduling #
Task: Implement Learning Rate Scheduling
Modify your training function to include a learning rate scheduler.
- After defining the optimizer, create a scheduler instance:
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=3)
. This scheduler reduces the LR when the validation loss ('min'
) stops improving for 3 (patience
) epochs. - At the end of each epoch, after the validation loop, call the scheduler’s step function with the current validation loss:
scheduler.step(val_loss)
. - Log the learning rate at each epoch to see how it changes over time. You can get it from
optimizer.param_groups[0]['lr']
. - Plot the loss curves and the learning rate over epochs.
Part III: Transfer Learning with Fashion-MNIST #
1. Fashion-MNIST Introduction #
Fashion-MNIST is a drop-in replacement for MNIST but is more challenging. We will use it to demonstrate the power of transfer learning.
from torchvision import datasets, transforms
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
fashion_train = datasets.FashionMNIST('./data', train=True, download=True, transform=transform)
fashion_test = datasets.FashionMNIST('./data', train=False, download=True, transform=transform)
fashion_train_loader = DataLoader(fashion_train, batch_size=64, shuffle=True)
fashion_test_loader = DataLoader(fashion_test, batch_size=64, shuffle=False)
2. Practical Implementation #
We will adapt a model pre-trained on ImageNet for our grayscale clothing classification task.
Task: Implement a Transfer Learning Model
Your goal is to build, train, and fine-tune a transfer learning model.
- Model Definition"
Create a TransferLearningModel
class that inherits from nn.Module
.
__init__(self)
:- Load a pre-trained model from
torchvision.models
, for example,models.mobilenet_v2(pretrained=True)
. - Freeze the backbone: Iterate through the parameters of the loaded model (
self.model.parameters()
) and setparam.requires_grad = False
. This prevents them from being updated during training. - Handle channel mismatch: Fashion-MNIST is grayscale (1 channel), but MobileNetV2 expects RGB (3 channels). Add a
nn.Conv2d(1, 3, kernel_size=1)
layer to convert the input. - Replace the classifier: The final layer of the pre-trained model must be replaced with a new one suited for our 10-class problem. For MobileNetV2, this is
self.model.classifier
. Replace it with ann.Linear
layer with the correct number of input features (1280 for MobileNetV2) and 10 output features.
- Load a pre-trained model from
forward(self, x)
:- First, pass the input
x
through your 1-to-3 channel conversion layer. - Then, pass the result through the main pre-trained model.
- First, pass the input
- Stage 1: Feature Extraction Train the model with the frozen backbone. This means you are only training the new classifier layer you added.
- Write a training loop for this stage.
- Important: The optimizer should only be passed the parameters that have
requires_grad
set to true. Useoptimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)
. - Train for a few epochs (e.g., 5) with a standard learning rate like
0.001
.
- Stage 2: Fine-Tuning After the initial training, unfreeze some of the later layers of the pre-trained model to adapt them to the new dataset.
- Add a method
unfreeze_layers(self, num_layers)
to your model class. It should setrequires_grad = True
for the parameters of the lastnum_layers
of the feature extractor (self.model.features
). - Call this method to unfreeze the last few layers (e.g.,
model.unfreeze_layers(3)
). - Train the model again for a few more epochs.
- Crucially, use a much lower learning rate (e.g.,
1e-4
) to avoid corrupting the pre-trained weights.
4. Comparative Analysis #
Finally, train your SimpleCNN
from scratch on Fashion-MNIST and compare its test accuracy curve against your two-stage transfer learning model. Also, consider the difference in training time.
Exercises for Further Practice #
- Implement data augmentation (
torchvision.transforms
) for the CNN and measure its impact on generalization. - Try different CNN architectures (e.g., add more layers, use different filter sizes).
- Implement early stopping in your training loop based on validation performance.
- Explore other pre-trained models (ResNet, EfficientNet) for transfer learning.