Build Your Own VGG16: A Step-by-Step PyTorch Tutorial

Want to understand deep learning architectures from the ground up? This article walks you through building VGG16, a powerful convolutional neural network, from scratch using PyTorch. Learn the concepts, code, and tricks to implement your own image classifier on the CIFAR100 dataset.

What is VGG16 and why build it in PyTorch?

VGG16 is a convolutional neural network known for its depth and uniform architecture. It was a milestone in deep learning, demonstrating the power of increasing network depth for image recognition. Building VGG16 from scratch lets you master PyTorch and gives you a deep understanding of CNN architecture.

Gain hands-on experience: Move beyond theory and implement cutting-edge models yourself.
Understand inner workings: See how convolutional layers, pooling, and fully connected layers come together.
Customize and experiment: Modify the architecture and hyperparameters to optimize performance.

Preparing Your Data: Loading and Preprocessing CIFAR-100

Before building the model, you need a dataset. We'll use CIFAR-100, a dataset of 60,000 32x32 color images divided into 100 classes. Learn how to load, normalize, and split the dataset for training and validation.

Efficient Data Loading: Use torchvision for easy dataset access and transformations.
Normalization: Improve model performance by normalizing image pixel values.
Train/Validation Split: Create separate datasets to monitor training progress and prevent overfitting.

import numpy as np
import torch
import torch.nn as nn
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def data_loader(data_dir, batch_size, random_seed=42, valid_size=0.1, shuffle=True, test=False):
    normalize = transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010],
    )

    # define transforms
    transform = transforms.Compose([
        transforms.Resize((227, 227)),
        transforms.ToTensor(),
        normalize,
    ])

    if test:
        dataset = datasets.CIFAR100(
            root=data_dir, train = False,
            download=True, transform=transform,
        )
        data_loader = torch.utils.data.DataLoader(
            dataset, batch_size=batch_size, shuffle=shuffle
        )
        return data_loader

    # load the dataset
    train_dataset = datasets.CIFAR100(
        root=data_dir, train=True,
        download=True, transform=transform,
    )

    num_train = len(train_dataset)
    indices = list(range(num_train))
    split = int(np.floor(valid_size * num_train))

    if shuffle:
        np.random.seed(random_seed)
        np.random.shuffle(indices)

    train_idx, valid_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(valid_idx)

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, sampler=train_sampler)

    return (train_loader, valid_loader)

# CIFAR100 dataset
train_loader, valid_loader = data_loader(data_dir='./data',
                                             batch_size=64)

test_loader = data_loader(data_dir='./data',
                                            batch_size=64,
                                            test=True)

Building Blocks: Understanding PyTorch Layers

Before diving into the code, let's review the essential PyTorch layers used in VGG16 implementation:

nn.Conv2d: Applies a convolutional filter to extract features from the image.
nn.BatchNorm2d: Improves training stability and speed with batch normalization.
nn.ReLU: Introduces non-linearity for learning complex patterns.
nn.MaxPool2d: Downsamples feature maps, reducing computational load.
nn.Linear: Implements fully connected layers for classification.
nn.Sequential: Simplifies model definition by creating a container for sequential layers.

VGG16 Architecture in PyTorch: Code and Explanation

Now, let's translate the VGG16 architecture into PyTorch code. This section provides a detailed breakdown of each layer and its purpose.

class VGG16(nn.Module):
    def __init__(self, num_classes=100):
        super(VGG16, self).__init__()

        self.layer1 = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU()
        )

        self.layer2 = nn.Sequential(
            nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )

        self.layer3 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU()
        )

        self.layer4 = nn.Sequential(
            nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )

        self.layer5 = nn.Sequential(
            nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU()
        )

        self.layer6 = nn.Sequential(
            nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU()
        )

        self.layer7 = nn.Sequential(
            nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )

        self.layer8 = nn.Sequential(
            nn.Conv2d(256, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU()
        )

        self.layer9 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU()
        )

        self.layer10 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )

        self.layer11 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU()
        )

        self.layer12 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU()
        )

        self.layer13 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )

        self.fc = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(7 * 7 * 512, 4096),
            nn.ReLU()
        )

        self.fc1 = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU()
        )

        self.fc2 = nn.Sequential(
            nn.Linear(4096, num_classes)
        )

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = self.layer5(out)
        out = self.layer6(out)
        out = self.layer7(out)
        out = self.layer8(out)
        out = self.layer9(out)
        out = self.layer10(out)
        out = self.layer11(out)
        out = self.layer12(out)
        out = self.layer13(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        out = self.fc1(out)
        out = self.fc2(out)

        return out

Training Your VGG16 Model: Hyperparameters and Optimization

Training involves setting hyperparameters and using an optimization algorithm to adjust the model's weights. We'll use cross-entropy loss and stochastic gradient descent (SGD) for the VGG16 training process.

Hyperparameter Tuning: Learn the impact of learning rate, batch size, and number of epochs.
Loss Function: Use cross-entropy loss, suited for multi-class classification.
Optimizer: Understand the role of SGD in minimizing loss and improving accuracy.

num_classes = 100
num_epochs = 20
batch_size = 64
learning_rate = 0.005

model = VGG16(num_classes).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=0.005, momentum=0.9)

# Train the model
total_step = len(train_loader)

for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Move tensors to the configured device
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

    # Validation
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in valid_loader:
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted==labels).sum().item()

        print('Accuracy of the network on the {} validation images: {} %'.format(5000, 100 * correct / total))

Evaluating Your Model: Testing on Unseen Data

After training, evaluate your model on the test set to assess its generalization ability. This step will give you an idea of accuracy and performance in real-world scenarios.

# Test the model
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))

Beyond the Basics: Future Work and Customization

Congratulations on building VGG16 from scratch! Now, challenge yourself with these extensions:

Explore different datasets: Try CIFAR-10 or a subset of ImageNet to test the model's scalability.
Tune Hyperparameters: Experiment with different learning rates, batch sizes, and optimizers for better performance.
Modify the Architecture: Add or remove layers, or even try building VGG19 version from scratch.
Check out other convolutional neural networks from scratch in PyTorch.

Resources to Deepen Your Understanding

Original VGG Paper: Very Deep Convolutional Networks for Large-Scale Image Recognition
PyTorch Documentation: PyTorch nn.Module (Useful for defining the VGG architecture)
Torchvision VGG Implementation: torchvision.models.vgg

This comprehensive guide empowers you to build and understand VGG16, taking your deep learning skills to the next level.