Created by Zalando to provide a more realistic and challenging image dataset while maintaining the convenient format of the original MNIST.
🔗 SourceCreated by Zalando to provide a more realistic and challenging image dataset while maintaining the convenient format of the original MNIST.
🔗 Source
Original Format: All images in the dataset are in grayscale format and sized exactly 28x28 pixels. They represent 10 distinct types of clothing and accessories.
* Hover over the bars to see exact sample counts.
mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225] for stability and convergence speed.# Define transformations for the dataset transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.Grayscale(num_output_channels=3), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) # Load Training and Testing Datasets train_dataset = datasets.FashionMNIST( root="./data", train=True, download=True, transform=transform ) test_dataset = datasets.FashionMNIST( root="./data", train=False, download=True, transform=transform ) # Create DataLoaders train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=batch_size)
Extracts features hierarchically using dense connectivity patterns, where each layer receives inputs from all preceding layers.
Splits the image into fixed-size patches, linearly embeds them with position encodings, and feeds them into a standard Transformer encoder.
Below is the data pipeline through the model's stages, detailing the Tensor dimensions (Batch Size is denoted as B) and the semantic meaning of the data at each step.
(B, 1, 28, 28)Raw data from the Fashion-MNIST dataset. Each image is a grayscale pixel matrix (1 color channel) with an original size of 28x28 pixels.
(B, 3, 224, 224)Images are resized to 224x224 and duplicated into 3 color channels (RGB) to match the input structure of models pre-trained on ImageNet. Pixel values are normalized to a range of [-1, 1].
(B, Features)The CNN or ViT model acts as a Feature Extractor. The output is a compressed vector containing the high-level semantics of the image.
*For DenseNet121: Features = 1024
*For ViT-Base: Features = 768
(B, 10)The original classification head (typically 1000 classes) is replaced with a new Fully Connected Layer. It maps the feature vector (1024 or 768 dimensions) down to exactly 10 dimensions corresponding to the 10 clothing/footwear labels. The output values (Logits) are then passed through the Cross-Entropy Loss function to calculate the error during training.
Note: Performance metrics evaluated on the 10,000 unseen test samples.
| Model | Accuracy | Precision | Recall | F1-Score | Params | Inference |
|---|---|---|---|---|---|---|
| DenseNet121 | 0.9411 | 0.9411 | 0.9411 | 0.9410 | ~ 6.96M | ~ 91.63 ms |
| ViT-based | 0.9064 | 0.9069 | 0.9064 | 0.9054 | 🏆 ~ 1.32M | ⚡ ~ 6.63 ms |
| Ensemble (Soft Voting) | 🏆 0.9454 | 🏆 0.9456 | 🏆 0.9454 | 🏆 0.9454 | - | - |
To achieve the highest performance of 94.54% accuracy, we implemented an Ensemble model using a Soft Voting mechanism. Instead of just taking the final predicted class, we extract the output probability distributions (logits passed through softmax) from both the DenseNet121 and ViT-based models. By averaging these probabilities, the ensemble model effectively combines the hierarchical feature extraction strengths of CNNs with the global context understanding of Vision Transformers, significantly reducing confusion between visually similar classes like Pullover, Coat, and Shirt.
* Click on any matrix to view in full screen.
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 - T-shirt/top | 0.86 | 0.92 | 0.88 | 1000 |
| 1 - Trouser | 0.99 | 0.99 | 0.99 | 1000 |
| 2 - Pullover | 0.89 | 0.94 | 0.91 | 1000 |
| 3 - Dress | 0.91 | 0.96 | 0.94 | 1000 |
| 4 - Coat | 0.95 | 0.89 | 0.92 | 1000 |
| 5 - Sandal | 0.99 | 0.99 | 0.99 | 1000 |
| 6 - Shirt | 0.87 | 0.76 | 0.81 | 1000 |
| 7 - Sneaker | 0.97 | 0.98 | 0.97 | 1000 |
| 8 - Bag | 0.99 | 0.99 | 0.99 | 1000 |
| 9 - Ankle boot | 0.98 | 0.97 | 0.97 | 1000 |
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 - T-shirt/top | 0.83 | 0.87 | 0.85 | 1000 |
| 1 - Trouser | 0.98 | 0.97 | 0.98 | 1000 |
| 2 - Pullover | 0.84 | 0.88 | 0.86 | 1000 |
| 3 - Dress | 0.89 | 0.92 | 0.90 | 1000 |
| 4 - Coat | 0.88 | 0.85 | 0.86 | 1000 |
| 5 - Sandal | 0.96 | 0.96 | 0.96 | 1000 |
| 6 - Shirt | 0.78 | 0.68 | 0.72 | 1000 |
| 7 - Sneaker | 0.94 | 0.95 | 0.95 | 1000 |
| 8 - Bag | 0.97 | 0.98 | 0.97 | 1000 |
| 9 - Ankle boot | 0.96 | 0.94 | 0.95 | 1000 |
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 - T-shirt/top | 0.88 | 0.92 | 0.90 | 1000 |
| 1 - Trouser | 0.99 | 0.99 | 0.99 | 1000 |
| 2 - Pullover | 0.90 | 0.95 | 0.92 | 1000 |
| 3 - Dress | 0.92 | 0.97 | 0.94 | 1000 |
| 4 - Coat | 0.96 | 0.89 | 0.93 | 1000 |
| 5 - Sandal | 0.99 | 0.99 | 0.99 | 1000 |
| 6 - Shirt | 0.89 | 0.78 | 0.83 | 1000 |
| 7 - Sneaker | 0.97 | 0.98 | 0.98 | 1000 |
| 8 - Bag | 0.99 | 0.99 | 0.99 | 1000 |
| 9 - Ankle boot | 0.98 | 0.97 | 0.98 | 1000 |
We reviewed images that were frequently mispredicted across the models. We noticed several classes sharing similar visual appearances:
Using Grad-CAM to visualize which regions of the input image contribute most to the prediction, we found that DenseNet121 mainly focuses on the important shape and contour areas of each clothing item instead of irrelevant background information:
Although not explicitly detailed in our final report, we conducted robustness experiments by injecting Gaussian noise into the Fashion-MNIST test images. This allowed us to observe how Convolutional architectures (DenseNet) compare to Transformer architectures (ViT) under visual distortion.
import torch def add_noise(images, std=0.1): # Injecting Gaussian noise directly into normalized tensors noise = torch.randn_like(images) * std noisy_images = images + noise # Clamp to ensure valid pixel distribution ranges [-1, 1] return torch.clamp(noisy_images, -1, 1)
| Model | Clean Acc (%) | Noisy Acc (%) | Drop (%) |
|---|---|---|---|
| DenseNet121 | 94.11 | ~ 84.20 | -9.91 |
| ViT-based | 90.64 | 90.03 | -0.61 |
| Ensemble | 94.54 | ~ 88.72 | -5.82 |