⬅ Dashboard

🧩 Multimodal Classification Report

MSCOCO Subset Analysis & CLIP Modeling

📚 01. Dataset Overview

MSCOCO is a large-scale object detection, segmentation, and captioning dataset providing highly authentic Image-Text pairs describing complex everyday scenes.

🔗 Source
300K+
Original Labeled Samples
80
Original Categories
🐈 🐄 🐕 🐎 🐑
5 Selected Target Classes
1 img = 5 caps
Data Mapping Structure

📸 Sample Data Point: 1 Image ➔ 5 Captions

MSCOCO Sample Image
Corresponding Captions:
  • "A small dog carrying a large Frisbee down a sidewalk."
  • "there is a small white dog that is carrying a frisbee"
  • "a small dog carries a frisbe in its mouth "
  • "Small toy sized dog carrying a Frisbee down the street."
  • "A white fluffy dog carries around a red Frisbee. "

🔍 02. Exploratory Data Analysis (EDA)

Data Insights

Unique Images per Class
Long-tail Distribution: The number of images per class varies significantly. Cat (3421) and Dog (2336) dominate, while Cow (1353), Sheep (1161), and Horse (933) have fewer representations.

* Hover over the bars to see exact image counts.

Image Dimensions Distribution
Image Size Variation
Diverse Resolutions: As observed in the scatter plot, the images in MSCOCO scatter widely in both height and width (from ~100 to 640 pixels). This variation necessitates strict image resizing during preprocessing before feeding into the models.
Caption length distribution
Caption Length
Normal Distribution: The character length of the human-written captions generally follows a normal distribution pattern, peaking prominently at around 40-50 characters per caption, ensuring they are descriptive and not overly simplistic.
Top 20 common words
Top 20 Words
Vocabulary Focus: The most frequent words heavily reflect the rural and animal context of our selected classes. Common keywords include 'sheep', 'field', 'cows', 'cat', 'grass', and 'standing'.

⚙️ 03. Setup & Preprocessing Pipeline

  • Class Filtering: Extract only annotations related to the 5 target animals to maintain domain focus.
  • Human & Redundancy Exclusion: Explicitly scan and drop any images that contain the 'person' category to prevent the model from learning unintended correlations. We also remove redundancy (multiple labels on 1 image).
  • Evaluation Sampling: To prevent data explosion, we randomly selected exactly 200 images per class (1,000 in total) to perform our zero-shot evaluation. For the few-shot evaluation, we used 10 samples per class to build the prototype, and tested on the remaining 190 samples per class.
# Extract annotations & exclude 'person' class
for annotation in coco_annotations['annotations']:
    if annotation['category_id'] in target_category_ids:
        img_id = annotation['image_id']
        if img_id not in image_ids_with_person_list:
            extracted_data.append({
                'image_id': img_id,
                'class': category_id_to_name[annotation['category_id']]
            })

# Drop duplicates to prevent data explosion
df_animal = df_animal.drop_duplicates(subset=['image_id', 'class', 'caption'])

# We decided to take 1000 images (200 each class) for further evaluation
sample_df = df_animal.groupby('class').apply(lambda x: x.sample(n=200, random_state=42)).reset_index(drop=True)

🏗️ 04. Model Building Pipeline

💡 Core Approach: We utilize Contrastive Language-Image Pre-training (CLIP) to perform classification without traditional retraining, applying an aggregated Text Prototype method for Few-shot capabilities.
CLIP Architecture

CLIP uses dual encoders (a Vision Transformer for images and a Text Transformer for captions) to map both modalities into a shared, highly semantic multi-dimensional embedding space.

🔄 End-to-End Forward Pass & Prototype Creation

Below is the specialized data pipeline showing how raw data transforms through each stage. Note that tensor dimensions heavily depend on the specific CLIP backbone used (denoted as Res for input resolution and D_feat for feature dimension).

Raw Data Image (H, W, 3) Text (Strings)
Preprocessed Img (1, 3, Res, Res) Tokens (N_caps, 77)
Encoders ViT / ResNet Text Transformer
Embeddings Img Feat (1, D_feat) Mean(Txt) ➔ (5, D_feat)
Similarity Logits (1, 5)

1. Preprocessing (Raw Data ➔ Tensors)

Original images with highly variable sizes (H, W) are first resized and center-cropped by the image preprocessor to a fixed resolution tensor (1, 3, Res, Res) specific to the chosen backbone. Concurrently, raw text captions are tokenized, padded, or truncated to a strictly fixed sequence length of 77 tokens.

Backbone Architecture Input Resolution (Res) Feature Dimension (D_feat)
ViT-B/32 & ViT-B/16 224 x 224 512
ViT-L/14 224 x 224 768
ViT-L/14@336px 336 x 336 768
RN50 224 x 224 1024
RN50x4 288 x 288 640
RN50x16 384 x 384 768

2. Text Prototype Aggregation (The Few-Shot Core)

After being processed by the encoders, instead of matching against a single zero-shot prompt ("a photo of a {animal}"), the few-shot model processes all available captions for a class. The resulting text features are L2-normalized, averaged together (Mean Aggregation), and normalized again to form a single, robust Class Prototype vector (5, D_feat).

🔥 Core Logic: Zero-Shot Similarity
clip_inference.py
# 1. Trích xuất đặc trưng (Features) từ Ảnh và Chữ
image_features = model.encode_image(image_input)  # Shape: (1, D_feat)
text_features = model.encode_text(text_inputs)    # Shape: (5, D_feat)

# 2. Chuẩn hóa L2 (L2 Normalization) để đưa về chung không gian độ dài
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# 3. CORE BREAKTHROUGH: Tính Cosine Similarity (Nhân ma trận)
# similarity = (1, D_feat) @ (D_feat, 5) -> (1, 5) Logits
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

# Lấy class có điểm số tương đồng cao nhất
predicted_class = similarity.argmax(dim=-1)
# Create an aggregated representative vector for a class
for c in classes:
    # Extract and normalize text features for all captions of class c
    outputs = model.get_text_features(**inputs)
    text_features = outputs / outputs.norm(dim=-1, keepdim=True)
    
    # Aggregate N vectors into 1 single prototype via Mean
    class_vector_aggregated = torch.mean(all_text_features, dim=0)
    
    # Final normalization of the prototype
    class_vector_final = class_vector_aggregated / class_vector_aggregated.norm(dim=-1, keepdim=True)
    class_vectors[c] = class_vector_final

3. Classification via Similarity

When predicting a new image, its extracted feature vector (1, D_feat) is compared against the 5 pre-computed aggregated text prototypes using dot product (Cosine Similarity). The class with the highest similarity score is chosen.

🏆 05. Evaluation & Results

CLIP Backbone Benchmarking (Zero-Shot)

We evaluated multiple CLIP variations to find the optimal architecture balancing Accuracy, Model Size, and Inference Speed.

Model Accuracy Parameters Prediction Time
ViT-L/14@336px 🏆 0.978 ~427.9M 31.30 s
ViT-L/14 🥈 0.973 ~427.6M 26.01 s
RN50x16 🥉 0.964 ~291.0M 29.71 s
ViT-B/16 0.954 ~149.6M 17.00 s
RN50x4 0.950 ~178.3M 22.14 s
ViT-B/32 0.930 ~151.3M ⚡ 17.66 s
RN50 0.920 ⚡ ~102.0M ⚡ 16.57 s
💡 Benchmark Insights: ViT-L/14@336px emerges as the best performer, achieving the highest accuracy (~97.8%) with the most time consumption. ViT-B/16 also has a really good performance with half the parameters of large models, less time consumption, and the accuracy is only about 2% away from them.

CLIP Backbone Benchmarking (Few-Shot & Ensemble)

We also evaluated the models using the aggregated Text Prototype (Few-shot) approach, including an Ensemble method combining all backbone predictions via Majority Vote.

Model Accuracy Parameters Prediction Time
ViT-L/14@336px 🏆 0.982 ~427.94M 24.74 s
RN50x16 🥈 0.975 ~290.98M 18.57 s
ViT-L/14 🥉 0.974 ~427.62M 15.98 s
🌟 Ensemble (Majority Vote) 0.969 ~1727.75M ~96.50 s
ViT-B/16 0.956 ~149.62M 8.44 s
RN50x4 0.944 ~178.30M 11.96 s
RN50 0.919 ⚡ 102.01M ⚡ 8.01 s
ViT-B/32 0.918 ~151.28M 8.80 s
💡 Few-Shot Insights: Aggregating text prototypes consistently improves accuracy across all backbones compared to Zero-Shot. The ViT-L/14@336px remains the top single model (98.2%). Interestingly, the Ensemble model (96.9%) does not outperform the best individual ViT model, while requiring vastly more parameters (~1.7B) and inference time.

🤝 The Ensemble Strategy (Majority Vote)

To achieve the highly stable 96.9% accuracy in the Few-Shot evaluation, we didn't just rely on a single architecture. We aggregated the predictions from all 7 distinct CLIP backbones. For each image, every model casts a "vote" for a specific class. We then apply a Majority Vote algorithm to select the final label. This effectively cancels out the individual biases and weaknesses of specific backbones.

Confusion Matrices

🔥 Core Logic: Visualizing Multiple Confusion Matrices
evaluation.py
# 1. Khởi tạo lưới biểu đồ (Grid) cho tất cả các model + Ensemble
fig, axes = plt.subplots(rows, cols, figsize=(18, 5 * rows))

# 2. Vẽ Confusion Matrix cho từng mô hình cá nhân (Tone màu Xanh)
for idx, (model_name, data) in enumerate(results.items()):
    cm = confusion_matrix(true_labels, data["Predictions"])
    sns.heatmap(cm, annot=True, cmap='Blues', ax=axes[idx])

# 3. Vẽ Confusion Matrix cho mô hình Ensemble (Tone màu Cam nổi bật)
cm_ensemble = confusion_matrix(true_labels, ensemble_preds)
sns.heatmap(cm_ensemble, annot=True, cmap='Oranges', ax=axes[len(results)])

plt.tight_layout()
plt.show()

* Click on any matrix to view in full screen.

ViT-L/14@336px

Zero-Shot ViT-L@336

ViT-L/14

Zero-Shot ViT-L/14

RN50x16

Zero-Shot RN50x16

ViT-B/16

Zero-Shot ViT-B/16

RN50x4

Zero-Shot RN50x4

ViT-B/32

Zero-Shot ViT-B/32

RN50

Zero-Shot RN50

ViT-L/14@336px

Few-Shot ViT-L@336

ViT-L/14

Few-Shot ViT-L/14

🌟 Ensemble

Few-Shot Ensemble

RN50x16

Few-Shot RN50x16

ViT-B/16

Few-Shot ViT-B/16

RN50x4

Few-Shot RN50x4

ViT-B/32

Few-Shot ViT-B/32

RN50

Few-Shot RN50

🔬 06. Error Analysis & Interpretability

Common Wrong Predictions

We reviewed images that were mispredicted by the models. Initially, ViT-B/32 resulted in 70 mispredicted images. By switching to a more powerful backbone like RN50x16, we successfully reduced the errors to just 8 images.

  • True: Cow ➔ Predict: Sheep / Horse
  • True: Sheep ➔ Predict: Cow / Horse
  • True: Dog ➔ Predict: Cat / Cow / Horse
  • True: Horse ➔ Predict: Cow

🔍 Error Observations

Looking at the mispredicted images, we identified several common factors causing confusion:

  • 📐
    Shape Similarity The physical shape of the animal strongly resembles another class.
  • 🔍
    Size The animal is too small in the frame, lacking distinct features.
  • 🍃
    Camouflage The animal is blended into the background.
  • ⚠️
    Dataset Noise The image was mislabeled by the original COCO dataset caption.

Interpretability with Grad-CAM

We used GRAD-cam to visualize what the RN50x16 model focuses on when making predictions.
🔥 Core Logic: Grad-CAM Extraction
interpretability.py
# 1. Bắt (Hook) vào layer tích chập cuối cùng của ResNet50x16
target_layers = [model.visual.layer4[-1]] 

# 2. Khởi tạo Grad-CAM với kiến trúc của CLIP
cam = GradCAM(model=model, target_layers=target_layers, use_cuda=True)

# 3. Tạo target cho lớp dự đoán (ví dụ: 'sheep' = index 3)
# Tính đạo hàm ngược (gradients) từ logits về target layer để xem vùng nào ảnh hưởng nhất
targets = [ClassifierOutputTarget(predicted_class_idx)]

# 4. Sinh ra bản đồ nhiệt (Heatmap)
grayscale_cam = cam(input_tensor=image_tensor, targets=targets)[0]

# Phủ heatmap lên ảnh gốc để nội suy vùng tập trung
cam_image = show_cam_on_image(original_image, grayscale_cam, use_rgb=True)

Original Image
True: cow | Pred: sheep

Original Cow

Grad-CAM (RN50x16)
True: cow | Pred: sheep

Grad-CAM RN50x16 Cow
📌 Insight (Data Mismatch): In cases like confusing a Cow for a Sheep, the Grad-CAM shows that the most strong activation points are not focusing on the animal itself. Instead, the model relies on the grassy background context, highlighting a classic data mismatch problem in contrastive pre-training.
×