Multimodal Dataset Report

📚 01. Dataset Overview

MSCOCO is a large-scale object detection, segmentation, and captioning dataset providing highly authentic Image-Text pairs describing complex everyday scenes.

🔗 Source

300K+

Original Labeled Samples

80

Original Categories

🐈 🐄 🐕 🐎 🐑

5 Selected Target Classes

1 img = 5 caps

Data Mapping Structure

📸 Sample Data Point: 1 Image ➔ 5 Captions

Corresponding Captions:

"A small dog carrying a large Frisbee down a sidewalk."
"there is a small white dog that is carrying a frisbee"
"a small dog carries a frisbe in its mouth "
"Small toy sized dog carrying a Frisbee down the street."
"A white fluffy dog carries around a red Frisbee. "

🔍 02. Exploratory Data Analysis (EDA)

Data Insights

Unique Images per Class

Long-tail Distribution: The number of images per class varies significantly. Cat (3421) and Dog (2336) dominate, while Cow (1353), Sheep (1161), and Horse (933) have fewer representations.

* Hover over the bars to see exact image counts.

Image Size Variation

Diverse Resolutions: As observed in the scatter plot, the images in MSCOCO scatter widely in both height and width (from ~100 to 640 pixels). This variation necessitates strict image resizing during preprocessing before feeding into the models.

Caption Length

Normal Distribution: The character length of the human-written captions generally follows a normal distribution pattern, peaking prominently at around 40-50 characters per caption, ensuring they are descriptive and not overly simplistic.

Top 20 Words

Vocabulary Focus: The most frequent words heavily reflect the rural and animal context of our selected classes. Common keywords include 'sheep', 'field', 'cows', 'cat', 'grass', and 'standing'.

⚙️ 03. Setup & Preprocessing Pipeline

Class Filtering: Extract only annotations related to the 5 target animals to maintain domain focus.
Human & Redundancy Exclusion: Explicitly scan and drop any images that contain the 'person' category to prevent the model from learning unintended correlations. We also remove redundancy (multiple labels on 1 image).
Evaluation Sampling: To prevent data explosion, we randomly selected exactly 200 images per class (1,000 in total) to perform our zero-shot evaluation. For the few-shot evaluation, we used 10 samples per class to build the prototype, and tested on the remaining 190 samples per class.

# Extract annotations & exclude 'person' class
for annotation in coco_annotations['annotations']:
    if annotation['category_id'] in target_category_ids:
        img_id = annotation['image_id']
        if img_id not in image_ids_with_person_list:
            extracted_data.append({
                'image_id': img_id,
                'class': category_id_to_name[annotation['category_id']]
            })

# Drop duplicates to prevent data explosion
df_animal = df_animal.drop_duplicates(subset=['image_id', 'class', 'caption'])

# We decided to take 1000 images (200 each class) for further evaluation
sample_df = df_animal.groupby('class').apply(lambda x: x.sample(n=200, random_state=42)).reset_index(drop=True)

🏗️ 04. Model Building Pipeline

💡 Core Approach: We utilize Contrastive Language-Image Pre-training (CLIP) to perform classification without traditional retraining, applying an aggregated Text Prototype method for Few-shot capabilities.

CLIP uses dual encoders (a Vision Transformer for images and a Text Transformer for captions) to map both modalities into a shared, highly semantic multi-dimensional embedding space.

🔄 End-to-End Forward Pass & Prototype Creation

Below is the specialized data pipeline showing how raw data transforms through each stage. Note that tensor dimensions heavily depend on the specific CLIP backbone used (denoted as Res for input resolution and D_feat for feature dimension).

Raw Data Image (H, W, 3) Text (Strings)

➔

Preprocessed Img (1, 3, Res, Res) Tokens (N_caps, 77)

➔

Encoders ViT / ResNet Text Transformer

➔

Embeddings Img Feat (1, D_feat) Mean(Txt) ➔ (5, D_feat)

➔

Similarity Logits (1, 5)

1. Preprocessing (Raw Data ➔ Tensors)

Original images with highly variable sizes (H, W) are first resized and center-cropped by the image preprocessor to a fixed resolution tensor (1, 3, Res, Res) specific to the chosen backbone. Concurrently, raw text captions are tokenized, padded, or truncated to a strictly fixed sequence length of 77 tokens.

Backbone Architecture	Input Resolution `(Res)`	Feature Dimension `(D_feat)`
ViT-B/32 & ViT-B/16	224 x 224	512
ViT-L/14	224 x 224	768
ViT-L/14@336px	336 x 336	768
RN50	224 x 224	1024
RN50x4	288 x 288	640
RN50x16	384 x 384	768

2. Text Prototype Aggregation (The Few-Shot Core)

After being processed by the encoders, instead of matching against a single zero-shot prompt ("a photo of a {animal}"), the few-shot model processes all available captions for a class. The resulting text features are L2-normalized, averaged together (Mean Aggregation), and normalized again to form a single, robust Class Prototype vector (5, D_feat).

🔥 Core Logic: Zero-Shot Similarity

clip_inference.py

# 1. Trích xuất đặc trưng (Features) từ Ảnh và Chữ
image_features = model.encode_image(image_input)  # Shape: (1, D_feat)
text_features = model.encode_text(text_inputs)    # Shape: (5, D_feat)

# 2. Chuẩn hóa L2 (L2 Normalization) để đưa về chung không gian độ dài
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# 3. CORE BREAKTHROUGH: Tính Cosine Similarity (Nhân ma trận)
# similarity = (1, D_feat) @ (D_feat, 5) -> (1, 5) Logits
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

# Lấy class có điểm số tương đồng cao nhất
predicted_class = similarity.argmax(dim=-1)

# Create an aggregated representative vector for a class
for c in classes:
    # Extract and normalize text features for all captions of class c
    outputs = model.get_text_features(**inputs)
    text_features = outputs / outputs.norm(dim=-1, keepdim=True)
    
    # Aggregate N vectors into 1 single prototype via Mean
    class_vector_aggregated = torch.mean(all_text_features, dim=0)
    
    # Final normalization of the prototype
    class_vector_final = class_vector_aggregated / class_vector_aggregated.norm(dim=-1, keepdim=True)
    class_vectors[c] = class_vector_final

3. Classification via Similarity

When predicting a new image, its extracted feature vector (1, D_feat) is compared against the 5 pre-computed aggregated text prototypes using dot product (Cosine Similarity). The class with the highest similarity score is chosen.

🏆 05. Evaluation & Results

CLIP Backbone Benchmarking (Zero-Shot)

We evaluated multiple CLIP variations to find the optimal architecture balancing Accuracy, Model Size, and Inference Speed.

Model	Accuracy	Parameters	Prediction Time
ViT-L/14@336px	🏆 0.978	~427.9M	31.30 s
ViT-L/14	🥈 0.973	~427.6M	26.01 s
RN50x16	🥉 0.964	~291.0M	29.71 s
ViT-B/16	0.954	~149.6M	17.00 s
RN50x4	0.950	~178.3M	22.14 s
ViT-B/32	0.930	~151.3M	⚡ 17.66 s
RN50	0.920	⚡ ~102.0M	⚡ 16.57 s

💡 Benchmark Insights: ViT-L/14@336px emerges as the best performer, achieving the highest accuracy (~97.8%) with the most time consumption. ViT-B/16 also has a really good performance with half the parameters of large models, less time consumption, and the accuracy is only about 2% away from them.

CLIP Backbone Benchmarking (Few-Shot & Ensemble)

We also evaluated the models using the aggregated Text Prototype (Few-shot) approach, including an Ensemble method combining all backbone predictions via Majority Vote.

Model	Accuracy	Parameters	Prediction Time
ViT-L/14@336px	🏆 0.982	~427.94M	24.74 s
RN50x16	🥈 0.975	~290.98M	18.57 s
ViT-L/14	🥉 0.974	~427.62M	15.98 s
🌟 Ensemble (Majority Vote)	0.969	~1727.75M	~96.50 s
ViT-B/16	0.956	~149.62M	8.44 s
RN50x4	0.944	~178.30M	11.96 s
RN50	0.919	⚡ 102.01M	⚡ 8.01 s
ViT-B/32	0.918	~151.28M	8.80 s

💡 Few-Shot Insights: Aggregating text prototypes consistently improves accuracy across all backbones compared to Zero-Shot. The ViT-L/14@336px remains the top single model (98.2%). Interestingly, the Ensemble model (96.9%) does not outperform the best individual ViT model, while requiring vastly more parameters (~1.7B) and inference time.

🤝 The Ensemble Strategy (Majority Vote)

To achieve the highly stable 96.9% accuracy in the Few-Shot evaluation, we didn't just rely on a single architecture. We aggregated the predictions from all 7 distinct CLIP backbones. For each image, every model casts a "vote" for a specific class. We then apply a Majority Vote algorithm to select the final label. This effectively cancels out the individual biases and weaknesses of specific backbones.

Confusion Matrices

🔥 Core Logic: Visualizing Multiple Confusion Matrices

evaluation.py

# 1. Khởi tạo lưới biểu đồ (Grid) cho tất cả các model + Ensemble
fig, axes = plt.subplots(rows, cols, figsize=(18, 5 * rows))

# 2. Vẽ Confusion Matrix cho từng mô hình cá nhân (Tone màu Xanh)
for idx, (model_name, data) in enumerate(results.items()):
    cm = confusion_matrix(true_labels, data["Predictions"])
    sns.heatmap(cm, annot=True, cmap='Blues', ax=axes[idx])

# 3. Vẽ Confusion Matrix cho mô hình Ensemble (Tone màu Cam nổi bật)
cm_ensemble = confusion_matrix(true_labels, ensemble_preds)
sns.heatmap(cm_ensemble, annot=True, cmap='Oranges', ax=axes[len(results)])

plt.tight_layout()
plt.show()

* Click on any matrix to view in full screen.

ViT-L/14@336px

ViT-L/14

RN50x16

ViT-B/16

RN50x4

ViT-B/32

RN50

ViT-L/14@336px

ViT-L/14

🌟 Ensemble

RN50x16

ViT-B/16

RN50x4

ViT-B/32

RN50

🔬 06. Error Analysis & Interpretability

Common Wrong Predictions

We reviewed images that were mispredicted by the models. Initially, ViT-B/32 resulted in 70 mispredicted images. By switching to a more powerful backbone like RN50x16, we successfully reduced the errors to just 8 images.

True: Cow ➔ Predict: Sheep / Horse
True: Sheep ➔ Predict: Cow / Horse

True: Dog ➔ Predict: Cat / Cow / Horse
True: Horse ➔ Predict: Cow

🔍 Error Observations

Looking at the mispredicted images, we identified several common factors causing confusion:

📐

Shape Similarity The physical shape of the animal strongly resembles another class.
🔍

Size The animal is too small in the frame, lacking distinct features.
🍃

Camouflage The animal is blended into the background.
⚠️

Dataset Noise The image was mislabeled by the original COCO dataset caption.

Interpretability with Grad-CAM

We used GRAD-cam to visualize what the RN50x16 model focuses on when making predictions.

🔥 Core Logic: Grad-CAM Extraction

interpretability.py

# 1. Bắt (Hook) vào layer tích chập cuối cùng của ResNet50x16
target_layers = [model.visual.layer4[-1]] 

# 2. Khởi tạo Grad-CAM với kiến trúc của CLIP
cam = GradCAM(model=model, target_layers=target_layers, use_cuda=True)

# 3. Tạo target cho lớp dự đoán (ví dụ: 'sheep' = index 3)
# Tính đạo hàm ngược (gradients) từ logits về target layer để xem vùng nào ảnh hưởng nhất
targets = [ClassifierOutputTarget(predicted_class_idx)]

# 4. Sinh ra bản đồ nhiệt (Heatmap)
grayscale_cam = cam(input_tensor=image_tensor, targets=targets)[0]

# Phủ heatmap lên ảnh gốc để nội suy vùng tập trung
cam_image = show_cam_on_image(original_image, grayscale_cam, use_rgb=True)

Original Image
True: cow | Pred: sheep

Grad-CAM (RN50x16)
True: cow | Pred: sheep

📌 Insight (Data Mismatch): In cases like confusing a Cow for a Sheep, the Grad-CAM shows that the most strong activation points are not focusing on the animal itself. Instead, the model relies on the grassy background context, highlighting a classic data mismatch problem in contrastive pre-training.

🧩 Multimodal Classification Report

📚 01. Dataset Overview

📸 Sample Data Point: 1 Image ➔ 5 Captions

🔍 02. Exploratory Data Analysis (EDA)

Data Insights

⚙️ 03. Setup & Preprocessing Pipeline

🏗️ 04. Model Building Pipeline

🔄 End-to-End Forward Pass & Prototype Creation

1. Preprocessing (Raw Data ➔ Tensors)

2. Text Prototype Aggregation (The Few-Shot Core)

3. Classification via Similarity

🏆 05. Evaluation & Results

CLIP Backbone Benchmarking (Zero-Shot)

CLIP Backbone Benchmarking (Few-Shot & Ensemble)

🤝 The Ensemble Strategy (Majority Vote)

Confusion Matrices

ViT-L/14@336px

ViT-L/14

RN50x16

ViT-B/16

RN50x4

ViT-B/32

RN50

ViT-L/14@336px

ViT-L/14

🌟 Ensemble

RN50x16

ViT-B/16

RN50x4

ViT-B/32

RN50

🔬 06. Error Analysis & Interpretability

Common Wrong Predictions

🔍 Error Observations

Interpretability with Grad-CAM

Original Image True: cow | Pred: sheep

Grad-CAM (RN50x16) True: cow | Pred: sheep

Original Image
True: cow | Pred: sheep

Grad-CAM (RN50x16)
True: cow | Pred: sheep