PaliGemma2-3b-224-COCO

Thai Image Captioning Model Overview

PaliGemma2-3b-224-COCO is a specialized Thai language image captioning model fine-tuned from google/paligemma2-3b-pt-224. This model represents a significant advancement in Thai language image understanding, demonstrating superior performance compared to existing solutions, particularly our baseline model MagiBoss/Blip2-Typhoon1.5-COCO.

Comparative Analysis

A comprehensive evaluation was conducted against MagiBoss/Blip2-Typhoon1.5-COCO. using identical test conditions and metrics. Both models were tested with max_new_tokens=64 to ensure fair comparison.

Key Innovations

  • Native Thai language support optimized for cultural context
  • Enhanced performance metrics across all evaluation criteria
  • Efficient 4-bit quantization for practical deployment
  • Comprehensive COCO dataset coverage with Thai annotations

Technical Specifications

Model Architecture

  • Base Model: google/paligemma2-3b-pt-224
  • Fine-tuning Approach: Parameter-Efficient Fine-Tuning (PEFT)
  • Quantization: 4-bit precision with double quantization
  • Token Generation: max_new_tokens=64

Training Dataset

  • Source: MagiBoss/COCO-Image-Captioning
  • Annotations: Thai-translated captions
  • Training Set Size: 142,291 images
  • Validation Set Size: 9,036 images
  • Data Processing: Manual quality assurance for Thai translations

Training Configuration

  • Learning Rate: 1e-5
  • Train Batch Size: 32
  • Eval Batch Size: 16
  • Seed: 42
  • Optimizer: AdamW (betas=(0.9, 0.98), epsilon=1e-8)
  • LR Scheduler: Linear warm-up (0.1 ratio)
  • Training Duration: 3 epochs

Training Progress

Training Loss Epoch Step Validation Loss
1.2182 1.0 4447 1.1568
1.1318 2.0 8894 1.0910
1.0837 3.0 13341 1.0724

Performance Evaluation

Evaluation Methodology

  • BERTScore Implementation: bert-base-multilingual-cased model
  • Thai Text Tokenization: newmm tokenizer
    • Selected for accurate Thai word segmentation
    • Optimized for evaluation metrics calculation
  • Token Length: Standardized at 64 tokens maximum
  • Evaluation Environment: Controlled testing setup

Comparative Metrics

BLEU Score Analysis

Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 BLEU (Cumulative)
Blip2-Typhoon1.5-COCO 7.49 1.48 0.26 0.05 0.62
PaliGemma2-3b-224-COCO 24.42 11.03 5.06 2.44 7.59

Key Achievement: ~3.3x improvement in BLEU-1 score, indicating significantly better basic translation accuracy.

ROUGE Score Performance

Model ROUGE-1 ROUGE-2 ROUGE-L ROUGE-Lsum
Blip2-Typhoon1.5-COCO 1.28 0.02 1.27 1.27
PaliGemma2-3b-224-COCO 3.17 0.04 3.17 3.16

Notable Improvement: 2.5x enhancement in ROUGE-1 score, showing better content matching.

BERTScore Evaluation

Model BERTScore (F1)
Blip2-Typhoon1.5-COCO 66.53
PaliGemma2-3b-224-COCO 75.78

Significant Gain: 9.25 point improvement in BERTScore, validated using bert-base-multilingual-cased model.

METEOR Score Comparison

Model METEOR
Blip2-Typhoon1.5-COCO 8.97
PaliGemma2-3b-224-COCO 37.70

Outstanding Result: 4.2x improvement in METEOR score, demonstrating superior semantic accuracy.

Implementation Guide

import torch
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration, BitsAndBytesConfig
from PIL import Image

# Optimized quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize model components
processor = PaliGemmaProcessor.from_pretrained("MagiBoss/PaliGemma2-3b-224-COCO")
model = PaliGemmaForConditionalGeneration.from_pretrained(
    "MagiBoss/PaliGemma2-3b-224-COCO", 
    quantization_config=quantization_config, 
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

model.eval()

def generate_caption(image_path):
    """
    Generate Thai caption for input image with optimized parameters.
    
    Args:
        image_path (str): Path to input image
        
    Returns:
        str: Generated Thai caption
    """
    image = Image.open(image_path).convert("RGB")
    question = "อธิบายภาพด้านล่างนี้ด้วยคำบรรยายที่ชัดเจนและละเอียดเป็นภาษาไทย"
    prompt = f"<image> {question}"
    
    model_inputs = processor(
        text=[prompt], 
        images=[image], 
        padding="max_length", 
        return_tensors="pt"
    ).to(device)
    
    with torch.inference_mode():
        output = model.generate(
            input_ids=model_inputs["input_ids"],
            pixel_values=model_inputs["pixel_values"],
            attention_mask=model_inputs["attention_mask"],
            max_new_tokens=64,  # Optimized token length
            do_sample=False
        )
    
    return processor.decode(output[0], skip_special_tokens=True)

# Example usage
image_path = "example.jpg"
caption = generate_caption(image_path)
print("Generated Caption:", caption)

Environment Requirements

  • PEFT 0.14.0
  • Transformers 4.49.0.dev0
  • Pytorch 2.5.1+cu124
  • Datasets 3.1.0
  • Tokenizers 0.21.0

Current Limitations

  • Complex cultural context interpretation
  • Handling of Thai-specific idiomatic expressions
  • Regional dialect variations
  • Specialized domain vocabulary

Future Development Roadmap

  1. Dataset Expansion

    • Integration of region-specific Thai content
    • Domain-specific terminology enhancement
    • Cultural context enrichment
  2. Model Improvements

    • Attention mechanism optimization
    • Context understanding enhancement
    • Inference speed optimization
  3. Evaluation Methods

    • Thai-specific metrics development
    • Cultural accuracy assessment
    • Regional relevance validation

Acknowledgements

Special thanks to:

  • Google for the PaliGemma2-3b-pt-224 base model
  • The open-source community for development tools
  • Contributors to the Thai translation effort

Citation

@misc{PaliGemma2-3b-224-COCO,
  author = {MagiBoss},
  title = {PaliGemma2-3b-224-COCO},
  year = {2025},
  publisher = {Hugging Face},
  note = {https://huggingface.co/MagiBoss/PaliGemma2-3b-224-COCO}
}
Downloads last month
27
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the HF Inference API does not support peft models with pipeline type image-to-text

Model tree for MagiBoss/PaliGemma2-3b-224-COCO

Adapter
(62)
this model

Dataset used to train MagiBoss/PaliGemma2-3b-224-COCO