PaliGemma2-3b-224-COCO
Thai Image Captioning Model Overview
PaliGemma2-3b-224-COCO is a specialized Thai language image captioning model fine-tuned from google/paligemma2-3b-pt-224. This model represents a significant advancement in Thai language image understanding, demonstrating superior performance compared to existing solutions, particularly our baseline model MagiBoss/Blip2-Typhoon1.5-COCO.
Comparative Analysis
A comprehensive evaluation was conducted against MagiBoss/Blip2-Typhoon1.5-COCO. using identical test conditions and metrics. Both models were tested with max_new_tokens=64 to ensure fair comparison.
Key Innovations
- Native Thai language support optimized for cultural context
- Enhanced performance metrics across all evaluation criteria
- Efficient 4-bit quantization for practical deployment
- Comprehensive COCO dataset coverage with Thai annotations
Technical Specifications
Model Architecture
- Base Model: google/paligemma2-3b-pt-224
- Fine-tuning Approach: Parameter-Efficient Fine-Tuning (PEFT)
- Quantization: 4-bit precision with double quantization
- Token Generation: max_new_tokens=64
Training Dataset
- Source: MagiBoss/COCO-Image-Captioning
- Annotations: Thai-translated captions
- Training Set Size: 142,291 images
- Validation Set Size: 9,036 images
- Data Processing: Manual quality assurance for Thai translations
Training Configuration
- Learning Rate: 1e-5
- Train Batch Size: 32
- Eval Batch Size: 16
- Seed: 42
- Optimizer: AdamW (betas=(0.9, 0.98), epsilon=1e-8)
- LR Scheduler: Linear warm-up (0.1 ratio)
- Training Duration: 3 epochs
Training Progress
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
1.2182 | 1.0 | 4447 | 1.1568 |
1.1318 | 2.0 | 8894 | 1.0910 |
1.0837 | 3.0 | 13341 | 1.0724 |
Performance Evaluation
Evaluation Methodology
- BERTScore Implementation: bert-base-multilingual-cased model
- Thai Text Tokenization: newmm tokenizer
- Selected for accurate Thai word segmentation
- Optimized for evaluation metrics calculation
- Token Length: Standardized at 64 tokens maximum
- Evaluation Environment: Controlled testing setup
Comparative Metrics
BLEU Score Analysis
Model | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | BLEU (Cumulative) |
---|---|---|---|---|---|
Blip2-Typhoon1.5-COCO | 7.49 | 1.48 | 0.26 | 0.05 | 0.62 |
PaliGemma2-3b-224-COCO | 24.42 | 11.03 | 5.06 | 2.44 | 7.59 |
Key Achievement: ~3.3x improvement in BLEU-1 score, indicating significantly better basic translation accuracy.
ROUGE Score Performance
Model | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum |
---|---|---|---|---|
Blip2-Typhoon1.5-COCO | 1.28 | 0.02 | 1.27 | 1.27 |
PaliGemma2-3b-224-COCO | 3.17 | 0.04 | 3.17 | 3.16 |
Notable Improvement: 2.5x enhancement in ROUGE-1 score, showing better content matching.
BERTScore Evaluation
Model | BERTScore (F1) |
---|---|
Blip2-Typhoon1.5-COCO | 66.53 |
PaliGemma2-3b-224-COCO | 75.78 |
Significant Gain: 9.25 point improvement in BERTScore, validated using bert-base-multilingual-cased model.
METEOR Score Comparison
Model | METEOR |
---|---|
Blip2-Typhoon1.5-COCO | 8.97 |
PaliGemma2-3b-224-COCO | 37.70 |
Outstanding Result: 4.2x improvement in METEOR score, demonstrating superior semantic accuracy.
Implementation Guide
import torch
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration, BitsAndBytesConfig
from PIL import Image
# Optimized quantization configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize model components
processor = PaliGemmaProcessor.from_pretrained("MagiBoss/PaliGemma2-3b-224-COCO")
model = PaliGemmaForConditionalGeneration.from_pretrained(
"MagiBoss/PaliGemma2-3b-224-COCO",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
device_map="auto"
)
model.eval()
def generate_caption(image_path):
"""
Generate Thai caption for input image with optimized parameters.
Args:
image_path (str): Path to input image
Returns:
str: Generated Thai caption
"""
image = Image.open(image_path).convert("RGB")
question = "อธิบายภาพด้านล่างนี้ด้วยคำบรรยายที่ชัดเจนและละเอียดเป็นภาษาไทย"
prompt = f"<image> {question}"
model_inputs = processor(
text=[prompt],
images=[image],
padding="max_length",
return_tensors="pt"
).to(device)
with torch.inference_mode():
output = model.generate(
input_ids=model_inputs["input_ids"],
pixel_values=model_inputs["pixel_values"],
attention_mask=model_inputs["attention_mask"],
max_new_tokens=64, # Optimized token length
do_sample=False
)
return processor.decode(output[0], skip_special_tokens=True)
# Example usage
image_path = "example.jpg"
caption = generate_caption(image_path)
print("Generated Caption:", caption)
Environment Requirements
- PEFT 0.14.0
- Transformers 4.49.0.dev0
- Pytorch 2.5.1+cu124
- Datasets 3.1.0
- Tokenizers 0.21.0
Current Limitations
- Complex cultural context interpretation
- Handling of Thai-specific idiomatic expressions
- Regional dialect variations
- Specialized domain vocabulary
Future Development Roadmap
Dataset Expansion
- Integration of region-specific Thai content
- Domain-specific terminology enhancement
- Cultural context enrichment
Model Improvements
- Attention mechanism optimization
- Context understanding enhancement
- Inference speed optimization
Evaluation Methods
- Thai-specific metrics development
- Cultural accuracy assessment
- Regional relevance validation
Acknowledgements
Special thanks to:
- Google for the PaliGemma2-3b-pt-224 base model
- The open-source community for development tools
- Contributors to the Thai translation effort
Citation
@misc{PaliGemma2-3b-224-COCO,
author = {MagiBoss},
title = {PaliGemma2-3b-224-COCO},
year = {2025},
publisher = {Hugging Face},
note = {https://huggingface.co/MagiBoss/PaliGemma2-3b-224-COCO}
}
- Downloads last month
- 27
Model tree for MagiBoss/PaliGemma2-3b-224-COCO
Base model
google/paligemma2-3b-pt-224