Model Card for PaliGemma Fine-Tuned Model

This model is a fine-tuned version of Google’s PaliGemma-3B, designed for Vision-Language tasks, particularly image-based question answering and multimodal reasoning. The model has been optimized using Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA and QLoRA, to reduce computational costs while maintaining high performance.

Model Details

Model Description

Developed by: [Taha Majlesi]
Funded by: [More Information Needed]
Model Type: Vision-Language Model (VLM)
Language(s): English
License: MIT
Finetuned from model: google/paligemma-3b-pt-224

Model Sources

Repository: [More Information Needed]
Paper (if available): [More Information Needed]
Demo: [More Information Needed]

Uses

Direct Use

Visual Question Answering (VQA)
Multimodal reasoning on image-text pairs
Image captioning with contextual understanding

Downstream Use

Custom fine-tuning for domain-specific multimodal datasets
Integration into AI assistants for visual understanding
Enhancements in image-text search systems

Out-of-Scope Use

This model is not designed for pure NLP tasks without visual inputs.
The model may not perform well on low-resource languages.
Not intended for real-time inference on edge devices due to model size constraints.

Bias, Risks, and Limitations

Bias: The model may reflect biases present in the training data, especially in image-text relationships.
Limitations: Performance may degrade on unseen, highly abstract, or domain-specific images.
Risks: Misinterpretation of ambiguous images and hallucination of non-existent details.

Recommendations

Use dataset-specific fine-tuning to mitigate biases.
Evaluate performance on diverse benchmarks before deployment.
Implement human-in-the-loop validation in sensitive applications.

How to Get Started with the Model

To use the fine-tuned model, install the required libraries:

pip install transformers peft accelerate bitsandbytes

tahamajs
/

plamma