File size: 3,504 Bytes

a19ef03
8295033
d62b1ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a19ef03
d62b1ee
a19ef03
d62b1ee
a19ef03
b2538f6
a19ef03
d62b1ee
a19ef03
d62b1ee
a19ef03
 
d62b1ee
 
 
 
 
 
a19ef03
 
d62b1ee
a19ef03
d62b1ee
 
 
 
 
a19ef03
d62b1ee
a19ef03
d62b1ee
 
 
a19ef03
d62b1ee
a19ef03
d62b1ee
 
 
a19ef03
d62b1ee
a19ef03
d62b1ee
 
 
 
 
 
 
a19ef03
d62b1ee
 
 
 
 
 
 
 
 
a19ef03
d62b1ee
 
 
a19ef03
d62b1ee
 
 
 
a19ef03
d62b1ee
 
a19ef03
d62b1ee
a19ef03
d62b1ee
a19ef03
 
d62b1ee

---
license: mit
datasets:
- openbmb/VisRAG-Ret-Train-Synthetic-data
- openbmb/VisRAG-Ret-Train-In-domain-data
- Metric-AI/rag_docmatix_100k
- vidore/colpali_train_set
- llamaindex/vdr-multilingual-train
- Metric-AI/tabfquad_train_set
language:
- en
- fr
- es
- it
- de
base_model:
- Metric-AI/ColQwenStella-base-2b
- Qwen/Qwen2-VL-2B
- NovaSearch/stella_en_1.5B_v5
tags:
- vidore
- multimodal_embedding
- multilingual_embedding
- Text-to-Visual Document (T→VD) retrieval
library_name: peft
pipeline_tag: visual-document-retrieval
---
# ColQwenStella-2b-multilingual: Multilingual Visual Retriever based on the combination of Qwen2 Vision and stella_en_1.5B_v5 model. 

## Ranked #1 among models <= 2B parameters and #8 overall on the Vidore benchmark (as of February 11, 2025). The reported scores on the [Vidore Leaderboard](https://huggingface.co/spaces/vidore/vidore-leaderboard) correspond to checkpoint-1800.

### This is the base version trained on 4xA100 80GB with per_device_batch_size=128 for 5 epoch. 

The ColQwenStella-2b-multilingual architecture combines the Vision component of the Qwen2 model with stella_en_1.5B_v5 as its embedding model. Training is done following the [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) recipe.


## Data
- **Synthetic data**: Selected and preprocessed from the `openbmb/VisRAG-Ret-Train-Synthetic-data` dataset.  
- **In-domain VQA dataset**: Drawn from `openbmb/VisRAG-Ret-Train-In-domain-data`.  
- **Docmatix dataset**: Extracted from the `Metric-AI/rag_docmatix_100k` dataset.  
- **Colpali dataset**: Taken from `vidore/colpali_train_set`.
- **Multilingual dataset**: Taken from `llamaindex/vdr-multilingual-train`.


## Model Training

### Parameters
We train models  use low-rank adapters ([LoRA](https://arxiv.org/abs/2106.09685)) 
with `alpha=128`  and `r=128` on the transformer layers from the language model, and `mlp` layers of the `vison_model.merger`
as well as the final randomly initialized projection layer, and use a `adamw` optimizer. 
We train on an 4xA100 GPU setup with distributed data parallelism (via accelerate), a learning rate of 5e-4 with cosine decay with 100 warmup steps, batch size per device is 128, in `bfloat16` format.

## Installation

```bash
pip install transformers>=4.46.3
```

## Usage

```python
import torch
from PIL import Image

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained(
        "Metric-AI/ColQwenStella-2b-multilingual",
        torch_dtype=torch.bfloat16,
        device_map="cuda:0",  # or "mps" if on Apple Silicon
        trust_remote_code=True
    ).eval()
processor = AutoProcessor.from_pretrained("Metric-AI/ColQwenStella-2b-multilingual", trust_remote_code=True)

# Your inputs
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "Is attention really all you need?",
    "What is the amount of bananas farmed in Salvador?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)
```

## License

The adapters attached to the model are under MIT license.


- **Developed by:** [Metric AI Research Lab](https://metric.am/)