File size: 3,970 Bytes
c857b90 7899534 c857b90 7899534 c857b90 7899534 c857b90 843b9c7 c857b90 7899534 c857b90 7899534 c857b90 7899534 c857b90 7899534 c857b90 7899534 c857b90 7899534 c857b90 7899534 c857b90 768ab0d 7899534 c857b90 768ab0d 7899534 c857b90 7899534 c857b90 7899534 c857b90 7899534 c857b90 7899534 c857b90 768ab0d 7899534 c857b90 7899534 c857b90 7899534 c857b90 7899534 c857b90 7899534 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- mistralai/Pixtral-12B-2409
pipeline_tag: image-to-text
---
# Pixtral-12B-Captioner-Relaxed
## Introduction
Pixtral-12B-Captioner-Relaxed is an instruction-tuned version of [Pixtral-12B-2409](https://huggingface.co/mistralai/Pixtral-12B-2409), an advanced multimodal large language model. This fine-tuned version is based on a hand-curated dataset for text-to-image models, providing significantly more detailed descriptions of given images.
### Key Features:
* **Enhanced Detail:** Generates more comprehensive and nuanced image descriptions.
* **Relaxed Constraints:** Offers less restrictive image descriptions compared to the base model.
* **Natural Language Output:** Describes different subjects in the image while specifying their locations using natural language.
* **Optimized for Image Generation:** Produces captions in formats compatible with state-of-the-art text-to-image generation models.
**Note:** This fine-tuned model is optimized for creating text-to-image datasets. As a result, performance on other complex tasks may be lower compared to the original model.
## Requirements
The 12B model needs 24GB of VRAM at half precision. Model can be loaded at 8 bit or 4 bit quantization but expect degraded performance.
## Quickstart
```python
from PIL import Image
from transformers import LlavaForConditionalGeneration, AutoProcessor
from transformers import BitsAndBytesConfig
import torch
import matplotlib.pyplot as plt
# example quantization config, add it to model load parameters to use 4bit quantization
quantization_config = BitsAndBytesConfig(
# load_in_8bit=True,
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)
model_id = "Ertugrul/Pixtral-12B-Captioner-Relaxed"
model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)
# for quantization just use this instead of previous load
# model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image.\n"},
{
"type": "image",
}
],
}
]
PROMPT = processor.apply_chat_template(conversation, add_generation_prompt=True)
image = Image.open(r"PATH_TO_YOUR_IMAGE")
def resize_image(image, target_size=768):
"""Resize the image to have the target size on the shortest side."""
width, height = image.size
if width < height:
new_width = target_size
new_height = int(height * (new_width / width))
else:
new_height = target_size
new_width = int(width * (new_height / height))
return image.resize((new_width, new_height), Image.LANCZOS)
# you can try different resolutions or disable it completely
image = resize_image(image, 768)
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
generate_ids = model.generate(**inputs, max_new_tokens=384, do_sample=True, temperature=0.3, use_cache=True, top_k=20)
output_text = processor.batch_decode(generate_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]
print(output_text)
```
## Acknowledgements
For more detailed options, refer to the [Pixtral-12B-2409](https://huggingface.co/mistralai/Pixtral-12B-2409) or [mistral-community/pixtral-12b](https://huggingface.co/mistral-community/pixtral-12b) documentation.
You can also try the [Qwen2-VL-7B-Captioner-Relaxed](https://huggingface.co/Ertugrul/Qwen2-VL-7B-Captioner-Relaxed), for an alternative smaller model. It's trianed in a similar manner. |