ibm-granite
/

granite-vision-3.1-2b-preview

Image-Text-to-Text

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Update README.md

#3

by RaushanTurganbay HF staff - opened 2 days ago

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

Files changed (1) hide show

README.md +41 -3

README.md CHANGED Viewed

@@ -46,10 +46,48 @@ English
 The model is intended to be used in enterprise applications that involve processing visual and text data. In particular, the model is well-suited for a range of visual document understanding tasks, such as analyzing tables and charts, performing optical character recognition (OCR), and answering questions based on document content. Additionally, its capabilities extend to general image understanding, enabling it to be applied to a broader range of business applications. For tasks that exclusively involve text-based input, we suggest using our Granite large language models, which are optimized for text-only processing and offer superior performance compared to this model.
-**Generation:**
-This is a simple example of how to use the granite-vision-3.1-2b-preview model.
-Install the following libraries:
 ```shell
 pip install torch torchvision torchaudio

 The model is intended to be used in enterprise applications that involve processing visual and text data. In particular, the model is well-suited for a range of visual document understanding tasks, such as analyzing tables and charts, performing optical character recognition (OCR), and answering questions based on document content. Additionally, its capabilities extend to general image understanding, enabling it to be applied to a broader range of business applications. For tasks that exclusively involve text-based input, we suggest using our Granite large language models, which are optimized for text-only processing and offer superior performance compared to this model.
+## Generation:
+Granite Vision model is supported natively `transformers>=4.48`. Below is a simple example of how to use the `granite-vision-3.1-2b-preview` model.
+### Usage with `transformers`
+```python
+from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
+model_path = "ibm-granite/granite-vision-3.1-2b-preview"
+processor = LlavaNextProcessor.from_pretrained(model_path)
+model = LlavaNextForConditionalGeneration.from_pretrained(model_path, device_map="cuda:0")
+# prepare image and text prompt, using the appropriate prompt template
+url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "url": url},
+            {"type": "text", "text": "What is shown in this image?"},
+        ],
+    },
+]
+inputs = processor.apply_chat_template(
+    conversation,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to("cuda:0")
+# autoregressively complete prompt
+output = model.generate(**inputs, max_new_tokens=100)
+print(processor.decode(output[0], skip_special_tokens=True))
+```
+### Usage with vLLM
+The model can also be loaded with `vLLM`. First make sure to install the following libraries:
 ```shell
 pip install torch torchvision torchaudio