Update README.md
#3
by
RaushanTurganbay
HF staff
- opened
README.md
CHANGED
@@ -46,10 +46,48 @@ English
|
|
46 |
The model is intended to be used in enterprise applications that involve processing visual and text data. In particular, the model is well-suited for a range of visual document understanding tasks, such as analyzing tables and charts, performing optical character recognition (OCR), and answering questions based on document content. Additionally, its capabilities extend to general image understanding, enabling it to be applied to a broader range of business applications. For tasks that exclusively involve text-based input, we suggest using our Granite large language models, which are optimized for text-only processing and offer superior performance compared to this model.
|
47 |
|
48 |
|
49 |
-
|
50 |
-
This is a simple example of how to use the granite-vision-3.1-2b-preview model.
|
51 |
|
52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
```shell
|
55 |
pip install torch torchvision torchaudio
|
|
|
46 |
The model is intended to be used in enterprise applications that involve processing visual and text data. In particular, the model is well-suited for a range of visual document understanding tasks, such as analyzing tables and charts, performing optical character recognition (OCR), and answering questions based on document content. Additionally, its capabilities extend to general image understanding, enabling it to be applied to a broader range of business applications. For tasks that exclusively involve text-based input, we suggest using our Granite large language models, which are optimized for text-only processing and offer superior performance compared to this model.
|
47 |
|
48 |
|
49 |
+
## Generation:
|
|
|
50 |
|
51 |
+
Granite Vision model is supported natively `transformers>=4.48`. Below is a simple example of how to use the `granite-vision-3.1-2b-preview` model.
|
52 |
+
|
53 |
+
### Usage with `transformers`
|
54 |
+
|
55 |
+
```python
|
56 |
+
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
|
57 |
+
|
58 |
+
model_path = "ibm-granite/granite-vision-3.1-2b-preview"
|
59 |
+
processor = LlavaNextProcessor.from_pretrained(model_path)
|
60 |
+
model = LlavaNextForConditionalGeneration.from_pretrained(model_path, device_map="cuda:0")
|
61 |
+
|
62 |
+
# prepare image and text prompt, using the appropriate prompt template
|
63 |
+
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
|
64 |
+
|
65 |
+
conversation = [
|
66 |
+
{
|
67 |
+
"role": "user",
|
68 |
+
"content": [
|
69 |
+
{"type": "image", "url": url},
|
70 |
+
{"type": "text", "text": "What is shown in this image?"},
|
71 |
+
],
|
72 |
+
},
|
73 |
+
]
|
74 |
+
inputs = processor.apply_chat_template(
|
75 |
+
conversation,
|
76 |
+
add_generation_prompt=True,
|
77 |
+
tokenize=True,
|
78 |
+
return_dict=True,
|
79 |
+
return_tensors="pt"
|
80 |
+
).to("cuda:0")
|
81 |
+
|
82 |
+
|
83 |
+
# autoregressively complete prompt
|
84 |
+
output = model.generate(**inputs, max_new_tokens=100)
|
85 |
+
print(processor.decode(output[0], skip_special_tokens=True))
|
86 |
+
```
|
87 |
+
|
88 |
+
### Usage with vLLM
|
89 |
+
|
90 |
+
The model can also be loaded with `vLLM`. First make sure to install the following libraries:
|
91 |
|
92 |
```shell
|
93 |
pip install torch torchvision torchaudio
|