File size: 1,225 Bytes
366f5a7
 
e7aff9f
 
366f5a7
 
3009be2
366f5a7
 
6c14bd3
 
366f5a7
 
3009be2
366f5a7
3009be2
 
 
 
366f5a7
3009be2
366f5a7
493d433
3009be2
366f5a7
3009be2
 
 
366f5a7
3009be2
 
 
 
 
366f5a7
3009be2
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---
library_name: transformers
license: apache-2.0
pipeline_tag: image-to-text
---

# rmfg

<!-- Provide a quick summary of what the model is/does. -->
<img src="https://i.pinimg.com/736x/7e/46/a6/7e46a6881623dfd3e1a2a5a2ae692374.jpg" width="300">



## Example

**Image**
<img src="https://media-cldnry.s-nbcnews.com/image/upload/t_fit-760w,f_auto,q_auto:best/rockcms/2023-12/231202-elon-musk-mjf-1715-fc0be2.jpg" width="300">
**Output**
> A man in a black cowboy hat and sunglasses stands in front of a white car, holding a microphone and speaking into it.

-----------------------------------------------------------------------------------

- underfit, doesn't perform well
- this marks the beginning of my tiny vision language model series, with this model serving as a prelude to what's to come in the next few days.

```
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "aloobun/rmfg"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

image = Image.open('692374.jpg')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))
```