---
license: apache-2.0
datasets:
- damerajee/clean_vqa_prt2
language:
- en
library_name: transformers
tags:
- 'small vision language model '
---
# GPT-Vision-1-ft
This model is a finetuned version of [GPT-Vision-1](https://huggingface.co/damerajee/GPT-Vision) 
You can find out more about this model in this [GITHUB](https://github.com/dame-cell/GPT-Vision-1)

# Training Details 
* The model was trained for 1 epoch on 200k rows
* The Fine-tuning took about 2.5 hours

# Inference 
```python
from transformers import AutoModelForCausalLM
from PIL import Image

model = AutoModelForCausalLM.from_pretrained("damerajee/GPTVision-1-ft", trust_remote_code=True)

image_path = "Your_image_path"
image = Image.open(image_path)
image = image.convert('RGB')

#the model generates better with these parameters
gen_kwargs = {
    "do_sample": True,
    "temperature": 0.8,
    "top_p": 0.6,
    "repetition_penalty": 1.6,

    }

question = "your_Quesition"
answer = model.generate(image=image,question=question,max_new_tokens=80,**gen_kwargs)
print(answer)
```

# Examples 
| Image | Question | Response |
|-------|----------|----------|
| <img src="desert.jpg" alt="barbie" width="400"> | Why is this dessert appealing to someone with a sweet tooth? | The cake is a must-try and the person with it may be enjoying their dessert.It might also serve as an appetizer, offering options for those who prefer to try more than just one of these options. This could include food or drinks they are not having yet consumed before. A plate can provide them from each side without additional options: some people would rather eat this dessert |
| <img src="tennis.jpg" alt="pc" width="400"> | What factors contribute to the player's success in this particular shot? |  1. Positioning and positioning: The ball is placed in a position that allows for the player to hit it with both hands, making contact on either side of his body or face down at an angle that requires them not only to reach the tennis court but also needs their feet to touch him. This can lead into different directions. For example if they are playing close-up during practice |

# Limitations 
Despite fine-tuning, the model struggles with generating coherent text for more complex or nuanced queries. While it performs well with generic questions like 'Describe the scenery in this image,' or 'what is this player doing in this image?' 

it falls short when asked about deeper or more intricate aspects.

which is a bummer but I tried