--- license: apache-2.0 datasets: - damerajee/clean_vqa_prt2 language: - en library_name: transformers tags: - 'small vision language model ' --- # GPT-Vision-1-ft This model is a finetuned version of [GPT-Vision-1](https://huggingface.co/damerajee/GPT-Vision) You can find out more about this model in this [GITHUB](https://github.com/dame-cell/GPT-Vision-1) # Training Details * The model was trained for 1 epoch on 200k rows * The Fine-tuning took about 2.5 hours # Inference ```python from transformers import AutoModelForCausalLM from PIL import Image model = AutoModelForCausalLM.from_pretrained("damerajee/GPTVision-1-ft", trust_remote_code=True) image_path = "Your_image_path" image = Image.open(image_path) image = image.convert('RGB') #the model generates better with these parameters gen_kwargs = { "do_sample": True, "temperature": 0.8, "top_p": 0.6, "repetition_penalty": 1.6, } question = "your_Quesition" answer = model.generate(image=image,question=question,max_new_tokens=80,**gen_kwargs) print(answer) ``` # Examples | Image | Question | Response | |-------|----------|----------| | barbie | Why is this dessert appealing to someone with a sweet tooth? | The cake is a must-try and the person with it may be enjoying their dessert.It might also serve as an appetizer, offering options for those who prefer to try more than just one of these options. This could include food or drinks they are not having yet consumed before. A plate can provide them from each side without additional options: some people would rather eat this dessert | | pc | What factors contribute to the player's success in this particular shot? | 1. Positioning and positioning: The ball is placed in a position that allows for the player to hit it with both hands, making contact on either side of his body or face down at an angle that requires them not only to reach the tennis court but also needs their feet to touch him. This can lead into different directions. For example if they are playing close-up during practice | # Limitations Despite fine-tuning, the model struggles with generating coherent text for more complex or nuanced queries. While it performs well with generic questions like 'Describe the scenery in this image,' or 'what is this player doing in this image?' it falls short when asked about deeper or more intricate aspects. which is a bummer but I tried