zzxslp
/

som-llava-v1.5-13b-hf

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

som-llava-v1.5-13b-hf / README.md

zzxslp's picture

Update README.md

7eca548 verified 9 months ago

|

history blame contribute delete

2.03 kB

	---
	library_name: transformers
	tags: []
	---

	# SoM-LLaVA Model Card
	LLaVA-v1.5 mixed trained with SoM style data (QA+listing).

	The model can understand tag-style visual prompts on the image (e.g., what is the object tagged with id 9?), also gained improved performance on MLLM benchmarks (POPE, MME, SEED, MM-Vet, LLav-wild), even when the input testing images has no tags.

	For more information about SoM-LLaVA, check our [github page](https://github.com/zzxslp/SoM-LLaVA) and [paper](https://arxiv.org/abs/2404.16375)!

	## Getting Started
	If you would like to load our model in huggingface, here is an example script:

	```python
	from PIL import Image
	import requests
	from transformers import AutoProcessor, LlavaForConditionalGeneration

	model_path = "zzxslp/som-llava-v1.5-13b-hf"

	model = LlavaForConditionalGeneration.from_pretrained(model_path)
	processor = AutoProcessor.from_pretrained(model_path)

	prompt = "USER: <image>\nWhat's the content of the image? ASSISTANT:"
	url = "https://www.ilankelman.org/stopsigns/australia.jpg"
	image = Image.open(requests.get(url, stream=True).raw)

	inputs = processor(text=prompt, images=image, return_tensors="pt")

	# Generate
	generate_ids = model.generate(**inputs, max_new_tokens=20)
	output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
	print (output)
	```

	Our original model weights: [[SoM-LLaVA-v1.5-13B](https://huggingface.co/zzxslp/som-llava-v1.5-13b)], to be used in [official LLaVA repo](https://github.com/haotian-liu/LLaVA)


	## Citation
	If you find our data or model useful for your research and applications, please cite our paper:

	```
	@article{yan2024list,
	title={List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs},
	author={Yan, An and Yang, Zhengyuan and Wu, Junda and Zhu, Wanrong and Yang, Jianwei and Li, Linjie and Lin, Kevin and Wang, Jianfeng and McAuley, Julian and Gao, Jianfeng and others},
	journal={arXiv preprint arXiv:2404.16375},
	year={2024}
	}
	```