Mel Massadian's picture

Mel Massadian

melmass

AI & ML interests

Building tools on top of Generative AI & LLM models

Recent Activity

liked a model 4 days ago
Alpha-VLLM/Lumina-Video-f24R960
liked a model 4 days ago
m-a-p/YuE-upsampler
liked a model 5 days ago
ZhengPeng7/BiRefNet_HR
View all activity

Organizations

MLX Community's profile picture

melmass's activity

reacted to merve's post with 🔥 about 1 month ago
view post
Post
1816
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗 ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093

> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️

> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)

> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️

> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
  • 1 reply
·