Vision-language model series based on Qwen2
Engage in multi-modal conversations with images and videos