SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation Paper • 2502.08168 • Published 2 days ago • 10
Scaling Pre-training to One Hundred Billion Data for Vision Language Models Paper • 2502.07617 • Published 3 days ago • 22
Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents Paper • 2502.04223 • Published 8 days ago • 9
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Paper • 2502.06788 • Published 4 days ago • 11
The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering Paper • 2502.03628 • Published 9 days ago • 11
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation Paper • 2502.05415 • Published 6 days ago • 16
VideoRoPE: What Makes for Good Video Rotary Position Embedding? Paper • 2502.05173 • Published 7 days ago • 60
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment Paper • 2502.04328 • Published 8 days ago • 21
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking Paper • 2502.02339 • Published 10 days ago • 19
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning Paper • 2412.14164 • Published Dec 18, 2024 • 4
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models Paper • 2502.00698 • Published 12 days ago • 22
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding Paper • 2502.01341 • Published 11 days ago • 33
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding Paper • 2501.16411 • Published 18 days ago • 17
Qwen2.5-VL Collection Vision-language model series based on Qwen2.5 • 3 items • Updated 18 days ago • 340
SmolVLM 256M & 500M Collection Collection for models & demos for even smoller SmolVLM release • 12 items • Updated 22 days ago • 68
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents Paper • 2501.11858 • Published 24 days ago • 5
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Paper • 2501.13826 • Published 22 days ago • 24