SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation Paper • 2502.08168 • Published 2 days ago • 10
Scaling Pre-training to One Hundred Billion Data for Vision Language Models Paper • 2502.07617 • Published 3 days ago • 22
Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents Paper • 2502.04223 • Published 8 days ago • 9
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Paper • 2502.06788 • Published 4 days ago • 11
The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering Paper • 2502.03628 • Published 9 days ago • 11
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation Paper • 2502.05415 • Published 6 days ago • 16
VideoRoPE: What Makes for Good Video Rotary Position Embedding? Paper • 2502.05173 • Published 7 days ago • 60
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment Paper • 2502.04328 • Published 8 days ago • 21
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking Paper • 2502.02339 • Published 10 days ago • 19