Submitted by akhaliq 126 MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training · 31 authors 12
Submitted by akhaliq 76 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking · 6 authors 7
Submitted by akhaliq 55 Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset · 3 authors 4
Submitted by akhaliq 26 GiT: Towards Generalist Vision Transformer through Universal Language Interface · 8 authors 11
Submitted by akhaliq 25 StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control · 4 authors 3
Submitted by akhaliq 21 BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences · 9 authors 2
Submitted by akhaliq 17 Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering · 7 authors 1
Submitted by akhaliq 15 Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring · 6 authors 3
Submitted by akhaliq 14 Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding · 10 authors 1
Submitted by akhaliq 9 VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding · 10 authors 1
Submitted by akhaliq 8 LocalMamba: Visual State Space Model with Windowed Selective Scan · 6 authors 1