Submitted by orrzohar 139 Apollo: An Exploration of Video Understanding in Large Multimodal Models · 12 authors 12
Submitted by wzk1015 35 SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding · 11 authors 4
Submitted by sahalshajim 26 BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities · 11 authors 2
Submitted by MoonQiu 20 FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion · 8 authors 2
Submitted by jaywalnut310 19 Efficient Generative Modeling with Residual Vector Quantization-Based Tokens · 4 authors 2
Submitted by AnonMegumi 19 InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption · 9 authors 3
Submitted by yedid 11 ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation · 7 authors 2
Submitted by MagicBag 11 FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing · 5 authors 3
Submitted by hongjiewang 10 LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity · 13 authors 4
Submitted by ydalva 9 FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers · 3 authors 2
Submitted by JackyZhuo 7 Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation · 10 authors 4
Submitted by sarathismg 5 GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers · 6 authors 3
Submitted by SultanR 4 SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs · 1 authors 2
Submitted by rzheng12 2 TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies · 8 authors 2
Submitted by moein99 1 Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images · 5 authors 2