EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Abstract
Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: https://github.com/baaivision/EVE.
Community
š” Highlights:
š„ Superior Capability: An originated encoder-free LVLM with minimalist patch embedding layer and arbitrary image aspect ratio, continuing to approach several modular encoder-based LVLMs.
š„ Data Efficiency: Filter solely 92M publicly avaliable data from OpenImages, SAM, LAION, Datacomp for pre-training; Utilizing 7.3M Infinity-MM and LLaVA-onevision SFT data for EVE-7B-HD-v2.0.
š„ Pioneering Route: We attempt to provide an efficient, transparent, and practical training strategy and procedure for developing a pure decoder-only architecture across modalities.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts (2025)
- Optimizing Vision-Language Interactions Through Decoder-Only Models (2024)
- Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models (2024)
- MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders (2025)
- Return of the Encoder: Maximizing Parameter Efficiency for SLMs (2025)
- Unifying Specialized Visual Encoders for Video Language Models (2025)
- MBQ: Modality-Balanced Quantization for Large Vision-Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper