Post
3500
Happy to announce AlignVLM ๐ โ a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) ๐๐๐ผ
๐ Read the paper: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding (2502.01341)
๐ง Whatโs the challenge?
Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. โ
๐ฏ Our Solution: ALIGN Connector
We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. โ
๐ฌ How does it perform?
We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all ๐ on diverse document understanding tasks ๐.
๐ Meet the AlignVLM Model Family!
We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results:
โ AlignVLM surpasses all Base VLMs trained under similar configurations. โ Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 ๐.
๐ค What about robustness to noise?
We injected Gaussian noise (ฮผ=0, ฯ=3) into the vision encoderโs outputs before feeding them to the connector:
โ ALIGN Connector: Minimal drop (โ1.67%) โ proving its high robustness!
โ MLP Connector: Severe degradation (โ25.54%) โ struggling with noisy inputs.
Code & model weights coming soon! Stay tuned! ๐ฅ
๐ Read the paper: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding (2502.01341)
๐ง Whatโs the challenge?
Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. โ
๐ฏ Our Solution: ALIGN Connector
We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. โ
๐ฌ How does it perform?
We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all ๐ on diverse document understanding tasks ๐.
๐ Meet the AlignVLM Model Family!
We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results:
โ AlignVLM surpasses all Base VLMs trained under similar configurations. โ Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 ๐.
๐ค What about robustness to noise?
We injected Gaussian noise (ฮผ=0, ฯ=3) into the vision encoderโs outputs before feeding them to the connector:
โ ALIGN Connector: Minimal drop (โ1.67%) โ proving its high robustness!
โ MLP Connector: Severe degradation (โ25.54%) โ struggling with noisy inputs.
Code & model weights coming soon! Stay tuned! ๐ฅ