8 3 4

Ahmed Masry PRO

ahmed-masry

https://ahmedmasryku.github.io/

Ahmed_Masry97

AI & ML interests

Multimodal Chart Understanding, Multimodal Document AI, Multimodal Vision - Language Models,

Recent Activity

posted an update 1 day ago

Happy to announce AlignVLM 📏 – a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) 🌍📄🖼 🔗 Read the paper: https://huggingface.co/papers/2502.01341 🧐 What’s the challenge? Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. ❌ 🎯 Our Solution: ALIGN Connector We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. ✅ 🔬 How does it perform? We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all 🏆 on diverse document understanding tasks 📄. 📊 Meet the AlignVLM Model Family! We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results: ✅ AlignVLM surpasses all Base VLMs trained under similar configurations. ✅ Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 🚀. 🤔 What about robustness to noise? We injected Gaussian noise (μ=0, σ=3) into the vision encoder’s outputs before feeding them to the connector: ✅ ALIGN Connector: Minimal drop (↓1.67%) – proving its high robustness! ❌ MLP Connector: Severe degradation (↓25.54%) – struggling with noisy inputs. Code & model weights coming soon! Stay tuned! 🔥

authored a paper 2 days ago

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

upvoted a paper 2 days ago

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

View all activity

Organizations

ahmed-masry's activity

liked 2 datasets about 1 month ago

MAmmoTH-VL/MAmmoTH-VL-Instruct-12M

Viewer • Updated Jan 5 • 37M • 4.51k • 44

ServiceNow/BigDocs-Bench

Updated about 9 hours ago • 337 • 11

liked a dataset 4 months ago

lmms-lab/LLaVA-OneVision-Data

Viewer • Updated Oct 22, 2024 • 3.72M • 11k • 162

liked a Space 6 months ago

Run Chats Pi

🏆