Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
8
3
4
Ahmed Masry
PRO
ahmed-masry
Follow
sfahadahmed's profile picture
omarelba's profile picture
deasdutta's profile picture
59 followers
Β·
4 following
https://ahmedmasryku.github.io/
Ahmed_Masry97
AI & ML interests
Multimodal Chart Understanding, Multimodal Document AI, Multimodal Vision - Language Models,
Recent Activity
posted
an
update
1 day ago
Happy to announce AlignVLM π β a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) πππΌ π Read the paper: https://huggingface.co/papers/2502.01341 π§ Whatβs the challenge? Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. β π― Our Solution: ALIGN Connector We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. β π¬ How does it perform? We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all π on diverse document understanding tasks π. π Meet the AlignVLM Model Family! We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results: β AlignVLM surpasses all Base VLMs trained under similar configurations. β Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 π. π€ What about robustness to noise? We injected Gaussian noise (ΞΌ=0, Ο=3) into the vision encoderβs outputs before feeding them to the connector: β ALIGN Connector: Minimal drop (β1.67%) β proving its high robustness! β MLP Connector: Severe degradation (β25.54%) β struggling with noisy inputs. Code & model weights coming soon! Stay tuned! π₯
authored
a paper
2 days ago
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
upvoted
a
paper
2 days ago
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
View all activity
Organizations
ahmed-masry
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
liked
2 datasets
about 1 month ago
MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
Viewer
β’
Updated
Jan 5
β’
37M
β’
4.51k
β’
44
ServiceNow/BigDocs-Bench
Updated
about 9 hours ago
β’
337
β’
11
liked
a dataset
4 months ago
lmms-lab/LLaVA-OneVision-Data
Viewer
β’
Updated
Oct 22, 2024
β’
3.72M
β’
11k
β’
162
liked
a Space
6 months ago
Sleeping
2
2
Run Chats Pi
π