RAHUL YASHWANTKUMAR GUPTA

ryg81

AI & ML interests

None yet

Recent Activity

reacted to ahmed-masry's post with 👍 about 23 hours ago

Happy to announce AlignVLM 📏 – a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) 🌍📄🖼 🔗 Read the paper: https://huggingface.co/papers/2502.01341 🧐 What’s the challenge? Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. ❌ 🎯 Our Solution: ALIGN Connector We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. ✅ 🔬 How does it perform? We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all 🏆 on diverse document understanding tasks 📄. 📊 Meet the AlignVLM Model Family! We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results: ✅ AlignVLM surpasses all Base VLMs trained under similar configurations. ✅ Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 🚀. 🤔 What about robustness to noise? We injected Gaussian noise (μ=0, σ=3) into the vision encoder’s outputs before feeding them to the connector: ✅ ALIGN Connector: Minimal drop (↓1.67%) – proving its high robustness! ❌ MLP Connector: Severe degradation (↓25.54%) – struggling with noisy inputs. Code & model weights coming soon! Stay tuned! 🔥

replied to Jaward's post 1 day ago

ByteDance drops OmniHuman🔥 This is peak SOTA performance - flawless natural gestures with perfect lip sync and facial expressions. This is the second time they've released SOTA level talking-heads only this time with hands and body motion. Project: https://omnihuman-lab.github.io/

updated a collection 2 days ago

Other Models

View all activity

Organizations

None yet

ryg81's activity

reacted to ahmed-masry's post with 👍 about 23 hours ago

Post

3500

Happy to announce AlignVLM 📏 – a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) 🌍📄🖼

🔗 Read the paper: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding (2502.01341)

🧐 What’s the challenge?
Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. ❌

🎯 Our Solution: ALIGN Connector
We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. ✅

🔬 How does it perform?
We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all 🏆 on diverse document understanding tasks 📄.

📊 Meet the AlignVLM Model Family!
We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results:
✅ AlignVLM surpasses all Base VLMs trained under similar configurations. ✅ Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 🚀.

🤔 What about robustness to noise?
We injected Gaussian noise (μ=0, σ=3) into the vision encoder’s outputs before feeding them to the connector:
✅ ALIGN Connector: Minimal drop (↓1.67%) – proving its high robustness!
❌ MLP Connector: Severe degradation (↓25.54%) – struggling with noisy inputs.

Code & model weights coming soon! Stay tuned! 🔥

replied to Jaward's post 1 day ago

looks good, but will they put for us to use or just showing their development and teasing us. :)

updated a collection 2 days ago

Other Models

Collection

86 items • Updated 2 days ago

New activity in lmstudio-community/MiniCPM-o-2_6-GGUF 3 days ago

getting error

#1 opened 7 days ago by

ryg81

updated a collection 4 days ago

VIDEO MODELS

Collection

12 items • Updated 4 days ago

updated a collection 5 days ago

Other Models

Collection

86 items • Updated 2 days ago

reacted to fuzzy-mittenz's post with 🔥 5 days ago

Post

2558

Not many seemed to notice but what was probably meant to be a WIN for artist's rights in the US Office of Copyright has solved some fundamental issues for the community.
In our recent article I outline how Companies like Suno, OpenAI, Midjourney etc can no longer claim any right to copy your work that you create with their platforms
We also look at other ways this study and new rules for AI will fundamentally effect creators who use it and companies incentives to give them control over certain aspects might change because of this. it's broken down pretty well here: https://huggingface.co/blog/fuzzy-mittenz/copyright-in-ai

commented on KV Caching Explained: Optimizing Transformer Inference Efficiency 6 days ago

Can this be similar for image generation models? (I am not a programmer :- or expert in AI))

updated a collection 8 days ago

Other Models

Collection

86 items • Updated 2 days ago

updated a collection 9 days ago

Other Models

Collection

86 items • Updated 2 days ago

updated a collection 10 days ago

Other Models

Collection

86 items • Updated 2 days ago

liked 2 models 10 days ago

ByteDance/Sa2VA-26B

Image-Text-to-Text • Updated 23 days ago • 175 • 12

ByteDance/CascadeV

Updated Sep 2, 2024 • 4

liked a model 11 days ago

jbilcke-hf/flux-satellite

Text-to-Image • Updated 11 days ago • 64 • • 1

replied to m-ric's post 12 days ago

looking for some tutorials on this topic.

updated a collection 13 days ago

Other Models

Collection

86 items • Updated 2 days ago

reacted to AdinaY's post with ❤️ 15 days ago

Post

2948

What happened yesterday in the Chinese AI community? 🚀

T2A-01-HD 👉 https://hailuo.ai/audio
MiniMax's Text-to-Audio model, now in Hailuo AI, offers 300+ voices in 17+ languages and instant emotional voice cloning.

Tare 👉 https://www.trae.ai/
A new coding tool by Bytedance for professional developers, supporting English & Chinese with free access to Claude 3.5 and GPT-4 for a limited time.

DeepSeek-R1 Series 👉 deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d
Open-source reasoning models with MIT license by DeepSeek.

Kimi K 1.5 👉 https://github.com/MoonshotAI/Kimi-k1.5 | https://kimi.ai/
An O1-level multi-modal model by MoonShot AI, utilizing reinforcement learning with long and short-chain-of-thought and supporting up to 128k tokens.

And today…

Hunyuan 3D-2.0 👉 tencent/Hunyuan3D-2
A SoTA 3D synthesis system for high-res textured assets by Tencent Hunyuan , with open weights and code!

Stay tuned for more updates 👉 https://huggingface.co/zh-ai-community