RAHUL YASHWANTKUMAR GUPTA

ryg81
ยท

AI & ML interests

None yet

Recent Activity

reacted to ahmed-masry's post with ๐Ÿ‘ about 23 hours ago
Happy to announce AlignVLM ๐Ÿ“ โ€“ a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) ๐ŸŒ๐Ÿ“„๐Ÿ–ผ ๐Ÿ”— Read the paper: https://huggingface.co/papers/2502.01341 ๐Ÿง Whatโ€™s the challenge? Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. โŒ ๐ŸŽฏ Our Solution: ALIGN Connector We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. โœ… ๐Ÿ”ฌ How does it perform? We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all ๐Ÿ† on diverse document understanding tasks ๐Ÿ“„. ๐Ÿ“Š Meet the AlignVLM Model Family! We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results: โœ… AlignVLM surpasses all Base VLMs trained under similar configurations. โœ… Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 ๐Ÿš€. ๐Ÿค” What about robustness to noise? We injected Gaussian noise (ฮผ=0, ฯƒ=3) into the vision encoderโ€™s outputs before feeding them to the connector: โœ… ALIGN Connector: Minimal drop (โ†“1.67%) โ€“ proving its high robustness! โŒ MLP Connector: Severe degradation (โ†“25.54%) โ€“ struggling with noisy inputs. Code & model weights coming soon! Stay tuned! ๐Ÿ”ฅ
updated a collection 2 days ago
Other Models
View all activity

Organizations

None yet

ryg81's activity

reacted to ahmed-masry's post with ๐Ÿ‘ about 23 hours ago
view post
Post
3500
Happy to announce AlignVLM ๐Ÿ“ โ€“ a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) ๐ŸŒ๐Ÿ“„๐Ÿ–ผ

๐Ÿ”— Read the paper: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding (2502.01341)

๐Ÿง Whatโ€™s the challenge?
Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. โŒ

๐ŸŽฏ Our Solution: ALIGN Connector
We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. โœ…

๐Ÿ”ฌ How does it perform?
We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all ๐Ÿ† on diverse document understanding tasks ๐Ÿ“„.

๐Ÿ“Š Meet the AlignVLM Model Family!
We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results:
โœ… AlignVLM surpasses all Base VLMs trained under similar configurations. โœ… Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 ๐Ÿš€.

๐Ÿค” What about robustness to noise?
We injected Gaussian noise (ฮผ=0, ฯƒ=3) into the vision encoderโ€™s outputs before feeding them to the connector:
โœ… ALIGN Connector: Minimal drop (โ†“1.67%) โ€“ proving its high robustness!
โŒ MLP Connector: Severe degradation (โ†“25.54%) โ€“ struggling with noisy inputs.

Code & model weights coming soon! Stay tuned! ๐Ÿ”ฅ
replied to Jaward's post 1 day ago
view reply

looks good, but will they put for us to use or just showing their development and teasing us. :)

New activity in lmstudio-community/MiniCPM-o-2_6-GGUF 3 days ago

getting error

8
#1 opened 7 days ago by
ryg81
reacted to fuzzy-mittenz's post with ๐Ÿ”ฅ 5 days ago
view post
Post
2558
Not many seemed to notice but what was probably meant to be a WIN for artist's rights in the US Office of Copyright has solved some fundamental issues for the community.
In our recent article I outline how Companies like Suno, OpenAI, Midjourney etc can no longer claim any right to copy your work that you create with their platforms
We also look at other ways this study and new rules for AI will fundamentally effect creators who use it and companies incentives to give them control over certain aspects might change because of this. it's broken down pretty well here: https://huggingface.co/blog/fuzzy-mittenz/copyright-in-ai
view reply

Can this be similar for image generation models? (I am not a programmer :- or expert in AI))

replied to m-ric's post 12 days ago
view reply

looking for some tutorials on this topic.

reacted to AdinaY's post with โค๏ธ 15 days ago
view post
Post
2948
What happened yesterday in the Chinese AI community? ๐Ÿš€

T2A-01-HD ๐Ÿ‘‰ https://hailuo.ai/audio
MiniMax's Text-to-Audio model, now in Hailuo AI, offers 300+ voices in 17+ languages and instant emotional voice cloning.

Tare ๐Ÿ‘‰ https://www.trae.ai/
A new coding tool by Bytedance for professional developers, supporting English & Chinese with free access to Claude 3.5 and GPT-4 for a limited time.

DeepSeek-R1 Series ๐Ÿ‘‰ deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d
Open-source reasoning models with MIT license by DeepSeek.

Kimi K 1.5 ๐Ÿ‘‰ https://github.com/MoonshotAI/Kimi-k1.5 | https://kimi.ai/
An O1-level multi-modal model by MoonShot AI, utilizing reinforcement learning with long and short-chain-of-thought and supporting up to 128k tokens.

And todayโ€ฆ

Hunyuan 3D-2.0 ๐Ÿ‘‰ tencent/Hunyuan3D-2
A SoTA 3D synthesis system for high-res textured assets by Tencent Hunyuan , with open weights and code!

Stay tuned for more updates ๐Ÿ‘‰ https://huggingface.co/zh-ai-community