sometimesanotion PRO

sometimesanotion

AI & ML interests

Agentic LLM services, model merging, finetunes, distillation

Recent Activity

Organizations

Hugging Face Discord Community's profile picture

sometimesanotion's activity

reacted to sequelbox's post with βž• 1 day ago
view post
Post
1717
New sneak preview of my next release! Raiden is a deepseek-ai/DeepSeek-R1 synthetic dataset that uses creative-reasoning and analytic-reasoning prompts!

This preview release has the first 5.8k rows, all responses generated using DeepSeek's 685b parameter R1 model: sequelbox/Raiden-DSR1-PREVIEW

Enjoy this look at R1's reasoning skills! Full dataset coming soon.
posted an update 2 days ago
view post
Post
412
"And don't even get me started on the '-v6' tacked onto the end. That's like when your grandma names her new cat 'Whiskers II.' We all know Whiskers I was the real deal."

- sometimesanotion/Qwenvergence-14B-v13-Prose-DS critiquing my model naming conventions
reacted to CultriX's post with πŸ”₯ 2 days ago
view post
Post
1615
# Multi-Agent Collaboration for Coding Tasks - Updated Space!

This version does not rely on AutoGen.
The user simply enters his OPENAI_API_KEY and a task and the Space goes to work, employing a
- 1. prompt-enhancer agent,
- 2. an orchestrator agent,
- 3. a coder agent,
- 4. a code-reviewing agent and
-5. a code documentation generator agent.

See below image for an example workflow:

CultriX/MultiAgent-CodeTask
  • 1 reply
Β·
replied to their post 2 days ago
view reply

Okay, this has become a major component of how I build model_stocks that keep IFEVAL high even while merging distantly related models, and this is the reason for some TIES merges to "qwenvergify" models you might have seen.

Here's the basic idea:
https://www.arcee.ai/blog/use-mergekit-to-extract-lora-adapters-from-any-fine-tuned-model

But not as many models are inter-compatible for LoRAs as you'd expect, because there are minor variations in size among some important finetunes. I get the train tracks to a standard width, as it were, and make them intercompatible with the "qwenvergify" TIES merges between two models, weight 1.0 for the model of interest and weight 0.0 for any Qwenvergence or Lamarck model for the tiny bit of infill. You now have all models intercompatible for what is akin to a super-high-precision DELLA merge of the most significant parts of the model, the most IFEVAL-preserving parts of the model. A rank 512 adapter extracts around 30% of the most defining aspects of the model, but captures around 90% of its distinct performance. A rank 128 adapter captures around 8% of the model, but about 70% of its distinct performance.

I arrived at this while thinking about the implication of @rombodawg 's "Continuous Fine Tuning" strategy, and reading I-forget-which-arxiv-paper and I really need to find that again. It's like the opposite side of the coin from how rombodawg uses it. I use it at the beginning to get a large model_stock started. He uses it to extract most of your merge at the end and apply it to a target model to avoid catastrophic forgetting.

There. Now you know the methodology behind my merge YAML that produced https://huggingface.co/sometimesanotion/Qwenvergence-14B-v13-Prose-DS - or, the model that calls itself "Qwenconceited-14B-v13-DeepSuffering". πŸ˜†

Adapters from a strong IFEVAL+BBH model applied to the majority of the models in the model_stock merge, in a mixture of rank sizes between 32 and 128, get them on the same page for core operation. Applying a Virtuoso or Chocolatine-based LoRA to just any model out there could cause instability, but the model_stock smooths many varying levels of adapter merges out.

That's enough for you to digest for now, and @rombodawg might be interested to know he inspired such a different strategy from anything he's shared.

replied to their post 2 days ago
view reply

You can reach me on Discord, my username is as you'd expect.

Once I show you how Qwentinuum broke the barrier and finally got stabilized, and made Vimarckoso v3, you'll see why I'm being a little careful. It takes multiple steps to reliably tame weighty breadcrumbs merges, and I'm using Makefiles to make sure nothing gets skipped. That's not so easily posted to a modelcard! If people misuse parts of my recipe, especially with more CoT models out there, we'll get spammed with a lot of unstable models.

But the rewards of getting it right!

replied to their post 3 days ago
replied to their post 3 days ago
view reply

I've really been pondering that, and it's almost certainly because of the blend of R1 and Krystalan/DRT-o1-14B. We have two different CoT lineages feeding into one model - wonderful, until it's not! DRT is a bit hard to give up. I think this is where we finally have done all we can do with merging, however fancy, and get down to fine-tuning, because if DRT and DS's influences sync up, it'll be magic.

replied to their post 3 days ago
view reply

I've spilled some of the beans in little separate doses, because I've hoped to prompt people to fill in the blanks with unique ideas rather than inspire a lot of copypasta. There's a lot of stuff that is just unique to my own workflow, but there's also some reaaaaally long and detailed YAML.

I do feel that what happens between the model_stock + LoRAs and the SLERP+TIES has been loosely described. It really is just a bit of general info about which layers influence what metric, like multiple gradients overlaid. I tend to keep densities under 40 or even 30, because if there's a strong core model, each extra model needs to leave headroom for the others.

Hit me up, though, I'm particularly grateful for your contribution!

replied to their post 3 days ago
view reply

While Arcee beats Lamarck 0.7 and
tempesthenno-ppo-ckpt40 for IFEVAL, BBH, and MATH, you score 23.55% higher on GPQA, 1.96% higher on MUSR, and 2.49% higher on MUSR than Virtuoso Small v2.

Plus, I'm thinking you fine-tune for use cases Arcee and I don't.

replied to their post 3 days ago
view reply

@Inschrift-Spruch-Raum is right. My estimates based off of comparator are only partially correct. Are percentile calculations on the leaderboard changing? Regardless, this graph shows why nobody needs to give up on their models, especially when each one's making a specialized contribution. Diversity is a benefit to us all.

I really like how this class of models proves that MUSR isn't just what you get when you throw IFEVAL into a blender. πŸ˜†
newplot.png

replied to their post 3 days ago
view reply

Wow! I have to check what I saw in comparator and based estimates off of. There's no doubt that Virtuoso Small v2 is a great model, and I'm already working on a Qwenvergence based on it. It's as awesome at IFEVAL and BBH as I'd thought.

Qwenvergence is the model_stock that makes the bases to blend in varying proportions across Lamarck's layers. Yet, it's not mere raw material. I'm getting really outstanding results out of Qwenvergence-14B-v12-Prose-DS's successor which includes Virtuoso Small v2. It's playing very nicely with the other components!

posted an update 4 days ago
view post
Post
2446
I'm just saving today's 14B parameter chart, because big things are about to hit. Lamarck v0.7 has been surpassed by at least two models I know of, and in ways that promise good things to come for the whole scene. I am taking my time to enjoy the progress, and Lamarck v0.8 will come when it's clearly keeping up and keeping its flavor.

There is no one best model for everyone, regardless of these rankings. I aim to make Lamarck good at coding, translating, and rigorously critiquing rhetoric and logic. Always check out the authors' notes on models to see if their intent is close to your use case!
Β·
replied to their post 4 days ago
view reply

My high-benchmarking merges have included Virtuoso v1 at nearly every stage, and I am now creating a new generation switching in V2 where apt.

Feedback from finetuners suggests my minimal compute and Arcee's MergeKit have given them a shortcut to great results. Smart merging really is energy efficient. Thank you for helping us push the limits!

replied to their post 4 days ago
view reply

Any model of yours made to a purpose beyond benchmarks has a reason unto itself. Your tempesthenno-ppo-ckpt40 does neat things. I've also found surprise pops in benchmarks for merges when two models that have similar scores arrive at it in different and complementary ways.

Not gonna lie, my merge strategy for Lamarck v0.8 was made with the expectation of 3-4 models with different strengths, and the combination of IFEVAL, BBH, MATH, and CoT in Virtuoso-Small-v2 is forcing me to look hard at that.

posted an update 5 days ago
view post
Post
3092
**Update** Either I had some wrong numbers plugged in to estimate benchmark numbers from comparator, or the benchmark changed. Virtuoso Small v2 at 41.07 average is still very impressive, especially for writing draft copy for business purposes, while Lamarck remains a chatty generalist-reasoning model.

I've felt confident that 14B Qwen finetunes and merges could break the 42.0 average, and Arcee **came close** with https://huggingface.co/arcee-ai/Virtuoso-Small-2. Congratulations to @arcee-ai !

Just two months ago, it was easy to think that 14B had plateaued, that you could have high IFEVAL or high MUSR/MATH/GPQA at 14B, but not both. That barrier is completely shattered. I see a pathway to even better, and Virtuoso Small 2 is a big part of why. Very impressive work. This community would expect no less from Arcee.

Just look at this graph! Keep in mind, my merges here build on the first Virtuoso Small, and *-DS merges build on DeepSeek R1. There are some impressive merges in the pipe!
Β·
replied to their post 5 days ago
view reply

You helped this project get started, validating merge methods. Thank you!

reacted to onekq's post with πŸš€ 5 days ago
view post
Post
1252
Mistral Small 3 is SUPER fast, and highest score for 20+b model, but still 11 points below Qwen 2.5 coder 32b.

I believe specialty model is the future. The more you know what to do with the model, the better bang you can get for your buck. If Mistral scopes this small model to coding only, I'm confident they can beat Qwen.

One day my leaderboard will be dominated by smol models excellent on one thing, not monolithic ones costing $$$. And I'm looking forward to that.

onekq-ai/WebApp1K-models-leaderboard
  • 1 reply
Β·
reacted to merve's post with πŸ‘ 5 days ago
view post
Post
3717
This week in open AI was πŸ”₯ Let's recap! πŸ€— merve/january-31-releases-679a10669bd4030090c5de4d
LLMs πŸ’¬
> Huge: AllenAI released new TΓΌlu models that outperform DeepSeek R1 using Reinforcement Learning with Verifiable Reward (RLVR) based on Llama 3.1 405B πŸ”₯
> Mistral AI is back to open-source with their "small" 24B models (base & SFT), with Apache 2.0 license 😱
> Alibaba Qwen released their 1M context length models Qwen2.5-Instruct-1M, great for agentic use with Apache 2.0 license πŸ”₯
> Arcee AI released Virtuoso-medium, 32.8B LLMs distilled from DeepSeek V3 with dataset of 5B+ tokens
> Velvet-14B is a new family of 14B Italian LLMs trained on 10T tokens in six languages
> OpenThinker-7B is fine-tuned version of Qwen2.5-7B-Instruct on OpenThoughts dataset

VLMs & vision πŸ‘€
> Alibaba Qwen is back with Qwen2.5VL, amazing new capabilities ranging from agentic computer use to zero-shot localization πŸ”₯
> NVIDIA released new series of Eagle2 models with 1B and 9B sizes
> DeepSeek released Janus-Pro, new any-to-any model (image-text generation from image-text input) with MIT license
> BEN2 is a new background removal model with MIT license!

Audio πŸ—£οΈ
> YuE is a new open-source music generation foundation model, lyrics-to-song generation

Codebase πŸ‘©πŸ»β€πŸ’»
> We are open-sourcing our SmolVLM training and eval codebase! https://github.com/huggingface/smollm/tree/main/vision
> Open-R1 is open-source reproduction of R1 by @huggingface science team https://huggingface.co/blog/open-r1
  • 1 reply
Β·
reacted to davanstrien's post with πŸ‘€πŸ”₯ 10 days ago
view post
Post
1983
🌍 Big step for multilingual AI data!

The Hugging Face community has rated educational content in languages spoken by 1.6 billion people! New additions:
β€’ Japanese
β€’ Italian
β€’ Old High German

Learn more and contribute: https://huggingface.co/blog/davanstrien/fineweb2-community

These ratings can help enhance training data for major world languages.
  • 1 reply
Β·