iopo-exp (IOPO Experiments)

lewtun

posted an update about 23 hours ago

Post

1919

Introducing OpenR1-Math-220k!

open-r1/OpenR1-Math-220k

The community has been busy distilling DeepSeek-R1 from inference providers, but we decided to have a go at doing it ourselves from scratch 💪

What’s new compared to existing reasoning datasets?

♾ Based on AI-MO/NuminaMath-1.5: we focus on math reasoning traces and generate answers for problems in NuminaMath 1.5, an improved version of the popular NuminaMath-CoT dataset.

🐳 800k R1 reasoning traces: We generate two answers for 400k problems using DeepSeek R1. The filtered dataset contains 220k problems with correct reasoning traces.

📀 512 H100s running locally: Instead of relying on an API, we leverage vLLM and SGLang to run generations locally on our science cluster, generating 180k reasoning traces per day.

⏳ Automated filtering: We apply Math Verify to only retain problems with at least one correct answer. We also leverage Llama3.3-70B-Instruct as a judge to retrieve more correct examples (e.g for cases with malformed answers that can’t be verified with a rules-based parser)

📊 We match the performance of DeepSeek-Distill-Qwen-7B by finetuning Qwen-7B-Math-Instruct on our dataset.

🔎 Read our blog post for all the nitty gritty details: https://huggingface.co/blog/open-r1/update-2

lewtun

authored a paper 5 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 7 days ago • 153

gabrielmbmb

authored a paper 5 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 7 days ago • 153

lewtun

posted an update 17 days ago

Post

10034

We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!

🧪 Step 1: replicate the R1-Distill models by distilling a high-quality reasoning corpus from DeepSeek-R1.

🧠 Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will involve curating new, large-scale datasets for math, reasoning, and code.

🔥 Step 3: show we can go from base model -> SFT -> RL via multi-stage training.

Follow along: https://github.com/huggingface/open-r1

5 replies

·

lewtun

posted an update about 1 month ago

Post

3830

I was initially pretty sceptical about Meta's Coconut paper [1] because the largest perf gains were reported on toy linguistic problems. However, these results on machine translation are pretty impressive!

https://x.com/casper_hansen_/status/1875872309996855343

Together with the recent PRIME method [2] for scaling RL, reasoning for open models is looking pretty exciting for 2025!

[1] Training Large Language Models to Reason in a Continuous Latent Space (2412.06769)
[2] https://huggingface.co/blog/ganqu/prime

lewtun

posted an update about 1 month ago

Post

2279

This paper ( HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs (2412.18925)) has a really interesting recipe for inducing o1-like behaviour in Llama models:

* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting.
* Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases)
* Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1
* Use the resulting data for SFT & RL
* Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.

Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!

1 reply

·

lewtun

posted an update about 2 months ago

Post

6852

We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute 🔥

How? By combining step-wise reward models with tree search algorithms :)

We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think"

We're open sourcing the full recipe and sharing a detailed blog post.

In our blog post we cover:

📈 Compute-optimal scaling: How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.

🎄 Diverse Verifier Tree Search (DVTS): An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.

🧭 Search and Learn: A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM

Here's the links:

- Blog post: HuggingFaceH4/blogpost-scaling-test-time-compute

- Code: https://github.com/huggingface/search-and-learn

Enjoy!

2 replies

·

JW17

authored 2 papers 3 months ago

Stable Language Model Pre-training by Reducing Embedding Variability

Paper • 2409.07787 • Published Sep 12, 2024

Cross-lingual Transfer of Reward Models in Multilingual Alignment

Paper • 2410.18027 • Published Oct 23, 2024

nlee-208

authored a paper 3 months ago

Cross-lingual Transfer of Reward Models in Multilingual Alignment

Paper • 2410.18027 • Published Oct 23, 2024

nlee-208

updated a model 5 months ago

iopo-exp/uf-qwen2-dpo-tgf_o50-iter1

Text Generation • Updated Sep 26, 2024 • 12

nlee-208

authored 2 papers 5 months ago

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

Paper • 2406.06424 • Published Jun 10, 2024 • 13

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Paper • 2406.05761 • Published Jun 9, 2024 • 2

gabrielmbmb

posted an update 5 months ago

Post

1865

Yesterday @mattshumer released mattshumer/Reflection-Llama-3.1-70B, an impressive model that achieved incredible results in benchmarks like MMLU. The model was fine-tuned using Reflection-Tuning and the dataset used wasn't released, but I created a small recipe with distilabel that allows generating a dataset with a similar output format:

1. We use MagPie 🐦 in combination with https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct to generate reasoning instructions.
2. We generate a response again using https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct, but we steer the LLM to generate an specific output format using a custom system prompt. In the system prompt, we instruct the LLM that it will have first to think 💭 and have reflections that will help resolving ambiguities. After that, we instruct the LLM to generate an output based on the previous thinking

In this dataset gabrielmbmb/distilabel-reflection-tuning you can found 5 rows that I generated with this recipe. You can also found the code of the pipeline in the file called reflection.py.

nlee-208

updated a model 6 months ago

iopo-exp/uf-l31-orpo-base-armo-iter1

Updated Aug 30, 2024 • 8

alvarobartt

posted an update 6 months ago

Post

2976

🤗 Serving Meta Llama 3.1 405B on Google Cloud is now possible via the Hugging Face Deep Learning Containers (DLCs) for Text Generation Inference (TGI)

In this post, we showcase how to deploy https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 on an A3 instance with 8 x H100 GPUs on Vertex AI

Thanks to the Hugging Face DLCs for TGI and Google Cloud Vertex AI, deploying a high-performance text generation container for serving Large Language Models (LLMs) has never been easier. And we’re not going to stop here – stay tuned as we enable more experiences to build AI with open models on Google Cloud!

Read the full post at https://huggingface.co/blog/llama31-on-vertex-ai

gabrielmbmb

posted an update 6 months ago

Post

2928

distilabel 1.3.0 is out! This release contains many core improvements and new tasks that help us building argilla/magpie-ultra-v0.1!

Distributed pipeline execution with Ray, new Magpie tasks, reward models, components for dataset diversity based on sentence embeddings, Argilla 2.0 compatibility and many more features!

Check the new release in GitHub: https://github.com/argilla-io/distilabel

gabrielmbmb

posted an update 6 months ago

Post

3577

Just dropped magpie-ultra-v0.1! The first open synthetic dataset generated with Llama 3.1 405B. Created with distilabel, it's our most advanced and compute-intensive pipeline to date. We made the GPUs of the cluster go brrrrr 🚀

argilla/magpie-ultra-v0.1

Take it a look and tell us what you think! Probably, the models taking the most out of it are smol models 🤗 We will be improving the dataset in upcoming iterations!

gabrielmbmb

posted an update 8 months ago

Post

2507

⚗️ distilabel 1.2.0 is out and it comes with improved support for structured generation, new tasks for generating datasets for training embedding models, new steps for loading data, MixtureOfAgentsLLM and improved docs.

We would love to see a few new datasets for training embedding models built with distilabel on the Hub! ❤️

kashif

authored a paper 8 months ago

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

Paper • 2406.06424 • Published Jun 10, 2024 • 13

IOPO Experiments

AI & ML interests

Recent Activity

iopo-exp's activity

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Stable Language Model Pre-training by Reducing Embedding Variability

Cross-lingual Transfer of Reward Models in Multilingual Alignment

Cross-lingual Transfer of Reward Models in Multilingual Alignment

iopo-exp/uf-qwen2-dpo-tgf_o50-iter1

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

iopo-exp/uf-l31-orpo-base-armo-iter1

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

AI & ML interests

Recent Activity

Team members 7

iopo-exp's activity