Anton Lozhkov

anton-l

AI & ML interests

Generative Models, Distributed Training, Photo and Video Enhancement

Organizations

Hugging Face's profile picture 🧨Diffusers's profile picture Hugging Face Internal Testing Organization's profile picture superb's profile picture Anton's SUPERB Test Org's profile picture Util scripts for speech recognition's profile picture Speech Recognition Community Event Version 2's profile picture Internal Data & Models for Speech Recognition Event's profile picture OpenSLR's profile picture (De)fusing's profile picture HuggingFaceGECLM's profile picture BigCode's profile picture CompVis's profile picture Hugging Face H4's profile picture CompVis Community's profile picture BigCode Data's profile picture Hugging Face TB Research's profile picture huggingPartyParis's profile picture HuggingFaceFW's profile picture Cosmopedia Stories Collab's profile picture StarCoder2 Data's profile picture Data Agents's profile picture Argilla Warehouse's profile picture smol-explorers's profile picture swissai-hf-data's profile picture Hugging Face Science's profile picture Open R1's profile picture

anton-l's activity

posted an update about 2 months ago
view post
Post
2376
Introducing πŸ“π…π’π§πžπŒπšπ­π‘: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
πŸ› οΈ carefully extracting math data from Common Crawl;
πŸ”Ž iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! πŸš€
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2
reacted to loubnabnl's post with πŸ€—β€οΈπŸ”₯ 11 months ago
view post
Post
6413
We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co/blog/cosmopedia

Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover many topics with few duplicates.
πŸ“š You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
βš™οΈ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!
  • 1 reply
Β·