HuggingFaceBR4

company

Activity Feed

AI & ML interests

Removing dependencies, building everything from scratch

Recent Activity

craffel authored a paper about 1 hour ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

hlarcher authored a paper about 1 hour ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

hynky authored a paper about 1 hour ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

View all activity

HuggingFaceBR4's activity

craffel

authored a paper about 1 hour ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 1 day ago • 60

hlarcher

authored a paper about 1 hour ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 1 day ago • 60

hynky

authored a paper about 1 hour ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 1 day ago • 60

eliebak

authored a paper about 1 hour ago

INTELLECT-1 Technical Report

Paper • 2412.01152 • Published Dec 2, 2024

loubnabnl

authored a paper about 1 hour ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 1 day ago • 60

clefourrier

authored a paper about 1 hour ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 1 day ago • 60

guipenedo

authored a paper about 1 hour ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 1 day ago • 60

eliebak

authored a paper about 1 hour ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 1 day ago • 60

thomwolf

authored a paper about 1 hour ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 1 day ago • 60

hynky

authored a paper 21 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 23 days ago • 53

guipenedo

authored a paper 21 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 23 days ago • 53

thomwolf

authored a paper 21 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 23 days ago • 53

hlarcher

posted an update 21 days ago

Post

1063

We are introducing multi-backend support in Hugging Face Text Generation Inference!
With new TGI architecture we are now able to plug new modeling backends to get best performances according to selected model and available hardware. This first step will very soon be followed by the integration of new backends (TRT-LLM, llama.cpp, vLLM, Neuron and TPU).

We are polishing the TensorRT-LLM backend which achieves impressive performances on NVIDIA GPUs, stay tuned 🤗 !

Check out the details: https://huggingface.co/blog/tgi-multi-backend

thomwolf

posted an update about 2 months ago

Post

5266

We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi

2 replies

clefourrier

authored a paper 2 months ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 17

thomwolf

posted an update 2 months ago

Post

1542

Exponentially growing number of open-source AI models over the course of the past 30 months – from a few thousands to over 1 million and more

Interactive data viz: huggingface/open-source-ai-year-in-review-2024

thomwolf

posted an update 2 months ago

Post

1488

Most liked and most downloaded open-source AI models from 2022 to 2024

Interactive viz: https://aiworld.eu/embed/model/model/treemap
Discussion: huggingface/open-source-ai-year-in-review-2024

loubnabnl

posted an update 2 months ago

Post

2231

Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?

thomwolf

posted an update 2 months ago

Post

1716

Interesting long read from @evanmiller-anthropic on having a better founded statistical approach to Language Model Evaluations:
https://www.anthropic.com/research/statistical-approach-to-model-evals

Worth a read if you're into LLM evaluations!

Cc @clefourrier

1 reply

thomwolf

posted an update 3 months ago

Post

1453

Very exciting new mistralai/Pixtral-Large-Instruct-2411 model from Mistral-AI

Impressive performances, huge congrats @patrickvonplaten @sgvaze @pandora-s @devendrachaplot @sophiamyang and team!

Very nice to have SOTA Multilingual OCR and Chart understanding in an open-weights model

AI & ML interests

Recent Activity

Team members 12

HuggingFaceBR4's activity