Guilherme Penedo's picture

Guilherme Penedo

guipenedo

·

AI & ML interests

None yet

Recent Activity

authored a paper about 3 hours ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

upvoted a paper about 6 hours ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

commented on an article 4 days ago

Open-R1: Update #1

View all activity

Organizations

guipenedo's activity

New activity in huggingface-legal/takedown-notices 6 days ago

Update 2025/2025-01-22-Torstar.md

#4 opened 6 days ago by

New activity in HuggingFaceFW/fineweb-edu 14 days ago

New update returns a 500 server error using the datasets-server API

#18 opened about 1 month ago by

New activity in HuggingFaceFW/fineweb-2 17 days ago

Synthetic Data Generator

#5 opened 26 days ago by

New activity in HuggingFaceFW/fineweb-2 29 days ago

Cannot load with datasets

#4 opened 29 days ago by

New activity in HuggingFaceFW/fineweb-edu about 1 month ago

A lot of load errors after new update

#19 opened about 1 month ago by

Add "date" column to "default" subset

#20 opened about 1 month ago by

New activity in HuggingFaceFW/fineweb about 2 months ago

Simple exact deduplication removes 2/3 of data.

#49 opened 6 months ago by

Torrent?

#4 opened 10 months ago by

Any plan to train models on larger subset of dataset?

#8 opened 10 months ago by

Are copyrighted works included in this dataset?

#9 opened 10 months ago by

Reprocessing for a new language

#12 opened 10 months ago by

Training configs for data ablation study

#14 opened 10 months ago by

tiny-fineweb

#19 opened 9 months ago by

Unsafe files

#25 opened 9 months ago by

"Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20" using fineweb by Karpathy

#28 opened 8 months ago by

Regarding to the newly updated indexes(writen as deduplication issues)

#29 opened 8 months ago by

Dedup

#32 opened 8 months ago by

Language subset

#33 opened 8 months ago by

How to compute the aggerate score?

#35 opened 8 months ago by

why do you apply "All filters except the (very destructive) terminal_punct"

#36 opened 8 months ago by