Guilherme Penedo
guipenedo
AI & ML interests
None yet
Recent Activity
authored
a paper
about 3 hours ago
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language
Model
upvoted
a
paper
about 6 hours ago
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language
Model
commented on
an
article
4 days ago
Open-R1: Update #1
Organizations
guipenedo's activity
Update 2025/2025-01-22-Torstar.md
#4 opened 6 days ago
by
guipenedo
New update returns a 500 server error using the datasets-server API
6
#18 opened about 1 month ago
by
jonna32
Synthetic Data Generator
1
#5 opened 26 days ago
by
kishorekashyap
Cannot load with datasets
3
#4 opened 29 days ago
by
mbanon
A lot of load errors after new update
14
#19 opened about 1 month ago
by
yzhangcs
Add "date" column to "default" subset
#20 opened about 1 month ago
by
lhoestq
Simple exact deduplication removes 2/3 of data.
4
#49 opened 6 months ago
by
egor-pakhomov
Torrent?
3
#4 opened 10 months ago
by
emilss
Any plan to train models on larger subset of dataset?
1
#8 opened 10 months ago
by
mrfakename
Are copyrighted works included in this dataset?
4
#9 opened 10 months ago
by
umm-maybe
Reprocessing for a new language
14
#12 opened 10 months ago
by
pere
Training configs for data ablation study
2
#14 opened 10 months ago
by
jimmyhbx
tiny-fineweb
3
#19 opened 9 months ago
by
3thn
Unsafe files
1
#25 opened 9 months ago
by
alielfilali01
"Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20" using fineweb by Karpathy
#28 opened 8 months ago
by
clem
Regarding to the newly updated indexes(writen as deduplication issues)
5
#29 opened 8 months ago
by
kimcando
Language subset
3
#33 opened 8 months ago
by
talmor
How to compute the aggerate score?
1
#35 opened 8 months ago
by
mornmirror
why do you apply "All filters except the (very destructive) terminal_punct"
3
#36 opened 8 months ago
by
bpwl0121