Maurice Weber
mauriceweber
AI & ML interests
None yet
Organizations
mauriceweber's activity
Add paper citation
1
#30 opened 3 months ago
by
davanstrien
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1627505688463-60107b385ac3e86b3ea4fc34.jpeg)
RPV2 ccnet preprocessing
1
#29 opened 6 months ago
by
bpwl0121
sample split details
3
#4 opened about 1 year ago
by
sujantkumarkv
How can I download the sample-10B fastestly?
1
#28 opened 8 months ago
by
zgxiao
defunct book subset
4
#28 opened about 1 year ago
by
polinaeterna
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1645527878855-61dd9f18f187b39868bd157e.jpeg)
How much disk space would the whole HF dataset take?
1
#27 opened 10 months ago
by
protossw512
rpv2-subsamples
1
#26 opened about 1 year ago
by
mauriceweber
![](https://cdn-avatars.huggingface.co/v1/production/uploads/6329ee3dab49d487dd1439ec/vxGvdBK0XMZaCpc5dGOIa.jpeg)
The doc_id in duplicates is should contain?
3
#24 opened about 1 year ago
by
newbietuan
Deduplication steps
23
#15 opened about 1 year ago
by
ilyayudkovich
Here's a download script parallelized using Spark
1
#22 opened about 1 year ago
by
srowen
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63050fbfce6b12280b1e2976/2lJphRSgdt9B_5YAQ1SIs.jpeg)
what is the meaning of snapshots in redpajama-data-v2?
2
#21 opened about 1 year ago
by
choidonghun
How to join documents and quality signals when downloading directly
3
#19 opened about 1 year ago
by
tgshdyfuhuf
Missing duplicates parquet files
5
#18 opened about 1 year ago
by
bebensee
Script to download all files of 1B sample data locally
2
#13 opened about 1 year ago
by
ivanzhouyq
![](https://cdn-avatars.huggingface.co/v1/production/uploads/6313f5f4c093ff968e0ec6c8/LVTpwU-pXVDhnJcEbDAEx.jpeg)
What is the total size, of the entirety of this dataset in TB?
1
#10 opened about 1 year ago
by
Bayaz
![](https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/36ka_el5MVxUPJqCMVvDH.png)
What's the concept on partitions
2
#5 opened over 1 year ago
by
SwatCat
quality_signals, minhash and duplicates missing
2
#3 opened over 1 year ago
by
sheshanshag
Request to add retries into RedPajama-Data-V2.py script
1
#16 opened about 1 year ago
by
yura38
How to obtain duplicates from minhash?
1
#8 opened over 1 year ago
by
cq
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1641968925128-noauth.jpeg)
Obtaining Filtered Samples
4
#12 opened about 1 year ago
by
ssingh22
![](https://cdn-avatars.huggingface.co/v1/production/uploads/64b91c71e3d41dbd696d83da/owt2VLvIMMaYTW9rEHX2_.jpeg)