AI & ML interests

Training DALL-E with volunteers from all over the Internet using hivemind and dalle-pytorch (NeurIPS 2021 demo)

Recent Activity

training-transformers-together's activity

lhoestq 
posted an update 2 months ago
view post
Post
1829
Made a HF Dataset editor a la gg sheets here: lhoestq/dataset-spreadsheets

With Dataset Spreadsheets:
✏️ Edit datasets in the UI
🔗 Share link with collaborators
🐍 Use locally in DuckDB or Python

Available for the 100,000+ parquet datasets on HF :)
julien-c 
posted an update 2 months ago
view post
Post
9457
After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team
·
thomwolf 
posted an update 2 months ago
view post
Post
5408
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
·
thomwolf 
posted an update 2 months ago
thomwolf 
posted an update 2 months ago
julien-c 
posted an update 2 months ago
view post
Post
3170
wow 😮

INTELLECT-1 is the first collaboratively trained 10 billion parameter language model trained from scratch on 1 trillion tokens of English text and code.

PrimeIntellect/INTELLECT-1-Instruct
thomwolf 
posted an update 3 months ago
thomwolf 
posted an update 3 months ago
thomwolf 
posted an update 4 months ago
view post
Post
4193
Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around 🤖✨
  • 2 replies
·
blanchon 
posted an update 5 months ago
lhoestq 
posted an update 7 months ago
view post
Post
4184
Hey ! I'm working on a 100% synthetic Dataset Hub here (you can search for any kind of datasets an the app invents them). The link is here: infinite-dataset-hub/infinite-dataset-hub

Question for the Community:

Which models should I use to generate images and audio samples for those datasets ? 🤗
  • 4 replies
·