Natalia Elvira
nataliaElv
AI & ML interests
Data curation, high-quality data, multilinguality, NLP & computational linguistics
Recent Activity
posted
an
update
20 days ago
New chapter in the Hugging Face NLP course! 🤗 🚀
We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub.
Any feedback for improvements welcome!
https://huggingface.co/learn/nlp-course/chapter10
reacted
to
davanstrien's
post
with 🚀
27 days ago
The https://huggingface.co/datasets/data-is-better-together/fineweb-c dataset is growing!
This week a few more languages have got 1,000 annotations for the educational quality of data from https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.
Why should you care?
The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data (https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1).
Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.
Why not use an LLM?
LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.
The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:
- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.
This week the following languages where done:
Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod
Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate
Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap
Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community
Contribute yourself here: https://huggingface.co/spaces/data-is-better-together/fineweb-c
posted
an
update
28 days ago
Do you want to easily save annotations to a Dataset in the Hub?
In the last version of Argilla (v2.6.0), you can export your data directly from the UI to the Hub.
Check all the changes and update to the latest version: https://github.com/argilla-io/argilla/releases/tag/v2.6.0
Organizations
nataliaElv's activity
Update app.py
#1 opened 2 months ago
by
davidberenstein1957
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1677141720071-634ff41ff32062e9eb7b06a3.jpeg)
More Argilla screenshots
#4 opened 3 months ago
by
nataliaElv
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63f7888abd28622c9b9a0b80/5t6JU_Cm7yFYTRUGr9eqH.jpeg)
argilla-chapter-images
#3 opened 3 months ago
by
nataliaElv
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63f7888abd28622c9b9a0b80/5t6JU_Cm7yFYTRUGr9eqH.jpeg)
Chapter 10 images
#2 opened 3 months ago
by
nataliaElv
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63f7888abd28622c9b9a0b80/5t6JU_Cm7yFYTRUGr9eqH.jpeg)
Test
#1 opened 3 months ago
by
nataliaElv
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63f7888abd28622c9b9a0b80/5t6JU_Cm7yFYTRUGr9eqH.jpeg)
Reconstruct tree of labels in the dataset
#2 opened 10 months ago
by
nataliaElv
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63f7888abd28622c9b9a0b80/5t6JU_Cm7yFYTRUGr9eqH.jpeg)
Librarian Bot: Add language metadata for dataset
#2 opened 11 months ago
by
librarian-bot
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg)
Create guia-de-anotacion.md
2
#3 opened almost 2 years ago
by
nataliaElv
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63f7888abd28622c9b9a0b80/5t6JU_Cm7yFYTRUGr9eqH.jpeg)
Draft: create guia-de-anotacion.md
#2 opened almost 2 years ago
by
nataliaElv
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63f7888abd28622c9b9a0b80/5t6JU_Cm7yFYTRUGr9eqH.jpeg)
Draft: create guia-de-anotacion.md
1
#1 opened almost 2 years ago
by
nataliaElv
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63f7888abd28622c9b9a0b80/5t6JU_Cm7yFYTRUGr9eqH.jpeg)
`DatasetGenerationError` when loading the dataset
4
#1 opened almost 2 years ago
by
nataliaElv
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63f7888abd28622c9b9a0b80/5t6JU_Cm7yFYTRUGr9eqH.jpeg)
`DatasetGenerationError` when loading the dataset
4
#1 opened almost 2 years ago
by
nataliaElv
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63f7888abd28622c9b9a0b80/5t6JU_Cm7yFYTRUGr9eqH.jpeg)