kargaranamir (Amir Hossein Kargaran)

replied to Smoke666's post 8 months ago

Please use discussions for these kind of things: https://huggingface.co/spaces/Be-Bo/llama-3-chatbot_70b/discussions not a post.

reacted to louisbrulenaudet's post with 👍 8 months ago

Post

4067

Mixtral or Llama 70B on Google Spreadsheet thanks to Hugging Face's Serverless Inference API 🤗

The Add-on is now available on the HF repo "Journalists on Hugging Face" and allows rapid generation of synthetic data, automatic translation, answering questions and more from simple spreadsheet cells 🖥️

Link to the 🤗 Space : JournalistsonHF/huggingface-on-sheets

Although this tool was initially developed for journalists, it actually finds a much wider inking among daily users of the Google suite and the remaining use cases to be explored are numerous.

Only a free Hugging Face API key is required to start using this no-code extension.

Do not hesitate to submit ideas for features that we could add!

Thanks to @fdaudens for initiating this development.

4 replies

·

reacted to their post with 👍 8 months ago

Post

1218

Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.

🤗 corpus v1: cis-lmu/GlotCC-V1
🐱 pipeline v3: https://github.com/cisnlp/GlotCC

More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.

Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).

posted an update 8 months ago

Post

1218

Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.

🤗 corpus v1: cis-lmu/GlotCC-V1
🐱 pipeline v3: https://github.com/cisnlp/GlotCC

More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.

Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).

reacted to akhaliq's post with ❤️ 12 months ago

Post

Aya Dataset

An Open-Access Collection for Multilingual Instruction Tuning

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning (2402.06619)

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.

3 replies

·

posted an update 12 months ago

Post

A Text Language Identification Model with Support for +2000 Labels:

space: cis-lmu/glotlid-space
model: cis-lmu/glotlid
github: https://github.com/cisnlp/GlotLID
paper: GlotLID: Language Identification for Low-Resource Languages (2310.16248)

Amir Hossein Kargaran

AI & ML interests

Recent Activity

Organizations

kargaranamir's activity