Please use discussions for these kind of things: https://huggingface.co/spaces/Be-Bo/llama-3-chatbot_70b/discussions not a post.
Amir Hossein Kargaran
kargaranamir
AI & ML interests
#NLP, checkout https://huggingface.co/cis-lmu
Recent Activity
liked
a model
1 day ago
mann-e/Hormoz-8B
liked
a Space
3 days ago
huggingface-projects/repo_duplicator
liked
a model
6 days ago
deepseek-ai/DeepSeek-V3
Organizations
kargaranamir's activity
reacted to
louisbrulenaudet's
post with ๐
8 months ago
Post
4067
Mixtral or Llama 70B on Google Spreadsheet thanks to Hugging Face's Serverless Inference API ๐ค
The Add-on is now available on the HF repo "Journalists on Hugging Face" and allows rapid generation of synthetic data, automatic translation, answering questions and more from simple spreadsheet cells ๐ฅ๏ธ
Link to the ๐ค Space : JournalistsonHF/huggingface-on-sheets
Although this tool was initially developed for journalists, it actually finds a much wider inking among daily users of the Google suite and the remaining use cases to be explored are numerous.
Only a free Hugging Face API key is required to start using this no-code extension.
Do not hesitate to submit ideas for features that we could add!
Thanks to @fdaudens for initiating this development.
The Add-on is now available on the HF repo "Journalists on Hugging Face" and allows rapid generation of synthetic data, automatic translation, answering questions and more from simple spreadsheet cells ๐ฅ๏ธ
Link to the ๐ค Space : JournalistsonHF/huggingface-on-sheets
Although this tool was initially developed for journalists, it actually finds a much wider inking among daily users of the Google suite and the remaining use cases to be explored are numerous.
Only a free Hugging Face API key is required to start using this no-code extension.
Do not hesitate to submit ideas for features that we could add!
Thanks to @fdaudens for initiating this development.
Post
1218
Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.
๐ค corpus v1: cis-lmu/GlotCC-V1
๐ฑ pipeline v3: https://github.com/cisnlp/GlotCC
More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.
Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).
๐ค corpus v1: cis-lmu/GlotCC-V1
๐ฑ pipeline v3: https://github.com/cisnlp/GlotCC
More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.
Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).
posted
an
update
8 months ago
Post
1218
Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.
๐ค corpus v1: cis-lmu/GlotCC-V1
๐ฑ pipeline v3: https://github.com/cisnlp/GlotCC
More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.
Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).
๐ค corpus v1: cis-lmu/GlotCC-V1
๐ฑ pipeline v3: https://github.com/cisnlp/GlotCC
More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.
Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).
reacted to
akhaliq's
post with โค๏ธ
12 months ago
Post
Aya Dataset
An Open-Access Collection for Multilingual Instruction Tuning
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning (2402.06619)
Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.
An Open-Access Collection for Multilingual Instruction Tuning
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning (2402.06619)
Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.
posted
an
update
12 months ago
Post
A Text Language Identification Model with Support for +2000 Labels:
space: cis-lmu/glotlid-space
model: cis-lmu/glotlid
github: https://github.com/cisnlp/GlotLID
paper: GlotLID: Language Identification for Low-Resource Languages (2310.16248)
space: cis-lmu/glotlid-space
model: cis-lmu/glotlid
github: https://github.com/cisnlp/GlotLID
paper: GlotLID: Language Identification for Low-Resource Languages (2310.16248)