Amir Hossein Kargaran

kargaranamir

AI & ML interests

#NLP, checkout https://huggingface.co/cis-lmu

Recent Activity

liked a model 1 day ago
mann-e/Hormoz-8B
liked a Space 3 days ago
huggingface-projects/repo_duplicator
liked a model 6 days ago
deepseek-ai/DeepSeek-V3
View all activity

Organizations

CIS, LMU Munich's profile picture DH and NLP Lab's profile picture Blog-explorers's profile picture Balochi Machine Learning's profile picture Social Post Explorers's profile picture Hugging Face Discord Community's profile picture SIG on Iranian languages's profile picture

kargaranamir's activity

replied to Smoke666's post 8 months ago
reacted to louisbrulenaudet's post with ๐Ÿ‘ 8 months ago
view post
Post
4067
Mixtral or Llama 70B on Google Spreadsheet thanks to Hugging Face's Serverless Inference API ๐Ÿค—

The Add-on is now available on the HF repo "Journalists on Hugging Face" and allows rapid generation of synthetic data, automatic translation, answering questions and more from simple spreadsheet cells ๐Ÿ–ฅ๏ธ

Link to the ๐Ÿค— Space : JournalistsonHF/huggingface-on-sheets

Although this tool was initially developed for journalists, it actually finds a much wider inking among daily users of the Google suite and the remaining use cases to be explored are numerous.

Only a free Hugging Face API key is required to start using this no-code extension.

Do not hesitate to submit ideas for features that we could add!

Thanks to @fdaudens for initiating this development.
  • 4 replies
ยท
reacted to their post with ๐Ÿ‘ 8 months ago
view post
Post
1218
Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.

๐Ÿค— corpus v1: cis-lmu/GlotCC-V1
๐Ÿฑ pipeline v3: https://github.com/cisnlp/GlotCC

More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.

Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).
posted an update 8 months ago
view post
Post
1218
Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.

๐Ÿค— corpus v1: cis-lmu/GlotCC-V1
๐Ÿฑ pipeline v3: https://github.com/cisnlp/GlotCC

More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.

Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).
reacted to akhaliq's post with โค๏ธ 12 months ago
view post
Post
Aya Dataset

An Open-Access Collection for Multilingual Instruction Tuning

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning (2402.06619)

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.
ยท
posted an update 12 months ago