Flores101: Large-Scale Multilingual Machine Translation
flores101_mm100_175M
is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It was released in this repository.
The model architecture and config are the same as M2M100 implementation, but the tokenizer should be modified to adjust language codes.
Demo: https://huggingface.co/spaces/seyoungsong/flores101_mm100_175M
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
chinese_text = "生活就像一盒巧克力。"
model = M2M100ForConditionalGeneration.from_pretrained("seyoungsong/flores101_mm100_175M")
tokenizer: M2M100Tokenizer = M2M100Tokenizer.from_pretrained("seyoungsong/flores101_mm100_175M")
# FIX TOKENIZER!
tokenizer.lang_token_to_id = {t: i for t, i in zip(tokenizer.all_special_tokens, tokenizer.all_special_ids) if i > 5}
tokenizer.lang_code_to_token = {s.strip("_"): s for s in tokenizer.lang_token_to_id}
tokenizer.lang_code_to_id = {s.strip("_"): i for s, i in tokenizer.lang_token_to_id.items()}
tokenizer.id_to_lang_token = {i: s for s, i in tokenizer.lang_token_to_id.items()}
# translate Hindi to French
tokenizer.src_lang = "hi"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "La vie est comme une boîte de chocolat."
# translate Chinese to English
tokenizer.src_lang = "zh"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Life is like a chocolate box."
Languages covered
Language | lang code |
---|---|
Akrikaans | af |
Amharic | am |
Arabic | ar |
Assamese | as |
Asturian | ast |
Aymara | ay |
Azerbaijani | az |
Bashkir | ba |
Belarusian | be |
Bulgarian | bg |
Bengali | bn |
Breton | br |
Bosnian | bs |
Catalan | ca |
Cebuano | ceb |
Chokwe | cjk |
Czech | cs |
Welsh | cy |
Danish | da |
German | de |
Dyula | dyu |
Greek | el |
English | en |
Spanish | es |
Estonian | et |
Persian | fa |
Fulah | ff |
Finnish | fi |
French | fr |
Western Frisian | fy |
Irish | ga |
Scottish Gaelic | gd |
Galician | gl |
Gujarati | gu |
Hausa | ha |
Hebrew | he |
Hindi | hi |
Croatian | hr |
Haitian Creole | ht |
Hungarian | hu |
Armenian | hy |
Indonesian | id |
Igbo | ig |
Iloko | ilo |
Icelandic | is |
Italian | it |
Japanese | ja |
Javanese | jv |
Georgian | ka |
Kachin | kac |
Kamba | kam |
Kabuverdianu | kea |
Kongo | kg |
Kazakh | kk |
Central Khmer | km |
Kimbundu | kmb |
Northern Kurdish | kmr |
Kannada | kn |
Korean | ko |
Kurdish | ku |
Kyrgyz | ky |
Luxembourgish | lb |
Ganda | lg |
Lingala | ln |
Lao | lo |
Lithuanian | lt |
Luo | luo |
Latvian | lv |
Malagasy | mg |
Maori | mi |
Macedonian | mk |
Malayalam | ml |
Mongolian | mn |
Marathi | mr |
Malay | ms |
Maltese | mt |
Burmese | my |
Nepali | ne |
Dutch | nl |
Norwegian | no |
Northern Sotho | ns |
Nyanja | ny |
Occitan | oc |
Oromo | om |
Oriya | or |
Punjabi | pa |
Polish | pl |
Pashto | ps |
Portuguese | pt |
Quechua | qu |
Romanian | ro |
Russian | ru |
Sindhi | sd |
Shan | shn |
Sinhala | si |
Slovak | sk |
Slovenian | sl |
Shona | sn |
Somali | so |
Albanian | sq |
Serbian | sr |
Swati | ss |
Sundanese | su |
Swedish | sv |
Swahili | sw |
Tamil | ta |
Telugu | te |
Tajik | tg |
Thai | th |
Tigrinya | ti |
Tagalog | tl |
Tswana | tn |
Turkish | tr |
Ukrainian | uk |
Umbundu | umb |
Urdu | ur |
Uzbek | uz |
Vietnamese | vi |
Wolof | wo |
Xhosa | xh |
Yiddish | yi |
Yoruba | yo |
Chinese | zh |
Zulu | zu |
- Downloads last month
- 102
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API has been turned off for this model.