|
--- |
|
license: apache-2.0 |
|
language: |
|
- zh |
|
- en |
|
pipeline_tag: translation |
|
tags: |
|
- text2text-generation |
|
--- |
|
|
|
# Zero-mt |
|
|
|
[https://github.com/zape-aat/zero-mt](https://github.com/zape-aat/zero-mt) |
|
|
|
## Metrics |
|
|
|
|Testset|BLEU|Chrf++|Comet22| |
|
|:-------------:|:---------------:|:---------:|:---------:| |
|
|flores200-dev|41.37|65.13|0.867| |
|
|flores200-devtest|63.06|53.57|0.868| |
|
|newstest2019|14.96|36.16|0.843| |
|
|wmt-22|?|?|0.775| |
|
|wmt-23|22.65|41.22|0.777| |
|
|
|
## How to use |
|
|
|
``` |
|
git lfs install |
|
git clone https://huggingface.co/aarontseng/zero-mt-zh_hant-en |
|
``` |
|
|
|
``` |
|
pip install ctranslate2 |
|
pip install sentencepiece |
|
``` |
|
## Basic Usage |
|
|
|
``` |
|
import ctranslate2 |
|
import sentencepiece |
|
|
|
src_model = sentencepiece.SentencePieceProcessor() |
|
src_model.load("zero-mt-zh_hant-en/source.model") |
|
tgt_model = sentencepiece.SentencePieceProcessor() |
|
tgt_model.load("zero-mt-zh_hant-en/target.model") |
|
|
|
translator = ctranslate2.Translator("zero-mt-zh_hant-en", device="cuda") # "cpu" or "cuda" |
|
|
|
encoded_line = src_model.encode_as_pieces("在世界上的許多地方,揮手都是一種表示「你好」的友善手勢」。") |
|
|
|
results = translator.translate_batch([encoded_line], batch_type="tokens", max_batch_size=1024) |
|
|
|
decoded_line = tgt_model.decode(results[0].hypotheses[0]) |
|
|
|
print(decoded_line) # In many places around the world, waving is a friendly gesture of "hello". |
|
``` |
|
|
|
## Batch translation |
|
``` |
|
import ctranslate2 |
|
import sentencepiece |
|
|
|
src_path = "dev.cmn_Hant" |
|
tgt_path = "translated.txt" |
|
|
|
src_model = sentencepiece.SentencePieceProcessor() |
|
src_model.load("zero-mt-zh_hant-en/source.model") |
|
tgt_model = sentencepiece.SentencePieceProcessor() |
|
tgt_model.load("zero-mt-zh_hant-en/target.model") |
|
|
|
translator = ctranslate2.Translator("zero-mt-zh_hant-en", device="cuda") # "cpu" or "cuda" |
|
|
|
src_file = open(src_path, 'r', encoding="utf-8") |
|
src_lines = src_file.readlines() |
|
|
|
encoded_lines = src_model.encode_as_pieces(src_lines) |
|
|
|
results = translator.translate_batch(encoded_lines, batch_type="tokens", max_batch_size=1024) |
|
translations = [translation.hypotheses[0] for translation in results] |
|
|
|
decoded_lines = tgt_model.decode(translations) |
|
|
|
tgt_file = open(tgt_path, "w", encoding="utf-8", newline='') |
|
|
|
for line in decoded_lines: |
|
tgt_file.write(line) |
|
tgt_file.write('\n') |
|
``` |