aarontseng
/

zero-mt-zh-en

text2text-generation

Model card Files Files and versions Community

aarontseng commited on Nov 14, 2024

Commit

4519f43

·

verified ·

1 Parent(s): a418fea

Update README.md

Files changed (1) hide show

README.md +88 -3

README.md CHANGED Viewed

@@ -1,3 +1,88 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- zh
+- en
+pipeline_tag: translation
+tags:
+- text2text-generation
+---
+# Zero-mt
+[https://github.com/zape-aat/zero-mt](https://github.com/zape-aat/zero-mt)
+## Metrics
+|Testset|BLEU|Chrf++|Comet22|
+|:-------------:|:---------------:|:---------:|:---------:|
+|flores200-dev|41.37|65.13|0.867|
+|flores200-devtest|63.06|53.57|0.868|
+|newstest2019|14.96|36.16|0.843|
+|wmt-22|?|?|0.775|
+|wmt-23|22.65|41.22|0.777|
+## How to use
+```
+git lfs install
+git clone https://huggingface.co/aarontseng/zero-mt-zh_hant-en
+```
+```
+pip install ctranslate2
+pip install sentencepiece
+```
+## Basic Usage
+```
+import ctranslate2
+import sentencepiece
+src_model = sentencepiece.SentencePieceProcessor()
+src_model.load("zero-mt-zh_hant-en/source.model")
+tgt_model = sentencepiece.SentencePieceProcessor()
+tgt_model.load("zero-mt-zh_hant-en/target.model")
+translator = ctranslate2.Translator("zero-mt-zh_hant-en", device="cuda")  # "cpu" or "cuda"
+encoded_line = src_model.encode_as_pieces("在世界上的許多地方，揮手都是一種表示「你好」的友善手勢」。")
+results = translator.translate_batch([encoded_line], batch_type="tokens", max_batch_size=1024)
+decoded_line = tgt_model.decode(results[0].hypotheses[0])
+print(decoded_line) # In many places around the world, waving is a friendly gesture of "hello".
+```
+## Batch translation
+```
+import ctranslate2
+import sentencepiece
+src_path = "dev.cmn_Hant"
+tgt_path = "translated.txt"
+src_model = sentencepiece.SentencePieceProcessor()
+src_model.load("zero-mt-zh_hant-en/source.model")
+tgt_model = sentencepiece.SentencePieceProcessor()
+tgt_model.load("zero-mt-zh_hant-en/target.model")
+translator = ctranslate2.Translator("zero-mt-zh_hant-en", device="cuda")  # "cpu" or "cuda"
+src_file = open(src_path, 'r', encoding="utf-8")
+src_lines = src_file.readlines()
+encoded_lines = src_model.encode_as_pieces(src_lines)
+results = translator.translate_batch(encoded_lines, batch_type="tokens", max_batch_size=1024)
+translations = [translation.hypotheses[0] for translation in results]
+decoded_lines = tgt_model.decode(translations)
+tgt_file = open(tgt_path, "w", encoding="utf-8", newline='')
+for line in decoded_lines:
+    tgt_file.write(line)
+    tgt_file.write('\n')
+```