File size: 2,263 Bytes
4519f43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: apache-2.0
language:
- zh
- en
pipeline_tag: translation
tags:
- text2text-generation
---

# Zero-mt

[https://github.com/zape-aat/zero-mt](https://github.com/zape-aat/zero-mt)

## Metrics

|Testset|BLEU|Chrf++|Comet22|
|:-------------:|:---------------:|:---------:|:---------:|
|flores200-dev|41.37|65.13|0.867|
|flores200-devtest|63.06|53.57|0.868|
|newstest2019|14.96|36.16|0.843|
|wmt-22|?|?|0.775|
|wmt-23|22.65|41.22|0.777|

## How to use

```
git lfs install
git clone https://huggingface.co/aarontseng/zero-mt-zh_hant-en
```

```
pip install ctranslate2
pip install sentencepiece
```
## Basic Usage

```
import ctranslate2
import sentencepiece

src_model = sentencepiece.SentencePieceProcessor()
src_model.load("zero-mt-zh_hant-en/source.model")
tgt_model = sentencepiece.SentencePieceProcessor()
tgt_model.load("zero-mt-zh_hant-en/target.model")

translator = ctranslate2.Translator("zero-mt-zh_hant-en", device="cuda")  # "cpu" or "cuda"

encoded_line = src_model.encode_as_pieces("在世界上的許多地方,揮手都是一種表示「你好」的友善手勢」。")

results = translator.translate_batch([encoded_line], batch_type="tokens", max_batch_size=1024)

decoded_line = tgt_model.decode(results[0].hypotheses[0])

print(decoded_line) # In many places around the world, waving is a friendly gesture of "hello".
```

## Batch translation
```
import ctranslate2
import sentencepiece

src_path = "dev.cmn_Hant"
tgt_path = "translated.txt"

src_model = sentencepiece.SentencePieceProcessor()
src_model.load("zero-mt-zh_hant-en/source.model")
tgt_model = sentencepiece.SentencePieceProcessor()
tgt_model.load("zero-mt-zh_hant-en/target.model")

translator = ctranslate2.Translator("zero-mt-zh_hant-en", device="cuda")  # "cpu" or "cuda"

src_file = open(src_path, 'r', encoding="utf-8")
src_lines = src_file.readlines()

encoded_lines = src_model.encode_as_pieces(src_lines)

results = translator.translate_batch(encoded_lines, batch_type="tokens", max_batch_size=1024)
translations = [translation.hypotheses[0] for translation in results]

decoded_lines = tgt_model.decode(translations)

tgt_file = open(tgt_path, "w", encoding="utf-8", newline='')

for line in decoded_lines:
    tgt_file.write(line)
    tgt_file.write('\n')
```