aarontseng
/

zero-mt-zh-en

text2text-generation

Model card Files Files and versions Community

zero-mt-zh-en / README.md

aarontseng's picture

Update README.md

4519f43 verified 3 months ago

|

history blame contribute delete

2.26 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	pipeline_tag: translation
	tags:
	- text2text-generation
	---

	# Zero-mt

	[https://github.com/zape-aat/zero-mt](https://github.com/zape-aat/zero-mt)

	## Metrics

	\|Testset\|BLEU\|Chrf++\|Comet22\|
	\|:-------------:\|:---------------:\|:---------:\|:---------:\|
	\|flores200-dev\|41.37\|65.13\|0.867\|
	\|flores200-devtest\|63.06\|53.57\|0.868\|
	\|newstest2019\|14.96\|36.16\|0.843\|
	\|wmt-22\|?\|?\|0.775\|
	\|wmt-23\|22.65\|41.22\|0.777\|

	## How to use

	```
	git lfs install
	git clone https://huggingface.co/aarontseng/zero-mt-zh_hant-en
	```

	```
	pip install ctranslate2
	pip install sentencepiece
	```
	## Basic Usage

	```
	import ctranslate2
	import sentencepiece

	src_model = sentencepiece.SentencePieceProcessor()
	src_model.load("zero-mt-zh_hant-en/source.model")
	tgt_model = sentencepiece.SentencePieceProcessor()
	tgt_model.load("zero-mt-zh_hant-en/target.model")

	translator = ctranslate2.Translator("zero-mt-zh_hant-en", device="cuda") # "cpu" or "cuda"

	encoded_line = src_model.encode_as_pieces("在世界上的許多地方，揮手都是一種表示「你好」的友善手勢」。")

	results = translator.translate_batch([encoded_line], batch_type="tokens", max_batch_size=1024)

	decoded_line = tgt_model.decode(results[0].hypotheses[0])

	print(decoded_line) # In many places around the world, waving is a friendly gesture of "hello".
	```

	## Batch translation
	```
	import ctranslate2
	import sentencepiece

	src_path = "dev.cmn_Hant"
	tgt_path = "translated.txt"

	src_model = sentencepiece.SentencePieceProcessor()
	src_model.load("zero-mt-zh_hant-en/source.model")
	tgt_model = sentencepiece.SentencePieceProcessor()
	tgt_model.load("zero-mt-zh_hant-en/target.model")

	translator = ctranslate2.Translator("zero-mt-zh_hant-en", device="cuda") # "cpu" or "cuda"

	src_file = open(src_path, 'r', encoding="utf-8")
	src_lines = src_file.readlines()

	encoded_lines = src_model.encode_as_pieces(src_lines)

	results = translator.translate_batch(encoded_lines, batch_type="tokens", max_batch_size=1024)
	translations = [translation.hypotheses[0] for translation in results]

	decoded_lines = tgt_model.decode(translations)

	tgt_file = open(tgt_path, "w", encoding="utf-8", newline='')

	for line in decoded_lines:
	tgt_file.write(line)
	tgt_file.write('\n')
	```