CantoneseLLM
This model is further pre-trained model based on 01-ai/Yi-6B with 800M tokens of Cantonese text compiled from various sources, including translated zh-yue Wikipedia, translated RTHK news datasets/jed351/rthk_news, Cantonese filtered CC100 and Cantonese textbooks generated by Gemini Pro.
This is a preview version, for experimental use only, we will use it to fine-tune on downstream tasks and evaluate the performance.
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 56.93 |
ARC (25-shot) | 55.63 |
HellaSwag (10-shot) | 75.8 |
MMLU (5-shot) | 63.07 |
TruthfulQA (0-shot) | 42.26 |
Winogrande (5-shot) | 74.11 |
GSM8K (5-shot) | 30.71 |
Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("hon9kon9ize/CantoneseLLM-6B-preview202402")
model = AutoModelForMaskedLM.from_pretrained("hon9kon9ize/CantoneseLLM-6B-preview202402")
prompt = "ๆญท็ถไธๅนด็ซๆ
๏ผๆ็ฉฟ็งๆฐด็ตๆผๅ
จ้ขๅพฉๅธธ๏ผ้จไฝๅ้
้ฒ็ซๆชๆฝ้ธ็บๆพๅฏฌไปฅ่ณๅๆถ๏ผ้ฆๆธฏ"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('cuda:0')
output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, repetition_penalty=1.1, do_sample=True, temperature=temperature, top_k=50, top_p=0.95)
output = tokenizer.decode(output[0], skip_special_tokens=True)
# output: ๆญท็ถไธๅนด็ซๆ
๏ผๆ็ฉฟ็งๆฐด็ตๆผๅ
จ้ขๅพฉๅธธ๏ผ้จไฝๅ้
้ฒ็ซๆชๆฝ้ธ็บๆพๅฏฌไปฅ่ณๅๆถ๏ผ้ฆๆธฏๆ
้ๆฅญๅฏ่ฌใ่ตทๆญปๅ็ใใ
# ไธ้๏ผๆ
้ๆฅญๅ
ๅพฉ่ไน่ทฏไธฆๅ้ ๅฉ๏ผ้ฆๆธฏ้ๅฎขๆธ้ไป็ถ้ ไฝๆผ็ซๅๆฐดๅนณ๏ผ่ๆตทๅคๆ
ๅฎขไบฆๅชไฟๆขๅพฉๅฐ็ซๆ
ๅ็ดไธๅใๆๆฅญ็ไบบๅฃซ่ช็บ๏ผ็ถๅฑ้่ฆ้ฒไธๆญฅๆพๅฏฌๅ
ฅๅขๆชข็ซๆชๆฝ๏ผๅธๅผๆดๅคๅ้ๆ
ๅฎขไพๆธฏ๏ผไปคๆ
้ๆฅญๅพไปฅ็ๆญฃๅพฉ็ฆใ
Limitation and Bias
The model is intended to use for Cantonese language understanding and generation tasks, it may not be suitable for other Chinese languages. The model is trained on a diverse range of Cantonese text, including news, Wikipedia, and textbooks, it may not be suitable for informal or dialectal Cantonese, it may contain bias and misinformation, please use it with caution.
We found the model is not well trained on the updated Hong Kong knowledge, it may due to the corpus is not large enough to brainwash the original model. We will continue to improve the model and corpus in the future.
- Downloads last month
- 124
Model tree for hon9kon9ize/CantoneseLLM-6B-preview202402
Spaces using hon9kon9ize/CantoneseLLM-6B-preview202402 3
Collection including hon9kon9ize/CantoneseLLM-6B-preview202402
Evaluation results
- normalized accuracy on AI2 Reasoning Challenge (25-Shot)test set Open LLM Leaderboard55.630
- normalized accuracy on HellaSwag (10-Shot)validation set Open LLM Leaderboard75.800
- accuracy on MMLU (5-Shot)test set Open LLM Leaderboard63.070
- mc2 on TruthfulQA (0-shot)validation set Open LLM Leaderboard42.260
- accuracy on Winogrande (5-shot)validation set Open LLM Leaderboard74.110
- accuracy on GSM8k (5-shot)test set Open LLM Leaderboard30.710