CantoneseLLM

This model is further pre-trained model based on 01-ai/Yi-6B with 800M tokens of Cantonese text compiled from various sources, including translated zh-yue Wikipedia, translated RTHK news datasets/jed351/rthk_news, Cantonese filtered CC100 and Cantonese textbooks generated by Gemini Pro.

This is a preview version, for experimental use only, we will use it to fine-tune on downstream tasks and evaluate the performance.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 56.93
ARC (25-shot) 55.63
HellaSwag (10-shot) 75.8
MMLU (5-shot) 63.07
TruthfulQA (0-shot) 42.26
Winogrande (5-shot) 74.11
GSM8K (5-shot) 30.71

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("hon9kon9ize/CantoneseLLM-6B-preview202402")
model = AutoModelForMaskedLM.from_pretrained("hon9kon9ize/CantoneseLLM-6B-preview202402")

prompt = "ๆญท็ถ“ไธ‰ๅนด็–ซๆƒ…๏ผŒๆœ›็ฉฟ็ง‹ๆฐด็ต‚ๆ–ผๅ…จ้ขๅพฉๅธธ๏ผŒ้šจไฝๅ„้ …้˜ฒ็–ซๆŽชๆ–ฝ้™ธ็บŒๆ”พๅฏฌไปฅ่‡ณๅ–ๆถˆ๏ผŒ้ฆ™ๆธฏ"

input_ids = tokenizer.encode(prompt, return_tensors="pt").to('cuda:0')
output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, repetition_penalty=1.1, do_sample=True, temperature=temperature, top_k=50, top_p=0.95)
output = tokenizer.decode(output[0], skip_special_tokens=True)

# output: ๆญท็ถ“ไธ‰ๅนด็–ซๆƒ…๏ผŒๆœ›็ฉฟ็ง‹ๆฐด็ต‚ๆ–ผๅ…จ้ขๅพฉๅธธ๏ผŒ้šจไฝๅ„้ …้˜ฒ็–ซๆŽชๆ–ฝ้™ธ็บŒๆ”พๅฏฌไปฅ่‡ณๅ–ๆถˆ๏ผŒ้ฆ™ๆธฏๆ—…้Šๆฅญๅฏ่ฌ‚ใ€Œ่ตทๆญปๅ›ž็”Ÿใ€ใ€‚
# ไธ้Ž๏ผŒๆ—…้Šๆฅญๅ˜…ๅพฉ่˜‡ไน‹่ทฏไธฆๅ””้ †ๅˆฉ๏ผŒ้ฆ™ๆธฏ้Šๅฎขๆ•ธ้‡ไป็„ถ้ ไฝŽๆ–ผ็–ซๅ‰ๆฐดๅนณ๏ผŒ่€Œๆตทๅค–ๆ—…ๅฎขไบฆๅชไฟ‚ๆขๅพฉๅˆฐ็–ซๆƒ…ๅ‰็ด„ไธ€ๅŠใ€‚ๆœ‰ๆฅญ็•Œไบบๅฃซ่ช็‚บ๏ผŒ็•ถๅฑ€้œ€่ฆ้€ฒไธ€ๆญฅๆ”พๅฏฌๅ…ฅๅขƒๆชข็–ซๆŽชๆ–ฝ๏ผŒๅธๅผ•ๆ›ดๅคšๅœ‹้š›ๆ—…ๅฎขไพ†ๆธฏ๏ผŒไปคๆ—…้Šๆฅญๅพ—ไปฅ็œŸๆญฃๅพฉ็”ฆใ€‚

Limitation and Bias

The model is intended to use for Cantonese language understanding and generation tasks, it may not be suitable for other Chinese languages. The model is trained on a diverse range of Cantonese text, including news, Wikipedia, and textbooks, it may not be suitable for informal or dialectal Cantonese, it may contain bias and misinformation, please use it with caution.

We found the model is not well trained on the updated Hong Kong knowledge, it may due to the corpus is not large enough to brainwash the original model. We will continue to improve the model and corpus in the future.

Downloads last month
124
Safetensors
Model size
6.06B params
Tensor type
BF16
ยท
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for hon9kon9ize/CantoneseLLM-6B-preview202402

Adapters
1 model
Quantizations
2 models

Spaces using hon9kon9ize/CantoneseLLM-6B-preview202402 3

Collection including hon9kon9ize/CantoneseLLM-6B-preview202402

Evaluation results