File size: 4,559 Bytes
8172068
 
e075f1f
 
 
 
 
 
 
 
 
 
 
 
8172068
 
 
 
1a8e99b
8172068
1a8e99b
 
 
5c8bb6c
1a8e99b
8172068
7539b04
 
8172068
 
1a8e99b
8172068
1a8e99b
8172068
1a8e99b
 
5c8bb6c
1a8e99b
 
5c8bb6c
1a8e99b
5c8bb6c
1a8e99b
5c8bb6c
8172068
1a8e99b
8172068
1a8e99b
8172068
1a8e99b
8172068
1a8e99b
8172068
1a8e99b
8172068
 
 
 
1a8e99b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8171eb
76aa0e2
 
 
 
 
a20fac8
76aa0e2
 
 
 
 
a20fac8
76aa0e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a8e99b
6df7f6f
 
8172068
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
library_name: transformers
license: llama3.1
language:
- ko
- vi
- id
- km
- th
metrics:
- bleu
- rouge
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---

# Model Card for Model ID

This model is a multilingual translation model fine-tuned on LLaMA 3.1 Instruct base model. It enables mutual translation between the following Southeast Asian languages:

- Korean
- Vietnamese
- Indonesian
- Cambodian (Khmer)
- Thai

## Acknowledgements
AICA  <img src="https://aica-gj.kr/images/logo.png" width="20%" height="20%">

## Model Details
The model is designed for translating short text segments between any pair of the supported languages.

Supported language pairs:

- Korean ↔ Vietnamese
- Korean ↔ Indonesian
- Korean ↔ Cambodian
- Korean ↔ Thai
- Vietnamese ↔ Indonesian
- Vietnamese ↔ Cambodian
- Vietnamese ↔ Thai
- Indonesian ↔ Cambodian
- Indonesian ↔ Thai
- Cambodian ↔ Thai

### Model Description

This model is specifically optimized for Southeast Asian language translation needs, focusing on enabling communication between these specific language communities. 

The extensive training data of 20M examples (1M for each translation direction) provides a robust foundation for handling common expressions and basic conversations across these languages.

### Model Architecture

Base Model: meta-llama/Llama-3.1-8B-Instruct


## Bias, Risks, and Limitations

- Performance is limited to short sentences and phrases
- May not handle complex or lengthy text effectively
- Translation quality may vary depending on language pair and content complexity

## Evaluation results

| Source Language | Target Language | BLEU Score | ROUGE-1 | ROUGE-L |
|----------------|-----------------|------------|---------|---------|
| Korean         | Vietnamese      | 56.70      | 81.64   | 76.66   |
| Korean         | Cambodian       | 71.69      | 89.26   | 88.20   |
| Korean         | Indonesian      | 58.32      | 80.39   | 76.63   |
| Korean         | Thai            | 63.26      | 78.88   | 72.29   |
| Vietnamese     | Korean          | 49.01      | 75.57   | 72.74   |
| Vietnamese     | Cambodian       | 78.26      | 90.74   | 90.32   |
| Vietnamese     | Indonesian      | 65.96      | 83.08   | 81.46   |
| Vietnamese     | Thai            | 65.93      | 81.09   | 76.57   |
| Cambodian      | Korean          | 49.10      | 72.67   | 69.75   |
| Cambodian      | Vietnamese      | 63.42      | 81.56   | 79.09   |
| Cambodian      | Indonesian      | 61.41      | 79.67   | 77.75   |
| Cambodian      | Thai            | 70.91      | 81.85   | 77.66   |
| Indonesian     | Korean          | 53.61      | 77.14   | 74.29   |
| Indonesian     | Vietnamese      | 68.21      | 85.41   | 83.10   |
| Indonesian     | Cambodian       | 78.84      | 90.81   | 90.35   |
| Indonesian     | Thai            | 67.12      | 81.54   | 77.19   |
| Thai           | Korean          | 45.59      | 72.48   | 69.46   |
| Thai           | Vietnamese      | 61.55      | 81.01   | 78.24   |
| Thai           | Cambodian       | 78.52      | 91.47   | 91.16   |
| Thai           | Indonesian      | 58.99      | 78.56   | 76.40   |

## Example

```py
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "MLP-KTLim/llama-3.1-Asian-Bllossom-8B-Translator",
    torch_dtype="auto",
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(
    "MLP-KTLim/llama-3.1-Asian-Bllossom-8B-Translator",
)

input_text = "μ•ˆλ…•ν•˜μ„Έμš”? μ•„μ‹œμ•„ μ–Έμ–΄ λ²ˆμ—­ λͺ¨λΈ μž…λ‹ˆλ‹€."

def get_input_ids(source_lang, target_lang, message):
    assert source_lang in ["Korean", "Vietnamese", "Indonesian", "Thai", "Cambodian"]
    assert target_lang in ["Korean", "Vietnamese", "Indonesian", "Thai", "Cambodian"]
    
    input_ids = tokenizer.apply_chat_template(
        conversation=[
            {"role": "system", "content": f"You are a useful translation AI. Please translate the sentence given in {source_lang} into {target_lang}."},
            {"role": "user", "content": message},
        ],
        tokenize=True,
        return_tensors="pt",
        add_generation_prompt=True,
    )
    return input_ids

input_ids = get_input_ids(
    source_lang="Korean",
    target_lang="Vietnamese",
    message=input_text,
)

output = model.generate(
    input_ids.to(model.device),
    max_new_tokens=128,
)

print(tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True))
```


## Contributor
- μ›μΈν˜Έ ([email protected])
- κΉ€λ―Όμ€€ ([email protected])