mrapacz commited on
Commit
f4be03c
·
verified ·
1 Parent(s): 10e06e0

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +64 -2
README.md CHANGED
@@ -12,14 +12,16 @@ datasets:
12
  ---
13
  # Model Card for Ancient Greek to Polish Interlinear Translation Model
14
 
15
- This model performs interlinear translation from Ancient Greek to {Language}, maintaining word-level alignment between source and target texts.
 
 
16
 
17
  ## Model Details
18
 
19
  ### Model Description
20
 
21
  - **Developed By:** Maciej Rapacz, AGH University of Kraków
22
- - **Model Type:** Neural machine translation (T5-based)
23
  - **Base Model:** mT5-base
24
  - **Tokenizer:** mT5
25
  - **Language(s):** Ancient Greek (source) → Polish (target)
@@ -37,3 +39,63 @@ This model performs interlinear translation from Ancient Greek to {Language}, ma
37
 
38
  - **Repository:** https://github.com/mrapacz/loreslm-interlinear-translation
39
  - **Paper:** https://aclanthology.org/2025.loreslm-1.11/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
  # Model Card for Ancient Greek to Polish Interlinear Translation Model
14
 
15
+ This model performs interlinear translation from Ancient Greek to Polish, maintaining word-level alignment between source and target texts.
16
+
17
+ You can find the source code used for training this and other models trained as part of this project in the [GitHub repository](https://github.com/mrapacz/loreslm-interlinear-translation).
18
 
19
  ## Model Details
20
 
21
  ### Model Description
22
 
23
  - **Developed By:** Maciej Rapacz, AGH University of Kraków
24
+ - **Model Type:** MorphT5AutoForConditionalGeneration
25
  - **Base Model:** mT5-base
26
  - **Tokenizer:** mT5
27
  - **Language(s):** Ancient Greek (source) → Polish (target)
 
39
 
40
  - **Repository:** https://github.com/mrapacz/loreslm-interlinear-translation
41
  - **Paper:** https://aclanthology.org/2025.loreslm-1.11/
42
+
43
+ ## Usage Example
44
+
45
+
46
+ > **Note**: This model uses a modification of T5-family models that includes dedicated embedding layers for encoding morphological information. To load these models, install the [morpht5](https://github.com/mrapacz/loreslm-interlinear-translation/blob/master/morpht5/README.md) package:
47
+ > ```bash
48
+ > pip install morpht5
49
+ > ```
50
+
51
+
52
+ ```python
53
+ >>> from morpht5 import MorphT5AutoForConditionalGeneration, MorphT5Tokenizer
54
+ >>> text = ['Λέγει', 'αὐτῷ', 'ὁ', 'Ἰησοῦς', 'Ἔγειρε', 'ἆρον', 'τὸν', 'κράβαττόν', 'σου', 'καὶ', 'περιπάτει']
55
+ >>> tags = ['V-PIA-3S', 'PPro-DM3S', 'Art-NMS', 'N-NMS', 'V-PMA-2S', 'V-AMA-2S', 'Art-AMS', 'N-AMS', 'PPro-G2S', 'Conj', 'V-PMA-2S']
56
+ >>> tokenizer = MorphT5Tokenizer.from_pretrained("mrapacz/interlinear-pl-mt5-base-emb-auto-diacritics-bh")
57
+ >>> inputs = tokenizer(
58
+ text=text,
59
+ morph_tags=tags,
60
+ return_tensors="pt"
61
+ )
62
+ >>> model = MorphT5AutoForConditionalGeneration.from_pretrained("mrapacz/interlinear-pl-mt5-base-emb-auto-diacritics-bh")
63
+ >>> outputs = model.generate(
64
+ **inputs,
65
+ max_new_tokens=100,
66
+ early_stopping=True,
67
+ )
68
+ >>> decoded = tokenizer.decode(outputs[0], skip_special_tokens=True, keep_block_separator=True)
69
+ >>> decoded = decoded.replace(tokenizer.target_block_separator_token, " | ")
70
+ >>> decoded
71
+ 'Mówi | mu | - | Jezus | wstawaj | weź | - | matę | swoją | i | chodź'
72
+
73
+ ```
74
+
75
+ ## Citation
76
+
77
+ If you use this model, please cite the following paper:
78
+
79
+ ```
80
+ @inproceedings{rapacz-smywinski-pohl-2025-low,
81
+ title = "Low-Resource Interlinear Translation: Morphology-Enhanced Neural Models for {A}ncient {G}reek",
82
+ author = "Rapacz, Maciej and
83
+ Smywi{\'n}ski-Pohl, Aleksander",
84
+ editor = "Hettiarachchi, Hansi and
85
+ Ranasinghe, Tharindu and
86
+ Rayson, Paul and
87
+ Mitkov, Ruslan and
88
+ Gaber, Mohamed and
89
+ Premasiri, Damith and
90
+ Tan, Fiona Anting and
91
+ Uyangodage, Lasitha",
92
+ booktitle = "Proceedings of the First Workshop on Language Models for Low-Resource Languages",
93
+ month = jan,
94
+ year = "2025",
95
+ address = "Abu Dhabi, United Arab Emirates",
96
+ publisher = "Association for Computational Linguistics",
97
+ url = "https://aclanthology.org/2025.loreslm-1.11/",
98
+ pages = "145--165",
99
+ abstract = "Contemporary machine translation systems prioritize fluent, natural-sounding output with flexible word ordering. In contrast, interlinear translation maintains the source text`s syntactic structure by aligning target language words directly beneath their source counterparts. Despite its importance in classical scholarship, automated approaches to interlinear translation remain understudied. We evaluated neural interlinear translation from Ancient Greek to English and Polish using four transformer-based models: two Ancient Greek-specialized (GreTa and PhilTa) and two general-purpose multilingual models (mT5-base and mT5-large). Our approach introduces novel morphological embedding layers and evaluates text preprocessing and tag set selection across 144 experimental configurations using a word-aligned parallel corpus of the Greek New Testament. Results show that morphological features through dedicated embedding layers significantly enhance translation quality, improving BLEU scores by 35{\%} (44.67 {\textrightarrow} 60.40) for English and 38{\%} (42.92 {\textrightarrow} 59.33) for Polish compared to baseline models. PhilTa achieves state-of-the-art performance for English, while mT5-large does so for Polish. Notably, PhilTa maintains stable performance using only 10{\%} of training data. Our findings challenge the assumption that modern neural architectures cannot benefit from explicit morphological annotations. While preprocessing strategies and tag set selection show minimal impact, the substantial gains from morphological embeddings demonstrate their value in low-resource scenarios."
100
+ }
101
+ ```