File size: 3,516 Bytes
f580828
 
 
fece59c
0632c77
f580828
 
 
 
 
 
 
 
0632c77
 
f580828
 
 
 
 
 
 
 
0632c77
 
f580828
 
 
 
d7e079f
f580828
c4f2c13
 
f580828
 
 
 
 
 
 
f309b85
f580828
 
f309b85
f580828
c4f2c13
f580828
f309b85
f580828
 
f309b85
f580828
 
 
 
f309b85
f580828
f309b85
f580828
 
c4f2c13
f580828
c4f2c13
f580828
c4f2c13
f580828
c4f2c13
f580828
 
 
 
 
 
 
 
 
 
 
 
c4f2c13
f580828
 
 
 
 
 
0632c77
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: apache-2.0
base_model: google/byt5-small
tags:
- generated_from_trainer
language: de
model-index:
- name: ybracke/transnormer-19c-beta-v02
  results:
  - task:
      name: Historic Text Normalization
      type: translation
    dataset:
      name: DTA EvalCorpus
      type: text
      split: test
    metrics:
    - name: Word Accuracy
      type: accuracy
      value: 0.98878
    - name: Word Accuracy (case insensitive)
      type: accuracy
      value: 0.99343
pipeline_tag: text2text-generation
library_name: transformers
---



# Transnormer 19th century (beta v02)

This model generates a normalized version of historical input text for German from the 19th (and late 18th) century.  
The base model [google/byt5-small](https://huggingface.co/google/byt5-small) was fine-tuned on a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (see section [Training and evaluation data](#training-and-evaluation-data)).

## Model description

### Demo Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("ybracke/transnormer-19c-beta-v02")
model = AutoModelForSeq2SeqLM.from_pretrained("ybracke/transnormer-19c-beta-v02")
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
inputs = tokenizer(sentence, return_tensors="pt",)
outputs = model.generate(**inputs, num_beams=4, max_length=128)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# >>> ['Die Königin saß auf des Palastes mittlerer Tribüne.']
```

Or use this model with the [pipeline API](https://huggingface.co/transformers/main_classes/pipelines.html) like this:

```python
from transformers import pipeline
transnormer = pipeline('text2text-generation', model='ybracke/transnormer-19c-beta-v02')
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
print(transnormer(sentence))
# >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
```

## Training and evaluation data

The model was fine-tuned and evaluated on splits derived from the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/), a parallel corpus containing of 121 texts from Deutsches Textarchiv (German Text Archive). The corpus was originally created by aligning historic prints in original spelling with an edition in contemporary orthography. 

The original corpus creators applied some corrections to the modern versions (see Jurish et al. 2013). For our use of the corpus, we further improved the quality of the normalized part of the corpus by enforcing spellings that accord to the German orthography reform (post 1996) and by applying selected rules of the [LanguageTool](https://pypi.org/project/language-tool-python/) and custom replacements to remove some errors and inconsistencies. We plan to publish the corpus as a dataset on the Huggingface Hub in the future.  

The training set contains 96 documents with 4.6M source tokens, the dev and test set contain 13 documents (405K tokens) and 12 documents (381K tokens), respectively.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10 (published model: 8 epochs)

### Framework versions

- Transformers 4.31.0
- Pytorch 2.1.0+cu121
- Datasets 2.18.0
- Tokenizers 0.13.3