File size: 4,956 Bytes
f580828 fece59c 0632c77 f580828 2426e29 0632c77 f580828 0632c77 ac2f266 f580828 d7e079f f580828 6d68386 f580828 5e8da84 6e69271 5e8da84 f580828 f309b85 f580828 f309b85 f580828 c4f2c13 f580828 f309b85 f580828 f309b85 f580828 5e8da84 f309b85 5e8da84 f309b85 f580828 5e8da84 2a51fc2 5e8da84 c4f2c13 f580828 2230d9a e283348 f580828 c4f2c13 f580828 0632c77 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
license: apache-2.0
base_model: google/byt5-small
tags:
- generated_from_trainer
language: de
model-index:
- name: ybracke/transnormer-19c-beta-v02
results:
- task:
name: Historic Text Normalization
type: translation
dataset:
name: DTA reviEvalCorpus v1
url: ybracke/dta-reviEvalCorpus-v1
type: text
split: test
metrics:
- name: Word Accuracy
type: accuracy
value: 0.98878
- name: Word Accuracy (case insensitive)
type: accuracy
value: 0.99343
pipeline_tag: text2text-generation
library_name: transformers
datasets:
- ybracke/dta-reviEvalCorpus-v1
---
# Transnormer 19th century (beta v02)
This model can normalize historical German spellings from the 19th century.
## Model description
`Transnormer` is a byte-level sequence-to-sequence model for normalizing historical German text.
This model was trained on text from the 19th and late 18th century,
by performing a fine-tuning of [google/byt5-small](https://huggingface.co/google/byt5-small) on the [DTA reviEvalCorpus](https://huggingface.co/datasets/ybracke/dta-reviEvalCorpus-v1), a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (see section [Training and evaluation data](#training-and-evaluation-data)).
## Uses
This model is intended for users that are working with historical text and are in need of a normalized version, i.e. a version that comes closer to modern spelling.
Historical text typically contains spelling variations and extinct spellings that differ from contemporary text.
This can be a drawback when working with historical text: The variation can impair the performance of NLP tools (POS tagging, etc.) that were trained on contemporary language,
and a full text search becomes more tedious due to numerous spelling options for the same search term.
Historical text normalization, as offered by this model, can mitigate these problems to some extent.
Note that this model is intended for the normalization of *historical German text from a specific time period*.
It is *not intended* for other types of text that may require normalization (e.g. computer mediated communication), other languages than German or other time frames.
There may be other models available for that on the [Hub](https://huggingface.co/models).
The model can be further fine-tuned to be adapted or improved, e.g. as described in the [Transformers](https://huggingface.co/docs/transformers/training) tutorials.
### Demo Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("ybracke/transnormer-19c-beta-v02")
model = AutoModelForSeq2SeqLM.from_pretrained("ybracke/transnormer-19c-beta-v02")
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
inputs = tokenizer(sentence, return_tensors="pt",)
outputs = model.generate(**inputs, num_beams=4, max_length=128)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# >>> ['Die Königin saß auf des Palastes mittlerer Tribüne.']
```
Or use this model with the [pipeline API](https://huggingface.co/transformers/main_classes/pipelines.html) like this:
```python
from transformers import pipeline
transnormer = pipeline(model='ybracke/transnormer-19c-beta-v02')
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
print(transnormer(sentence, num_beams=4, max_length=128))
# >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
```
### Recommendations
The model was trained using a maximum input length of 512 bytes (~70 words).
Inference is generally possible for longer sequences, but may be worse than for shorter sequence.
Generally, by passing shorter sequences you make sure that inference is faster and less computationally expensive.
Consider splitting long sequences to process them separately.
## Training and evaluation data
The model was fine-tuned and evaluated on the [DTA reviEvalCorpus](https://huggingface.co/datasets/ybracke/dta-reviEvalCorpus-v1).
*DTA reviEvalCorpus* is a parallel corpus of German texts from the period between 1780 to 1899, that aligns sentences in historical spelling of with their normalizations.
The training set contains 96 documents with 4.6M source tokens, the dev and test set contain 13 documents (405K tokens) and 12 documents (381K tokens), respectively.
For more information, see the [dataset card](https://huggingface.co/datasets/ybracke/dta-reviEvalCorpus-v1) of the corpus.
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10 (published model: 8 epochs)
### Framework versions
- Transformers 4.31.0
- Pytorch 2.1.0+cu121
- Datasets 2.18.0
- Tokenizers 0.13.3 |