metadata
license: apache-2.0
base_model: google/byt5-small
language: de
model-index:
- name: ybracke/transnormer-19c-beta-v02
results:
- task:
name: Historic Text Normalization
type: translation
dataset:
name: DTA EvalCorpus
type: N/A
split: test
metrics:
- name: Word Accuracy
type: accuracy
value: 0.98878
- name: Word Accuracy (case insensitive)
type: accuracy
value: 0.99343
Transnormer 19th century (beta v01)
This model normalizes spelling variants in historical German text to the modern spelling. We fine-tuned google/byt5-small on a modified version of the DTA EvalCorpus (1780-1901).
Model description
Demo Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("ybracke/transnormer-19c-beta-v02")
model = AutoModelForSeq2SeqLM.from_pretrained("ybracke/transnormer-19c-beta-v02")
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
inputs = tokenizer(sentence, return_tensors="pt",)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# >>> ['Die Königin saß auf des Palastes mittlerer Tribüne.']
Or use this model with the pipeline API like this:
from transformers import pipeline
transnormer = pipeline('text2text-generation', model='ybracke/transnormer-19c-beta-v02')
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
print(transnormer(sentence))
# >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10
Framework versions
- Transformers 4.31.0
- Pytorch 2.1.0+cu121
- Datasets 2.18.0
- Tokenizers 0.13.3