ybracke commited on
Commit
c4f2c13
·
verified ·
1 Parent(s): 0632c77

Update README.md

Browse files

Update data section

Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -29,8 +29,8 @@ library_name: transformers
29
 
30
  # Transnormer 19th century (beta v02)
31
 
32
- This model normalizes spelling variants in historical German text to the modern spelling.
33
- We fine-tuned [google/byt5-small](https://huggingface.co/google/byt5-small) on a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (1780-1901).
34
 
35
  ## Model description
36
 
@@ -43,7 +43,7 @@ tokenizer = AutoTokenizer.from_pretrained("ybracke/transnormer-19c-beta-v02")
43
  model = AutoModelForSeq2SeqLM.from_pretrained("ybracke/transnormer-19c-beta-v02")
44
  sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
45
  inputs = tokenizer(sentence, return_tensors="pt",)
46
- outputs = model.generate(**inputs, max_length=128)
47
  print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
48
  # >>> ['Die Königin saß auf des Palastes mittlerer Tribüne.']
49
  ```
@@ -58,13 +58,13 @@ print(transnormer(sentence))
58
  # >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
59
  ```
60
 
61
- ## Intended uses & limitations
62
 
63
- More information needed
64
 
65
- ## Training and evaluation data
66
 
67
- More information needed
68
 
69
  ## Training procedure
70
 
@@ -77,7 +77,7 @@ The following hyperparameters were used during training:
77
  - seed: 42
78
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
79
  - lr_scheduler_type: linear
80
- - num_epochs: 10
81
 
82
  ### Framework versions
83
 
 
29
 
30
  # Transnormer 19th century (beta v02)
31
 
32
+ This model generates a normalized version of historical input text for German from the 19th (and late 18th) century.
33
+ The base model [google/byt5-small](https://huggingface.co/google/byt5-small) was fine-tuned on a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (see section [Training and evaluation data](#training-and-evaluation-data)).
34
 
35
  ## Model description
36
 
 
43
  model = AutoModelForSeq2SeqLM.from_pretrained("ybracke/transnormer-19c-beta-v02")
44
  sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
45
  inputs = tokenizer(sentence, return_tensors="pt",)
46
+ outputs = model.generate(**inputs, num_beams=4, max_length=128)
47
  print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
48
  # >>> ['Die Königin saß auf des Palastes mittlerer Tribüne.']
49
  ```
 
58
  # >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
59
  ```
60
 
61
+ ## Training and evaluation data
62
 
63
+ The model was fine-tuned and evaluated on splits derived from the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/), a parallel corpus containing of 121 texts from Deutsches Textarchiv (German Text Archive). The corpus was originally created by aligning historic prints in original spelling with an edition in contemporary orthography.
64
 
65
+ The original corpus creators applied some corrections to the modern versions (see Jurish et al. 2013). For our use of the corpus, we further improved the quality of the normalized part of the corpus by enforcing spellings that accord to the German orthography reform (post 1996) and by applying selected rules of the [LanguageTool](https://pypi.org/project/language-tool-python/) and custom replacements to remove some errors and inconsistencies. We plan to publish the corpus as a dataset on the Huggingface Hub in the future.
66
 
67
+ The training set contains 96 documents with 4.6M source tokens, the dev and test set contain 13 documents (405K tokens) and 12 documents (381K tokens), respectively.
68
 
69
  ## Training procedure
70
 
 
77
  - seed: 42
78
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
79
  - lr_scheduler_type: linear
80
+ - num_epochs: 10 (published model: 8 epochs)
81
 
82
  ### Framework versions
83