ybracke
/

transnormer-19c-beta-v02

@@ -29,11 +29,28 @@ library_name: transformers
 # Transnormer 19th century (beta v02)
-This model generates a normalized version of historical input text for German from the 19th (and late 18th) century.
-The base model [google/byt5-small](https://huggingface.co/google/byt5-small) was fine-tuned on a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (see section [Training and evaluation data](#training-and-evaluation-data)).
 ## Model description
 ### Demo Usage
 ```python
@@ -52,12 +69,21 @@ Or use this model with the [pipeline API](https://huggingface.co/transformers/ma
 ```python
 from transformers import pipeline
-transnormer = pipeline('text2text-generation', model='ybracke/transnormer-19c-beta-v02')
 sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
-print(transnormer(sentence))
 # >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
 ```
 ## Training and evaluation data
 The model was fine-tuned and evaluated on splits derived from the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/), a parallel corpus containing of 121 texts from Deutsches Textarchiv (German Text Archive). The corpus was originally created by aligning historic prints in original spelling with an edition in contemporary orthography.

 # Transnormer 19th century (beta v02)
+This model generates a normalized version of historical input text for German from the 19th century.
 ## Model description
+`Transnormer` is a byte-level sequence-to-sequence model for normalizing historical German text.
+This model was trained on text from the 19th and late 18th century, by performing a fine-tuning of [google/byt5-small](https://huggingface.co/google/byt5-small).
+The fine-tuning data was a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (see section [Training and evaluation data](#training-and-evaluation-data)).
+## Uses
+This model is intended for users that are working with historical text and are in need of a normalized version, i.e. a version that comes closer to modern spelling.
+Historical text typically contains spelling variations and extinct spellings that differ from contemporary text.
+This can be a drawback when working with historical text: The variation can impair the performance of NLP tools (POS tagging, etc.) that were trained on contemporary language,
+and a full text search becomes more tedious due to numerous spelling options for the same search term.
+Historical text normalization, as offered by this model, can mitigate these problems to some extent.
+Note that this model is intended for the normalization of *historical German text from a specific time period*.
+It is *not intended* for other types of text that may require normalization (e.g. computer mediated communication), other languages than German or other time frames.
+There may be other models available for that on the [Hub](https://huggingface.co/models).
+The model can be further fine-tuned to be adapted or improved, e.g. as described in the [Transformers](https://huggingface.co/docs/transformers/training) tutorials.
 ### Demo Usage
 ```python
 ```python
 from transformers import pipeline
+transnormer = pipeline(model='ybracke/transnormer-19c-beta-v02')
 sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
+print(transnormer(sentence, num_beams=4, max_length=128))
 # >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
 ```
+### Recommendations
+The model was trained using a maximum input length of 512 bytes (~70 words).
+Inference is generally possible for longer sequences, but may be worse than for shorter sequence.
+Generally, shorter sequences ensures inference that is faster and less computationally expensive.
+Consider splitting long sequences to process them separately.
 ## Training and evaluation data
 The model was fine-tuned and evaluated on splits derived from the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/), a parallel corpus containing of 121 texts from Deutsches Textarchiv (German Text Archive). The corpus was originally created by aligning historic prints in original spelling with an edition in contemporary orthography.