Update README.md
Browse filesAdd sections on use
README.md
CHANGED
@@ -29,11 +29,28 @@ library_name: transformers
|
|
29 |
|
30 |
# Transnormer 19th century (beta v02)
|
31 |
|
32 |
-
This model generates a normalized version of historical input text for German from the 19th
|
33 |
-
The base model [google/byt5-small](https://huggingface.co/google/byt5-small) was fine-tuned on a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (see section [Training and evaluation data](#training-and-evaluation-data)).
|
34 |
|
35 |
## Model description
|
36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
### Demo Usage
|
38 |
|
39 |
```python
|
@@ -52,12 +69,21 @@ Or use this model with the [pipeline API](https://huggingface.co/transformers/ma
|
|
52 |
|
53 |
```python
|
54 |
from transformers import pipeline
|
55 |
-
|
|
|
56 |
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
|
57 |
-
print(transnormer(sentence))
|
58 |
# >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
|
59 |
```
|
60 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
61 |
## Training and evaluation data
|
62 |
|
63 |
The model was fine-tuned and evaluated on splits derived from the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/), a parallel corpus containing of 121 texts from Deutsches Textarchiv (German Text Archive). The corpus was originally created by aligning historic prints in original spelling with an edition in contemporary orthography.
|
|
|
29 |
|
30 |
# Transnormer 19th century (beta v02)
|
31 |
|
32 |
+
This model generates a normalized version of historical input text for German from the 19th century.
|
|
|
33 |
|
34 |
## Model description
|
35 |
|
36 |
+
`Transnormer` is a byte-level sequence-to-sequence model for normalizing historical German text.
|
37 |
+
This model was trained on text from the 19th and late 18th century, by performing a fine-tuning of [google/byt5-small](https://huggingface.co/google/byt5-small).
|
38 |
+
The fine-tuning data was a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (see section [Training and evaluation data](#training-and-evaluation-data)).
|
39 |
+
|
40 |
+
## Uses
|
41 |
+
|
42 |
+
This model is intended for users that are working with historical text and are in need of a normalized version, i.e. a version that comes closer to modern spelling.
|
43 |
+
Historical text typically contains spelling variations and extinct spellings that differ from contemporary text.
|
44 |
+
This can be a drawback when working with historical text: The variation can impair the performance of NLP tools (POS tagging, etc.) that were trained on contemporary language,
|
45 |
+
and a full text search becomes more tedious due to numerous spelling options for the same search term.
|
46 |
+
Historical text normalization, as offered by this model, can mitigate these problems to some extent.
|
47 |
+
|
48 |
+
Note that this model is intended for the normalization of *historical German text from a specific time period*.
|
49 |
+
It is *not intended* for other types of text that may require normalization (e.g. computer mediated communication), other languages than German or other time frames.
|
50 |
+
There may be other models available for that on the [Hub](https://huggingface.co/models).
|
51 |
+
|
52 |
+
The model can be further fine-tuned to be adapted or improved, e.g. as described in the [Transformers](https://huggingface.co/docs/transformers/training) tutorials.
|
53 |
+
|
54 |
### Demo Usage
|
55 |
|
56 |
```python
|
|
|
69 |
|
70 |
```python
|
71 |
from transformers import pipeline
|
72 |
+
|
73 |
+
transnormer = pipeline(model='ybracke/transnormer-19c-beta-v02')
|
74 |
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
|
75 |
+
print(transnormer(sentence, num_beams=4, max_length=128))
|
76 |
# >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
|
77 |
```
|
78 |
|
79 |
+
### Recommendations
|
80 |
+
|
81 |
+
The model was trained using a maximum input length of 512 bytes (~70 words).
|
82 |
+
Inference is generally possible for longer sequences, but may be worse than for shorter sequence.
|
83 |
+
Generally, shorter sequences ensures inference that is faster and less computationally expensive.
|
84 |
+
Consider splitting long sequences to process them separately.
|
85 |
+
|
86 |
+
|
87 |
## Training and evaluation data
|
88 |
|
89 |
The model was fine-tuned and evaluated on splits derived from the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/), a parallel corpus containing of 121 texts from Deutsches Textarchiv (German Text Archive). The corpus was originally created by aligning historic prints in original spelling with an edition in contemporary orthography.
|