ctsoukala commited on
Commit
71f9b03
·
verified ·
1 Parent(s): 69a251a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -5
README.md CHANGED
@@ -11,11 +11,13 @@ tags:
11
 
12
  # wav2vec2-xls-r-slavic-pomak
13
 
14
- To train a Pomak ASR model, we fine-tuned a Slavic model ([classla/wav2vec2-large-slavic-parlaspeech-hr](https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr)) on 11h of recorded Pomak speech.
 
 
15
 
16
  ## Recordings
17
 
18
- Fours native Pomak speakers (2 female and 2 male) agreed to read Pomak texts at the ILSP audio-visual studio in Xanthi, Greece, resulting in a total of 14h.
19
 
20
  |Speaker|Gender|Total recorded hours|
21
  |---|---|---|
@@ -29,7 +31,7 @@ This removed the majority of pauses and resulted in a total dataset duration of
29
 
30
  ## Metrics
31
 
32
- The test set consists of 10% of the dataset recordings.
33
 
34
  |Model|CER|WER|
35
  |---|---|---|
@@ -38,7 +40,7 @@ The test set consists of 10% of the dataset recordings.
38
 
39
  ## Training hyperparameters
40
 
41
- To fine-tune the wav2vec2-large-slavic-parlaspeech-hr model, we used the following hyperparameters:
42
 
43
  | arg | value |
44
  |-------------------------------|-------|
@@ -46,4 +48,32 @@ To fine-tune the wav2vec2-large-slavic-parlaspeech-hr model, we used the followi
46
  | `gradient_accumulation_steps` | 2 |
47
  | `num_train_epochs` | 35 |
48
  | `learning_rate` | 3e-4 |
49
- | `warmup_steps` | 500 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  # wav2vec2-xls-r-slavic-pomak
13
 
14
+ Pomak is an endangered South East Slavic language variety spoken in Nothern Greece.
15
+ This is the first automatic speech recognition (ASR) model for Pomak.
16
+ To train the model, we fine-tuned a Slavic model ([classla/wav2vec2-large-slavic-parlaspeech-hr](https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr)) on 11h of recorded Pomak speech.
17
 
18
  ## Recordings
19
 
20
+ Four native Pomak speakers (2 female and 2 male) agreed to read Pomak texts at the ILSP audio-visual studio in Xanthi, Greece, resulting in a corpus of 14h.
21
 
22
  |Speaker|Gender|Total recorded hours|
23
  |---|---|---|
 
31
 
32
  ## Metrics
33
 
34
+ We evaluated the model on the test set split, which consists of 10% of the dataset recordings.
35
 
36
  |Model|CER|WER|
37
  |---|---|---|
 
40
 
41
  ## Training hyperparameters
42
 
43
+ We fine-tuned the baseline model (`wav2vec2-large-slavic-parlaspeech-hr`) on an NVIDIA GeForce RTX 3090, using the following hyperparameters:
44
 
45
  | arg | value |
46
  |-------------------------------|-------|
 
48
  | `gradient_accumulation_steps` | 2 |
49
  | `num_train_epochs` | 35 |
50
  | `learning_rate` | 3e-4 |
51
+ | `warmup_steps` | 500 |
52
+
53
+ ## Citation
54
+
55
+ To cite this work or read more about the training pipeline, see [this paper](https://aclanthology.org/2023.fieldmatters-1.5/)
56
+
57
+ ```
58
+ @inproceedings{tsoukala-etal-2023-asr,
59
+ title = "{ASR} pipeline for low-resourced languages: A case study on Pomak",
60
+ author = "Tsoukala, Chara and
61
+ Kritsis, Kosmas and
62
+ Douros, Ioannis and
63
+ Katsamanis, Athanasios and
64
+ Kokkas, Nikolaos and
65
+ Arampatzakis, Vasileios and
66
+ Sevetlidis, Vasileios and
67
+ Markantonatou, Stella and
68
+ Pavlidis, George",
69
+ booktitle = "Proceedings of the Second Workshop on NLP Applications to Field Linguistics",
70
+ month = may,
71
+ year = "2023",
72
+ address = "Dubrovnik, Croatia",
73
+ publisher = "Association for Computational Linguistics",
74
+ url = "https://aclanthology.org/2023.fieldmatters-1.5",
75
+ doi = "10.18653/v1/2023.fieldmatters-1.5",
76
+ pages = "40--45",
77
+ abstract = "Automatic Speech Recognition (ASR) models can aid field linguists by facilitating the creation of text corpora from oral material. Training ASR systems for low-resource languages can be a challenging task not only due to lack of resources but also due to the work required for the preparation of a training dataset. We present a pipeline for data processing and ASR model training for low-resourced languages, based on the language family. As a case study, we collected recordings of Pomak, an endangered South East Slavic language variety spoken in Greece. Using the proposed pipeline, we trained the first Pomak ASR model.",
78
+ }
79
+ ```