oovword commited on
Commit
a741cff
·
verified ·
1 Parent(s): 9621acd

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +149 -0
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: openai/whisper-small
3
+ language:
4
+ - uk
5
+ - en
6
+ datasets: oovword/speech-translation-uk-en
7
+ pipeline_tag: speech-translation
8
+ license: apache-2.0
9
+ metrics:
10
+ - bleu
11
+ - chrf
12
+ inference: true
13
+ library_name: transformers
14
+ model-index:
15
+ - name: uk2en-speech-translation
16
+ results:
17
+ - task:
18
+ type: speech-translation
19
+ dataset:
20
+ name: Half-Synthetic Speech Dataset for Ukrainian-to-English Translation
21
+ type: oovword/speech-translation-uk-en
22
+ metrics:
23
+ - type: bleu
24
+ value: 22.34
25
+ name: BLEU
26
+ - task:
27
+ type: translation, speech-translation
28
+ dataset:
29
+ name: Half-Synthetic Speech Dataset for Ukrainian-to-English Translation
30
+ type: oovword/speech-translation-uk-en
31
+ metrics:
32
+ - type: chrf
33
+ value: 48.1
34
+ name: ChrF++
35
+ ---
36
+
37
+ # Model Card
38
+
39
+ 1. [Model Summary](##model-summary)
40
+ 2. [Use](##use)
41
+ 4. [Training](##training)
42
+ 5. [License](##license)
43
+ 6. [Citation](##citation)
44
+
45
+ ## Model Summary
46
+
47
+ This model has been fine-tuned as a part of Speech Translation 3-Week Mentorship by Yasmin Moslem.
48
+
49
+ ## Use
50
+
51
+ ### Intended use
52
+
53
+ The model has been trained on the Ukrainian speech (source) and English text (target) data and can be used for speech-to-text translation between the specified source and target languages.
54
+
55
+ ### Generation
56
+
57
+ The model accepts mono-channel audio files with the sampling rate of 16kHz.
58
+
59
+ ```python
60
+ import torchaudio
61
+
62
+ from transformers import WhisperForConditionalGeneration, WhisperProcessor
63
+
64
+ model = WhisperForConditionalGeneration.from_pretrained('whisper-uk2en-speech-translation')
65
+ processor = WhisperProcessor.from_pretrained('whisper-uk2en-speech-translation')
66
+
67
+ # Audio files in `datasets` format
68
+ inputs = processor(sample['audio']['array'].squeeze(), sampling_rate=16000, return_tensors='pt', return_attention_mask=True)
69
+ with torch.inference_mode():
70
+ predictions = model.generate(**inputs)
71
+ sample['translation'] = processor.batch_decode(predictions, skip_special_tokens=True)[0].strip()
72
+
73
+ # Standalone audio files
74
+ waveform, _ = torchaudio.load('ukrainian_speech.wav')
75
+ inputs = processor(waveform, sampling_rate=16000, return_tensors='pt', return_attention_mask=True)
76
+ with torch.inference_mode():
77
+ predictions = model.generate(**inputs)
78
+ print(processor.batch_decode(predictions, skip_special_tokens=True)[0].strip())
79
+ ```
80
+
81
+ ### Attribution & Other Requirements
82
+
83
+ The following datasets, all licensed under CC-BY-4.0, were used for the model fine-tuning:
84
+
85
+ - [`google/fleurs`](https://huggingface.co/datasets/google/fleurs) (fully authentic)
86
+ - [`skypro1111/elevenlabs_dataset`](ttps://huggingface.co/datasets/skypro1111/elevenlabs_dataset) (fully synthetic)
87
+ - [`MLCommons/ml_spoken_words`](https://huggingface.co/datasets/MLCommons/ml_spoken_words) (authentic + synthetic)
88
+
89
+ The Fleurs dataset only contains authentic human speech and translations.
90
+ For the `elevenlabs` dataset, the Ukrainian text was generated by ChatGPT and later voiced by the `elevenlabs` TTS model. The transcripts were machine-translated into English by Azure Translator.
91
+ Speech and Ukrainian transcripts in the ML Spoken Words dataset are authentic human data; the English text is machine-translated from Ukrainian by Azure Translator.
92
+ **NOTE:** English translations were not human-verified or proofread due to time limitations and, as such, may contain mistakes and inaccuracies.
93
+
94
+ Total (train): 10390 samples
95
+ Total (dev): 2058 samples
96
+ Total (test): 2828 samples
97
+
98
+ Total duration (train): 10 hours 45 minutes 12 seconds
99
+ Total duration (dev): 1 hour 36 minutes 7 seconds
100
+ Total duration (test): 3 hours 1 minute 28 seconds
101
+
102
+ ## Training
103
+
104
+ The model has been fine-tuned on a mix of authentic human and synthetic speech and text translations on a T4 GPU in Google Colab with the following training parameters:
105
+
106
+ - learning_rate: 1e-6
107
+ - batch_size: 32
108
+ - num_train_epochs: 3 (975 training steps)
109
+ - warmup_steps: 0
110
+
111
+ ## License
112
+
113
+ The fine-tuned model is licensed under the same CC-BY-4.0 license agreement as the original `openai/whisper-small` checkpoint.
114
+
115
+ ## Citations
116
+
117
+ @misc{radford2022whisper,
118
+ doi = {10.48550/ARXIV.2212.04356},
119
+ url = {https://arxiv.org/abs/2212.04356},
120
+ author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
121
+ title = {Robust Speech Recognition via Large-Scale Weak Supervision},
122
+ publisher = {arXiv},
123
+ year = {2022},
124
+ copyright = {arXiv.org perpetual, non-exclusive license}
125
+ }
126
+
127
+ @article{fleurs2022arxiv,
128
+ title = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
129
+ author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
130
+ journal={arXiv preprint arXiv:2205.12446},
131
+ url = {https://arxiv.org/abs/2205.12446},
132
+ year = {2022},
133
+ }
134
+
135
+ @misc{synthetic_tts_dataset,
136
+ author = {@skypro1111},
137
+ title = {Synthetic TTS Dataset for Training Models},
138
+ year = {2024},
139
+ publisher = {GitHub},
140
+ journal = {GitHub repository},
141
+ url= {https://github.com/skypro1111/pflowtts_pytorch_uk}
142
+ }
143
+
144
+ @inproceedings{mazumder2021multilingual,
145
+ title={Multilingual Spoken Words Corpus},
146
+ author={Mazumder, Mark and Chitlangia, Sharad and Banbury, Colby and Kang, Yiping and Ciro, Juan Manuel and Achorn, Keith and Galvez, Daniel and Sabini, Mark and Mattson, Peter and Kanter, David and others},
147
+ booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
148
+ year={2021}
149
+ }