mzboito commited on
Commit
ba2232b
·
verified ·
1 Parent(s): b7e1e47

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +182 -0
README.md CHANGED
@@ -1,3 +1,185 @@
1
  ---
2
  license: cc-by-nc-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
+ language:
4
+ - ab
5
+ - af
6
+ - am
7
+ - ar
8
+ - as
9
+ - az
10
+ - ba
11
+ - be
12
+ - bn
13
+ - bo
14
+ - bs
15
+ - br
16
+ - bg
17
+ - ca
18
+ - cs
19
+ - cv
20
+ - cy
21
+ - da
22
+ - de
23
+ - dv
24
+ - el
25
+ - en
26
+ - eo
27
+ - et
28
+ - eu
29
+ - ee
30
+ - fo
31
+ - fa
32
+ - tl
33
+ - fi
34
+ - fr
35
+ - fy
36
+ - ga
37
+ - gl
38
+ - gv
39
+ - gn
40
+ - gu
41
+ - ht
42
+ - ha
43
+ - he
44
+ - hi
45
+ - hr
46
+ - hu
47
+ - hy
48
+ - ig
49
+ - ia
50
+ - id
51
+ - is
52
+ - it
53
+ - jv
54
+ - ja
55
+ - kn
56
+ - ka
57
+ - kk
58
+ - km
59
+ - rw
60
+ - ky
61
+ - ku
62
+ - ko
63
+ - lo
64
+ - la
65
+ - lv
66
+ - ln
67
+ - lt
68
+ - lb
69
+ - lg
70
+ - ml
71
+ - mr
72
+ - mk
73
+ - mg
74
+ - mt
75
+ - mn
76
+ - mi
77
+ - ms
78
+ - my
79
+ - ne
80
+ - nl
81
+ - nn
82
+ - no
83
+ - oc
84
+ - or
85
+ - pa
86
+ - pl
87
+ - pt
88
+ - ps
89
+ - ro
90
+ - ru
91
+ - sa
92
+ - si
93
+ - sl
94
+ - sk
95
+ - sn
96
+ - sd
97
+ - so
98
+ - st
99
+ - es
100
+ - sq
101
+ - sc
102
+ - sr
103
+ - su
104
+ - sw
105
+ - sv
106
+ - ta
107
+ - tt
108
+ - te
109
+ - tg
110
+ - th
111
+ - tn
112
+ - tk
113
+ - tr
114
+ - tw
115
+ - ug
116
+ - uk
117
+ - ur
118
+ - uz
119
+ - vi
120
+ - xh
121
+ - yi
122
+ - yo
123
+ - zh
124
  ---
125
+
126
+ ## mHuBERT-147 models
127
+
128
+ mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
129
+
130
+ This repository contains:
131
+ * Fairseq checkpoint (original);
132
+ * HuggingFace checkpoint;
133
+ * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
134
+
135
+
136
+ # Additional Information
137
+
138
+
139
+ **Manifest list:** https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest
140
+
141
+ Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available.
142
+
143
+ **Fairseq fork:** https://github.com/utter-project/fairseq
144
+
145
+ **Scripts for pre-processing/faiss clustering:** https://github.com/utter-project/mHuBERT-147-scripts
146
+
147
+ **Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).
148
+
149
+
150
+ # Datasets Included
151
+
152
+ For ASR/ST/TTS datasets, only train set is used.
153
+ * [Aishell](https://www.openslr.org/33/) and [AISHELL-3](https://www.openslr.org/93/)
154
+ * [BibleTTS](https://www.openslr.org/129/)
155
+ * [ClovaCall](https://github.com/clovaai/ClovaCall)
156
+ * [CommonVoice v11](https://commonvoice.mozilla.org/en/datasets)
157
+ * Google TTS data: [Javanese](https://www.openslr.org/41/), [Khmer](https://www.openslr.org/42/), [Nepali](https://www.openslr.org/43/), [Sundanese](https://www.openslr.org/44/), [South African Languages](https://www.openslr.org/32/), [Bengali Languages](https://www.openslr.org/37/)
158
+ * IISc-MILE: [Tamil](https://www.openslr.org/127/), [Kannada](https://www.openslr.org/126/)
159
+ * [Japanese Versatile Speech](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus)
160
+ * [Kokoro](https://github.com/kaiidams/Kokoro-Speech-Dataset)
161
+ * [Kosp2e](https://github.com/warnikchow/kosp2e)
162
+ * Media Speech: [Turkish Only](https://www.openslr.org/108/)
163
+ * [Multilingual LibriSpeech](https://www.openslr.org/94/)
164
+ * [Samrómur](https://www.openslr.org/128/)
165
+ * [THCHS-30](https://www.openslr.org/18/) and [THUYG-20](https://www.openslr.org/22/)
166
+ * [VoxLingua107](https://bark.phon.ioc.ee/voxlingua107/)
167
+ * [VoxPopuli](https://github.com/facebookresearch/voxpopuli/)
168
+
169
+
170
+ # Citing
171
+
172
+ ```
173
+ @inproceedings{boito2024mhubert,
174
+ author={Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu},
175
+ title={{mHuBERT-147: A Compact Multilingual HuBERT Model}},
176
+ year=2024,
177
+ booktitle={Interspeech 2024},
178
+ }
179
+ ```
180
+
181
+
182
+ # Funding
183
+
184
+ This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) under grant number 101070631.
185
+ For more information go to https://he-utter.eu/