Update README.md
Browse files
README.md
CHANGED
@@ -67,32 +67,35 @@ we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-larg
|
|
67 |
teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
|
68 |
Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
|
69 |
|
70 |
-
As
|
71 |
(the largest speech-transcription paired dataset in Japanese extracted from Japanese TV audio recordings),
|
72 |
-
which amounts
|
73 |
those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
|
74 |
The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
|
75 |
|
76 |
-
Kotoba-whisper-
|
77 |
from ReazonSpeech, and achieves competitive CER and WER on the out-of-domain test sets including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
|
78 |
the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice) (see [Evaluation](#evaluation) for detail).
|
79 |
|
80 |
- ***CER***
|
81 |
|
82 |
-
| Model
|
83 |
-
|
84 |
-
| [**kotoba-tech/kotoba-whisper-
|
85 |
-
| [
|
86 |
-
| [openai/whisper-
|
87 |
-
| [openai/whisper-
|
88 |
-
| [openai/whisper-
|
|
|
|
|
89 |
|
90 |
- ***WER***
|
91 |
|
92 |
-
| Model |
|
93 |
|:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
|
94 |
-
| [**kotoba-tech/kotoba-whisper-
|
95 |
-
| [
|
|
|
96 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
|
97 |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
|
98 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
|
@@ -103,7 +106,7 @@ it inherits the benefit of the improved latency compared to [openai/whisper-larg
|
|
103 |
|
104 |
| Model | Params / M | Rel. Latency |
|
105 |
|----------------------------------------------------------------------------------------------|------------|--------------|
|
106 |
-
| **[kotoba-tech/kotoba-whisper-
|
107 |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 |
|
108 |
|
109 |
|
@@ -126,7 +129,7 @@ from transformers import pipeline
|
|
126 |
from datasets import load_dataset
|
127 |
|
128 |
# config
|
129 |
-
model_id = "kotoba-tech/kotoba-whisper-
|
130 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
131 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
132 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|
@@ -188,7 +191,7 @@ from transformers import pipeline
|
|
188 |
from datasets import load_dataset
|
189 |
|
190 |
# config
|
191 |
-
model_id = "kotoba-tech/kotoba-whisper-
|
192 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
193 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
194 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|
@@ -261,7 +264,7 @@ from evaluate import load
|
|
261 |
from transformers.models.whisper.english_normalizer import BasicTextNormalizer
|
262 |
|
263 |
# model config
|
264 |
-
model_id = "kotoba-tech/kotoba-whisper-
|
265 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
266 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
267 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|
|
|
67 |
teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
|
68 |
Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
|
69 |
|
70 |
+
As successor of our first model, [kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), we release ***kotoba-whisper-v2.0*** trained on the `all` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
|
71 |
(the largest speech-transcription paired dataset in Japanese extracted from Japanese TV audio recordings),
|
72 |
+
which amounts 7,203,957 audio clips (5 sec audio with 18 text tokens in average) after
|
73 |
those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
|
74 |
The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
|
75 |
|
76 |
+
Kotoba-whisper-v2.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set
|
77 |
from ReazonSpeech, and achieves competitive CER and WER on the out-of-domain test sets including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
|
78 |
the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice) (see [Evaluation](#evaluation) for detail).
|
79 |
|
80 |
- ***CER***
|
81 |
|
82 |
+
| Model | [CommonVoice 8.0](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT basic5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech Test](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
|
83 |
+
|:---------------------------------------------------------------------------------------------|-------------------:|-----------------:|--------------------:|
|
84 |
+
| [**kotoba-tech/kotoba-whisper-v2.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0)| 9.20 | 8.40 | **11.63** |
|
85 |
+
| [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 9.44 | 8.48 | **12.60** |
|
86 |
+
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **8.52** | **7.18** | 15.18 |
|
87 |
+
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 11.34 | 9.87 | 29.56 |
|
88 |
+
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 15.26 | 14.22 | 34.29 |
|
89 |
+
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 46.86 | 35.69 | 96.69 |
|
90 |
+
|
91 |
|
92 |
- ***WER***
|
93 |
|
94 |
+
| Model | [CommonVoice 8.0](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT basic5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech Test](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
|
95 |
|:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
|
96 |
+
| [**kotoba-tech/kotoba-whisper-v2.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0) | 58.8 | 63.7 | **55.6** |
|
97 |
+
| [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 59.27 | 64.36 | 56.62 |
|
98 |
+
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **55.41** | **59.34** | 60.23 |
|
99 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
|
100 |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
|
101 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
|
|
|
106 |
|
107 |
| Model | Params / M | Rel. Latency |
|
108 |
|----------------------------------------------------------------------------------------------|------------|--------------|
|
109 |
+
| **[kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0)**| **756** | **6.3** |
|
110 |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 |
|
111 |
|
112 |
|
|
|
129 |
from datasets import load_dataset
|
130 |
|
131 |
# config
|
132 |
+
model_id = "kotoba-tech/kotoba-whisper-v2.0"
|
133 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
134 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
135 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|
|
|
191 |
from datasets import load_dataset
|
192 |
|
193 |
# config
|
194 |
+
model_id = "kotoba-tech/kotoba-whisper-v2.0"
|
195 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
196 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
197 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|
|
|
264 |
from transformers.models.whisper.english_normalizer import BasicTextNormalizer
|
265 |
|
266 |
# model config
|
267 |
+
model_id = "kotoba-tech/kotoba-whisper-v2.0"
|
268 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
269 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
270 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|