File size: 18,774 Bytes
79fc12a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5753a9f
 
79fc12a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
from textwrap import dedent

BANNER_TEXT = """
<div style="text-align: center;">
    <h1><a href='https://github.com/argmaxinc/WhisperKitAndroid'>WhisperKit Android Benchmarks</a></h1>
</div>
"""


INTRO_LABEL = """We present comprehensive benchmarks for WhisperKit Android, our on-device ASR solution for Android devices, compared against a reference implementation. These benchmarks aim to help developers and enterprises make informed decisions when choosing optimized or compressed variants of machine learning models for production use. Show more."""


INTRO_TEXT = """
<h3 style="display: flex;
  justify-content: center;
  align-items: center;
"></h2>
\n📈 Key Metrics:  
Word Error Rate (WER) (⬇️): The percentage of words incorrectly transcribed. Lower is better.  
Quality of Inference (QoI) (⬆️): Percentage of examples where WhisperKit Android performs no worse than the reference model. Higher is better.  
Tokens per Second (⬆️): The number of output tokens generated per second. Higher is better.  
Speed (⬆️): Input audio seconds transcribed per second. Higher is better.
🎯 WhisperKi Android is evaluated across different datasets, with a focus on per-example no-regressions (QoI) and overall accuracy (WER).
\n💻 Our benchmarks include:  
Reference: <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> (OpenAI's Whisper API)  
On-device: <a href='https://github.com/argmaxinc/WhisperKitAndroid'>WhisperKit Android</a> (various versions and optimizations)  
ℹ️ Reference Implementation:  
<a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> sets the reference standard. We assume it uses the equivalent of openai/whisper-large-v2 in float16 precision, along with additional undisclosed optimizations from OpenAI. As of 02/29/24, it costs $0.36 per hour of audio and has a 25MB file size limit per request.
\n🔍 We use two primary datasets:  
<a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>: ~5 hours of short English audio clips  
<a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>: ~120 hours of English audio from earnings calls  
🔄 Results are periodically updated using our automated evaluation pipeline on Apple Silicon Macs.
\n🛠️ Developers can use <a href='https://github.com/argmaxinc/WhisperKitAndroid'>WhisperKit Android</a> to reproduce these results or run evaluations on their own custom datasets.
🔗 Links:
- <a href='https://github.com/argmaxinc/WhisperKit Android'>WhisperKit Android</a>
- <a href='https://github.com/argmaxinc/whisperkittools'>whisperkittools</a>
- <a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>
- <a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>
- <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a>
"""


METHODOLOGY_TEXT = dedent(
    """
    # Methodology
    ## Overview
    WhisperKit Android Benchmarks is the one-stop shop for on-device performance and quality testing of WhisperKit Android models across supported devices, OS versions and audio datasets.
    ## Metrics
    - **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit Android latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second.
    - **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time.
        - This metric varies with input data given that the pace of speech changes the text decoder % of overall latency. This metric should not be confused with the reciprocal of the text decoder latency which is constant across input files.
    - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
    - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit Android performs no worse than the reference model.
        - This metric does not capture improvements to the reference. It only measures potential regressions.
    
    ## Data
    - **Short-form**: 10 minutes of English audiobook clips with 30s/clip comprising a subset of the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). Proxy for average streaming performance.
    - **Long-form**: 10 minutes of earnings call recordings in English. Built from the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). Proxy for average from-file performance.
    - Full datasets are used for English Quality tests and random 10-minute subsets are used for Performance tests.
    ## Performance Measurement
    1. On-device testing is conducted with [WhisperKit Android Tests](https://github.com/argmaxinc/WhisperKitAndroid) on Android devices, across different Android versions.
    2. Performance is recorded on 10-minute datasets described above for short- and long-form
    3. Quality metrics are recorded on 10-minute datasets using an Apple M2 Pro CPU on a Linux host to allow for fast processing of many configurations and providing a consistent, high-performance baseline for all evaluations displayed in the English Quality tab.
    4. Results are aggregated and presented in the dashboard, allowing for easy comparison and analysis.
    ## Dashboard Features
    - Performance: Interactive filtering by model, device, OS, and performance metrics
    - Timeline: Visualizations of performance trends
    - English Quality: English transcription quality on short- and long-form audio
	- Device Support: Matrix of supported device, OS and model version combinations. Unsupported combinations are marked with :warning:.
    - This methodology ensures a comprehensive and fair evaluation of speech recognition models supported by WhisperKit Android across a wide range of scenarios and use cases.
"""
)

PERFORMANCE_TEXT = dedent(
    """
    ## Metrics
    - **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit Android latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second.
    - **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time.
    ## Data
   - **Short-form**: 10 minutes of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech).
    - **Long-form**: 10 minutes of earnings call recordings in English with various accents. Built from the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours).
"""
)

QUALITY_TEXT = dedent(
    """
    ## Metrics
    - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
    - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit Android performs no worse than the reference model.
        - This metric does not capture improvements to the reference. It only measures potential regressions.
"""
)

COL_NAMES = {
    "model.model_version": "Model",
    "device.product_name": "Device",
    "device.os": "OS",
    "average_wer": "Average WER",
    "qoi": "QoI",
    "speed": "Speed",
    "tokens_per_second": "Tok / s",
    "model": "Model",
    "device": "Device",
    "os": "OS",
    "english_wer": "English WER",
    "multilingual_wer": "Multilingual WER",
}


CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"


CITATION_BUTTON_TEXT = r"""@misc{whisperkit-android-argmax,
   title = {WhisperKit Android},
   author = {Argmax, Inc.},
   year = {2024},
   URL = {https://github.com/argmaxinc/WhisperKitAndroid}
}"""


HEADER = """<div align="center">
        <div position: relative>
        <img
            src=""
            style="display:block;width:7%;height:auto;"
        />
        </div>
</div>"""


EARNINGS22_URL = (
    "https://huggingface.co/datasets/argmaxinc/earnings22-debug/resolve/main/{0}"
)
LIBRISPEECH_URL = (
    "https://huggingface.co/datasets/argmaxinc/librispeech-debug/resolve/main/{0}"
)

AUDIO_URL = (
    "https://huggingface.co/datasets/argmaxinc/whisperkit-test-data/resolve/main/"
)

WHISPER_OPEN_AI_LINK = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/{}/{}"

BASE_WHISPERKIT_BENCHMARK_URL = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data"