from textwrap import dedent BANNER_TEXT = """

WhisperKit Android Benchmarks

""" INTRO_LABEL = """We present comprehensive benchmarks for WhisperKit Android, our on-device ASR solution for Android devices, compared against a reference implementation. These benchmarks aim to help developers and enterprises make informed decisions when choosing optimized or compressed variants of machine learning models for production use. Show more.""" INTRO_TEXT = """

\n📈 Key Metrics: Word Error Rate (WER) (⬇️): The percentage of words incorrectly transcribed. Lower is better. Quality of Inference (QoI) (⬆️): Percentage of examples where WhisperKit Android performs no worse than the reference model. Higher is better. Tokens per Second (⬆️): The number of output tokens generated per second. Higher is better. Speed (⬆️): Input audio seconds transcribed per second. Higher is better. 🎯 WhisperKi Android is evaluated across different datasets, with a focus on per-example no-regressions (QoI) and overall accuracy (WER). \n💻 Our benchmarks include: Reference: WhisperOpenAIAPI (OpenAI's Whisper API) On-device: WhisperKit Android (various versions and optimizations) ℹ️ Reference Implementation: WhisperOpenAIAPI sets the reference standard. We assume it uses the equivalent of openai/whisper-large-v2 in float16 precision, along with additional undisclosed optimizations from OpenAI. As of 02/29/24, it costs $0.36 per hour of audio and has a 25MB file size limit per request. \n🔍 We use two primary datasets: LibriSpeech: ~5 hours of short English audio clips Earnings22: ~120 hours of English audio from earnings calls 🔄 Results are periodically updated using our automated evaluation pipeline on Apple Silicon Macs. \n🛠️ Developers can use WhisperKit Android to reproduce these results or run evaluations on their own custom datasets. 🔗 Links: - WhisperKit Android - whisperkittools - LibriSpeech - Earnings22 - WhisperOpenAIAPI """ METHODOLOGY_TEXT = dedent( """ # Methodology ## Overview WhisperKit Android Benchmarks is the one-stop shop for on-device performance and quality testing of WhisperKit Android models across supported devices, OS versions and audio datasets. ## Metrics - **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit Android latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second. - **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time. - This metric varies with input data given that the pace of speech changes the text decoder % of overall latency. This metric should not be confused with the reciprocal of the text decoder latency which is constant across input files. - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy. - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit Android performs no worse than the reference model. - This metric does not capture improvements to the reference. It only measures potential regressions. ## Data - **Short-form**: 10 minutes of English audiobook clips with 30s/clip comprising a subset of the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). Proxy for average streaming performance. - **Long-form**: 10 minutes of earnings call recordings in English. Built from the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). Proxy for average from-file performance. - Full datasets are used for English Quality tests and random 10-minute subsets are used for Performance tests. ## Performance Measurement 1. On-device testing is conducted with [WhisperKit Android Tests](https://github.com/argmaxinc/WhisperKitAndroid) on Android devices, across different Android versions. 2. Performance is recorded on 10-minute datasets described above for short- and long-form 3. Quality metrics are recorded on 10-minute datasets using an Apple M2 Pro CPU on a Linux host to allow for fast processing of many configurations and providing a consistent, high-performance baseline for all evaluations displayed in the English Quality tab. 4. Results are aggregated and presented in the dashboard, allowing for easy comparison and analysis. ## Dashboard Features - Performance: Interactive filtering by model, device, OS, and performance metrics - Timeline: Visualizations of performance trends - English Quality: English transcription quality on short- and long-form audio - Device Support: Matrix of supported device, OS and model version combinations. Unsupported combinations are marked with :warning:. - This methodology ensures a comprehensive and fair evaluation of speech recognition models supported by WhisperKit Android across a wide range of scenarios and use cases. """ ) PERFORMANCE_TEXT = dedent( """ ## Metrics - **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit Android latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second. - **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time. ## Data - **Short-form**: 10 minutes of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). - **Long-form**: 10 minutes of earnings call recordings in English with various accents. Built from the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). """ ) QUALITY_TEXT = dedent( """ ## Metrics - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy. - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit Android performs no worse than the reference model. - This metric does not capture improvements to the reference. It only measures potential regressions. """ ) COL_NAMES = { "model.model_version": "Model", "device.product_name": "Device", "device.os": "OS", "average_wer": "Average WER", "qoi": "QoI", "speed": "Speed", "tokens_per_second": "Tok / s", "model": "Model", "device": "Device", "os": "OS", "english_wer": "English WER", "multilingual_wer": "Multilingual WER", } CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" CITATION_BUTTON_TEXT = r"""@misc{whisperkit-android-argmax, title = {WhisperKit Android}, author = {Argmax, Inc.}, year = {2024}, URL = {https://github.com/argmaxinc/WhisperKitAndroid} }""" HEADER = """
""" EARNINGS22_URL = ( "https://huggingface.co/datasets/argmaxinc/earnings22-debug/resolve/main/{0}" ) LIBRISPEECH_URL = ( "https://huggingface.co/datasets/argmaxinc/librispeech-debug/resolve/main/{0}" ) AUDIO_URL = ( "https://huggingface.co/datasets/argmaxinc/whisperkit-test-data/resolve/main/" ) WHISPER_OPEN_AI_LINK = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/{}/{}" BASE_WHISPERKIT_BENCHMARK_URL = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data"