--- # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 # Doc / guide: https://huggingface.co/docs/hub/model-cards license: cc-by-4.0 language: - fr - en library_name: hibiki tags: - speech - translation - streaming metrics: - bleu --- # Model Card for Hibiki [Hibiki](https://github.com/kyutai-labs/hibiki) is a model for streaming speech translation (also known as *simultaneous* translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- Hibiki adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. As the user speaks, Hibiki generates natural speech in the target language, optionally with voice transfer, along with a text translation. Hibiki currently only supports French-to-English translation. ## Model Details This is the model simply referred to as *Hibiki* in our [paper](https://arxiv.org/abs/2502.03382), a 2.7B parameter hierarchical Transformer producing speech and text tokens at a framerate of 12.5Hz, with audio being generated at a 2.2kbps bitrate. ### Model Description Hibiki is a decoder-only model for simultaneous speech translation. Hibiki leverages the multistream architecture of [Moshi](https://arxiv.org/abs/2410.00037) to model source and target speech jointly. This allows Hibiki to continuously process the input stream while generating the target speech. Hibiki produces text and audio tokens at a constant framerate of 12.5Hz. This allows for a continuous output audio stream, along with timestamped text tranlsation. Since Hibiki relies on simple temperature sampling, it is compatible with batching unlike models that rely on complex inference policies. Moreover, the fidelity of Hibiki's voice transfer can be controlled by changing the coefficient of the Classifier-Free Guidance: a larger coefficient will increase voice similarity, but excessive coefficients can lead to worse translations. - **Developed by:** Kyutai - **Model type:** Simultaneous speech-to-speech and speech-to-text translation. - **Language(s) (NLP):** French-to-English - **License:** CC-BY ### Model Sources - **Repository:** [repo](https://github.com/kyutai-labs/hibiki) - **Paper:** [paper](https://arxiv.org/abs/2502.03382) - **Examples:** [demo](https://hf.co/spaces/kyutai/hibiki-samples) ## Uses ### Direct Use The model can be used for streaming translation from French to English in real-time settings, or for batched simultaneous translation of many input sequences. It is robust to noisy conditions and is trained on sequences up to 120 seconds. ### Downstream Use Some components of the model can be used independently or repurposed relatively easily. For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems. Regarding the main Hibiki architecture, supporting other pairs of languages would require finetuning. ### Out-of-Scope Use The model is not intended to be used to impersonate other people or any malicious use of any kind. ## How to Get Started with the Model See the main [README](https://github.com/kyutai-labs/hibiki) file. ## Training Details ### Training Data - Textual data: The underlying [Helium](https://huggingface.co/kyutai/helium-1-preview-2b) model is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl. - Audio data - **Unsupervised audio dataset:** used for pre-training, this is a collection of 7M hours of readily available audio content in English and 450k hours in French, following the preprocessing and recipe of [Moshi](https://arxiv.org/abs/2410.00037). - **Synthetic translation dataset**: Around 40k hours of parallel French-English data synthesized with *contextual alignment* (see [Section 3.2](https://arxiv.org/pdf/2502.03382)) with various levels of speaker similarity. - **Translation finetuning:** A 900 hours mixture of a resynthesized version of [CVSS-T](https://github.com/google-research-datasets/cvss) and synthetic long-form utterances. ### Training procedure and hyper-parameters The different stages of the training procedure are detailled in the paper along with the hyper-parameters. ### Compute Infrastructure The final model was trained on 48 H100 Nvidia GPUs. ## Citation ``` @misc{labiausse2025hibiki, title={High-Fidelity Simultaneous Speech-To-Speech Translation}, author={Tom Labiausse and Laurent Mazaré and Edouard Grave and Patrick Pérez and Alexandre Défossez and Neil Zeghidour}, year={2025}, eprint={2502.03382}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.03382}, } ``` ## Model Card Authors Tom Labiausse, Laurent Mazaré, Edouard Grave, Patrick Pérez, Alexandre Défossez, Neil Zeghidour