Kokoro TTS

Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out).

Table of contents

Usage

JavaScript

First, install the kokoro-js library from NPM using:

npm i kokoro-js

You can then generate speech as follows:

import { KokoroTTS } from "kokoro-js";

const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
  dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});

const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text, {
  // Use `tts.list_voices()` to list all available voices
  voice: "af_bella",
});
audio.save("audio.wav");

Python

import os
import numpy as np
from onnxruntime import InferenceSession

# You can generate token ids as follows:
#   1. Convert input text to phonemes using https://github.com/hexgrad/misaki
#   2. Map phonemes to ids using https://huggingface.co/hexgrad/Kokoro-82M/blob/785407d1adfa7ae8fbef8ffd85f34ca127da3039/config.json#L34-L148
tokens = [50, 157, 43, 135, 16, 53, 135, 46, 16, 43, 102, 16, 56, 156, 57, 135, 6, 16, 102, 62, 61, 16, 70, 56, 16, 138, 56, 156, 72, 56, 61, 85, 123, 83, 44, 83, 54, 16, 53, 65, 156, 86, 61, 62, 131, 83, 56, 4, 16, 54, 156, 43, 102, 53, 16, 156, 72, 61, 53, 102, 112, 16, 70, 56, 16, 138, 56, 44, 156, 76, 158, 123, 56, 16, 62, 131, 156, 43, 102, 54, 46, 16, 102, 48, 16, 81, 47, 102, 54, 16, 54, 156, 51, 158, 46, 16, 70, 16, 92, 156, 135, 46, 16, 54, 156, 43, 102, 48, 4, 16, 81, 47, 102, 16, 50, 156, 72, 64, 83, 56, 62, 16, 156, 51, 158, 64, 83, 56, 16, 44, 157, 102, 56, 16, 44, 156, 76, 158, 123, 56, 4]

# Context length is 512, but leave room for the pad token 0 at the start & end
assert len(tokens) <= 510, len(tokens)

# Style vector based on len(tokens), ref_s has shape (1, 256)
voices = np.fromfile('./voices/af.bin', dtype=np.float32).reshape(-1, 1, 256)
ref_s = voices[len(tokens)]

# Add the pad ids, and reshape tokens, should now have shape (1, <=512)
tokens = [[0, *tokens, 0]]

model_name = 'model.onnx' # Options: model.onnx, model_fp16.onnx, model_quantized.onnx, model_q8f16.onnx, model_uint8.onnx, model_uint8f16.onnx, model_q4.onnx, model_q4f16.onnx
sess = InferenceSession(os.path.join('onnx', model_name))

audio = sess.run(None, dict(
    input_ids=tokens,
    style=ref_s,
    speed=np.ones(1, dtype=np.float32),
))[0]

Optionally, save the audio to a file:

import scipy.io.wavfile as wavfile
wavfile.write('audio.wav', 24000, audio[0])

Voices/Samples

Life is like a box of chocolates. You never know what you're gonna get.

Name Nationality Gender Sample
af_heart American Female
af_alloy American Female
af_aoede American Female
af_bella American Female
af_jessica American Female
af_kore American Female
af_nicole American Female
af_nova American Female
af_river American Female
af_sarah American Female
af_sky American Female
am_adam American Male
am_echo American Male
am_eric American Male
am_fenrir American Male
am_liam American Male
am_michael American Male
am_onyx American Male
am_puck American Male
am_santa American Male
bf_alice British Female
bf_emma British Female
bf_isabella British Female
bf_lily British Female
bm_daniel British Male
bm_fable British Male
bm_george British Male
bm_lewis British Male

Quantizations

The model is resilient to quantization, enabling efficient high-quality speech synthesis at a fraction of the original model size.

How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born.

Model Size (MB) Sample
model.onnx (fp32) 326
model_fp16.onnx (fp16) 163
model_quantized.onnx (8-bit) 92.4
model_q8f16.onnx (Mixed precision) 86
model_uint8.onnx (8-bit & mixed precision) 177
model_uint8f16.onnx (Mixed precision) 114
model_q4.onnx (4-bit matmul) 305
model_q4f16.onnx (4-bit matmul & fp16 weights) 154
Downloads last month
39,313
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support text-to-speech models for transformers.js library.

Model tree for onnx-community/Kokoro-82M-v1.0-ONNX

Finetuned
hexgrad/Kokoro-82M
Quantized
(7)
this model

Space using onnx-community/Kokoro-82M-v1.0-ONNX 1