Text-to-Speech
English

Post-Training Thoughts

#92
by hexgrad - opened

The main feeling is relief that the model turned out reasonably well, especially after making a relatively brazen call for training data in #21.

At the time, I had strong conviction (but not 100% certainty) that the model would converge to a better spot: the big risks were new languages being introduced in various quantities. In line with that call, voicepacks have been delivered back to their respective contributors and/or open-sourced. 🤝

Huge thanks to everyone who contributed training data and sponsored compute. ❤️

The next training run is not yet scheduled, because the field moves insanely fast and things could change:

  • Word on the street is that Llama 4 🦙4️⃣ will be multimodal. If it's good, and there are small enough versions, it could render Kokoro obsolete, but we will have to see.
  • Amodei says Anthropic might ship Voice. (Fair to assume closed-source.) Depending on quality, that could be added to the list of trainable targets, right next to OpenAI.
  • Who knows, maybe DeepSeek will add Voice to their dartboard.

Now more than ever in AI, there seems to be a Sword of Damocles over every model: a new model can always drop rendering that one obsolete. That observation should not paralyze development per se, but I think you should have a very clear thesis about what value you are delivering, in general and in your model's size class, and not just mindlessly set money on fire.

Separately, the buzz surrounding DeepSeek might complicate the calculus for Kokoro. As you may or may not know, Kokoro's training mix relies heavily on synthetic training data as described in the Training Details. At the time, my reasoning chain went something like this:

  • Synthetic data lacks US copyright protection thanks to the monkey selfie and can be mass-produced.
  • Protecting synthetic data and restricting model distillation favors incumbents.
  • With Trump in office and Musk in his corner, Musk and Altman beefing, xAI the challenger to OpenAI the incumbent, I anticipated that synthetic data and model distillation would remain relatively unprotected.

Now, with Stargate announced and DeepSeek making waves, I am less certain about the future of synthetic data and model distillation. (By the way, if it wasn't obvious, I am a US citizen.)

Assuming Kokoro is not made obsolete—in general or in its 82M size class—and OpenAI does not succeed at regulatory capture re: synthetic/distillation, I see at least 3 potential directions where Kokoro could improve.

Better G2P

As you may or may not know, Kokoro is relatively small because it relies on external G2P, which can be made relatively fast. The code for Kokoro's G2P can be found here: https://github.com/hexgrad/misaki

For English, the G2P is still good but not great. It is currently a somewhat naive word-based dictionary lookup with espeak-ng fallback. Some phrases might sound less natural, in part because the stress is improperly placed or the phonemes are simply wrong. In large speech models, those stress patterns are learned from thousands, sometimes millions of hours of audio, but Kokoro simply has not been trained on that scale of audio.

Nevertheless, it might be possible to build better G2P systems just from text. The easiest would be to pull great open-source G2P solutions off the shelf, where they exist. For data mining, Wiktionary is a yet-untapped source of G2P as it relates to misaki. Also, I think I saw somewhere that DeepSeek R1 was trained on over 14 trillion tokens, so the odds are high it knows how to G2P many languages. Smarter minds than me could use structured outputs to build large, high quality G2P dictionaries, in various languages, possibly for phrases in addition to words.

By the way, for those asking "Can you add [this language]": For Kokoro at least, the answer starts with G2P, then training data. G2P and training data are two legs, and you need both to start walking.

Better Training Data

One common criticism of Kokoro is that it sounds flat and boring, or more generously "synthetic and neutral". This is fair and somewhat expected, because most of Kokoro's training data sounds the same. If you scan most of the "Target Quality" grades in VOICES.md you will find that almost all are B and below (and some in incredibly small quantities as well). Informally, none of the synthetic data passes my own "audio Turing test", i.e. I can tell its synthetic, but I've also listened to a lot of synthetic data at this point.

This might be wishful thinking, but for a next training run I'd like to see A-tier or even S-tier training data. To do that in the realm of synthetics, we would want to juice the next mix with more "reinforcement learned" training data. For English synthetics, that means:

  • More ChatGPT Advanced Voice Mode, especially from $20 and $200 users since they will be using the full undistilled/unquantized AVM
  • Less OpenAI TTS API and less Realtime API, because I believe they are sandbagging that one
  • Less ElevenLabs because I am not sure about their RL regime
  • Maybe some Anthropic if they drop, because I generally respect Anthropic's ability to do RL as a lab

For non-English languages, often these large providers coat outputs with an English accent, which is really undesirable. It is probably best to find native providers, such as grabbing Hailuo AI voices for Chinese. Also, the quantity of non-English training data remains a significant obstacle (in addition to G2P in some cases).

Getting access to exceptional and permissive human training data is not impossible, but probably not feasible in the realm of what can be Apache open-sourced.

Better Architecture

If you don't already know, Kokoro uses a StyleTTS2 architecture: https://github.com/yl4579/StyleTTS2

That paper was published back in 2023, which is ancient in AI timelines. Two key design principles underpinning Kokoro are somewhat generic and could in theory be applied to other architectures:

  1. Use external per-language G2P systems to massively shrink the size of the model
  2. Use synthetic training data to effectively distill larger models

Even if sticking with StyleTTS2, there are a number of improvements that could be made, starting with figuring out how to train in FP16/BF16 instead of FP32 to cut costs.

In Conclusion

There is no immediate next training run scheduled yet, as we all collectively wait for the geopolitical dust to settle and build better G2P systems, collect exceptional training data, and find better architectures in the meantime.

Kokoro is far from perfect, but I do think it is a step function improvement in at least one of {cost, speed, quality, license} from any previously available TTS solution.

hexgrad pinned discussion

I just want to say thank you for your work and contribution here and it's great to see your post training thoughts. TTS is a fast moving field where each improvement and new tool is a step in the right direction.
Good luck with the future of the project.

You have made a remarkable contribution to TTS realm! Thank you so much!

Sign up or log in to comment