Can I replace the Inference Widget backend on this model page with a Gradio Spaces API?
While I fully support Replicate's capacity to monetize Kokoro under the Apache 2.0 license, I begin to take issue when it's stuck back onto the original model page with no option to modify that Inference Widget on my end.
If the point is a widget for people to play with, I'm happy to hook it into a Gradio Spaces API endpoint, if only I was allowed to.
Maybe I missed it, but I don't think Microsoft forces "deploy this docker container on Azure" onto GitHub project pages.
cc @julien-c @reach-vb @celinah because of https://huggingface.co/posts/hexgrad/219731859025675#679b52dab5bb9d891de1b4d4
hey
@hexgrad
if you're strongly opposed to this, we can hardcode and turn off the Replicate widget on your model page.
But I do think it's an easy way for community members to try out the model, and our intent is for it to be a net-positive for the community
I'm happy to hook it into a Gradio Spaces API endpoint, if only I was allowed to.
Interesting idea, we can look into it in the future 💡
The current Inference Widget appears to be pointed at the old v0.19 model with the voice set to af_bella
, which is not the same as the v1.0 model described to the left of the widget. We can see the Replicate endpoint uses v0.19 because that uses espeak-ng
directly, which I know fails on the words wisest, shogun, and matcha (among others):
The wisest shogun loves to drink matcha.
The v1.0 model + pipeline with kokoro==0.7.8
still fails on wisest, because that word is OOD and falls back to espeak-ng. But shogun and matcha are correct. Here is af_heart
:
ðə wˈIsɪst ʃˈOɡən lˈʌvz tə dɹˈɪŋk mˈɑʧə.
I'm about to go push out a quick s => z
fix for wisest, so in kokoro>=0.7.9
it should get everything correct:
ðə wˈIzɪst ʃˈOɡən lˈʌvz tə dɹˈɪŋk mˈɑʧə.
And that fix can be immediately deployed to the Gradio Space at https://hf.co/spaces/hexgrad/Kokoro-TTS with a 1-character diff like: https://hf.co/spaces/hexgrad/Kokoro-TTS/commit/ebacc6134ef372337bc8ed4d0b94b7f073a4a8b2
This is not the first G2P error and it won't be the last, but I can't bubble these patches to the Inference Widget, because it points to a third-party API that isn't pinned to the latest kokoro
and misaki
packages.
As far as speed and price, Replicate has historically done a poor job of serving TTS models:
- Third party benchmarks by ArtificialAnalysis show a dismal speed factor for StyleTTS 2, worst among analyzed TTS models: https://artificialanalysis.ai/text-to-speech/model-family/styletts#speed
- This isn't just one "bad day": If you scroll down to "Characters Per Second, Variance" and "Characters per Second, Over Time" you'll note that all the speed numbers have been rock bottom for months. STTS2 models have known and reproducible RTFs that Replicate is falling way short of.
- Price for STTS2 is computed at $2.84 per 1M characters. Although this is lowest among analyzed TTS models, this price makes no sense (should easily be sub-$1, maybe an OOM lower) given the parameter count of STTS2, and suggests they are either gouging or have an unoptimized stack. Maybe you can justify a premium like that by delivering blazing fast speed, but they are not: https://artificialanalysis.ai/text-to-speech/model-family/styletts#price
- It's not just STTS2; if you look at the same pages, you can see they have been serving OpenVoice just as badly on speed and price (again for months, not just one bad day): https://artificialanalysis.ai/text-to-speech/model-family/openvoice#speed
Replicate's listed pricing for TTS models is also opaque, so ArtificialAnalysis likely had to compute that price per 1M characters by spending money. The industry standard for TTS is to list a fixed rate per 1M input characters (or 1K), which Replicate does not.
Just now, I went and looked at my Billing, and what I believe was far less than 1000 characters of me playing around with the Kokoro inference widget is already showing $0.33 worth of inference credits used (or a rate of over $330 for 1M characters). Absolutely ridiculous. 33 cents should easily get you hundreds of thousands of characters from Kokoro, which is multiple OOMs off. Again, I'll assume incompetence, not malice, is the more likely explanation for these numbers.
Given all this, Replicate has none of my trust for serving this class of TTS models even remotely well. I realize there is value in hosted inference APIs, but I don't think it's good for anyone if the performance and price are fumbled this badly. People who are able & willing would likely be far better served either running Kokoro locally, or deploying a Docker container to bare metal like with @Remsky 's https://github.com/remsky/Kokoro-FastAPI
I understand what the Apache 2.0 license entails in terms of commercial use, but I think there is a difference between:
- A 3rd-party paid API hosted in their own corner of the internet, totally fair game and they can do what they want
- Having it forced back onto your model page, and now their shortcomings reflect poorly on you. You hope things are done well, otherwise people think a 45+ second cold start is your model's generation time, and dollars per million characters is your model's cost to run. For the record, the bare-metal cost of Kokoro is likely on the order of cents per million characters at current market rate. With overhead, margin, and expensive GPUs, you could say at most double-digit cents. Happy to provide more evidence to support that claim if needed, and technical folks can already verify.
I agree it's nice to have a way for people to try the model, hence the Gradio API suggestion. Assuming that isn't possible for now, I will probably update the README to more prominently link to the public Gradio Space which people can freely run (and duplicate) to use the model: https://hf.co/spaces/hexgrad/Kokoro-TTS
DeepInfra currently lists a price of $0.80 per 1M characters. I assume and expect there to be margin baked in there, since owning and operating a GPU fleet isn't free—totally understandable, fair, and in the ballpark of what science tells us an 82M parameter TTS model should roughly cost to serve.
Meanwhile, the top Kokoro endpoint on Replicate says "approximately $0.00034 per run" which is (A) totally inaccurate, as shown by Artificial Analysis benchmarks + my informal button mashing and (B) vague language that departs from the industry TTS standard of "price per million characters", likely giving them leeway to charge more.