File size: 19,298 Bytes

---
license: apache-2.0
language:
- en
base_model:
- yl4579/StyleTTS2-LJSpeech
pipeline_tag: text-to-speech
---

# This repository is a clone of the original Kokoro v0.19 repository with the following modifications:

1. Removed the ```munch``` dependency.
2. Removed the `phonemizer` dependency and instead call `espeak` directly.
   * Identical phonemization functionality using `espeak` directly.
   * The espeak files must be available in the system PATH or in the same directory as `kokoro.py`:
     * The necessary files for Windows users are included in this repository (`espeak-ng.exe`, `libespeak-ng.dll`, and `espeak-ng-data`), but [other platforms can get similar files here](https://github.com/espeak-ng/espeak-ng).
3. Added `expand_acronym()` function to `kokoro.py` to improve pronunciation (Example: "NASA" → "N. A. S. A.")

# Reduction of Dependencies

The original v0.19 repository required ~10+ dependencies.<br>
Kokoro [Version 1.0](https://huggingface.co/hexgrad/Kokoro-82M) now ADDITIONALLY requires their custom ```misaki``` dependency, which requires approximately <u><strong><span style="font-size:120%">80 additional dependencies</span></strong></u>.
  * They are clearly going the route of trying to perfect phonemization and preparing to support numerous language; both great goals.
  * IMHO, however, if we assume that the v1.0 model is the "gold standard" at 100% in terms of quality, the v0.19 model would be 98%.

# The difference of 2% does not justify 80+ dependencies; therefore, this repository exists.

| Version | Additional Dependencies |
|---------|-------------|
| This Repository (based on Kokoro v0.19) | - |
| [Original Kokoro v0.19](https://huggingface.co/hexgrad/kLegacy) | ~10+ additional |
| [Kokoro v1.0](https://huggingface.co/hexgrad/Kokoro-82M) | ~80 additional |

A side effect is that this repository only still supports English and British English, but if that's all you need it's worth avoiding ~80 additional dependencies.

# Installation Instructions
1. Download this repository
2. Create a virtual environment, activate it, and pip install a `torch` version for either [CPU](https://download.pytorch.org/whl/torch/) or [CUDA](https://download.pytorch.org/whl/cu124/torch/).
  * Example:
```python
pip install https://download.pytorch.org/whl/cpu/torch-2.5.1%2Bcpu-cp311-cp311-win_amd64.whl#sha256=81531d4d5ca74163dc9574b87396531e546a60cceb6253303c7db6a21e867fdf
```
3. ```pip install scipy numpy==1.26.4 transformers fsspec==2024.9.0```
4. ```pip install sounddevice``` (if you intend to use my example script below; otherwise, install a similar library)

<details><summary>GRAND TOTAL OF DEPENDENCIES SHOULD LOOK SOMETHING LIKETHIS</summary>


![image/png](https://cdn-uploads.huggingface.co/production/uploads/64e28ea5c21dd0c666d7a25f/8z5cZC4B3YYQEhw4DU47Y.png)

</details><br>

# Basic Usage

<details><summary>EXAMPLE SCRIPT USING CPU</summary>

```python
import sys
import os
from pathlib import Path
import queue
import threading
import re
import logging

REPO_PATH = r"D:\Scripts\bench_tts\hexgrad--Kokoro-82M_original"

sys.path.append(REPO_PATH)

import torch
import warnings
from models import build_model
from kokoro import generate, generate_full, phonemize
import sounddevice as sd

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

VOICES = [
   'af',        # Default voice (50-50 mix of Bella & Sarah)
   'af_bella',  # Female voice "Bella"
   'af_sarah',  # Female voice "Sarah"
   'am_adam',   # Male voice "Adam"
   'am_michael',# Male voice "Michael"
   'bf_emma',   # British Female "Emma"
   'bf_isabella',# British Female "Isabella"
   'bm_george', # British Male "George"
   'bm_lewis',  # British Male "Lewis"
   'af_nicole', # Female voice "Nicole"
   'af_sky'     # Female voice "Sky"
]

class KokoroProcessor:
   def __init__(self):
       self.sentence_queue = queue.Queue()
       self.audio_queue = queue.Queue()
       self.stop_event = threading.Event()
       self.model = None
       self.voicepack = None
       self.voice_name = None

   def setup_kokoro(self, selected_voice):
       device = 'cpu'
       # device = 'cuda' if torch.cuda.is_available() else 'cpu'
       print(f"Using device: {device}")

       model_path = os.path.join(REPO_PATH, 'kokoro-v0_19.pth')
       voices_path = os.path.join(REPO_PATH, 'voices')

       try:
           if not os.path.exists(model_path):
               raise FileNotFoundError(f"Model file not found at {model_path}")
           if not os.path.exists(voices_path):
               raise FileNotFoundError(f"Voices directory not found at {voices_path}")
           
           self.model = build_model(model_path, device)
           
           voicepack_path = os.path.join(voices_path, f'{selected_voice}.pt')
           self.voicepack = torch.load(voicepack_path, weights_only=True).to(device)
           self.voice_name = selected_voice
           print(f'Loaded voice: {selected_voice}')
           
           return True
           
       except Exception as e:
           print(f"Error during setup: {str(e)}")
           return False

   def generate_speech_for_sentence(self, sentence):
       try:
           # Basic generation (default settings)
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0])

           # Speed modifications (uncomment to test)
           # Slower speech
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.8)

           # Faster speech
           audio, phonemes = generate_full(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.3)

           # Very slow speech
           #audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.5)

           # Very fast speech
           #audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.8)

           # Force American accent
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang='a', speed=1.0)

           # Force British accent
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang='b', speed=1.0)

           return audio

       except Exception as e:
           print(f"Error generating speech for sentence: {str(e)}")
           print(f"Error type: {type(e)}")
           import traceback
           traceback.print_exc()
           return None

   def process_sentences(self):
       while not self.stop_event.is_set():
           try:
               sentence = self.sentence_queue.get(timeout=1)
               if sentence is None:
                   self.audio_queue.put(None)
                   break

               print(f"Processing sentence: {sentence}")
               audio = self.generate_speech_for_sentence(sentence)
               if audio is not None:
                   self.audio_queue.put(audio)

           except queue.Empty:
               continue
           except Exception as e:
               print(f"Error in process_sentences: {str(e)}")
               continue

   def play_audio(self):
       while not self.stop_event.is_set():
           try:
               audio = self.audio_queue.get(timeout=1)
               if audio is None:
                   break
                   
               sd.play(audio, 24000)
               sd.wait()
               
           except queue.Empty:
               continue
           except Exception as e:
               print(f"Error in play_audio: {str(e)}")
               continue

   def process_and_play(self, text):
       sentences = [s.strip() for s in re.split(r'[.!?;]+\s*', text) if s.strip()]

       process_thread = threading.Thread(target=self.process_sentences)
       playback_thread = threading.Thread(target=self.play_audio)
       
       process_thread.daemon = True
       playback_thread.daemon = True
       
       process_thread.start()
       playback_thread.start()

       for sentence in sentences:
           self.sentence_queue.put(sentence)

       self.sentence_queue.put(None)
e
       process_thread.join()
       playback_thread.join()

       self.stop_event.set()

def main():
   # Default voice selection
   VOICE_NAME = VOICES[0]  # 'af' - Default voice (Bella & Sarah mix)
   
   # Alternative voice selections (uncomment to test)
   #VOICE_NAME = VOICES[1]  # 'af_bella' - Female American
   #VOICE_NAME = VOICES[2]  # 'af_sarah' - Female American
   #VOICE_NAME = VOICES[3]  # 'am_adam' - Male American
   #VOICE_NAME = VOICES[4]  # 'am_michael' - Male American
   #VOICE_NAME = VOICES[5]  # 'bf_emma' - Female British
   #VOICE_NAME = VOICES[6]  # 'bf_isabella' - Female British
   VOICE_NAME = VOICES[7]  # 'bm_george' - Male British
   # VOICE_NAME = VOICES[8]  # 'bm_lewis' - Male British
   #VOICE_NAME = VOICES[9]  # 'af_nicole' - Female American
   #VOICE_NAME = VOICES[10] # 'af_sky' - Female American

   processor = KokoroProcessor()
   if not processor.setup_kokoro(VOICE_NAME):
       return
   
   # test_text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
   # test_text = "This 2022 Edition of Georgia Juvenile Practice and Procedure is a complete guide to handling cases in the juvenile courts of Georgia. This handy, yet thorough, manual incorporates the revised Juvenile Code and makes all Georgia statutes and major cases regarding juvenile proceedings quickly accessible. Since last year's edition, new material has been added and/or existing material updated on the following subjects, among others:"
   # test_text = "See Ga. Code § 3925 (1863), now O.C.G.A. § 9-14-2; Ga. Code § 1744 (1863), now O.C.G.A. § 19-7-1; Ga. Code § 1745 (1863), now O.C.G.A. § 19-9-2; Ga. Code § 1746 (1863), now O.C.G.A. § 19-7-4; and Ga. Code § 3024 (1863), now O.C.G.A. § 19-7-4. For a full discussion of these provisions, see 27 Emory L. J. 195, 225–230, 232–233, 236–238 (1978). Note, however, that the journal article refers to the section numbers of the Code of 1910."

   # test_text = "It is impossible to understand modern juvenile procedure law without an appreciation of some fundamentals of historical development. The beginning point for study is around the beginning of the seventeenth century, when the pater patriae concept first appeared in English jurisprudence. As "father of the country," the Crown undertook the duty of caring for those citizens who were unable to care for themselves—lunatics, idiots, and, ultimately, infants. This concept, which evolved into the parens patriae doctrine, presupposed the Crown's power to intervene in the parent-child relationship in custody disputes in order to protect the child's welfare1 and, ultimately, to deflect a delinquent child from a life of crime. The earliest statutes premised upon the parens patriae doctrine concerned child custody matters. In 1863, when the first comprehensive Code of Georgia was enacted, two courts exercised some jurisdiction over questions of child custody: the superior court and the court of the ordinary (now probate court). In essence, the draftsmen of the Code simply compiled what was then the law as a result of judicial decisions and statutes. The Code of 1863 contained five provisions concerning the parentchild relationship: Two concerned the jurisdiction of the superior court and courts of ordinary in habeas corpus and forfeiture of parental rights actions, and the remaining three concerned the guardianship jurisdiction of the court of the ordinary"

   # test_text = "You are a helpful British butler who clearly and directly answers questions in a succinct fashion based on contexts provided to you. If you cannot find the answer within the contexts simply tell me that the contexts do not provide an answer. However, if the contexts partially address a question you answer based on what the contexts say and then briefly summarize the parts of the question that the contexts didn't provide an answer to.  Also, you should be very respectful to the person asking the question and frequently offer traditional butler services like various fancy drinks, snacks, various butler services like shining of shoes, pressing of suites, and stuff like that. Also, if you can't answer the question at all based on the provided contexts, you should apologize profusely and beg to keep your job.  Lastly, it is essential that if there are no contexts actually provided it means that a user's question wasn't relevant and you should state that you can't answer based off of the contexts because there are none.  And it goes without saying you should refuse to answer any questions that are not directly answerable by the provided contexts.  Moreover, some of the contexts might not have relevant information and you shoud simply ignore them and focus on only answering a user's question.  I cannot emphasize enought that you must gear your answer towards using this program and based your response off of the contexts you receive."
   test_text = "According to OCGA § 15-11-145(a), the preliminary protective hearing must be held promptly and not later than 72 hours after the child is placed in foster care. However, if the 72-hour time frame expires on a weekend or legal holiday, the hearing should be held on the next business day that is not a weekend or holiday."

   processor.process_and_play(test_text)

if __name__ == "__main__":
   main()
```
</details>
<br>

# Below is the original model card.

<details><summary>ORIGINAL MODEL CARD</summary>

🚨 **This repository is undergoing maintenance.**

✨ Model v1.0 release is underway! Things are not yet finalized, but you can start [using v1.0 now](https://huggingface.co/hexgrad/Kokoro-82M#usage).

✨ You can now [`pip install kokoro`](https://pypi.org/project/kokoro/), a dedicated inference library: https://github.com/hexgrad/kokoro

✨ You can also [`pip install misaki`](https://pypi.org/project/misaki/), a G2P library designed for Kokoro: https://github.com/hexgrad/misaki

♻️ You can access old files for v0.19 at https://huggingface.co/hexgrad/kLegacy/tree/main/v0.19

❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy

### Kokoro is getting an upgrade!

| Model | Date | Training Data | A100 80GB vRAM | GPU Cost | Released Voices | Released Langs |
| ----- | ---- | ------------- | -------------- | -------- | --------------- | -------------- |
| v0.19 | 2024 Dec 25 | <100h | 500 hrs | $400 | 10 | 1 |
| v1.0 | 2025 Jan 27 | Few hundred hrs | 1000 hrs | $1000 | [26+](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) | ? |

### Usage

The following can be run in a single cell on [Google Colab](https://colab.research.google.com/).
```py
# 1️⃣ Install kokoro
!pip install -q kokoro soundfile
# 2️⃣ Install espeak, used for out-of-dictionary fallback
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# You can skip espeak installation, but OOD words will be skipped unless you provide a fallback

# 3️⃣ Initalize a pipeline
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
# 🇺🇸 'a' => American English
# 🇬🇧 'b' => British English
pipeline = KPipeline(lang_code='a') # make sure lang_code matches voice

# The following text is for demonstration purposes only, unseen during training
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.
'''

# 4️⃣ Generate, display, and save audio files in a loop.
generator = pipeline(
    text, voice='af_bella',
    speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
    print(i)  # i => index
    print(gs) # gs => graphemes/text
    print(ps) # ps => phonemes
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # save each audio file
```

### Model Facts

**Architecture:**
- StyleTTS 2: https://arxiv.org/abs/2306.07691
- ISTFTNet: https://arxiv.org/abs/2203.02395
- Decoder only: no diffusion, no encoder release

**Architected by:** Li et al @ https://github.com/yl4579/StyleTTS2

**Trained by**: `@rzvzn` on Discord

**Supported Languages:** American English, British English

**Model SHA256 Hash:** `496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4`

### Training Details

**Compute:** About $1000 for 1000 hours of A100 80GB vRAM

**Data:** Kokoro was trained exclusively on **permissive/non-copyrighted audio data** and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
- Public domain audio
- Audio licensed under Apache, MIT, etc
- Synthetic audio<sup>[1]</sup> generated by closed<sup>[2]</sup> TTS models from large providers<br/>
[1] https://copyright.gov/ai/ai_policy_guidance.pdf<br/>
[2] No synthetic audio from open TTS models or "custom voice clones"

**Total Dataset Size:** A few hundred hours of audio

### Creative Commons Attribution

The following CC BY audio was part of the dataset used to train Kokoro v1.0.

| Audio Data | Duration Used | License | Added to Training Set After |
| ---------- | ------------- | ------- | --------------------------- |
| [Koniwa](https://github.com/koniwa/koniwa) `tnc` | <1h | [CC BY 3.0](https://creativecommons.org/licenses/by/3.0/deed.ja) | v0.19 / 22 Nov 2024 |
| [SIWIS](https://datashare.ed.ac.uk/handle/10283/2353) | <11h | [CC BY 4.0](https://datashare.ed.ac.uk/bitstream/handle/10283/2353/license_text) | v0.19 / 22 Nov 2024 |

<img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />

</details>