Text-to-Speech
English

Adding defined period pauses to the input text file

#61
by vijay120 - opened

Is there a way to add pauses to the text file to that TTS will pause for "x" number of seconds before continuing with the next sentence? Currently I need to create separate text files and merge the output TTS manually with a "x" second pause.

i second this. would be awesome:

  1. to have the ability to add pauses in a format like [BREAK=3] seconds or something like that
  2. control delay between sentences. now its too fast
  3. add other tags for filler words, etc

I had same issue and my approach was to handle it manually. This approach is available using both torch tensors or numpy arrays, but I think the syntaxis may change a little bit:

  1. I generate the speech segment. Kokoro returns them as numpy arrays but I convert them into torch tensors. It isn't necessary to do this conversion.
  2. I manually create a silence. As Kokoro's audios sample rate is 24000 Hz, to generate a silence of 3 seconds, it could be done as
    silence = torch.zeros(1, 3*24000). Using numpy arrays is the same but np.zeros instead of torch.zeros.
  3. Then I continue generating all the different segments that I need and all this segments, both speech and silence, are appended to a python list.
  4. When I finish with the generation I concatenate all segments with torch.cat() or np.concatenate().

However, it would be very amazing to have like a list of special tokens to perform this kind of things with the model itself. Not only pauses but also laughts, emotion, etc.

Anyways, I hope this is useful to perform the task you are commenting :)

I found when using a series of punctuation marks it inserts a pause. However if you do too many, there is a weird noice like a breath but very unnatural sounding for the voice I was using. Insert the following in your text.
, . , . , . , .

I implemented defined period pause functionality here: https://github.com/vijay120/kokoro-tts?tab=readme-ov-file#input-file-formatting

You just have to provide input file like this:

Welcome to the presentation
PAUSE_2.5
This text comes after a 2.5 second pause
PAUSE_10
And this comes after a 10 second pause

And it will output an audio file with the appropriate pauses between the audio for the defined number of seconds. Let me know if this works for y'all.

Sign up or log in to comment