Text-to-Speech
Transformers
Safetensors
parler_tts
text2text-generation
annotation

FineTuning for Single Speaker

#6
by skjdhuhsnjd - opened

Hi, I'm new to IndicParler TTS. I'm trying to fine-tune it for a single speaker, but I'm encountering this error: 'TypeError: 'NoneType' object is not subscriptable'.

I suspect the issue might be related to using --feature_extractor_name "parler-tts/dac_44khZ_8kbps" because I couldn't find a feature extractor specifically for IndicParler. I'm a beginner and would appreciate some guidance.

AI4Bharat org

Hi,

We do not train or finetune DAC on Indic Parler TTS data, but rather use the pretrained one from ylacombe/dac_44khz. You should be able to use that. That being said, AutoProcessor.from_pretrained("ai4bharat/indic-parler-tts", trust_remote_code=True) should also work. Would be able to look into it if you can share a code snippet.

Thank you for showing interest in Indic Parler TTS.

First of all, thank you so much for your time. I'm using the following script:

!accelerate launch ./training/run_parler_tts_training.py \ --model_name_or_path "ai4bharat/indic-parler-tts-pretrained" \ --feature_extractor_name "ylacombe/dac_44khz" \ --description_tokenizer_name "ai4bharat/indic-parler-tts-pretrained" \ --prompt_tokenizer_name "ai4bharat/indic-parler-tts-pretrained" \ --report_to "wandb" \ --overwrite_output_dir true \ --train_dataset_name "mavihsrr/Hindi_TTS_M-2k" \ --train_metadata_dataset_name "skjdhuhsnjd/h-t-tagged" \ --train_dataset_config_name "default" \ --train_split_name "train" \ --eval_dataset_name "mavihsrr/Hindi_TTS_M-2k" \ --eval_metadata_dataset_name "skjdhuhsnjd/h-t-tagged" \ --eval_dataset_config_name "default" \ --eval_split_name "train" \ --max_eval_samples 8 \ --per_device_eval_batch_size 8 \ --target_audio_column_name "audio" \ --description_column_name "text_description" \ --prompt_column_name "text" \ --max_duration_in_seconds 20 \ --min_duration_in_seconds 2.0 \ --max_text_length 400 \ --preprocessing_num_workers 2 \ --do_train true \ --num_train_epochs 2 \ --gradient_accumulation_steps 18 \ --gradient_checkpointing true \ --per_device_train_batch_size 2 \ --learning_rate 0.00008 \ --adam_beta1 0.9 \ --adam_beta2 0.99 \ --weight_decay 0.01 \ --lr_scheduler_type "constant_with_warmup" \ --warmup_steps 50 \ --logging_steps 2 \ --freeze_text_encoder true \ --audio_encoder_per_device_batch_size 4 \ --dtype "float16" \ --seed 456 \ --output_dir "./output_dir_training/" \ --temporary_save_to_disk "./audio_code_tmp/" \ --save_to_disk "./tmp_dataset_audio/" \ --dataloader_num_workers 2 \ --do_eval \ --predict_with_generate \ --include_inputs_for_metrics \ --group_by_length true

However, I keep getting this error:
subprocess.CalledProcessError: Command '['/usr/bin/python3', './training/run_parler_tts_training.py', ...]' returned non-zero exit status 1

When I use the tokenizer (ylacombe/parler-tts-mini-v1-Jenny-colab) for both description and prompt, the process completes without errors, but the output audio quality is terrible. You can check the audio samples here: (https://wandb.ai/sjahk-/parler-speech/reports/Speech-samples-24-12-20-19-31-39---VmlldzoxMDY3NzI5Mw?accessToken=lmtsm2zj12qoc0nl8os0dgpdgyorvbufbgrqjnzfb1bqmfxmnak35cnxspoo6pgc)

Could you please guide me on the appropriate description and prompt tokenizer to use for fine-tuning in Hindi? Thanks in advance!

Any help would mean a lot! I believe the issue might be with the prompt or description tokenizer.

AI4Bharat org

Hi @skjdhuhsnjd ,

Please use flan-t5-large tokenizer as that is our description encoder as well. This model works pretty well for our use case as the descriptions are still in English, and FlanT5 is instruction tuned which means better representations even without training it.

AI4Bharat org

For any clarification on which models where used, please look at the config: https://huggingface.co/ai4bharat/indic-parler-tts/blob/main/config.json

Hi @AshwinSankar

First of all, thank you so much for your time. I’m really sorry to bother you, but as a beginner, your help means a lot to me. I was using this notebook:

https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/Finetuning_Parler_TTS_on_a_single_speaker_dataset.ipynb

to fine-tune the Indic Parler pretrained model.

I replaced the model path with "ai4bharat/indic-parler-tts-pretrained", the prompt and description tokenizer with "google/flan-t5-large", and the feature extractor with "ylacombe/dac_44khz".

However, I’m still encountering this error:
TypeError: dacmodel.encode() got an unexpected keyword argument 'bandwidth'

I’d be incredibly grateful if you could take some time from your busy schedule to guide me through this issue. Thank you so much in advance!

AI4Bharat org

which version of transformers are you using?

I'm using Google Colab with Transformers version 4.46.1.

This comment has been hidden

Hi @AshwinSankar

I've tried my best, but I haven't been able to resolve the problem. Could you please take a look at it?

Thank you!

Was anyone able to resolve this issue???

AI4Bharat org

I will post a detailed tutorial notebook for doing this after Feb 20. Thank you for your patience

i removed the bandwidth parameter as it was not being accepted by the model and the training started, however i am not sure if this is a correct approach:
def apply_audio_decoder(batch):
len_audio = batch.pop("len_audio")
audio_decoder.to(batch["input_values"].device).eval()
'''if bandwidth is not None:
batch["bandwidth"] = bandwidth'''
if "num_quantizers" in encoder_signature:
batch["num_quantizers"] = num_codebooks
elif "num_codebooks" in encoder_signature:
batch["num_codebooks"] = num_codebooks
elif "n_quantizers" in encoder_signature:
batch["n_quantizers"] = num_codebooks

        with torch.no_grad():
            labels = audio_decoder.encode(**batch)["audio_codes"]
        output = {}
        output["len_audio"] = len_audio
        # (1, bsz, codebooks, seq_len) -> (bsz, seq_len, codebooks)
        output["labels"] = labels.squeeze(0).transpose(1, 2)

        # if `pad_to_max_length`, the maximum corresponding audio length of the current batch is max_duration*sampling_rate
        max_length = len_audio.max() if padding != "max_length" else max_target_length
        output["ratio"] = torch.ones_like(len_audio) * labels.shape[-1] / max_length
        return output..
AI4Bharat org

This is indeed correct correct. Instead you can check if "bandwidth" in encoder_signature like the rest of the if conditions too.

Thanks for your help.

Sign up or log in to comment