rwth-i6/2024-zeyer-ctc-librispeech-spm10k

Internal model alias name: v6-relPosAttDef-noBias-aedLoss-bhv20-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_2-lrlin1e_5_295k-featBN-speedpertV2-spm10k-bpeSample001

Last epoch (subepoch 500) greedy decoding (without LM) on Librispeech (WERs): {"dev-clean": 2.38, "dev-other": 5.67, "test-clean": 2.63, "test-other": 5.93}

(Note, together with a good LM trained on Librispeech LM text data, output/ctc_recog_ext/ctc+lm/opt-beam128-fp128-lm_n32-d1024-labelprior/recog-1stpass-res.txt: {"dev-clean": 2.04, "dev-other": 4.06, "test-clean": 2.08, "test-other": 4.36})

From https://github.com/rwth-i6/i6_experiments/blob/main/users/zeyer/experiments/exp2024_04_23_baselines/ctc.py.

Usage example: https://github.com/rwth-i6/i6_experiments/blob/main/users/zeyer/experiments/exp2024_04_23_baselines/standalone/model_2024_ctc_spm10k.py

Example:

pip install torch
pip install returnn

wget https://raw.githubusercontent.com/rwth-i6/i6_experiments/refs/heads/main/users/zeyer/experiments/exp2024_04_23_baselines/standalone/model_2024_ctc_spm10k.py
wget https://huggingface.co/rwth-i6/2024-zeyer-ctc-librispeech-spm10k/resolve/main/data/epoch.500.pt
wget https://huggingface.co/rwth-i6/2024-zeyer-ctc-librispeech-spm10k/resolve/main/deps/spm.vocab

python model_2024_ctc_spm10k.py example_audio.ogg

This Sisyphus config code snippet was used to setup the Sisyphus training job:

    # v6-relPosAttDef-noBias-aedLoss-bhv20-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_2-lrlin1e_5_295k-featBN-speedpertV2-spm10k-bpeSample001
    # noBias. (Baseline: 5.77)
    train_exp(  # 5.65 (!!!)
        "v6-relPosAttDef-noBias-aedLoss-bhv20-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_2"
        "-lrlin1e_5_295k-featBN-speedpertV2-spm10k-bpeSample001",
        config_11gb_v6_f32_accgrad1_mgpu4_pavg100_wd1e_4,
        model_config={
            "enc_conformer_layer": rf.build_dict(
                rf.encoder.conformer.ConformerEncoderLayer,
                ff=rf.build_dict(
                    rf.encoder.conformer.ConformerPositionwiseFeedForward,
                    activation=rf.build_dict(rf.relu_square),
                    with_bias=False,
                ),
                num_heads=8,
            ),
            "feature_batch_norm": True,
        },
        config_updates={
            **_get_cfg_lrlin_oclr_by_bs_nep(15_000, 500),
            "optimizer.weight_decay": 1e-2,
            "__train_audio_preprocess": speed_pert_librosa_config,
            "speed_pert_discrete_values": [0.7, 0.8, 0.9, 1.0, 1.1],
            "aux_attention_decoder": rf.build_dict(TransformerDecoder, num_layers=6),  # purely used for training
        },
        vocab="spm10k",
        train_vocab_opts={"other_opts": {"class": "SamplingBytePairEncoding", "breadth_prob": 0.01}},
    )

I uploaded the info and output files from the Sisyphus RETURNN training job to trainjob, except of the model checkpoint, which I uploaded to data.

From the train job info file, I was checking dependencies. Specifically, there is the SPM vocab. I uploaded those to deps.

rwth-i6
/

2024-zeyer-ctc-librispeech-spm10k

Dataset used to train rwth-i6/2024-zeyer-ctc-librispeech-spm10k