# Speech2C

> [**Speech2C**](https://arxiv.org/abs/2203.17113) (```INTERSPEECH 2022```): **Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data**

## Pre-Trained and Fine-tuned Models

|  Model   |               Pre-training Dataset               | Fine-tuning Dataset | Model |
| :------: | :----------------------------------------------: | :-----------------: | :-----: |
| Speech2C | [960 hrs LibriSpeech](http://www.openslr.org/12) |          -          | [Google Drive](https://drive.google.com/file/d/1nGZ0LWEwlLq2pz7o805YALsMr9irV0Za/view?usp=sharing)  |
| Speech2C | [960 hrs LibriSpeech](http://www.openslr.org/12) | [10 hrs LibriSpeech](http://www.openslr.org/12) |  [Google Drive](https://drive.google.com/file/d/1nWSAc-33LmcDQHzH8IjXVJsuk0JZTWgN/view?usp=sharing) |
| Speech2C | [960 hrs LibriSpeech](http://www.openslr.org/12) | [100 hrs LibriSpeech](http://www.openslr.org/12) |  [Google Drive](https://drive.google.com/file/d/1LwbQ5Y3tKZoK3s1ayLQgsfLTFnmkKNZs/view?usp=sharing) |


## Language Model and Vocabulary
|  Model   |  Dataset | Model | Vocabulary | 
| :------: | :------: | :---: | :--------: |
| LM | [LibriSpeech LM Dataset](https://www.openslr.org/11/) | [Model](https://drive.google.com/file/d/1UDCcNJT1DlquSRw0iRAXH6GHlf6zK6-8/view?usp=sharing)  | [Vocabulary](https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt) |

## Setup
```
git submodule update --init Speech2C/fairseq
cd Speech2C/
pip install --editable fairseq/
```

## Data Preparation
Please follow the steps of data preparation for HuBERT in [here](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert#data-preparation).

## Pre-Training
```
DATA_DIR=
LABEL_DIR=
FAIRSEQ_PATH=

python ${FAIRSEQ_PATH}/fairseq_cli/hydra_train.py \
  --config-dir speech2c/config \
  --config-name speech2c_base_librispeech \
  task.data=${DATA_DIR} task.label_dir=${LABEL_DIR} task.labels='["km"]' \
  model.label_rate=50 common.user_dir=SpeechT5/Speech2C/speech2c \
```

## Finetune

```
DATA_DIR=
LABEL_DIR=
FAIRSEQ_PATH=
W2V_PATH=
CONFIG_NAME=

python ${FAIRSEQ_PATH}/fairseq_cli/hydra_train.py \
  --config-dir speech2c/config \
  --config-name ${CONFIG_NAME} \
  task.data=${DATA_DIR} task.label_dir=${LABEL_DIR} \
  model.w2v_path=${W2V_PATH} common.user_dir=SpeechT5/Speech2C/speech2c \
```

## Inference
Note that joint CTC and decoder inference is only supported when the batch size is 1.

```
FAIRSEQ_PATH=
DATA_DIR=
LABEL_DIR=
BEAM_SIZE=
CTC_WEIGHT=
TEST_SET=
CHECKPOINT_PATH=
W2V_PATH=


python ${FAIRSEQ_PATH}/fairseq_cli/generate.py ${DATA_DIR} \
    --label-dir ${LABEL_DIR} \
    --path ${CHECKPOINT_PATH} \
    --user-dir SpeechT5/Speech2C/speech2c \
    --model-overrides "{'w2v_path': '${W2V_PATH}'}" \
    --gen-subset ${TEST_SET} \
    --task speech2c_pretraining \
    --post-process letter \
    --add-decoder \
    --labels '["ltr"]' \
    --fine-tuning \
    --scoring wer \
    --max-len-a 0 \
    --max-len-b 620 \
    --pad-audio \
    --random-crop \
    --ctc-weight ${CTC_WEIGHT} \
    --max-tokens 8000000 \
    --beam ${BEAM_SIZE} \
    --single-target \
```

## Results on Librispeech

### Evaluation on the [LibriSpeech](http://www.openslr.org/12) 10hr subset

| Model         |LM                 | test-clean   | test-other   |
| ------------- |-------------      | ----|  ----|
| wav2vec2.0 Base          | -      | 11.1 | 17.6 |
| HuBERT Base              | -      | 10.1 | 16.8 |
| **Speech2C**              | -      | **7.8** | **13.1** |
| wav2vec 2.0 Base         | 4-gram | 4.3  |9.5   |
| wav2vec 2.0 Base   | Transf. |3.2  |7.8   |
| HuBERT Base              | 4-gram	|4.3 |9.4   |
| **Speech2C**              | **Transf.**     | **3.1** | **7.0** |

### Evaluation on the [LibriSpeech](http://www.openslr.org/12) 100hr subset

| Model         |LM                 | test-clean   | test-other   |
| ------------- |-------------      | ----|  ----|
| wav2vec2.0 Base          | -      | 6.1 | 13.3 |
| wav2vec2.0 Large          | -      | 4.7 | 9.0 |
| HuBERT Base              | -      | 6.3 | 13.2 |
| SpeechT5             | -      | 4.4 | 10.4 |
| Baseline                 | -      |  5.0 | 11.9 |
| **Speech2C**                 | - | **4.3**  |**9.0**   |
| wav2vec 2.0 Base         | 4-gram | 3.4  |8.0   |
| wav2vec 2.0 Base         | Transf. | 2.6  | 6.3   |
| HuBERT Base              | 4-gram	| 3.4  |8.1   |
| SpeechT5             | Transf. | 2.4  |5.8   |
| Baseline                 | Transf. | 2.5  |6.3   |
| **Speech2C**                 | **Transf.** | **2.4**  |**5.2**   |

## License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq).

[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

## Reference

If you find our work is useful in your research, please cite the following paper:

```bibtex
@article{Ao2022Speech2C,
  title   = {Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data},
  author  = {Junyi Ao and Ziqiang Zhang and Long Zhou and Shujie Liu and Haizhou Li and Tom Ko and Lirong Dai and Jinyu Li and Yao Qian and Furu Wei},
  eprint={2203.17113},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  year={2022}
}
```