|
Datasets |
|
======== |
|
|
|
NeMo has scripts to convert several common ASR datasets into the format expected by the `nemo_asr` collection. |
|
You can get started with those datasets by following the instructions to run those scripts in the section appropriate to each dataset below. |
|
|
|
If you have your own data and want to preprocess it to use with NeMo ASR models, check out the `Preparing Custom Speech Classification Data`_ section at the bottom of the page. |
|
|
|
.. _Freesound-dataset: |
|
|
|
Freesound |
|
----------- |
|
|
|
`Freesound <http://www.freesound.org/>`_ is a website that aims to create a huge open collaborative database of audio snippets, samples, recordings, bleeps. |
|
Most audio samples are released under Creative Commons licenses that allow their reuse. |
|
Researchers and developers can access Freesound content using the Freesound API to retrieve meaningful sound information such as metadata, analysis files, and the sounds themselves. |
|
|
|
**Instructions** |
|
|
|
Go to ``<NeMo_git_root>/scripts/freesound_download_resample`` and follow the below steps to download and convert freedsound data into a format expected by the `nemo_asr` collection. |
|
|
|
1. We will need some required libraries including freesound, requests, requests_oauthlib, joblib, librosa and sox. If they are not installed, please run `pip install -r freesound_requirements.txt` |
|
2. Create an API key for freesound.org at https: |
|
3. Create a python file called `freesound_private_apikey.py` and add lined `api_key = <your Freesound api key> and client_id = <your Freesound client id>` |
|
4. Authorize by run `python freesound_download.py --authorize` and visit the website and paste response code |
|
5. Feel free to change any arguments in `download_resample_freesound.sh` such as max_samples and max_filesize |
|
6. Run `bash download_resample_freesound.sh <numbers of files you want> <download data directory> <resampled data directory>` . For example: |
|
|
|
.. code-block:: bash |
|
|
|
bash download_resample_freesound.sh 4000 ./freesound ./freesound_resampled_background |
|
|
|
Note that downloading this dataset may take hours. Change categories in download_resample_freesound.sh to include other (speech) categories audio files. |
|
Then, you should have 16khz mono wav files in `<resampled data directory>`. |
|
|
|
|
|
.. _Google-Speech-Commands-Dataset: |
|
|
|
Google Speech Commands Dataset |
|
------------------------------ |
|
|
|
Google released two versions of the dataset with the first version containing 65k samples over 30 classes and the second containing 110k samples over 35 classes. |
|
We refer to these datasets as `v1` and `v2` respectively. |
|
|
|
Run the script `process_speech_commands_data.py` to process Google Speech Commands dataset in order to generate files in the supported format of `nemo_asr`, |
|
which can be found in ``<NeMo_git_root>/scripts/dataset_processing/``. |
|
You should set the data folder of Speech Commands using :code:`--data_root` and the version of the dataset using :code:`--data_version` as an int. |
|
|
|
You can further rebalance the train set by randomly oversampling files inside the manifest by passing the `--rebalance` flag. |
|
|
|
.. code-block:: bash |
|
|
|
python process_speech_commands_data.py --data_root=<data directory> --data_version=<1 or 2> {--rebalance} |
|
|
|
|
|
Then, you should have `train_manifest.json`, `validation_manifest.json` and `test_manifest.json` |
|
in the directory `{data_root}/google_speech_recognition_v{1/2}`. |
|
|
|
.. note:: |
|
You should have at least 4GB or 6GB of disk space available if you use v1 or v2 respectively. |
|
Also, it will take some time to download and process, so go grab a coffee. |
|
|
|
Each line is a training example. |
|
|
|
.. code-block:: bash |
|
|
|
{"audio_filepath": "<absolute path to dataset>/two/8aa35b0c_nohash_0.wav", "duration": 1.0, "label": "two"} |
|
{"audio_filepath": "<absolute path to dataset>/two/ec5ab5d5_nohash_2.wav", "duration": 1.0, "label": "two"} |
|
|
|
|
|
|
|
Speech Command & Freesound for VAD |
|
------------------------------------ |
|
Speech Command & Freesound (SCF) dataset is used to train MarbleNet in the `paper <https://arxiv.org/pdf/2010.13886.pdf>`_. Here we show how to download and process it. |
|
This script assumes that you already have the Freesound dataset, if not, have a look at :ref:`Freesound-dataset`. |
|
We will use the open-source :ref:`Google-Speech-Commands-Dataset` (we will use V2 of the dataset for SCF dataset, but require very minor changes to support V1 dataset) as our speech data. |
|
|
|
These scripts below will download the Google Speech Commands v2 dataset and convert speech and background data to a format suitable for use with nemo_asr. |
|
|
|
.. note:: |
|
You may additionally pass :code:`--test_size` or :code:`--val_size` flag for splitting train val and test data. |
|
|
|
You may additionally pass :code:`--window_length_in_sec` flag for indicating the segment/window length. Default is 0.63s. |
|
|
|
You may additionally pass a :code:`-rebalance_method='fixed|over|under'` at the end of the script to rebalance the class samples in the manifest. |
|
|
|
|
|
|
|
* `'fixed'`: Fixed number of sample for each class. Train 5000, val 1000, and test 1000. (Change number in script if you want) |
|
* `'over'`: Oversampling rebalance method |
|
* `'under'`: Undersampling rebalance method |
|
|
|
|
|
.. code-block:: bash |
|
|
|
mkdir './google_dataset_v2' |
|
python process_vad_data.py --out_dir='./manifest/' --speech_data_root='./google_dataset_v2'--background_data_root=<resampled freesound data directory> --log --rebalance_method='fixed' |
|
|
|
|
|
After download and conversion, your `manifest` folder should contain a few json manifest files: |
|
|
|
* `(balanced_)background_testing_manifest.json` |
|
* `(balanced_)background_training_manifest.json` |
|
* `(balanced_)background_validation_manifest.json` |
|
* `(balanced_)speech_testing_manifest.json` |
|
* `(balanced_)speech_training_manifest.json` |
|
* `(balanced_)speech_validation_manifest.json` |
|
|
|
Each line is a training example. `audio_filepath` contains path to the wav file, `duration` is duration in seconds, `offset` is offset in seconds, and `label` is label (class): |
|
|
|
.. code-block:: bash |
|
|
|
{"audio_filepath": "<absolute path to dataset>/two/8aa35b0c_nohash_0.wav", "duration": 0.63, "label": "speech", "offset": 0.0} |
|
{"audio_filepath": "<absolute path to dataset>/Emergency_vehicle/id_58368 simambulance.wav", "duration": 0.63, "label": "background", "offset": 4.0} |
|
|
|
|
|
.. _Voxlingua107: |
|
|
|
Voxlingua107 |
|
------------------------------ |
|
|
|
VoxLingua107 consists of short speech segments automatically extracted from YouTube videos. |
|
It contains 107 languages. The total amount of speech in the training set is 6628 hours, and 62 hours per language on average but it's highly imbalanced. |
|
It also includes seperate evaluation set containing 1609 speech segments from 33 languages, validated by at least two volunteers. |
|
|
|
You could download dataset from its `official website <http://bark.phon.ioc.ee/voxlingua107/>`_. |
|
|
|
Each line is a training example. |
|
|
|
.. code-block:: bash |
|
|
|
{"audio_filepath": "<absolute path to dataset>/ln/lFpWXQYseo4__U__S113---0400.650-0410.420.wav", "offset": 0, "duration": 3.0, "label": "ln"} |
|
{"audio_filepath": "<absolute path to dataset>/lt/w0lp3mGUN8s__U__S28---0352.170-0364.770.wav", "offset": 8, "duration": 4.0, "label": "lt"} |
|
|
|
|
|
Preparing Custom Speech Classification Data |
|
-------------------------------------------- |
|
|
|
Preparing Custom Speech Classification Data is almost identical to `Preparing Custom ASR Data <../datasets.html#preparing-custom-asr-data>`__. |
|
|
|
Instead of :code:`text` entry in manifest, you need :code:`label` to determine class of this sample |
|
|
|
|
|
Tarred Datasets |
|
--------------- |
|
|
|
Similarly to ASR, you can tar your audio files and use ASR Dataset class ``TarredAudioToClassificationLabelDataset`` (corresponding to the ``AudioToClassificationLabelDataset``) for this case. |
|
|
|
If you would like to use tarred dataset, have a look at `ASR Tarred Datasets <../datasets.html#tarred-datasets>`__. |