espnet/owsm_ctc_v3.2_ft_1B

OWSM-CTC (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC. It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, Open Whisper-style Speech Model (OWSM).

This model is initialized with OWSM-CTC v3.1 and then fine-tuned on v3.2 data for 225k steps.

To use the pre-trained model, please install espnet and espnet_model_zoo. The requirements are:

librosa
torch
espnet
espnet_model_zoo

The recipe can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1

Example script for batched inference

Speech2TextGreedySearch now provides a unified batched inference method batch_decode. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).

from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device="cuda",
    use_flash_attn=False,   # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
    lang_sym='<eng>',
    task_sym='<asr>',
)

res = s2t.batch_decode(
    "audio.wav",    # a single audio (path or 1-D array/tensor) as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a single str, i.e., the predicted text without special tokens

res = s2t.batch_decode(
    ["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a list of str

# Please check the code of `batch_decode` for all supported inputs

Example script for short-form ASR/ST/LID

Our models are trained on 16kHz audio with a fixed duration of 30s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 30s.

import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device="cuda",
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
speech, rate = librosa.load("xxx.wav", sr=16000)
speech = librosa.util.fix_length(speech, size=(16000 * 30))

res = s2t(speech)[0]
print(res)

Example script for long-form ASR/ST

import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

context_len_in_secs = 4   # left and right context when doing buffered inference
batch_size = 32   # depends on the GPU memory
s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device='cuda' if torch.cuda.is_available() else 'cpu',
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

speech, rate = sf.read(
    "xxx.wav"
)

text = s2t.decode_long_batched_buffered(
    speech,
    batch_size=batch_size,
    context_len_in_secs=context_len_in_secs,
)
print(text)

Example of CTC forced alignment using `ctc-segmentation`

CTC segmentation can be efficiently applied to audio of an arbitrary length.

import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation
from espnet_model_zoo.downloader import ModelDownloader

# Download model first
d = ModelDownloader()
downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.2_ft_1B")

aligner = CTCSegmentation(
    **downloaded,
    fs=16000,
    ngpu=1,
    batch_size=32,    # batched parallel decoding; reduce it if your GPU memory is smaller
    kaldi_style_text=True,
    time_stamps="auto",     # "auto" can be more accurate than "fixed" when converting token index to timestamp
    lang_sym="<eng>",
    task_sym="<asr>",
    context_len_in_secs=2,  # left and right context in buffered decoding
)

speech, rate = sf.read(
    "./test_utils/ctc_align_test.wav"
)
print(f"speech duration: {len(speech) / rate : .2f} seconds")
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""

segments = aligner(speech, text)
print(segments)

espnet
/

owsm_ctc_v3.2_ft_1B

Example script for batched inference

Example script for short-form ASR/ST/LID

Example script for long-form ASR/ST

Example of CTC forced alignment using `ctc-segmentation`

Model tree for espnet/owsm_ctc_v3.2_ft_1B

Space using espnet/owsm_ctc_v3.2_ft_1B 1

Example script for batched inference

Example script for short-form ASR/ST/LID

Example script for long-form ASR/ST

Example of CTC forced alignment using ctc-segmentation

Model tree for espnet/owsm_ctc_v3.2_ft_1B

Space using espnet/owsm_ctc_v3.2_ft_1B 1

Example of CTC forced alignment using `ctc-segmentation`