OWSM-CTC (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC. It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, Open Whisper-style Speech Model (OWSM).
This model is initialized with OWSM-CTC v3.1 and then fine-tuned on v3.2 data for 225k steps.
To use the pre-trained model, please install espnet
and espnet_model_zoo
. The requirements are:
librosa
torch
espnet
espnet_model_zoo
The recipe can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
Example script for batched inference
Speech2TextGreedySearch
now provides a unified batched inference method batch_decode
. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.2_ft_1B",
device="cuda",
use_flash_attn=False, # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
lang_sym='<eng>',
task_sym='<asr>',
)
res = s2t.batch_decode(
"audio.wav", # a single audio (path or 1-D array/tensor) as input
batch_size=16,
context_len_in_secs=4,
) # res is a single str, i.e., the predicted text without special tokens
res = s2t.batch_decode(
["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
batch_size=16,
context_len_in_secs=4,
) # res is a list of str
# Please check the code of `batch_decode` for all supported inputs
Example script for short-form ASR/ST/LID
Our models are trained on 16kHz audio with a fixed duration of 30s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 30s.
import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.2_ft_1B",
device="cuda",
generate_interctc_outputs=False,
lang_sym='<eng>',
task_sym='<asr>',
)
# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
speech, rate = librosa.load("xxx.wav", sr=16000)
speech = librosa.util.fix_length(speech, size=(16000 * 30))
res = s2t(speech)[0]
print(res)
Example script for long-form ASR/ST
import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
context_len_in_secs = 4 # left and right context when doing buffered inference
batch_size = 32 # depends on the GPU memory
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.2_ft_1B",
device='cuda' if torch.cuda.is_available() else 'cpu',
generate_interctc_outputs=False,
lang_sym='<eng>',
task_sym='<asr>',
)
speech, rate = sf.read(
"xxx.wav"
)
text = s2t.decode_long_batched_buffered(
speech,
batch_size=batch_size,
context_len_in_secs=context_len_in_secs,
)
print(text)
Example of CTC forced alignment using ctc-segmentation
CTC segmentation can be efficiently applied to audio of an arbitrary length.
import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation
from espnet_model_zoo.downloader import ModelDownloader
# Download model first
d = ModelDownloader()
downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.2_ft_1B")
aligner = CTCSegmentation(
**downloaded,
fs=16000,
ngpu=1,
batch_size=32, # batched parallel decoding; reduce it if your GPU memory is smaller
kaldi_style_text=True,
time_stamps="auto", # "auto" can be more accurate than "fixed" when converting token index to timestamp
lang_sym="<eng>",
task_sym="<asr>",
context_len_in_secs=2, # left and right context in buffered decoding
)
speech, rate = sf.read(
"./test_utils/ctc_align_test.wav"
)
print(f"speech duration: {len(speech) / rate : .2f} seconds")
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""
segments = aligner(speech, text)
print(segments)
- Downloads last month
- 85
Model tree for espnet/owsm_ctc_v3.2_ft_1B
Unable to build the model tree, the base model loops to the model itself. Learn more.