Model Information

Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.

The model is based on facebook/wav2vec2-xls-r-300m and was pre-trained on a subset of the Common Voice Corpus 10.0 transcribed with eSpeak NG.

Model Name UCLA Phonetic Corpus (PER) UCLA Phonetic Corpus (AER) Common Voice (PER) Common Voice (AER)
Multitask 45.62% 19.44% 34.34% 8.36%
Hierarchical 46.09% 19.18% 34.35% 8.56%
Multitask Shared 46.05% 19.52% 41.20% 8.88%
Baseline Shared 48.25% - 45.35% -
Baseline 57.01% - 46.95% -

Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.

Usage

Install the allophant package:

pip install allophant

A pre-trained model can be loaded from a huggingface checkpoint or local file:

from allophant.estimator import Estimator

device = "cpu"
model, attribute_indexer = Estimator.restore("kgnlp/allophant-shared", device=device)
supported_features = attribute_indexer.feature_names
# The phonetic feature categories supported by the model, including "phonemes"
print(supported_features)

Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:

# 1. For a single language:
inventory = attribute_indexer.phoneme_inventory("es")
# 2. For multiple languages, e.g. in code-switching scenarios
inventory = attribute_indexer.phoneme_inventory(["es", "it"])
# 3. Any custom selection of phones for which features are available in the Allophoible database
inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']

Audio files can then be loaded, resampled and transcribed using the given inventory by first computing the log probabilities for each classifier:

import torch
import torchaudio
from allophant.dataset_processing import Batch

# Load an audio file and resample the first channel to the sample rate used by the model
audio, sample_rate = torchaudio.load("utterance.wav")
audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)

# Construct a batch of 0-padded single channel audio, lengths and language IDs
# Language ID can be 0 for inference
batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
model_outputs = model.predict(
  batch.to(device),
  attribute_indexer.composition_feature_matrix(inventory).to(device)
)

Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:

from allophant import predictions

# Create a feature mapping for your inventory and CTC decoders for the desired feature set
inventory_indexer = attribute_indexer.attributes.subset(inventory)
ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)

for feature_name, decoder in ctc_decoders.items():
    decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
    # Print the feature name and values for each utterance in the batch
    for [hypothesis] in decoded:
        # NOTE: token indices are offset by one due to the <BLANK> token used during decoding
        recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
        print(feature_name, recognized)

Citation

@inproceedings{glocker2023allophant,
    title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},
    author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},
    year={2023},
    booktitle={{Proc. Interspeech 2023}},
    month={8}}

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support automatic-speech-recognition models for allophant library.

Model tree for kgnlp/allophant-shared

Finetuned
(553)
this model

Dataset used to train kgnlp/allophant-shared

Collection including kgnlp/allophant-shared