wav2vec2-base-SLURP

This model is a fine-tuned version of facebook/wav2vec2-large on the SLURP dataset for the intent classification task.

Model description

The base Facebook's Wav2Vec2 model pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.

Task and dataset description

Intent Classification (IC) classifies utterances into predefined classes to determine the intent of speakers. The dataset used here is SLURP, where each utterance is tagged with two intent labels: action and scenario.

Usage examples

You can use the model directly in the following manner:

import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

## Load an audio file
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)

## Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/wav2vec2-large-slurp")
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-large")

## Extract features
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")

## Compute logits
logits = model(**inputs).logits

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-04
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 128
  • optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • warmup_steps: 3000
  • num_steps: 30000

Framework versions

  • Datasets 3.2.0
  • Pytorch 2.1.2
  • Tokenizers 0.20.3
  • Transformers 4.45.2

BibTeX entry and citation info

@ARTICLE{koudounas2024taslp,
  author={Koudounas, Alkis and Pastor, Eliana and Attanasio, Giuseppe and Mazzia, Vittorio and Giollo, Manuel and Gueudre, Thomas and Reale, Elisa and Cagliero, Luca and Cumani, Sandro and de Alfaro, Luca and Baralis, Elena and Amberti, Daniele},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={Towards Comprehensive Subgroup Performance Analysis in Speech Models}, 
  year={2024},
  volume={32},
  number={},
  pages={1468-1480},
  keywords={Analytical models;Task analysis;Metadata;Speech processing;Behavioral sciences;Itemsets;Speech;Speech representation;E2E-SLU models;subgroup identification;model bias analysis;divergence},
  doi={10.1109/TASLP.2024.3363447}}
Downloads last month
0
Safetensors
Model size
316M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for alkiskoudounas/wav2vec2-large-slurp

Finetuned
(18)
this model