|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- agkphysics/AudioSet |
|
pipeline_tag: audio-classification |
|
--- |
|
# Model Details |
|
This is a CRNN sound event detection model pre-trained on [AudioSet](https://research.google.com/audioset/download.html) and then finetuned on [AudioSet-strong](https://research.google.com/audioset/download_strong.html). |
|
It contains 8 convolution layers and a GRU, with a time resolution of 40ms and a total of about 6.4 million parameters. |
|
|
|
# Usage |
|
```python |
|
import torch |
|
from transformers import AutoModel |
|
import torchaudio |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model = AutoModel.from_pretrained( |
|
"wsntxxn/cnn8rnn-audioset-sed", |
|
trust_remote_code=True |
|
).to(device) |
|
|
|
wav1, sr1 = torchaudio.load("/path/to/file1.wav") |
|
wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate) |
|
wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0] |
|
|
|
wav2, sr2 = torchaudio.load("/path/to/file2.wav") |
|
wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate) |
|
wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0] |
|
|
|
wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True) |
|
|
|
with torch.no_grad(): |
|
output = model(waveform=wav_batch) |
|
# output: { |
|
# "framewise_output": (2, 447, n_frames), |
|
# "clipwise_output": (2, 447) |
|
# } |
|
|
|
# classes is in `model.classes` |
|
# for example, the probability sequence of male speech is: |
|
male_speech_prob = output[:, model.classes.index("Male speech, man speaking"), :] |
|
|
|
``` |