wsntxxn
/

cnn8rnn-audioset-sed

Audio Classification

Model card Files Files and versions Community

cnn8rnn-audioset-sed / README.md

wsntxxn's picture

Create README.md

cc8ed6e verified 6 months ago

|

history blame contribute delete

1.51 kB

	---
	license: apache-2.0
	datasets:
	- agkphysics/AudioSet
	pipeline_tag: audio-classification
	---
	# Model Details
	This is a CRNN sound event detection model pre-trained on [AudioSet](https://research.google.com/audioset/download.html) and then finetuned on [AudioSet-strong](https://research.google.com/audioset/download_strong.html).
	It contains 8 convolution layers and a GRU, with a time resolution of 40ms and a total of about 6.4 million parameters.

	# Usage
	```python
	import torch
	from transformers import AutoModel
	import torchaudio

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = AutoModel.from_pretrained(
	"wsntxxn/cnn8rnn-audioset-sed",
	trust_remote_code=True
	).to(device)

	wav1, sr1 = torchaudio.load("/path/to/file1.wav")
	wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
	wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]

	wav2, sr2 = torchaudio.load("/path/to/file2.wav")
	wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
	wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]

	wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True)

	with torch.no_grad():
	output = model(waveform=wav_batch)
	# output: {
	# "framewise_output": (2, 447, n_frames),
	# "clipwise_output": (2, 447)
	# }

	# classes is in `model.classes`
	# for example, the probability sequence of male speech is:
	male_speech_prob = output[:, model.classes.index("Male speech, man speaking"), :]

	```