File size: 3,544 Bytes

---
license: cc-by-4.0
language:
- es
base_model:
- pyannote/segmentation-3.0
library_name: pyannote-audio
---
# pyannote-segmentation-3.0-RTVE-primary

## Model Details

This system is a collection of three fine-tuned models monitoring False Alarm, Missed Detection, and Speaker Confusion, to be fused with [DOVER-Lap](https://github.com/desh2608/dover-lap).

Each model is a fine-tuned version of [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) on [the RTVE database](https://catedrartve.unizar.es/rtvedatabase.html) used for Albayzin Evaluations of IberSPEECH 2024.

On the RTVE2024 test set it achives the following results (two-decimal rounding):

- Diarization Error Rate (DER): 14.98%
- False Alarm: 2.64%
- Missed Detection: 4.54%
- Speaker Confusion: 7.80%


## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This system is intented to be used for speaker diarization of TV shows.

## Usage

The instructions to obtain the RTTM output of each model can be found [here](https://huggingface.co/pyannote/speaker-diarization-3.1), using [this configuration file](config_diarization-3.1.yaml)

Once obtained, [this script](https://huggingface.co/chsougan/pyannote-segmentation-3.0-RTVE-primary/blob/main/primary_fusion.py) can be modified to obtain the fusion of each model's output.

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The [train.lst](https://huggingface.co/chsougan/pyannote-segmentation-3.0-RTVE-primary/blob/main/train.lst) file includes the URIs of the training data.



#### Training Hyperparameters

**Model:**  <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

  - duration: 10.0
  - max_speakers_per_chunk: 3
  - max_speakers_per_frame: 2
  - train_batch_size: 32
  - powerset_max_classes: 2

**Adam Optimizer:**
  - lr: 0.0001

**Early Stopping:**
  
  - direction: 'min'
  - max_epochs: 20

### Development Data

The [development.lst](https://huggingface.co/chsougan/pyannote-segmentation-3.0-RTVE-primary/blob/main/development.lst) file includes the URIs of the development data.

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

- Forgiveness collar: 250ms
- Skip overlap: False

### Testing Data & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

The [test.lst](https://huggingface.co/chsougan/pyannote-segmentation-3.0-RTVE-primary/blob/main/test.lst) file includes the URIs of the testing data. 


#### Metrics

Diarization Error Rate, False Alarm, Missed Detection, Speaker Confusion.


## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
If you use these models, please cite:


**BibTeX:**
```bibtex
@inproceedings{souganidis24_iberspeech,
  title     = {HiTZ-Aholab Speaker Diarization System for Albayzin Evaluations of IberSPEECH 2024},
  author    = {Christoforos Souganidis and Gemma Meseguer and Asier Herranz and Inma {Hernáez Rioja} and Eva Navas and Ibon Saratxaga},
  year      = {2024},
  booktitle = {IberSPEECH 2024},
  pages     = {327--330},
  doi       = {10.21437/IberSPEECH.2024-68},
}
````