File size: 4,054 Bytes
414b994
fcb1619
414b994
 
 
 
 
fcb1619
 
 
 
 
 
 
 
 
 
414b994
f6bc0d9
414b994
 
 
b061278
 
 
365c1de
ec8d110
414b994
1c7a0ab
414b994
ec8d110
 
 
 
414b994
 
 
 
 
429e902
414b994
9c8c25c
 
114e08d
9c8c25c
fefa752
9c8c25c
414b994
 
 
 
 
 
c7d9e17
414b994
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c7d9e17
414b994
 
 
 
 
 
 
 
 
 
 
 
 
 
c7d9e17
414b994
 
 
 
 
 
 
 
 
 
8238c86
414b994
 
 
 
 
 
 
 
 
 
 
 
7b19f57
 
 
 
178cd78
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
license: apache-2.0
language:
- es
base_model:
- pyannote/segmentation-3.0
library_name: pyannote-audio
tags:
- pyannote
- pyannote-audio
- audio
- voice
- speech
- speaker
- speaker-diarization
- segmentation
pipeline_tag: automatic-speech-recognition
---
# pyannote-segmentation-3.0-RTVE-primary

## Model Details

This system is a collection of three fine-tuned models, to be fused with [DOVER-Lap](https://github.com/desh2608/dover-lap).
Each models is fine-tuned monitoring a different metric component of Diarization Error Rate (i.e., False Alarm, Missed Detection, and Speaker Confusion).
More information about the fusion of these models can be found in this [paper](https://www.isca-archive.org/iberspeech_2024/souganidis24_iberspeech.html).

Each model is a fine-tuned version of [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) on [the RTVE database](https://catedrartve.unizar.es/rtvedatabase.html) used for Albayzin Evaluations of IberSPEECH 2024.

On the RTVE2024 test set it achives the following results (two-decimal rounding), being the best-performing system of Albayzin Evaluations 2024:

- Diarization Error Rate (DER): 14.98%
- False Alarm: 2.64%
- Missed Detection: 4.54%
- Speaker Confusion: 7.80%


## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This system is intented to be used for speaker diarization of TV shows.

## Usage

The instructions to obtain the RTTM output of each model can be found [here](https://huggingface.co/pyannote/speaker-diarization-3.1), using this [configuration file](config.yaml)

Once obtained, [this script](primary_fusion.py) can be modified to obtain the fusion of each model's output.

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The [train.lst](train.lst) file includes the URIs of the training data.



#### Training Hyperparameters

**Model:**  <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

  - duration: 10.0
  - max_speakers_per_chunk: 3
  - max_speakers_per_frame: 2
  - train_batch_size: 32
  - powerset_max_classes: 2

**Adam Optimizer:**
  - lr: 0.0001

**Early Stopping:**
  
  - direction: 'min'
  - max_epochs: 20

### Development Data

The [development.lst](development.lst) file includes the URIs of the development data.

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

- Forgiveness collar: 250ms
- Skip overlap: False

### Testing Data & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

The [test.lst](test.lst) file includes the URIs of the testing data. 


#### Metrics

Diarization Error Rate, False Alarm, Missed Detection, Speaker Confusion.


## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
If you use these models, please cite:


**BibTeX:**
```bibtex
@inproceedings{souganidis24_iberspeech,
  title     = {HiTZ-Aholab Speaker Diarization System for Albayzin Evaluations of IberSPEECH 2024},
  author    = {Christoforos Souganidis and Gemma Meseguer and Asier Herranz and Inma {Hernáez Rioja} and Eva Navas and Ibon Saratxaga},
  year      = {2024},
  booktitle = {IberSPEECH 2024},
  pages     = {327--330},
  doi       = {10.21437/IberSPEECH.2024-68},
}
````

## Acknowledgments 

This project with reference 2022/TL22/00215335 has been parcially funded by the Ministerio de Transformación Digital and by the Plan de Recuperación, Transformación y Resiliencia – Funded by the European Union – NextGenerationEU [ILENIA](https://proyectoilenia.es/) and by the project [IkerGaitu](https://www.hitz.eus/iker-gaitu/) funded by the Basque Government.