lvwerra HF staff commited on
Commit
9fb7d26
·
1 Parent(s): 40d62f1

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +131 -5
  2. app.py +6 -0
  3. requirements.txt +4 -0
  4. xtreme_s.py +271 -0
README.md CHANGED
@@ -1,12 +1,138 @@
1
  ---
2
- title: Xtreme_s
3
- emoji: 😻
4
- colorFrom: pink
5
- colorTo: gray
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: XTREME-S
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for XTREME-S
16
+
17
+
18
+ ## Metric Description
19
+
20
+ The XTREME-S metric aims to evaluate model performance on the Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark.
21
+
22
+ This benchmark was designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval.
23
+
24
+ ## How to Use
25
+
26
+ There are two steps: (1) loading the XTREME-S metric relevant to the subset of the benchmark being used for evaluation; and (2) calculating the metric.
27
+
28
+ 1. **Loading the relevant XTREME-S metric** : the subsets of XTREME-S are the following: `mls`, `voxpopuli`, `covost2`, `fleurs-asr`, `fleurs-lang_id`, `minds14` and `babel`. More information about the different subsets can be found on the [XTREME-S benchmark page](https://huggingface.co/datasets/google/xtreme_s).
29
+
30
+
31
+ ```python
32
+ >>> xtreme_s_metric = evaluate.load('xtreme_s', 'mls')
33
+ ```
34
+
35
+ 2. **Calculating the metric**: the metric takes two inputs :
36
+
37
+ - `predictions`: a list of predictions to score, with each prediction a `str`.
38
+
39
+ - `references`: a list of lists of references for each translation, with each reference a `str`.
40
+
41
+ ```python
42
+ >>> references = ["it is sunny here", "paper and pen are essentials"]
43
+ >>> predictions = ["it's sunny", "paper pen are essential"]
44
+ >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
45
+ ```
46
+
47
+ It also has two optional arguments:
48
+
49
+ - `bleu_kwargs`: a `dict` of keywords to be passed when computing the `bleu` metric for the `covost2` subset. Keywords can be one of `smooth_method`, `smooth_value`, `force`, `lowercase`, `tokenize`, `use_effective_order`.
50
+
51
+ - `wer_kwargs`: optional dict of keywords to be passed when computing `wer` and `cer`, which are computed for the `mls`, `fleurs-asr`, `voxpopuli`, and `babel` subsets. Keywords are `concatenate_texts`.
52
+
53
+ ## Output values
54
+
55
+ The output of the metric depends on the XTREME-S subset chosen, consisting of a dictionary that contains one or several of the following metrics:
56
+
57
+ - `accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information). This is returned for the `fleurs-lang_id` and `minds14` subsets.
58
+
59
+ - `f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall. It is returned for the `minds14` subset.
60
+
61
+ - `wer`: Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The lower the value, the better the performance of the ASR system, with a WER of 0 being a perfect score (see [WER score](https://huggingface.co/metrics/wer) for more information). It is returned for the `mls`, `fleurs-asr`, `voxpopuli` and `babel` subsets of the benchmark.
62
+
63
+ - `cer`: Character error rate (CER) is similar to WER, but operates on character instead of word. The lower the CER value, the better the performance of the ASR system, with a CER of 0 being a perfect score (see [CER score](https://huggingface.co/metrics/cer) for more information). It is returned for the `mls`, `fleurs-asr`, `voxpopuli` and `babel` subsets of the benchmark.
64
+
65
+ - `bleu`: the BLEU score, calculated according to the SacreBLEU metric approach. It can take any value between 0.0 and 100.0, inclusive, with higher values being better (see [SacreBLEU](https://huggingface.co/metrics/sacrebleu) for more details). This is returned for the `covost2` subset.
66
+
67
+
68
+ ### Values from popular papers
69
+ The [original XTREME-S paper](https://arxiv.org/pdf/2203.10752.pdf) reported average WERs ranging from 9.2 to 14.6, a BLEU score of 20.6, an accuracy of 73.3 and F1 score of 86.9, depending on the subsets of the dataset tested on.
70
+
71
+ ## Examples
72
+
73
+ For the `mls` subset (which outputs `wer` and `cer`):
74
+
75
+ ```python
76
+ >>> xtreme_s_metric = evaluate.load('xtreme_s', 'mls')
77
+ >>> references = ["it is sunny here", "paper and pen are essentials"]
78
+ >>> predictions = ["it's sunny", "paper pen are essential"]
79
+ >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
80
+ >>> print({k: round(v, 2) for k, v in results.items()})
81
+ {'wer': 0.56, 'cer': 0.27}
82
+ ```
83
+
84
+ For the `covost2` subset (which outputs `bleu`):
85
+
86
+ ```python
87
+ >>> xtreme_s_metric = evaluate.load('xtreme_s', 'covost2')
88
+ >>> references = ["bonjour paris", "il est necessaire de faire du sport de temps en temp"]
89
+ >>> predictions = ["bonjour paris", "il est important de faire du sport souvent"]
90
+ >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
91
+ >>> print({k: round(v, 2) for k, v in results.items()})
92
+ {'bleu': 31.65}
93
+ ```
94
+
95
+ For the `fleurs-lang_id` subset (which outputs `accuracy`):
96
+
97
+ ```python
98
+ >>> xtreme_s_metric = evaluate.load('xtreme_s', 'fleurs-lang_id')
99
+ >>> references = [0, 1, 0, 0, 1]
100
+ >>> predictions = [0, 1, 1, 0, 0]
101
+ >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
102
+ >>> print({k: round(v, 2) for k, v in results.items()})
103
+ {'accuracy': 0.6}
104
+ ```
105
+
106
+ For the `minds14` subset (which outputs `f1` and `accuracy`):
107
+
108
+ ```python
109
+ >>> xtreme_s_metric = evaluate.load('xtreme_s', 'minds14')
110
+ >>> references = [0, 1, 0, 0, 1]
111
+ >>> predictions = [0, 1, 1, 0, 0]
112
+ >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
113
+ >>> print({k: round(v, 2) for k, v in results.items()})
114
+ {'f1': 0.58, 'accuracy': 0.6}
115
+ ```
116
+
117
+ ## Limitations and bias
118
+ This metric works only with datasets that have the same format as the [XTREME-S dataset](https://huggingface.co/datasets/google/xtreme_s).
119
+
120
+ While the XTREME-S dataset is meant to represent a variety of languages and tasks, it has inherent biases: it is missing many languages that are important and under-represented in NLP datasets.
121
+
122
+ It also has a particular focus on read-speech because common evaluation benchmarks like CoVoST-2 or LibriSpeech evaluate on this type of speech, which results in a mismatch between performance obtained in a read-speech setting and a more noisy setting (in production or live deployment, for instance).
123
+
124
+ ## Citation
125
+
126
+ ```bibtex
127
+ @article{conneau2022xtreme,
128
+ title={XTREME-S: Evaluating Cross-lingual Speech Representations},
129
+ author={Conneau, Alexis and Bapna, Ankur and Zhang, Yu and Ma, Min and von Platen, Patrick and Lozhkov, Anton and Cherry, Colin and Jia, Ye and Rivera, Clara and Kale, Mihir and others},
130
+ journal={arXiv preprint arXiv:2203.10752},
131
+ year={2022}
132
+ }
133
+ ```
134
+
135
+ ## Further References
136
+
137
+ - [XTREME-S dataset](https://huggingface.co/datasets/google/xtreme_s)
138
+ - [XTREME-S github repository](https://github.com/google-research/xtreme)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("xtreme_s")
6
+ launch_gradio_widget(module)
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ sklearn
xtreme_s.py ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2022 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ XTREME-S benchmark metric. """
15
+
16
+ from typing import List
17
+
18
+ import datasets
19
+ from datasets.config import PY_VERSION
20
+ from packaging import version
21
+ from sklearn.metrics import f1_score
22
+
23
+ import evaluate
24
+
25
+
26
+ if PY_VERSION < version.parse("3.8"):
27
+ import importlib_metadata
28
+ else:
29
+ import importlib.metadata as importlib_metadata
30
+
31
+
32
+ # TODO(Patrick/Anton)
33
+ _CITATION = """\
34
+ """
35
+
36
+ _DESCRIPTION = """\
37
+ XTREME-S is a benchmark to evaluate universal cross-lingual speech representations in many languages.
38
+ XTREME-S covers four task families: speech recognition, classification, speech-to-text translation and retrieval.
39
+ """
40
+
41
+ _KWARGS_DESCRIPTION = """
42
+ Compute XTREME-S evaluation metric associated to each XTREME-S dataset.
43
+ Args:
44
+ predictions: list of predictions to score.
45
+ Each translation should be tokenized into a list of tokens.
46
+ references: list of lists of references for each translation.
47
+ Each reference should be tokenized into a list of tokens.
48
+ bleu_kwargs: optional dict of keywords to be passed when computing 'bleu'.
49
+ Keywords include Dict can be one of 'smooth_method', 'smooth_value', 'force', 'lowercase',
50
+ 'tokenize', 'use_effective_order'.
51
+ wer_kwargs: optional dict of keywords to be passed when computing 'wer' and 'cer'.
52
+ Keywords include 'concatenate_texts'.
53
+ Returns: depending on the XTREME-S task, one or several of:
54
+ "accuracy": Accuracy - for 'fleurs-lang_id', 'minds14'
55
+ "f1": F1 score - for 'minds14'
56
+ "wer": Word error rate - for 'mls', 'fleurs-asr', 'voxpopuli', 'babel'
57
+ "cer": Character error rate - for 'mls', 'fleurs-asr', 'voxpopuli', 'babel'
58
+ "bleu": BLEU score according to the `sacrebleu` metric - for 'covost2'
59
+ Examples:
60
+
61
+ >>> xtreme_s_metric = evaluate.load('xtreme_s', 'mls') # 'mls', 'voxpopuli', 'fleurs-asr' or 'babel'
62
+ >>> references = ["it is sunny here", "paper and pen are essentials"]
63
+ >>> predictions = ["it's sunny", "paper pen are essential"]
64
+ >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
65
+ >>> print({k: round(v, 2) for k, v in results.items()})
66
+ {'wer': 0.56, 'cer': 0.27}
67
+
68
+ >>> xtreme_s_metric = evaluate.load('xtreme_s', 'covost2')
69
+ >>> references = ["bonjour paris", "il est necessaire de faire du sport de temps en temp"]
70
+ >>> predictions = ["bonjour paris", "il est important de faire du sport souvent"]
71
+ >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
72
+ >>> print({k: round(v, 2) for k, v in results.items()})
73
+ {'bleu': 31.65}
74
+
75
+ >>> xtreme_s_metric = evaluate.load('xtreme_s', 'fleurs-lang_id')
76
+ >>> references = [0, 1, 0, 0, 1]
77
+ >>> predictions = [0, 1, 1, 0, 0]
78
+ >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
79
+ >>> print({k: round(v, 2) for k, v in results.items()})
80
+ {'accuracy': 0.6}
81
+
82
+ >>> xtreme_s_metric = evaluate.load('xtreme_s', 'minds14')
83
+ >>> references = [0, 1, 0, 0, 1]
84
+ >>> predictions = [0, 1, 1, 0, 0]
85
+ >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
86
+ >>> print({k: round(v, 2) for k, v in results.items()})
87
+ {'f1': 0.58, 'accuracy': 0.6}
88
+ """
89
+
90
+ _CONFIG_NAMES = ["fleurs-asr", "mls", "voxpopuli", "babel", "covost2", "fleurs-lang_id", "minds14"]
91
+ SENTENCE_DELIMITER = ""
92
+
93
+ try:
94
+ from jiwer import transforms as tr
95
+
96
+ _jiwer_available = True
97
+ except ImportError:
98
+ _jiwer_available = False
99
+
100
+ if _jiwer_available and version.parse(importlib_metadata.version("jiwer")) < version.parse("2.3.0"):
101
+
102
+ class SentencesToListOfCharacters(tr.AbstractTransform):
103
+ def __init__(self, sentence_delimiter: str = " "):
104
+ self.sentence_delimiter = sentence_delimiter
105
+
106
+ def process_string(self, s: str):
107
+ return list(s)
108
+
109
+ def process_list(self, inp: List[str]):
110
+ chars = []
111
+ for sent_idx, sentence in enumerate(inp):
112
+ chars.extend(self.process_string(sentence))
113
+ if self.sentence_delimiter is not None and self.sentence_delimiter != "" and sent_idx < len(inp) - 1:
114
+ chars.append(self.sentence_delimiter)
115
+ return chars
116
+
117
+ cer_transform = tr.Compose(
118
+ [tr.RemoveMultipleSpaces(), tr.Strip(), SentencesToListOfCharacters(SENTENCE_DELIMITER)]
119
+ )
120
+ elif _jiwer_available:
121
+ cer_transform = tr.Compose(
122
+ [
123
+ tr.RemoveMultipleSpaces(),
124
+ tr.Strip(),
125
+ tr.ReduceToSingleSentence(SENTENCE_DELIMITER),
126
+ tr.ReduceToListOfListOfChars(),
127
+ ]
128
+ )
129
+ else:
130
+ cer_transform = None
131
+
132
+
133
+ def simple_accuracy(preds, labels):
134
+ return float((preds == labels).mean())
135
+
136
+
137
+ def f1_and_simple_accuracy(preds, labels):
138
+ return {
139
+ "f1": float(f1_score(y_true=labels, y_pred=preds, average="macro")),
140
+ "accuracy": simple_accuracy(preds, labels),
141
+ }
142
+
143
+
144
+ def bleu(
145
+ preds,
146
+ labels,
147
+ smooth_method="exp",
148
+ smooth_value=None,
149
+ force=False,
150
+ lowercase=False,
151
+ tokenize=None,
152
+ use_effective_order=False,
153
+ ):
154
+ # xtreme-s can only have one label
155
+ labels = [[label] for label in labels]
156
+ preds = list(preds)
157
+ try:
158
+ import sacrebleu as scb
159
+ except ImportError:
160
+ raise ValueError(
161
+ "sacrebleu has to be installed in order to apply the bleu metric for covost2."
162
+ "You can install it via `pip install sacrebleu`."
163
+ )
164
+
165
+ if version.parse(scb.__version__) < version.parse("1.4.12"):
166
+ raise ImportWarning(
167
+ "To use `sacrebleu`, the module `sacrebleu>=1.4.12` is required, and the current version of `sacrebleu` doesn't match this condition.\n"
168
+ 'You can install it with `pip install "sacrebleu>=1.4.12"`.'
169
+ )
170
+
171
+ references_per_prediction = len(labels[0])
172
+ if any(len(refs) != references_per_prediction for refs in labels):
173
+ raise ValueError("Sacrebleu requires the same number of references for each prediction")
174
+ transformed_references = [[refs[i] for refs in labels] for i in range(references_per_prediction)]
175
+ output = scb.corpus_bleu(
176
+ preds,
177
+ transformed_references,
178
+ smooth_method=smooth_method,
179
+ smooth_value=smooth_value,
180
+ force=force,
181
+ lowercase=lowercase,
182
+ use_effective_order=use_effective_order,
183
+ **(dict(tokenize=tokenize) if tokenize else {}),
184
+ )
185
+ return {"bleu": output.score}
186
+
187
+
188
+ def wer_and_cer(preds, labels, concatenate_texts, config_name):
189
+ try:
190
+ from jiwer import compute_measures
191
+ except ImportError:
192
+ raise ValueError(
193
+ f"jiwer has to be installed in order to apply the wer metric for {config_name}."
194
+ "You can install it via `pip install jiwer`."
195
+ )
196
+
197
+ if concatenate_texts:
198
+ wer = compute_measures(labels, preds)["wer"]
199
+
200
+ cer = compute_measures(labels, preds, truth_transform=cer_transform, hypothesis_transform=cer_transform)["wer"]
201
+ return {"wer": wer, "cer": cer}
202
+ else:
203
+
204
+ def compute_score(preds, labels, score_type="wer"):
205
+ incorrect = 0
206
+ total = 0
207
+ for prediction, reference in zip(preds, labels):
208
+ if score_type == "wer":
209
+ measures = compute_measures(reference, prediction)
210
+ elif score_type == "cer":
211
+ measures = compute_measures(
212
+ reference, prediction, truth_transform=cer_transform, hypothesis_transform=cer_transform
213
+ )
214
+ incorrect += measures["substitutions"] + measures["deletions"] + measures["insertions"]
215
+ total += measures["substitutions"] + measures["deletions"] + measures["hits"]
216
+ return incorrect / total
217
+
218
+ return {"wer": compute_score(preds, labels, "wer"), "cer": compute_score(preds, labels, "cer")}
219
+
220
+
221
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
222
+ class XtremeS(evaluate.EvaluationModule):
223
+ def _info(self):
224
+ if self.config_name not in _CONFIG_NAMES:
225
+ raise KeyError(f"You should supply a configuration name selected in {_CONFIG_NAMES}")
226
+
227
+ pred_type = "int64" if self.config_name in ["fleurs-lang_id", "minds14"] else "string"
228
+
229
+ return evaluate.EvaluationModuleInfo(
230
+ description=_DESCRIPTION,
231
+ citation=_CITATION,
232
+ inputs_description=_KWARGS_DESCRIPTION,
233
+ features=datasets.Features(
234
+ {"predictions": datasets.Value(pred_type), "references": datasets.Value(pred_type)}
235
+ ),
236
+ codebase_urls=[],
237
+ reference_urls=[],
238
+ format="numpy",
239
+ )
240
+
241
+ def _compute(self, predictions, references, bleu_kwargs=None, wer_kwargs=None):
242
+
243
+ bleu_kwargs = bleu_kwargs if bleu_kwargs is not None else {}
244
+ wer_kwargs = wer_kwargs if wer_kwargs is not None else {}
245
+
246
+ if self.config_name == "fleurs-lang_id":
247
+ return {"accuracy": simple_accuracy(predictions, references)}
248
+ elif self.config_name == "minds14":
249
+ return f1_and_simple_accuracy(predictions, references)
250
+ elif self.config_name == "covost2":
251
+ smooth_method = bleu_kwargs.pop("smooth_method", "exp")
252
+ smooth_value = bleu_kwargs.pop("smooth_value", None)
253
+ force = bleu_kwargs.pop("force", False)
254
+ lowercase = bleu_kwargs.pop("lowercase", False)
255
+ tokenize = bleu_kwargs.pop("tokenize", None)
256
+ use_effective_order = bleu_kwargs.pop("use_effective_order", False)
257
+ return bleu(
258
+ preds=predictions,
259
+ labels=references,
260
+ smooth_method=smooth_method,
261
+ smooth_value=smooth_value,
262
+ force=force,
263
+ lowercase=lowercase,
264
+ tokenize=tokenize,
265
+ use_effective_order=use_effective_order,
266
+ )
267
+ elif self.config_name in ["fleurs-asr", "mls", "voxpopuli", "babel"]:
268
+ concatenate_texts = wer_kwargs.pop("concatenate_texts", False)
269
+ return wer_and_cer(predictions, references, concatenate_texts, self.config_name)
270
+ else:
271
+ raise KeyError(f"You should supply a configuration name selected in {_CONFIG_NAMES}")