Spaces:

evaluate-metric
/

xtreme_s

Running

App Files Files Community

lvwerra HF staff commited on May 20, 2022

Commit

9fb7d26

1 Parent(s): 40d62f1

Update Space (evaluate main: 828c6327)

Browse files

Files changed (4) hide show

README.md +131 -5
app.py +6 -0
requirements.txt +4 -0
xtreme_s.py +271 -0

README.md CHANGED Viewed

@@ -1,12 +1,138 @@
 ---
-title: Xtreme_s
-emoji: 😻
-colorFrom: pink
-colorTo: gray
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
+title: XTREME-S
+emoji: 🤗
+colorFrom: blue
+colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
 ---
+# Metric Card for XTREME-S
+## Metric Description
+The XTREME-S metric aims to evaluate model performance on the Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark.
+This benchmark was designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval.
+## How to Use
+There are two steps: (1) loading the XTREME-S metric relevant to the subset of the benchmark being used for evaluation; and (2) calculating the metric.
+1. **Loading the relevant XTREME-S metric** : the subsets of XTREME-S are the following: `mls`, `voxpopuli`, `covost2`, `fleurs-asr`, `fleurs-lang_id`,  `minds14`  and `babel`. More information about the different subsets can be found on the [XTREME-S benchmark page](https://huggingface.co/datasets/google/xtreme_s).
+```python
+>>> xtreme_s_metric = evaluate.load('xtreme_s', 'mls')
+```
+2. **Calculating the metric**: the metric takes two inputs :
+- `predictions`: a list of predictions to score, with each prediction a `str`.
+- `references`: a list of lists of references for each translation, with each reference a `str`.
+```python
+>>> references = ["it is sunny here", "paper and pen are essentials"]
+>>> predictions = ["it's sunny", "paper pen are essential"]
+>>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
+```
+It also has two optional arguments:
+- `bleu_kwargs`: a `dict` of keywords to be passed when computing the `bleu` metric for the `covost2` subset. Keywords can be one of `smooth_method`, `smooth_value`, `force`, `lowercase`, `tokenize`, `use_effective_order`.
+- `wer_kwargs`: optional dict of keywords to be passed when computing `wer` and `cer`, which are computed for the `mls`, `fleurs-asr`, `voxpopuli`, and `babel` subsets. Keywords are `concatenate_texts`.
+## Output values
+The output of the metric depends on the XTREME-S subset chosen, consisting of a dictionary that contains one or several of the following metrics:
+- `accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information). This is returned for the `fleurs-lang_id` and `minds14` subsets.
+- `f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall. It is returned for the `minds14` subset.
+- `wer`: Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The lower the value, the better the performance of the ASR system, with a WER of 0 being a perfect score (see [WER score](https://huggingface.co/metrics/wer) for more information). It is returned for the `mls`, `fleurs-asr`, `voxpopuli` and `babel` subsets of the benchmark.
+- `cer`:  Character error rate (CER) is similar to WER, but operates on character instead of word. The lower the CER value, the better the performance of the ASR system, with a CER of 0 being a perfect score (see [CER score](https://huggingface.co/metrics/cer) for more information).  It is returned for the `mls`, `fleurs-asr`, `voxpopuli` and `babel` subsets of the benchmark.
+- `bleu`: the BLEU score, calculated according to the SacreBLEU metric approach. It can take any value between 0.0 and 100.0, inclusive, with higher values being better (see [SacreBLEU](https://huggingface.co/metrics/sacrebleu) for more details).  This is returned for the `covost2` subset.
+### Values from popular papers
+The [original XTREME-S paper](https://arxiv.org/pdf/2203.10752.pdf) reported average WERs ranging from 9.2 to 14.6, a BLEU score of 20.6, an accuracy of 73.3 and F1 score of 86.9, depending on the subsets of the dataset tested on.
+## Examples
+For the `mls` subset (which outputs `wer` and `cer`):
+```python
+>>> xtreme_s_metric = evaluate.load('xtreme_s', 'mls')
+>>> references = ["it is sunny here", "paper and pen are essentials"]
+>>> predictions = ["it's sunny", "paper pen are essential"]
+>>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
+>>> print({k: round(v, 2) for k, v in results.items()})
+{'wer': 0.56, 'cer': 0.27}
+```
+For the `covost2` subset (which outputs `bleu`):
+```python
+>>> xtreme_s_metric = evaluate.load('xtreme_s', 'covost2')
+>>> references = ["bonjour paris", "il est necessaire de faire du sport de temps en temp"]
+>>> predictions = ["bonjour paris", "il est important de faire du sport souvent"]
+>>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
+>>> print({k: round(v, 2) for k, v in results.items()})
+{'bleu': 31.65}
+```
+For the `fleurs-lang_id` subset (which outputs `accuracy`):
+```python
+>>> xtreme_s_metric = evaluate.load('xtreme_s', 'fleurs-lang_id')
+>>> references = [0, 1, 0, 0, 1]
+>>> predictions = [0, 1, 1, 0, 0]
+>>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
+>>> print({k: round(v, 2) for k, v in results.items()})
+{'accuracy': 0.6}
+ ```
+For the `minds14` subset (which outputs `f1` and `accuracy`):
+```python
+>>> xtreme_s_metric = evaluate.load('xtreme_s', 'minds14')
+>>> references = [0, 1, 0, 0, 1]
+>>> predictions = [0, 1, 1, 0, 0]
+>>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
+>>> print({k: round(v, 2) for k, v in results.items()})
+{'f1': 0.58, 'accuracy': 0.6}
+```
+## Limitations and bias
+This metric works only with datasets that have the same format as the [XTREME-S dataset](https://huggingface.co/datasets/google/xtreme_s).
+While the XTREME-S dataset is meant to represent a variety of languages and tasks, it has inherent biases: it is missing many languages that are important and under-represented in NLP datasets.
+It also has a particular focus on read-speech because common evaluation benchmarks like CoVoST-2 or LibriSpeech evaluate on this type of speech, which results in a mismatch between performance obtained in a read-speech setting and a more noisy setting (in production or live deployment, for instance).
+## Citation
+```bibtex
+@article{conneau2022xtreme,
+  title={XTREME-S: Evaluating Cross-lingual Speech Representations},
+  author={Conneau, Alexis and Bapna, Ankur and Zhang, Yu and Ma, Min and von Platen, Patrick and Lozhkov, Anton and Cherry, Colin and Jia, Ye and Rivera, Clara and Kale, Mihir and others},
+  journal={arXiv preprint arXiv:2203.10752},
+  year={2022}
+}
+```
+## Further References
+- [XTREME-S dataset](https://huggingface.co/datasets/google/xtreme_s)
+- [XTREME-S github repository](https://github.com/google-research/xtreme)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("xtreme_s")
+launch_gradio_widget(module)

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+# TODO: fix github to release
+git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
+datasets~=2.0
+sklearn

xtreme_s.py ADDED Viewed

	@@ -0,0 +1,271 @@

+# Copyright 2022 The HuggingFace Evaluate Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" XTREME-S benchmark metric. """
+from typing import List
+import datasets
+from datasets.config import PY_VERSION
+from packaging import version
+from sklearn.metrics import f1_score
+import evaluate
+if PY_VERSION < version.parse("3.8"):
+    import importlib_metadata
+else:
+    import importlib.metadata as importlib_metadata
+# TODO(Patrick/Anton)
+_CITATION = """\
+"""
+_DESCRIPTION = """\
+    XTREME-S is a benchmark to evaluate universal cross-lingual speech representations in many languages.
+    XTREME-S covers four task families: speech recognition, classification, speech-to-text translation and retrieval.
+"""
+_KWARGS_DESCRIPTION = """
+Compute XTREME-S evaluation metric associated to each XTREME-S dataset.
+Args:
+    predictions: list of predictions to score.
+        Each translation should be tokenized into a list of tokens.
+    references: list of lists of references for each translation.
+        Each reference should be tokenized into a list of tokens.
+    bleu_kwargs: optional dict of keywords to be passed when computing 'bleu'.
+        Keywords include Dict can be one of 'smooth_method', 'smooth_value', 'force', 'lowercase',
+        'tokenize', 'use_effective_order'.
+    wer_kwargs: optional dict of keywords to be passed when computing 'wer' and 'cer'.
+        Keywords include 'concatenate_texts'.
+Returns: depending on the XTREME-S task, one or several of:
+    "accuracy": Accuracy - for 'fleurs-lang_id', 'minds14'
+    "f1": F1 score - for 'minds14'
+    "wer": Word error rate - for 'mls', 'fleurs-asr', 'voxpopuli', 'babel'
+    "cer": Character error rate - for 'mls', 'fleurs-asr', 'voxpopuli', 'babel'
+    "bleu": BLEU score according to the `sacrebleu` metric - for 'covost2'
+Examples:
+    >>> xtreme_s_metric = evaluate.load('xtreme_s', 'mls')  # 'mls', 'voxpopuli', 'fleurs-asr' or 'babel'
+    >>> references = ["it is sunny here", "paper and pen are essentials"]
+    >>> predictions = ["it's sunny", "paper pen are essential"]
+    >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
+    >>> print({k: round(v, 2) for k, v in results.items()})
+    {'wer': 0.56, 'cer': 0.27}
+    >>> xtreme_s_metric = evaluate.load('xtreme_s', 'covost2')
+    >>> references = ["bonjour paris", "il est necessaire de faire du sport de temps en temp"]
+    >>> predictions = ["bonjour paris", "il est important de faire du sport souvent"]
+    >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
+    >>> print({k: round(v, 2) for k, v in results.items()})
+    {'bleu': 31.65}
+    >>> xtreme_s_metric = evaluate.load('xtreme_s', 'fleurs-lang_id')
+    >>> references = [0, 1, 0, 0, 1]
+    >>> predictions = [0, 1, 1, 0, 0]
+    >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
+    >>> print({k: round(v, 2) for k, v in results.items()})
+    {'accuracy': 0.6}
+    >>> xtreme_s_metric = evaluate.load('xtreme_s', 'minds14')
+    >>> references = [0, 1, 0, 0, 1]
+    >>> predictions = [0, 1, 1, 0, 0]
+    >>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
+    >>> print({k: round(v, 2) for k, v in results.items()})
+    {'f1': 0.58, 'accuracy': 0.6}
+"""
+_CONFIG_NAMES = ["fleurs-asr", "mls", "voxpopuli", "babel", "covost2", "fleurs-lang_id", "minds14"]
+SENTENCE_DELIMITER = ""
+try:
+    from jiwer import transforms as tr
+    _jiwer_available = True
+except ImportError:
+    _jiwer_available = False
+if _jiwer_available and version.parse(importlib_metadata.version("jiwer")) < version.parse("2.3.0"):
+    class SentencesToListOfCharacters(tr.AbstractTransform):
+        def __init__(self, sentence_delimiter: str = " "):
+            self.sentence_delimiter = sentence_delimiter
+        def process_string(self, s: str):
+            return list(s)
+        def process_list(self, inp: List[str]):
+            chars = []
+            for sent_idx, sentence in enumerate(inp):
+                chars.extend(self.process_string(sentence))
+                if self.sentence_delimiter is not None and self.sentence_delimiter != "" and sent_idx < len(inp) - 1:
+                    chars.append(self.sentence_delimiter)
+            return chars
+    cer_transform = tr.Compose(
+        [tr.RemoveMultipleSpaces(), tr.Strip(), SentencesToListOfCharacters(SENTENCE_DELIMITER)]
+    )
+elif _jiwer_available:
+    cer_transform = tr.Compose(
+        [
+            tr.RemoveMultipleSpaces(),
+            tr.Strip(),
+            tr.ReduceToSingleSentence(SENTENCE_DELIMITER),
+            tr.ReduceToListOfListOfChars(),
+        ]
+    )
+else:
+    cer_transform = None
+def simple_accuracy(preds, labels):
+    return float((preds == labels).mean())
+def f1_and_simple_accuracy(preds, labels):
+    return {
+        "f1": float(f1_score(y_true=labels, y_pred=preds, average="macro")),
+        "accuracy": simple_accuracy(preds, labels),
+    }
+def bleu(
+    preds,
+    labels,
+    smooth_method="exp",
+    smooth_value=None,
+    force=False,
+    lowercase=False,
+    tokenize=None,
+    use_effective_order=False,
+):
+    # xtreme-s can only have one label
+    labels = [[label] for label in labels]
+    preds = list(preds)
+    try:
+        import sacrebleu as scb
+    except ImportError:
+        raise ValueError(
+            "sacrebleu has to be installed in order to apply the bleu metric for covost2."
+            "You can install it via `pip install sacrebleu`."
+        )
+    if version.parse(scb.__version__) < version.parse("1.4.12"):
+        raise ImportWarning(
+            "To use `sacrebleu`, the module `sacrebleu>=1.4.12` is required, and the current version of `sacrebleu` doesn't match this condition.\n"
+            'You can install it with `pip install "sacrebleu>=1.4.12"`.'
+        )
+    references_per_prediction = len(labels[0])
+    if any(len(refs) != references_per_prediction for refs in labels):
+        raise ValueError("Sacrebleu requires the same number of references for each prediction")
+    transformed_references = [[refs[i] for refs in labels] for i in range(references_per_prediction)]
+    output = scb.corpus_bleu(
+        preds,
+        transformed_references,
+        smooth_method=smooth_method,
+        smooth_value=smooth_value,
+        force=force,
+        lowercase=lowercase,
+        use_effective_order=use_effective_order,
+        **(dict(tokenize=tokenize) if tokenize else {}),
+    )
+    return {"bleu": output.score}
+def wer_and_cer(preds, labels, concatenate_texts, config_name):
+    try:
+        from jiwer import compute_measures
+    except ImportError:
+        raise ValueError(
+            f"jiwer has to be installed in order to apply the wer metric for {config_name}."
+            "You can install it via `pip install jiwer`."
+        )
+    if concatenate_texts:
+        wer = compute_measures(labels, preds)["wer"]
+        cer = compute_measures(labels, preds, truth_transform=cer_transform, hypothesis_transform=cer_transform)["wer"]
+        return {"wer": wer, "cer": cer}
+    else:
+        def compute_score(preds, labels, score_type="wer"):
+            incorrect = 0
+            total = 0
+            for prediction, reference in zip(preds, labels):
+                if score_type == "wer":
+                    measures = compute_measures(reference, prediction)
+                elif score_type == "cer":
+                    measures = compute_measures(
+                        reference, prediction, truth_transform=cer_transform, hypothesis_transform=cer_transform
+                    )
+                incorrect += measures["substitutions"] + measures["deletions"] + measures["insertions"]
+                total += measures["substitutions"] + measures["deletions"] + measures["hits"]
+            return incorrect / total
+        return {"wer": compute_score(preds, labels, "wer"), "cer": compute_score(preds, labels, "cer")}
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class XtremeS(evaluate.EvaluationModule):
+    def _info(self):
+        if self.config_name not in _CONFIG_NAMES:
+            raise KeyError(f"You should supply a configuration name selected in {_CONFIG_NAMES}")
+        pred_type = "int64" if self.config_name in ["fleurs-lang_id", "minds14"] else "string"
+        return evaluate.EvaluationModuleInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {"predictions": datasets.Value(pred_type), "references": datasets.Value(pred_type)}
+            ),
+            codebase_urls=[],
+            reference_urls=[],
+            format="numpy",
+        )
+    def _compute(self, predictions, references, bleu_kwargs=None, wer_kwargs=None):
+        bleu_kwargs = bleu_kwargs if bleu_kwargs is not None else {}
+        wer_kwargs = wer_kwargs if wer_kwargs is not None else {}
+        if self.config_name == "fleurs-lang_id":
+            return {"accuracy": simple_accuracy(predictions, references)}
+        elif self.config_name == "minds14":
+            return f1_and_simple_accuracy(predictions, references)
+        elif self.config_name == "covost2":
+            smooth_method = bleu_kwargs.pop("smooth_method", "exp")
+            smooth_value = bleu_kwargs.pop("smooth_value", None)
+            force = bleu_kwargs.pop("force", False)
+            lowercase = bleu_kwargs.pop("lowercase", False)
+            tokenize = bleu_kwargs.pop("tokenize", None)
+            use_effective_order = bleu_kwargs.pop("use_effective_order", False)
+            return bleu(
+                preds=predictions,
+                labels=references,
+                smooth_method=smooth_method,
+                smooth_value=smooth_value,
+                force=force,
+                lowercase=lowercase,
+                tokenize=tokenize,
+                use_effective_order=use_effective_order,
+            )
+        elif self.config_name in ["fleurs-asr", "mls", "voxpopuli", "babel"]:
+            concatenate_texts = wer_kwargs.pop("concatenate_texts", False)
+            return wer_and_cer(predictions, references, concatenate_texts, self.config_name)
+        else:
+            raise KeyError(f"You should supply a configuration name selected in {_CONFIG_NAMES}")