Icelandic LLM leaderboard

from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    task0 = Task("icelandic_winogrande_stringmatch", "exact_match,get-answer", "WinoGrande-IS (3-shot)")
    task1 = Task("icelandic_sentences_ged_stringmatch", "exact_match,get-answer", "GED")
    task2 = Task("icelandic_inflection_all", "exact_match,get-answer", "Inflection (1-shot)")
    task5 = Task("icelandic_belebele", "exact_match,get-answer", "Belebele (IS)")
    task6 = Task("icelandic_arc_challenge", "exact_match,get-answer", "ARC-Challenge-IS")
    task7 = Task("icelandic_wiki_qa", "lm_judge_score,get-answer", "WikiQA-IS")

# ---------------------------------------------------


# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">Icelandic LLM leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
## New submissions
Do you want your model to be included on the leaderboard? Open a discussion on this repository with the details of your model and we will get back to you.

## Benchmark tasks
The Icelandic LLM leaderboard evaluates models on several tasks. All of them are set up as generation tasks, where the model's output is compared to the expected output.
This means that models that have not been instruction fine-tuned might perform poorly on these tasks.

The following tasks are evaluated:

### WinoGrande-IS
The Icelandic WinoGrande task is a human-translated and localized version of the ~1000 test set examples in the WinoGrande task in English.
Each example consists of a sentence with a blank, and two answer choices for the blank. The task is to choose the correct answer choice using coreference resolution.
The benchmark is designed to test the model's ability to use knowledge and common sense reasoning in Icelandic. For this benchmark, we use 3-shot evaluation.
The Icelandic WinoGrande dataset is described in more detail in the IceBERT paper (https://aclanthology.org/2022.lrec-1.464.pdf).
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-winogrande

### GED
This is a benchmark for binary sentence-level Icelandic grammatical error detection, adapted from the Icelandic Error Corpus (IEC) and contains 200 examples.
Each example consists of a sentence that may contain one or more grammatical errors, and the task is to predict whether the sentence contains an error.
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-sentences-gec

### Inflection benchmark
The inflection benchmark tests models' ability to generate inflected forms of 300 Icelandic adjective-noun pairs for all four cases, singular and plural.
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-inflection-all-flat

### Belebele (IS)
This is the Icelandic subset (900 examples) of the Belebele benchmark, a multiple-choice reading comprehension task. The task is to answer questions about a given passage.
- Link to dataset: https://huggingface.co/datasets/facebook/belebele

### ARC-Challenge-IS
A machine-translated version of the ARC-Challenge multiple-choice question-answering dataset. For this benchmark, we use the test set which contains 1.23k examples.
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-arc-challenge

### WikiQA-IS
The Icelandic WikiQA dataset is a collection of 1.9k question-answer pairs from the Icelandic Wikipedia, meant to evaluate models' knowledge of Icelandic culture and history. 
They were collected by making GPT-4o generate questions and anwswers
given Icelandic Wikipedia articles as context. All examples were then manually verified and corrected where necessary. For evaluation, we prompt GPT-4o to 
compare the generated answer to the original answer for semantic similarity and rate the answer on the following scale: (0, "poor"), (1, "fair"), (2, "excellent").
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic_wiki_qa
"""