Leaderboards and benchmarks ✨

clefourrier 's Collections

LLM evaluation datasets

updated 1 day ago

Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ...

Upvote

Running on CPU Upgrade

12.4k

12.4k

Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots

Note The reference leaderboard for Open LLMs! Find the best LLM for your size and precision needs, compare your models to the others! (Evaluates on ARC, HellaSwag, TruthfulQA, and MMLU)
Running

1.11k

1.11k

Big Code Models Leaderboard

📈

Submit code models for evaluation on benchmarks

Note Specialized leaderboard for models with coding capabilities 🖥️ (Evaluates on HumanEval and MultiPL-E)
Running

3.96k

3.96k

Chatbot Arena Leaderboard

🏆

Note Pitches chatbots against one another to compare their output quality (Evaluates on MTBench, an Elo score, and MMLU)
Running

416

416

LLM-Perf Leaderboard

🏆

Explore hardware performance for language models

Note Do you want to know which model to use for which hardware? This leaderboard is for you! (Looks at the throughput of many LLMs in different hardware settings)
EleutherAI: Going Beyond "Open Science" to "Science in the Open"

Paper • 2210.06413 • Published Oct 12, 2022

Note This paper introduces (among other things) the Eleuther AI Harness, a reference evaluation suite which is simple to use and quite complete!
Holistic Evaluation of Language Models

Paper • 2211.09110 • Published Nov 16, 2022

Note The HELM paper! A super cool reference paper on the many axes to look at when creating an LLM benchmark or evaluation suite. Super exhaustive and interesting to read.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Paper • 2206.04615 • Published Jun 9, 2022 • 5

Note The BigBench paper! A bunch of tasks to evaluate edge cases and random unusual LLM capabilities. The associated benchmark has since been completed with a lot of fun crowdsourced tasks.
Running on CPU Upgrade

4.69k

4.69k

MTEB Leaderboard

🥇

Select and filter benchmarks for text embedding tasks

Note Text Embeddings benchmark across 58 tasks and 112 languages!
Running on CPU Upgrade

248

248

GAIA Leaderboard

🦾

Submit models for evaluation and view leaderboard results

Note A leaderboard for tool augmented LLMs!
Running

89

89

OpenCompass LLM Leaderboard

🚀

Display a web page

Note An LLM leaderboard for Chinese models on many metric axes - super complete
Running on CPU Upgrade

514

514

Open Ko-LLM Leaderboard

📉

Explore and filter language model benchmark results

Note An Open LLM Leaderboard specially for Korean models by our friends at Upstage!
Configuration error

52

52

Hallucination Evaluation Leaderboard

⚡

Note A leaderboard to evaluate the propensy of LLMs to hallucinate
Running on CPU Upgrade

128

128

Hallucinations Leaderboard

🔥

View and submit LLM evaluations

Note A lot of metrics if you are interested in the propensity of LLMs to hallucinate!
Running

89

89

Nexus Function Calling Leaderboard

🐠

Visualize model performance on function calling tasks

Note Tests LLM API usage and calls (few models atm)
Running

59

59

CyberSecEvalTest

📈

Evaluate LLM cybersecurity risks

Note How likely is your LLM to help cyber attacks?
Running

185

185

Yet Another LLM Leaderboard

🌖

Run a Streamlit web app

Note An aggregation of benchmarks well correlated with human preferences
Running on CPU Upgrade

87

87

LLM Safety Leaderboard

🥇

View and submit machine learning model evaluations

Note Bias, safety, toxicity, all those things that are important to test when your chatbot actually interacts with users
Running

32

32

EvalCrafter

⚡

Display leaderboard data for video generation models

Note Text to video generation leaderboard
Running

430

430

Can Ai Code Results

🏆

Generate animated avatars from images

Note Coding benchmark
Running

116

116

Ocrbench Leaderboard

🏆

Display OCRBench leaderboard for model evaluations

Note An OCR benchmark
Running

52

52

NPHardEval Leaderboard

🥇

Explore and compare LLM models through a leaderboard

Note Dynamic leaderboard using complexity classes to create reasoning problems for LLMs - quite a cool one
Running

41

41

Redteaming Resistance Leaderboard

💻

Display model benchmark results

Note Red teaming datasets success against models
Running

18

18

Subquadratic LLM Leaderboard

🏆

Submit and filter LLM models for evaluation

Note The Open LLM Leaderboard, but for structured state models!
Running

539

539

Vision Arena (Testing VLMs side-by-side)

🖼

Analyze images to detect and label objects

Note A multimodal arena!
Running

164

164

VBench Leaderboard

📊

Upload and evaluate video models
Running on CPU Upgrade

161

161

Open Portuguese LLM Leaderboard

🏆

Track, rank and evaluate open LLMs in Portuguese

Note An LLM leaderboard for Portuguese
Running on CPU Upgrade

69

69

Open Ita Llm Leaderboard

🏆

Track, rank and evaluate open LLMs in the italian language!

Note An LLM leaderboard for Italian
Running

8

8

Malay LLM Leaderboard

🏆

Display Malay LLM leaderboard scores

Note An LLM leaderboard for Malay
Running on Zero

267

267

GenAI Arena

📈

Realtime Image/Video Gen AI Arena

Note An arena for image generation!
Running

10

10

Q-Bench+ Leaderboard

📊

Browse Q-Bench leaderboard for vision model performance
Running on CPU Upgrade

32

32

Parti Prompts Leaderboard

📊

Display leaderboard for text-to-image model evaluations
Running on CPU Upgrade

93

93

HHEM Leaderboard

🥇

Explore and submit LLM benchmark evaluations

Note An hallucination leaderboard, focused on a different set of tasks
Restarting on CPU Upgrade

55

55

Open PL LLM Leaderboard

🏆

Display and filter a leaderboard of language models
Running on CPU Upgrade

88

88

OpenLLM Turkish leaderboard

🥇

Browse and filter leaderboard of language models
Running

222

222

AI2 WildBench Leaderboard (V2)

🦁

Display and explore model leaderboards and chat history
Running on CPU Upgrade

609

609

Open ASR Leaderboard

🏆

Request evaluation results for a speech model
Running on CPU Upgrade

595

595

Open VLM Leaderboard

🌎

VLMEvalKit Evaluation Results Collection
Running

324

324

Reward Bench Leaderboard

📐

Explore and analyze RewardBench leaderboard data
Running on CPU Upgrade

621

621

TTS Arena

🏆

Vote on the latest TTS models!
Running

13

13

Prompt Injection Detection Benchmark

📝

detect prompt injection risks
Running

32

32

Long Code Arena

🏟

Display leaderboard results for coding tasks
Running

8

8

ML.ENERGY Leaderboard

⚡

Explore GenAI model efficiency on ML.ENERGY leaderboard
Running

625

625

UGI Leaderboard

📢

Display UGI leaderboard data in an interactive grid
Configuration error

84

84

Berkeley Function Calling Leaderboard

🏃
Running on CPU Upgrade

50

50

Open CoT Leaderboard

🥇

Track, rank and evaluate open LLMs' CoT quality
Running

22

22

URIAL Bench (Eval Base LLMs on MT-Bench)

🐑

Display a leaderboard of models
Running

22

22

Indic Llm Leaderboard

🔥

Browse and compare Indic language LLMs on a leaderboard
Running

8

8

Meta Open LLM Leaderboard

🏆
Running

9

9

Science Leaderboard

👁

Leaderboard for LLM for Science Reasoning
Running on CPU Upgrade

331

331

Open Medical-LLM Leaderboard

🥇

Browse and submit LLM evaluations
Runtime error

29

29

Open RL Leaderboard

🥇
Running

18

18

LLM Leaderboard for SEA

🥇

Browse leaderboard of language models
Running on CPU Upgrade

30

30

Hebrew LLM Leaderboard

🥇
Running on CPU Upgrade

146

146

Open LLM Progress Tracker

🔬

Visualize LLM progress with interactive filters
Running

163

163

Low-bit Quantized Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade

66

66

AIR-Bench Leaderboard

🥇

Explore benchmark results for QA and long doc models
Running on CPU Upgrade

122

122

Open Arabic LLM Leaderboard

🏆

Track, rank and evaluate open Arabic LLMs and chatbots
Running on CPU Upgrade

107

107

Open Chinese LLM Leaderboard

🏆

Display and filter LLM benchmark results
Running

250

250

3D Arena

🏢

Generate a 3D leaderboard by voting
Running

176

176

BigCodeBench Leaderboard

🥇

Explore and analyze code evaluation data
Running

19

19

Open Tw Llm Leaderboard

🥇

Browse and submit LLM evaluations
Running

84

84

Zebra Logic Bench

🦓

Display and explore zebra puzzle leaderboard
Running on CPU Upgrade

88

88

European Leaderboard

🌍

Benchmark LLMs in accuracy and translation across languages
Running

19

19

🇨🇿 BenCzechMark

📊

Submit a machine learning model for ranking evaluation
Running

41

41

Leaderboard

🥇

View and submit evaluations for benchmarks
Running

36

36

Stick To Your Role! Leaderboard

🎭

Compare LLMs on role consistency across contexts
Running

175

175

GPU Poor LLM Arena

🏆

Compact LLM Battle Arena: Frugal AI Face-Off!
Running on CPU Upgrade

67

67

La Leaderboard

🌸

Evaluate open LLMs in the languages of LATAM and Spain.
Running on CPU Upgrade

31

31

OpenLLM French leaderboard 🇫🇷

🥇

Explore and compare LLM benchmarks and submit models for evaluation
Running

50

50

GIFT Eval

🥇

GIFT-Eval: A Benchmark for General Time Series Forecasting
Running

88

88

Judge Arena

💻

Compare AI models by voting on responses
Running

50

50

Open Persian LLM Leaderboard

🏅

Open Persian LLM Leaderboard
Running

34

34

Japanese Chatbot Arena Leaderboard

🌖

Compare two chatbots and vote on the better one
Running on CPU Upgrade

66

66

Open Japanese LLM Leaderboard

🌸

Explore and compare LLM models through interactive leaderboards and submissions
Running

7

7

Leaderboard2024

🏅

Submit protein prediction models to MLSB 2024 leaderboard
Sleeping

11

11

Toxicity Benchmarking

🥇

Explore toxicity scores of models
Running

60

60

Background Removal Arena

⚡

Compare image backgrounds and vote for the best
Running

9

9

Fev Leaderboard

🥇

Display benchmark results for time series models
Running

22

22

AI Phone Leaderboard

📱

AI Phone Leaderboard
Running

5

5

Icelandic LLM leaderboard

🥇

Find and filter models on the leaderboard
Running

9

9

Polish EQ-Bench Leaderboard

🏆

Display benchmark leaderboard for model evaluation
Restarting on CPU Upgrade

7

7

Polish Medical Leaderboard

🇵
Running

3

3

CPTU-Bench

🧠

Display and analyze Polish text understanding benchmark results
Running

20

20

MT Bench PL

📊

Explore and compare model performance on Polish MT-Bench
Runtime error

15

15

DABstep Leaderboard

🕺

DABstep Reasoning Benchmark Leaderboard

Upvote

Leaderboards and benchmarks ✨

Open LLM Leaderboard

Big Code Models Leaderboard

Chatbot Arena Leaderboard

LLM-Perf Leaderboard

MTEB Leaderboard

GAIA Leaderboard

OpenCompass LLM Leaderboard

Open Ko-LLM Leaderboard

Hallucination Evaluation Leaderboard

Hallucinations Leaderboard

Nexus Function Calling Leaderboard

CyberSecEvalTest

Yet Another LLM Leaderboard

LLM Safety Leaderboard

EvalCrafter

Can Ai Code Results

Ocrbench Leaderboard

NPHardEval Leaderboard

Redteaming Resistance Leaderboard

Subquadratic LLM Leaderboard

Vision Arena (Testing VLMs side-by-side)

VBench Leaderboard

Open Portuguese LLM Leaderboard

Open Ita Llm Leaderboard

Malay LLM Leaderboard

GenAI Arena

Q-Bench+ Leaderboard

Parti Prompts Leaderboard

HHEM Leaderboard

Open PL LLM Leaderboard

OpenLLM Turkish leaderboard

AI2 WildBench Leaderboard (V2)

Open ASR Leaderboard

Open VLM Leaderboard

Reward Bench Leaderboard

TTS Arena

Prompt Injection Detection Benchmark

Long Code Arena

ML.ENERGY Leaderboard

UGI Leaderboard

Berkeley Function Calling Leaderboard

Open CoT Leaderboard

URIAL Bench (Eval Base LLMs on MT-Bench)

Indic Llm Leaderboard

Meta Open LLM Leaderboard

Science Leaderboard

Open Medical-LLM Leaderboard

Open RL Leaderboard

LLM Leaderboard for SEA

Hebrew LLM Leaderboard

Open LLM Progress Tracker

Low-bit Quantized Open LLM Leaderboard

AIR-Bench Leaderboard

Open Arabic LLM Leaderboard

Open Chinese LLM Leaderboard

3D Arena

BigCodeBench Leaderboard

Open Tw Llm Leaderboard

Zebra Logic Bench

European Leaderboard

🇨🇿 BenCzechMark

Leaderboard

Stick To Your Role! Leaderboard

GPU Poor LLM Arena

La Leaderboard

OpenLLM French leaderboard 🇫🇷

GIFT Eval

Judge Arena

Open Persian LLM Leaderboard

Japanese Chatbot Arena Leaderboard

Open Japanese LLM Leaderboard

Leaderboard2024

Toxicity Benchmarking

Background Removal Arena

Fev Leaderboard

AI Phone Leaderboard

Icelandic LLM leaderboard

Polish EQ-Bench Leaderboard

Polish Medical Leaderboard