📊 Benchmarks and Leaderboards - a society-ethics Collection

society-ethics 's Collections

⛔️🔦 Provenance, Watermarking & Deepfake Detection

🗳️ AI for Policymakers

⚖️ Showing Biases in ML Systems

🤬⛔ Hate Speech and Filtering

🪪🔦Model Cards

🔒☂️🧑‍🤝‍🧑 Privacy and AI

📊 Benchmarks and Leaderboards

📚🔍 Understanding Datasets

💻🔍 Understanding Models

🏛️📚🖼️ Open Data: Public Domain and Open Licenses

📊 Benchmarks and Leaderboards

updated Sep 26, 2024

Running on CPU Upgrade

12.4k

12.4k

Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots
Runtime error

5

5

Zeno Evals Hub

🏃
Running on CPU Upgrade

4.7k

4.7k

MTEB Leaderboard

🥇

Select and filter benchmarks for text embedding tasks
Running

417

417

LLM-Perf Leaderboard

🏆

Explore hardware performance for language models
Runtime error

136

136

Leaderboards

📈
Running on CPU Upgrade

609

609

Open ASR Leaderboard

🏆

Request evaluation results for a speech model
Running

1.11k

1.11k

Big Code Models Leaderboard

📈

Submit code models for evaluation on benchmarks
Running

3.96k

3.96k

Chatbot Arena Leaderboard

🏆
Running

152

152

Open Object Detection Leaderboard

🏆

Request model evaluation on COCO val 2017 dataset
Running

65

65

Toolbench Leaderboard

⚡

Display ToolBench model performance results
Running

81

81

SEED-Bench Leaderboard

🏆
Running

89

89

OpenCompass LLM Leaderboard

🚀

Display a web page
nguha/legalbench

Updated Sep 30, 2024 • 18.7k • 98
Running

6

6

Skillmix

🚀

Browse and compare AI model evaluations
Running on CPU Upgrade

128

128

Hallucinations Leaderboard

🔥

View and submit LLM evaluations
Running

33

33

MVBench Leaderboard

🐨

Submit model evaluation and view leaderboard
Sleeping

3

3

Mt Bench French Browser

📊
Running

8

8

ML.ENERGY Leaderboard

⚡

Explore GenAI model efficiency on ML.ENERGY leaderboard
Running

52

52

NPHardEval Leaderboard

🥇

Explore and compare LLM models through a leaderboard
Running

164

164

VBench Leaderboard

📊

Upload and evaluate video models
Runtime error

104

104

Enterprise Scenarios Leaderboard

🥇
Running

185

185

Yet Another LLM Leaderboard

🌖

Run a Streamlit web app
Running

59

59

CyberSecEvalTest

📈

Evaluate LLM cybersecurity risks
Runtime error

30

30

Contextual Leaderboard

🐨
Running

52

52

Open Multilingual Llm Leaderboard

🐨

Search for model performance across languages and benchmarks
Running on CPU Upgrade

88

88

OpenLLM Turkish leaderboard

🥇

Browse and filter leaderboard of language models
Running on CPU Upgrade

595

595

Open VLM Leaderboard

🌎

VLMEvalKit Evaluation Results Collection
Running

326

326

Reward Bench Leaderboard

📐

Explore and analyze RewardBench leaderboard data
Runtime error

63

63

Guardrails Arena

⚔

Jailbreak the LLM and privacy guardrails
Running on CPU Upgrade

62

62

LeaderboardExplorer

🔎

Filter and display leaderboards based on selected criteria
Running

16

16

🐍💨 Data Contamination Database

🏭

Filter data for contamination in datasets or models
Running on CPU Upgrade

123

123

Open Arabic LLM Leaderboard

🏆

Track, rank and evaluate open Arabic LLMs and chatbots
Running on CPU Upgrade

66

66

AIR-Bench Leaderboard

🥇

Explore benchmark results for QA and long doc models
Running

23

23

MM-UPD Leaderboard

🥇

Submit and evaluate model results for the MM-AAD leaderboard
Running

177

177

BigCodeBench Leaderboard

🥇

Explore and analyze code evaluation data
Running on CPU Upgrade

67

67

La Leaderboard

🌸

Evaluate open LLMs in the languages of LATAM and Spain.