Open LLM Leaderboard
Track, rank and evaluate open LLMs and chatbots
Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ...
Track, rank and evaluate open LLMs and chatbots
Note The reference leaderboard for Open LLMs! Find the best LLM for your size and precision needs, compare your models to the others! (Evaluates on ARC, HellaSwag, TruthfulQA, and MMLU)
Submit code models for evaluation on benchmarks
Note Specialized leaderboard for models with coding capabilities π₯οΈ (Evaluates on HumanEval and MultiPL-E)
Note Pitches chatbots against one another to compare their output quality (Evaluates on MTBench, an Elo score, and MMLU)
Explore hardware performance for language models
Note Do you want to know which model to use for which hardware? This leaderboard is for you! (Looks at the throughput of many LLMs in different hardware settings)
Note This paper introduces (among other things) the Eleuther AI Harness, a reference evaluation suite which is simple to use and quite complete!
Note The HELM paper! A super cool reference paper on the many axes to look at when creating an LLM benchmark or evaluation suite. Super exhaustive and interesting to read.
Note The BigBench paper! A bunch of tasks to evaluate edge cases and random unusual LLM capabilities. The associated benchmark has since been completed with a lot of fun crowdsourced tasks.
Select and filter benchmarks for text embedding tasks
Note Text Embeddings benchmark across 58 tasks and 112 languages!
Submit models for evaluation and view leaderboard results
Note A leaderboard for tool augmented LLMs!
Display a web page
Note An LLM leaderboard for Chinese models on many metric axes - super complete
Explore and filter language model benchmark results
Note An Open LLM Leaderboard specially for Korean models by our friends at Upstage!
Note A leaderboard to evaluate the propensy of LLMs to hallucinate
View and submit LLM evaluations
Note A lot of metrics if you are interested in the propensity of LLMs to hallucinate!
Visualize model performance on function calling tasks
Note Tests LLM API usage and calls (few models atm)
Evaluate LLM cybersecurity risks
Note How likely is your LLM to help cyber attacks?
Run a Streamlit web app
Note An aggregation of benchmarks well correlated with human preferences
View and submit machine learning model evaluations
Note Bias, safety, toxicity, all those things that are important to test when your chatbot actually interacts with users
Display leaderboard data for video generation models
Note Text to video generation leaderboard
Generate animated avatars from images
Note Coding benchmark
Display OCRBench leaderboard for model evaluations
Note An OCR benchmark
Explore and compare LLM models through a leaderboard
Note Dynamic leaderboard using complexity classes to create reasoning problems for LLMs - quite a cool one
Display model benchmark results
Note Red teaming datasets success against models
Submit and filter LLM models for evaluation
Note The Open LLM Leaderboard, but for structured state models!
Analyze images to detect and label objects
Note A multimodal arena!
Upload and evaluate video models
Track, rank and evaluate open LLMs in Portuguese
Note An LLM leaderboard for Portuguese
Track, rank and evaluate open LLMs in the italian language!
Note An LLM leaderboard for Italian
Display Malay LLM leaderboard scores
Note An LLM leaderboard for Malay
Realtime Image/Video Gen AI Arena
Note An arena for image generation!
Browse Q-Bench leaderboard for vision model performance
Display leaderboard for text-to-image model evaluations
Explore and submit LLM benchmark evaluations
Note An hallucination leaderboard, focused on a different set of tasks
Display and filter a leaderboard of language models
Browse and filter leaderboard of language models
Display and explore model leaderboards and chat history
Request evaluation results for a speech model
VLMEvalKit Evaluation Results Collection
Explore and analyze RewardBench leaderboard data
Vote on the latest TTS models!
detect prompt injection risks
Display leaderboard results for coding tasks
Explore GenAI model efficiency on ML.ENERGY leaderboard
Display UGI leaderboard data in an interactive grid
Track, rank and evaluate open LLMs' CoT quality
Display a leaderboard of models
Browse and compare Indic language LLMs on a leaderboard
Leaderboard for LLM for Science Reasoning
Browse and submit LLM evaluations
Browse leaderboard of language models
Visualize LLM progress with interactive filters
Track, rank and evaluate open LLMs and chatbots
Explore benchmark results for QA and long doc models
Track, rank and evaluate open Arabic LLMs and chatbots
Display and filter LLM benchmark results
Generate a 3D leaderboard by voting
Explore and analyze code evaluation data
Browse and submit LLM evaluations
Display and explore zebra puzzle leaderboard
Benchmark LLMs in accuracy and translation across languages
Submit a machine learning model for ranking evaluation
View and submit evaluations for benchmarks
Compare LLMs on role consistency across contexts
Compact LLM Battle Arena: Frugal AI Face-Off!
Evaluate open LLMs in the languages of LATAM and Spain.
Explore and compare LLM benchmarks and submit models for evaluation
GIFT-Eval: A Benchmark for General Time Series Forecasting
Compare AI models by voting on responses
Open Persian LLM Leaderboard
Compare two chatbots and vote on the better one
Explore and compare LLM models through interactive leaderboards and submissions
Submit protein prediction models to MLSB 2024 leaderboard
Explore toxicity scores of models
Compare image backgrounds and vote for the best
Display benchmark results for time series models
AI Phone Leaderboard
Find and filter models on the leaderboard
Display benchmark leaderboard for model evaluation
Display and analyze Polish text understanding benchmark results
Explore and compare model performance on Polish MT-Bench
DABstep Reasoning Benchmark Leaderboard