Open LLM Leaderboard
Track, rank and evaluate open LLMs and chatbots
Gathering benchmark spaces on the hub (beyond the Open LLM Leaderboard)
Track, rank and evaluate open LLMs and chatbots
Note π The π€ Open LLM Leaderboard aims to track, rank and evaluate open LLMs and chatbots. π€ Submit a model for automated evaluation on the π€ GPU cluster on the βSubmitβ page!
Select and filter benchmarks for text embedding tasks
Note Massive Text Embedding Benchmark (MTEB) Leaderboard.
Note π This leaderboard is based on the following three benchmarks: Chatbot Arena - a crowdsourced, randomized battle platform. We use 70K+ user votes to compute Elo ratings. MT-Bench - a set of challenging multi-turn questions. We use GPT-4 to grade the model responses. MMLU (5-shot) - a test to measure a modelβs multitask accuracy on 57 tasks.
Explore hardware performance for language models
Note The π€ LLM-Perf Leaderboard ποΈ aims to benchmark the performance (latency, throughput & memory) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benchmark and Optimum flavors. Anyone from the community can request a model or a hardware/backend/optimization configuration for automated benchmarking:
Submit code models for evaluation on benchmarks
Note Compare performance of base multilingual code generation models on HumanEval benchmark and MultiPL-E. We also measure throughput and provide information about the models. We only compare open pre-trained multilingual code models, that people can start from as base models for their trainings.
Request evaluation results for a speech model
Note The π€ Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. We report the Average WER (β¬οΈ) and RTF (β¬οΈ) - lower the better. Models are ranked based on their Average WER, from lowest to highest
Compare model answers to questions
Note The MT-Bench Browser (see Chatbot arena)
Display ToolBench model performance results
Display a web page
View and filter MMBench leaderboard data
Explore and filter language model benchmark results
Submit and filter LLM models for evaluation
Open Persian LLM Leaderboard