LLM Leaderboard
28AI tools in the LLM Leaderboard category
OlympicArena
a benchmark for evaluating AI models across multiple academic disciplines like math, physics, chemistry, biology, and more.
...moreInfiBench
a benchmark designed to evaluate large language models (LLMs) specifically in their ability to answer real-world coding-related questions.
...moreBeHonest
A pioneering benchmark specifically designed to assess honesty in LLMs comprehensively.
MMedBench
a benchmark that evaluates large language models' ability to answer medical questions across multiple languages.
LLMEval
focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.
...moreDreamBench++
a benchmark for evaluating the performance of large language models (LLMs) in various tasks related to both textual and visual imagination.
...morePubMedQA
a biomedical question-answering benchmark designed for answering research-related questions using PubMed abstracts.
Open LLM Leaderboard
aims to track, rank, and evaluate LLMs and chatbots as they are released.
LawBench
a benchmark designed to evaluate large language models in the legal domain.
Berkeley Function-Calling Leaderboard
evaluates LLM's ability to call external functions/tools.
MixEval
a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU).
...moreM3CoT
a benchmark that evaluates large language models on a variety of multimodal reasoning tasks, including language, natural and social sciences, physical and social commonsense, temporal reasoning, algebra, and geometry.
...moreCompMix
a benchmark evaluating QA methods that operate over a mixture of heterogeneous input sources (KB, text, tables, infoboxes).
...moreMMToM-QA
a multimodal question-answering benchmark designed to evaluate AI models' cognitive ability to understand human beliefs and goals.
...moreACLUE
an evaluation benchmark focused on ancient Chinese language comprehension.
FELM
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
Chatbot Arena Leaderboard
a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner.
...moreAlpacaEval
An Automatic Evaluator for Instruction-following Language Models using Nous benchmark suite.
CompassRank
CompassRank is dedicated to exploring the most advanced language and visual models, offering a comprehensive, objective, and neutral evaluation reference for the industry and research.
...moreMathEval
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
...more