LLM Leaderboard

AI tools in the LLM Leaderboard category

All (28)MCP Servers (0)Skills (0)Agents (28)

OlympicArena

a benchmark for evaluating AI models across multiple academic disciplines like math, physics, chemistry, biology, and more.

...more

AgentLLM Leaderboard

1 dir

InfiBench

a benchmark designed to evaluate large language models (LLMs) specifically in their ability to answer real-world coding-related questions.

...more

AgentLLM Leaderboard

1 dir

BeHonest

A pioneering benchmark specifically designed to assess honesty in LLMs comprehensively.

AgentLLM Leaderboard

1 dir

MMedBench

a benchmark that evaluates large language models' ability to answer medical questions across multiple languages.

AgentLLM Leaderboard

1 dir

LLMEval

focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.

...more

AgentLLM Leaderboard

1 dir

DreamBench++

a benchmark for evaluating the performance of large language models (LLMs) in various tasks related to both textual and visual imagination.

...more

AgentLLM Leaderboard

1 dir

PubMedQA

a biomedical question-answering benchmark designed for answering research-related questions using PubMed abstracts.

AgentLLM Leaderboard

1 dir

Open LLM Leaderboard

aims to track, rank, and evaluate LLMs and chatbots as they are released.

AgentLLM Leaderboard

1 dir

LawBench

a benchmark designed to evaluate large language models in the legal domain.

AgentLLM Leaderboard

1 dir

Berkeley Function-Calling Leaderboard

evaluates LLM's ability to call external functions/tools.

AgentLLM Leaderboard

1 dir

MixEval

a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU).

...more

AgentLLM Leaderboard

1 dir

M3CoT

a benchmark that evaluates large language models on a variety of multimodal reasoning tasks, including language, natural and social sciences, physical and social commonsense, temporal reasoning, algebra, and geometry.

...more

AgentLLM Leaderboard

1 dir

CompMix

a benchmark evaluating QA methods that operate over a mixture of heterogeneous input sources (KB, text, tables, infoboxes).

...more

AgentLLM Leaderboard

1 dir

MMToM-QA

a multimodal question-answering benchmark designed to evaluate AI models' cognitive ability to understand human beliefs and goals.

...more

AgentLLM Leaderboard

1 dir

ACLUE

an evaluation benchmark focused on ancient Chinese language comprehension.

AgentLLM Leaderboard

331 dir

FELM

a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).

AgentLLM Leaderboard

1 dir

Chatbot Arena Leaderboard

a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner.

...more

AgentLLM Leaderboard

1 dir

AlpacaEval

An Automatic Evaluator for Instruction-following Language Models using Nous benchmark suite.

AgentLLM Leaderboard

1 dir

CompassRank

CompassRank is dedicated to exploring the most advanced language and visual models, offering a comprehensive, objective, and neutral evaluation reference for the industry and research.

...more

AgentLLM Leaderboard

1 dir

MathEval

a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.

...more

AgentLLM Leaderboard

1 dir