Search

MathEval

a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.

...more

AgentLLM Leaderboard

1 dir

MixEval

a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU).

...more

AgentLLM Leaderboard

1 dir

MMedBench

a benchmark that evaluates large language models' ability to answer medical questions across multiple languages.

AgentLLM Leaderboard

1 dir

MMToM-QA

a multimodal question-answering benchmark designed to evaluate AI models' cognitive ability to understand human beliefs and goals.

...more

AgentLLM Leaderboard

1 dir

OlympicArena

a benchmark for evaluating AI models across multiple academic disciplines like math, physics, chemistry, biology, and more.

...more

AgentLLM Leaderboard

1 dir

PubMedQA

a biomedical question-answering benchmark designed for answering research-related questions using PubMed abstracts.

AgentLLM Leaderboard

1 dir

SciBench

benchmark designed to evaluate large language models (LLMs) on solving complex, college-level scientific problems from domains like chemistry, physics, and mathematics.

...more

AgentLLM Leaderboard

1 dir

SuperBench

a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural language understanding, reasoning, and generalization.

...more

AgentLLM Leaderboard

1 dir

SuperLim

a Swedish language understanding benchmark that evaluates natural language processing (NLP) models on various tasks such as argumentation analysis, semantic similarity, and textual entailment.

...more

AgentLLM Leaderboard

1 dir

TAT-DQA

a large-scale Document Visual Question Answering (VQA) dataset designed for complex document understanding, particularly in financial reports.

...more

AgentLLM Leaderboard

1 dir

TAT-QA

a large-scale question-answering benchmark focused on real-world financial data, integrating both tabular and textual information.

...more

AgentLLM Leaderboard

1 dir

VisualWebArena

a benchmark designed to assess the performance of multimodal web agents on realistic visually grounded tasks.

AgentLLM Leaderboard

1 dir

We-Math

a benchmark that evaluates large multimodal models (LMMs) on their ability to perform human-like mathematical reasoning.

AgentLLM Leaderboard

1 dir

WHOOPS!

a benchmark dataset testing AI's ability to reason about visual commonsense through images that defy normal expectations.

...more

AgentLLM Leaderboard

1 dir

DeepSeek-Math-7B

LLM application: DeepSeek-Math-7B

AgentOpen LLM

1 dir

DeepSeek-Coder-1.3|6.7|7|33B

LLM application: DeepSeek-Coder-1.3|6.7|7|33B

AgentOpen LLM

1 dir

DeepSeek-VL-1.3|7B

LLM application: DeepSeek-VL-1.3|7B

AgentOpen LLM

1 dir

DeepSeek-MoE-16B

LLM application: DeepSeek-MoE-16B

AgentOpen LLM

1 dir

DeepSeek-Coder-v2-16|236B-MOE

LLM application: DeepSeek-Coder-v2-16|236B-MOE

AgentOpen LLM

6.5K1 dir

DeepSeek-V2.5

LLM application: DeepSeek-V2.5

AgentOpen LLM

1 dir