Search
MathEval
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
...moreMixEval
a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU).
...moreMMedBench
a benchmark that evaluates large language models' ability to answer medical questions across multiple languages.
MMToM-QA
a multimodal question-answering benchmark designed to evaluate AI models' cognitive ability to understand human beliefs and goals.
...moreOlympicArena
a benchmark for evaluating AI models across multiple academic disciplines like math, physics, chemistry, biology, and more.
...morePubMedQA
a biomedical question-answering benchmark designed for answering research-related questions using PubMed abstracts.
SciBench
benchmark designed to evaluate large language models (LLMs) on solving complex, college-level scientific problems from domains like chemistry, physics, and mathematics.
...moreSuperBench
a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural language understanding, reasoning, and generalization.
...moreSuperLim
a Swedish language understanding benchmark that evaluates natural language processing (NLP) models on various tasks such as argumentation analysis, semantic similarity, and textual entailment.
...moreTAT-DQA
a large-scale Document Visual Question Answering (VQA) dataset designed for complex document understanding, particularly in financial reports.
...moreTAT-QA
a large-scale question-answering benchmark focused on real-world financial data, integrating both tabular and textual information.
...moreVisualWebArena
a benchmark designed to assess the performance of multimodal web agents on realistic visually grounded tasks.
We-Math
a benchmark that evaluates large multimodal models (LMMs) on their ability to perform human-like mathematical reasoning.
WHOOPS!
a benchmark dataset testing AI's ability to reason about visual commonsense through images that defy normal expectations.
...moreDeepSeek-Math-7B
LLM application: DeepSeek-Math-7B
DeepSeek-Coder-1.3|6.7|7|33B
LLM application: DeepSeek-Coder-1.3|6.7|7|33B
DeepSeek-VL-1.3|7B
LLM application: DeepSeek-VL-1.3|7B
DeepSeek-MoE-16B
LLM application: DeepSeek-MoE-16B
DeepSeek-Coder-v2-16|236B-MOE
LLM application: DeepSeek-Coder-v2-16|236B-MOE
DeepSeek-V2.5
LLM application: DeepSeek-V2.5