benchmark designed to evaluate large language models (LLMs) on solving complex, college-level scientific problems from domains like chemistry, physics, and mathematics.
Cross-referenced across 55 tracked directories
#3924
Popularity Rank
1 / 55
Listed In
Emerging
Adoption Stage
3/13/2026
First Seen
Recently added to the ecosystem
aims to track, rank, and evaluate LLMs and chatbots as they are released.
an evaluation benchmark focused on ancient Chinese language comprehension.
a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner.
...moreAn Automatic Evaluator for Instruction-following Language Models using Nous benchmark suite.