LLM Evaluation
48AI tools in the LLM Evaluation category
LiveBench
A Challenging, Contamination-Free LLM Benchmark.
OpenAI Evals
An open-source library for evaluating task performance of language models and prompts.
llm-comparator
Google, LLC
LLM Comparator: An interactive visualization tool for side-by-side LLM evaluation
How to Evaluate LLM Applications: The Complete Guide - Confident AI
Awesome Gen AI Tools: How to Evaluate LLM Applications: The Complete Guide - Confident AI
How to Evaluate, Compare, and Optimize LLM Systems
Awesome Gen AI Tools: How to Evaluate, Compare, and Optimize LLM Systems
LLM Benchmarks: MMLU, HellaSwag, BBH, and Beyond - Confident AI
Awesome Gen AI Tools: LLM Benchmarks: MMLU, HellaSwag, BBH, and Beyond - Confident AI
AI Evaluation Metrics | Microsoft Learn
Awesome Gen AI Tools: AI Evaluation Metrics | Microsoft Learn
OLMO-eval
a repository for evaluating open language models.
Reward Bench Leaderboard - a Hugging Face Space by allenai
Awesome Gen AI Tools: Reward Bench Leaderboard - a Hugging Face Space by allenai
Evaluating Large Language Models
Methods, Best Practices & Tools | Lakera – Protecting AI teams that disrupt the world
instruct-eval
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
...moreianarawjo/ChainForge
An open-source visual programming environment for battle-testing prompts to LLMs.
The Ultimate Guide to LLM Product Evaluation
Awesome Gen AI Tools: The Ultimate Guide to LLM Product Evaluation
Cleanlab Trustworthy Language Model: Score the trustworthiness of any LLM response
Awesome Gen AI Tools: Cleanlab Trustworthy Language Model: Score the trustworthiness of any LLM response
LLM Evaluation | Clarifai Guide
Awesome Gen AI Tools: LLM Evaluation | Clarifai Guide
simple-evals
Eval tools by OpenAI.
LLM Leaderboards
Awesome Gen AI Tools: LLM Leaderboards
lighteval
a lightweight LLM evaluation suite that Hugging Face has been using internally.
Prometheus-2 Cookbook - LlamaIndex
"An Open Source Language Model Specialized in Evaluating Other Language Models."
LLM Evaluation: Everything You Need To Run, Benchmark Evals
Awesome Gen AI Tools: LLM Evaluation: Everything You Need To Run, Benchmark Evals