LLM Evaluation

AI tools in the LLM Evaluation category

All (48)MCP Servers (0)Skills (42)Agents (6)

LiveBench

A Challenging, Contamination-Free LLM Benchmark.

AgentLLM Evaluation

2 dirs

OpenAI Evals

An open-source library for evaluating task performance of language models and prompts.

AgentLLM Evaluation

18K2 dirs

llm-comparator

Google, LLC

LLM Comparator: An interactive visualization tool for side-by-side LLM evaluation

SkillLLM Evaluation

5222 dirs

How to Evaluate LLM Applications: The Complete Guide - Confident AI

Awesome Gen AI Tools: How to Evaluate LLM Applications: The Complete Guide - Confident AI

SkillLLM Evaluation

1 dir

How to Evaluate, Compare, and Optimize LLM Systems

Awesome Gen AI Tools: How to Evaluate, Compare, and Optimize LLM Systems

SkillLLM Evaluation

1 dir

LLM Benchmarks: MMLU, HellaSwag, BBH, and Beyond - Confident AI

Awesome Gen AI Tools: LLM Benchmarks: MMLU, HellaSwag, BBH, and Beyond - Confident AI

SkillLLM Evaluation

1 dir

AI Evaluation Metrics | Microsoft Learn

Awesome Gen AI Tools: AI Evaluation Metrics | Microsoft Learn

SkillLLM Evaluation

1 dir

OLMO-eval

a repository for evaluating open language models.

AgentLLM Evaluation

3791 dir

Reward Bench Leaderboard - a Hugging Face Space by allenai

Awesome Gen AI Tools: Reward Bench Leaderboard - a Hugging Face Space by allenai

SkillLLM Evaluation

1 dir

Evaluating Large Language Models

Methods, Best Practices & Tools | Lakera – Protecting AI teams that disrupt the world

SkillLLM Evaluation

1 dir

instruct-eval

This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.

...more

AgentLLM Evaluation

5521 dir

ianarawjo/ChainForge

An open-source visual programming environment for battle-testing prompts to LLMs.

SkillLLM Evaluation

3K1 dir

The Ultimate Guide to LLM Product Evaluation

Awesome Gen AI Tools: The Ultimate Guide to LLM Product Evaluation

SkillLLM Evaluation

1 dir

Cleanlab Trustworthy Language Model: Score the trustworthiness of any LLM response

Awesome Gen AI Tools: Cleanlab Trustworthy Language Model: Score the trustworthiness of any LLM response

SkillLLM Evaluation

1 dir

LLM Evaluation | Clarifai Guide

Awesome Gen AI Tools: LLM Evaluation | Clarifai Guide

SkillLLM Evaluation

1 dir

simple-evals

Eval tools by OpenAI.

AgentLLM Evaluation

4.4K1 dir

LLM Leaderboards

Awesome Gen AI Tools: LLM Leaderboards

SkillLLM Evaluation

1 dir

lighteval

a lightweight LLM evaluation suite that Hugging Face has been using internally.

AgentLLM Evaluation

2.3K1 dir

Prometheus-2 Cookbook - LlamaIndex

"An Open Source Language Model Specialized in Evaluating Other Language Models."

SkillLLM Evaluation

1 dir

LLM Evaluation: Everything You Need To Run, Benchmark Evals

Awesome Gen AI Tools: LLM Evaluation: Everything You Need To Run, Benchmark Evals

SkillLLM Evaluation

1 dir