Skip to content

Instantly share code, notes, and snippets.

@spboyer
Created January 31, 2026 22:10
Show Gist options
  • Select an option

  • Save spboyer/971228d469e9f67b36434e4b7437f0de to your computer and use it in GitHub Desktop.

Select an option

Save spboyer/971228d469e9f67b36434e4b7437f0de to your computer and use it in GitHub Desktop.
AI Evals — What, Why, How, and Where

AI Evals — A Practical, End‑to‑End Overview

This document captures a consolidated, practical understanding of AI evaluations (evals) based on our discussions and the reference material you shared. It is intended as a mental model + system design reference for building, running, and operationalizing evals in modern, agentic AI systems.


1. What Are AI Evals?

AI evals are structured mechanisms for measuring, validating, and improving the behavior of AI systems—especially LLM‑powered applications and agents—against explicit goals.

At their core, evals answer:

  • Did the system do the right thing?
  • How well did it do it?
  • Is it improving or regressing over time?

Evals are not just tests. They are feedback instruments that sit inside a continuous learning loop.

Key characteristics:

  • Quantitative and/or qualitative
  • Automated or human‑in‑the‑loop
  • Offline (pre‑deploy) and online (post‑deploy)
  • Model‑centric and system‑centric

2. Why Evals Matter (The Aspiration)

The aspiration of evals is high‑agency AI systems:

Systems that can learn, self‑correct, and improve autonomously while remaining safe, grounded, and aligned.

Evals enable:

  • Trust — confidence in outputs, safety, and compliance
  • Velocity — faster iteration without fear of regressions
  • Autonomy — agents that adapt their behavior based on feedback
  • Accountability — explicit signals tied to product goals
  • Comparability — informed choices between models, prompts, tools, and architectures

Long‑term aspiration:

  • Evals become a control plane, not a reporting artifact
  • Systems dynamically select models, tools, or strategies based on eval outcomes
  • Human review is focused on edge cases, not routine validation

3. Where Evals Live in the System

Evals are not a single component—they span the lifecycle.

3.1 Offline / Pre‑Deployment

Used during:

  • Prompt engineering
  • Model selection
  • RAG tuning
  • Agent design

Examples:

  • Benchmarking different prompts
  • Testing retrieval quality on golden datasets
  • Safety and policy validation

3.2 CI/CD Integration

Evals increasingly behave like unit tests + integration tests for AI:

  • Triggered on PRs
  • Run in pipelines
  • Gate deployments

Signals:

  • Pass/fail thresholds
  • Regression detection
  • Composite quality scores

3.3 Online / Production

Post‑deployment evals monitor real behavior:

  • Live traffic sampling
  • Shadow evals
  • Drift detection
  • User feedback loops

This is where evals transition from quality assurance → operational intelligence.


4. When to Run Evals

Phase Purpose
Design Validate assumptions and UX intent
Development Iterate prompts, tools, and flows
Pre‑Deploy Catch regressions, safety issues
Deploy Confidence gating
Post‑Deploy Drift detection, learning loops
Continuous Autonomous optimization

Evals should be:

  • Cheap and frequent early
  • Representative and precise later

5. How Evals Work (Conceptual Model)

5.1 Inputs

  • Test prompts / conversations
  • Context (documents, tools, memory)
  • Expected outputs or scoring criteria

5.2 Execution

  • Run model / agent
  • Capture outputs, traces, tool calls

5.3 Scoring

  • Automatic metrics (scores, labels)
  • Model‑based judges
  • Human review (as needed)

5.4 Aggregation

  • Per‑metric scores
  • Weighted composite scores
  • Pass/fail thresholds

5.5 Action

  • Block deployment
  • Select alternate model
  • Trigger self‑correction
  • Update prompts, skills, or policies

6. Core Categories of Eval Metrics

6.1 Quality & Language

Measures how well the response is formed and understood.

  • Fluency
  • Coherence
  • Similarity
  • Relevance
  • Response Completeness

6.2 Grounding & Retrieval

Measures whether responses are based on provided sources.

  • Groundedness
  • GroundednessPro
  • Retrieval
  • Ungrounded Attributes

Critical for RAG and enterprise systems.


6.3 Task & Intent Success

Measures whether the system achieved the user’s goal.

  • Intent Resolution
  • Task Success / ECI
  • F1 / Exact Match (where applicable)

6.4 Safety & Policy

Measures harmful or disallowed content.

  • Violence
  • Sexual
  • Self‑Harm
  • Hate & Unfairness
  • Indirect Attacks
  • Protected Material
  • Code Vulnerability

These often operate as hard gates.


6.5 Text Similarity & NLP Benchmarks

Often used for summarization, translation, or generation comparison.

  • BLEU
  • ROUGE
  • METEOR
  • GLEU

Useful but not sufficient alone.


7. Composite Scoring & Tradeoffs

Single metrics are rarely enough.

Modern systems use:

  • Weighted metric bundles
  • Context‑specific thresholds
  • Scenario‑based scoring

Example:

  • Retrieval quality weighted higher for RAG
  • Safety metrics always blocking
  • Fluency de‑prioritized for internal tools

Composite scores are typically external to individual evaluators and live in an orchestration or policy layer.


8. Evals in Agentic Systems

Agent evals expand beyond text:

  • Tool selection correctness
  • Planning quality
  • Step ordering
  • Error recovery
  • Self‑reflection quality

Key insight:

You don’t just eval outputs — you eval decisions.

This enables:

  • Self‑correction loops
  • Strategy switching
  • Skill refinement

9. Human‑in‑the‑Loop (HITL)

Humans are still essential for:

  • Defining intent
  • Labeling gold datasets
  • Reviewing edge cases
  • Training evaluators

But the trajectory is clear:

Humans move from graders → designers of grading systems.


10. System Design Principles for Evals

  • Evals are first‑class citizens, not afterthoughts
  • Separate signal generation from decision policy
  • Optimize for iteration speed early
  • Treat eval datasets as versioned assets
  • Expect metrics to evolve

11. Reference Materials

Primary documents referenced:


12. Mental Model Summary

Think of evals as:

The nervous system of AI products

They sense, score, and signal—so systems can act with speed, safety, and intent.

If you want, next we can:

  • Turn this into a one‑page executive view
  • Map evals to a reference architecture
  • Design an eval control plane for agents
  • Create a metric → decision matrix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment