AI Evaluations 101: Testing LLMs, Agents, and Everything in Between

Haziqa Sajid

Data Scientist and Content Writer

10 min read

min read

Wednesday, October 22, 2025

AI Evaluations 101: Testing LLMs, Agents, and Everything in Between | Domo

Consider a scenario where your team builds an AI agent that aces every demo. But, when leadership asks, “How do we know it will work with real customers?” or “What if it fails? How can we trust it?” suddently doubts begin to creep in.

This uncertainty highlights an AI trust deficit: While building an AI that works is one thing, building one you can trust is the ultimate challenge.

This isn't just a hypothetical issue. As companies increasingly use AI in their operations, the stakes climb and the need to get it right becomes increasingly urgent.

Regulations like the EU’s AI Act require proof that AI systems are fair and safe. Plus, the risks of untested AI, such as unexpected costs or brand-damaging mistakes, are too serious to ignore.

So, how can you close this trust gap? The solution lies in AI evaluations: A framework for testing, measuring, and monitoring your AI system’s performance.

In this article, we’ll explore what AI evaluations are, why they’re crucial for responsible AI, and how a platform like Domo can help you implement responsible AI evaluations at scale.

Let’s start with clarifying what exactly AI evaluations are and how they differ from traditional software testing.

What are AI evaluations?

AI evaluations (AI evals) are systematic measurements of an AI system’s performance, safety, and reliability against predefined standards in production (real-world) environments. It gives you evidence that an AI behaves as expected, aligns with your organization’s values and policies, and can be trusted to operate safely.

Evaluations fall into three categories, each serving a distinct purpose within the AI lifecycle:

Background monitoring: These evals run on live traffic or logs, without interrupting the core workflow. They track drift or performance degradation over time (checking if response quality or speed has declined). ‍
Guardrails: These are real-time evaluation methods embedded directly in the AI pipeline to intercept and mitigate risks before outputs reach users. They enforce boundaries through scanning for harmful content, such as toxicity, bias, personally identifiable information (PII) or off-topic responses. Guardrails then block, reroute, or rewrite outputs to ensure compliance. ‍
Improvement-focused evals: These are offline evaluations used during development and fine-tuning cycles to optimize AI performance. For example, after building a prompt or training a model, run evals on a benchmark data set or user stories to see changes. These evals help you rank models, select prompts, and identify failure cases. They often involve human annotation or LLM-based scoring to compare different versions.

AI evaluations vs traditional software testing

Although both strategies aim to ensure a system’s reliability, they differ greatly because of AI’s inherent unpredictability. Traditional testing works in predictable environments where outcomes are deterministic (the same input yields identical output). On the other hand, large language models (LLMs) and agents produce probability-based outputs, involve subjective judgments (such as determining whether the language sounds natural), and use data that can change.

As a result, AI testing and validation should evaluate more nuanced qualities and require greater flexibility. This will require using diverse strategies to accommodate the variability and complexity associated with AI systems. Here's a comparison:

Dimension	Traditional software testing	AI evaluations
Nature	Predictable pass/fail.	Handles variability and nuance.
Methods	Rule-based scripts, mocks, assertions.	Automated metrics, human-in-the-loop, LLM judges.
Focus	Functionality, bugs in code.	Performance, safety, bias, drift.
Adaptability	Static and requires manual updates for changes.	Dynamic and incorporates ML for self-healing and real-time monitoring.
Challenges	Scalability in large codebases.	Subjectivity, cost of human evals, handling non-deterministic failures.

‍

Let’s explore why evaluations are critical for LLMs and AI agents and how they form the backbone of responsible, governed AI systems.

Why evaluations matter for AI agents and LLMs

Implementing an evaluation framework is a strategic necessity to connect a promising technology like LLMs and AI agents with a trusted business tool. It helps mitigate critical business risks and ultimately proves an AI system’s value to the organization.

AI evals are how we enforce responsible AI principles like fairness, accountability, transparency, and privacy into development through measurable outcomes. To ensure fairness, we check for Bias and the Fairness Score to detect disparities across user groups.

For reliability, we verify that outputs are accurate and not misleading or hallucinatory. Similarly, safety is maintained by testing for issues like prompt injection and monitoring for harmful content. We also measure how well they follow rules using a Policy Adherence Rate.

If organizations don’t do a formal evaluation, they face considerable business risks, including:

Financial risk: An improperly designed agent can get stuck in an execution loop, making thousands of unnecessary API calls and racking up considerable costs in a matter of hours. A thorough evaluation process stress-tests an agent’s logic to prevent such costly runaway scenarios. ‍
Brand risk: A customer-facing chatbot might give wrong pricing or misrepresent products, causing brand damage. It could also produce offensive responses, eroding customer trust built over the years. Evals help to catch off-brand or insensitive outputs before customers see them. ‍
Operational and compliance risk: Internally, an unvetted AI tool can erode confidence. For example, if an internal data-analysis agent keeps giving subtly wrong insights, teams may start ignoring it or reverting to manual work. Beyond these internal challenges, this lack of validation creates significant external exposure. A failure to ensure an AI system adheres to industry regulations like GDPR or HIPAA, or even internal data governance policies, can result in significant fines, legal action, and a loss of customer confidence.

Finally, evaluations bridge the gap between a working demo and a valuable product. When AI is treated as an engineering project, leadership often wants evidence of impact (from just proving it works to proving it works well). A good eval framework provides that evidence by measuring metrics such as accuracy, speed, cost, and user satisfaction.

How startups use evaluations to improve AI outcomes

Modern AI-centric startups are evaluating AI performance at the core of their workflow to make it ready for real-world use. They adopt eval-driven development (EDD) to quickly build, test, and improve features, turning prototypes into strong solutions.

Here are some common practices:

Adopting “eval-driven development” (EDD)

Instead of treating testing as just a final checkbox, teams incorporate evaluations into the core feedback process. They develop a feature or model update, immediately run it through a series of evaluation tests, and analyze the results before moving further.

This institutionalizes the test-learn-iterate cycle, using quantitative feedback to make improvements and quickly innovate products. It’s similar to how software teams depend on unit tests and CI/CD, but for AI, it often involves more comprehensive metrics.

Making data-driven model selections

Instead of simply choosing the latest hyped or largest foundation model, agile teams use evaluations to pick the right model for the job.

They run multiple candidate models through the same eval tests to compare performance on their specific tasks. Startups optimize the balance of accuracy, speed, and expense by benchmarking each model.

Optimizing prompts systematically

Crafting the perfect prompt requires creativity, while improving it should be data-driven. Evals are used to A/B test different prompt variations at scale by comparing their outputs against ground truth or human preference score.

Teams can measure which prompt yields better accuracy, clarity, and on-brand responses by running a standard set of tests on each prompt. It eliminates guesswork and makes the development process more precise.

Catching regressions before they ship

Since startups roll out new code and AI changes every day. To keep everything running smoothly and prevent new changes from causing issues, teams integrate evals into their CI/CD pipelines.

Whenever code or prompts are modified, a script automatically runs the standard eval suite (unit tests, automated checks, and LLM-based scoring) on a test set. If critical metrics fall below a threshold, the system fails the build.

Key metrics and methods for evaluating AI

To ensure an AI works as intended, it’s important to decide how to evaluate it (the method) and what to measure (the metrics). There are three core AI evaluation methods:

Automated evals: These are code-driven checks that automatically score outputs. They’re fast, scalable, and consistent, and let you run thousands of test cases without manual labor. Automated evals are great for catching apparent failures such as incorrect data formatting or the absence of a required keyword. ‍
Human in the loop (HITL): HITL is considered the gold standard for assessing subjective quality. It uses human reviewers with a detailed rubric to judge nuances like tone, coherence, or faithfulness in context. HITL also captures dimensions that automated checks miss, such as subtle biases or user satisfaction. ‍
LLM-as-a-judge: A hybrid approach that uses a language model (like DomoGPT) to score or compare outputs based on natural language criteria. It’s faster and more scalable than HITL while capturing more nuance than purely automated methods.

Practically, a strong evaluation strategy combines these methods, such as automated checks for filters, LLM judges for semantic scoring, and humans for final validation. No matter which method you use, run the same test set under controlled conditions so you can track changes over time.

To keep track of your evaluation and understand how to test AI agents, you’ll typically build a dashboard of key metrics. Here are common categories:

Accuracy and relevance: Measures if the AI’s answers are correct and on-topic, ensuring they’re helpful. ‍
Task success rate: Tracks the percentage of goals or instructions the AI successfully completes, serving as a high-level indicator of effectiveness. ‍
Faithfulness (vs hallucination): Assesses how often the AI invents facts or strays from source data, which is critical for building user trust. ‍
Contextual recall and precision (for RAG): For systems that retrieve information, this verifies that the AI found all the right data (recall) and that the data it used was relevant (precision). ‍
Toxicity and bias: Scans for harmful, offensive, or unfair content to safeguard user experience and ensure compliance. ‍
Latency and cost: Measures response time and resource consumption (like API tokens) to manage user experience and operational budgets. ‍
Robustness: Tests the AI’s performance against unexpected or adversarial inputs to ensure it isn’t brittle.

Common challenges in AI evaluations

Despite the benefits, designing good evals for LLMs and agents is hard. Here are some common hurdles:

The three gulfs: Many of the evaluation challenges can be framed as three conceptual “gulfs” that separate a successful prototype from a reliable production system. ‍
- Gulf 1. Comprehension: AI developers can’t know the full range of user inputs. It’s impossible to review every possible query or data point manually. Without evals, you might overlook rare edge cases or corrupt data patterns. We need evaluations to understand the input data and the diversity of outputs at scale. ‍
- Gulf 2. Specification: The goal of a prompt or task is often unclear, and the model may not understand your unstated expectations. Evals help clarify these implicit requirements and craft prompts and evaluation rubrics that are precise enough to define what a “good” output looks like. ‍
- Gulf 3. Generalization: Even with good data and clear prompts, models can generalize incorrectly. A model might misinterpret a rare edge case or confuse context. No model is perfect, so evaluations must probe how the system handles unusual scenarios and ensure it doesn’t systematically fail on them. ‍
Choosing the right metrics: While it’s easy to use standard NLP benchmarks like BLEU, ROUGE, and others, they may not accurately represent your specific use case. The definitions of these metrics can often be ambiguous for various tasks. You might require custom metrics or human evaluations for subjective aspects. Spend time defining exactly what “correct” means for your task. ‍
Data understanding and curation: Evaluations are only as good as the data behind them. Creating a high-quality test set (with correct labels and diversity) is challenging. Biases or gaps in the eval data will skew results. It’s essential to continually audit and update your test collections.

How Domo helps operationalize AI evaluations at scale

AI evaluations are only as strong as the data and systems behind them. Domo provides an end-to-end platform to connect your AI eval efforts with your organization’s data and workflows. Here’s how we at Domo can help:

Unified, governed data foundation: Domo lets you bring all relevant data into a single governed platform. This means your test cases can be directly sourced from real production data or business performance indicators, which ensures they are representative and up-to-date. ‍
Connect AI results to business metrics: In Domo, you can visualize AI evaluation results alongside core key performance indicators (KPIs). It lets you trace the impact of AI behavior on real outcomes. ‍
Real-time monitoring and alerts: It supports live dashboards and automated alerts. You can instrument your AI system to send eval metrics into Domo as they happen. If any metric crosses a threshold, Domo can automatically notify you and help ensure that degradations or drift are caught early, before they impact users. ‍
Controlled AI development lifecycle: Domo’s Agent Catalyst helps structure the agent-building process, from defining use cases and data schema to testing prompts and setting up guardrails. For example, Domo’s AI Service Layer ensures queries run on live, governed data for accurate answers. And you test and refine agents in a staging environment with audit trails before going live.

Ready to build AI you can trust? Explore Domo’s resources on AI readiness and see how Domo’s Agent Catalyst helps you build, govern, and deploy agents with confidence.

Table of contents

Example H2