In “Unwinding Cultural Debt,” I introduced the idea that every agentic workflow requires an eval (evaluation framework), a rubric that grades the quality of an agent’s output. Evals are the single most important (and most misunderstood) component of any AI deployment, and they deserve a deeper look.
An eval is a structured test that measures whether an AI agent’s output meets your standards. Every time an agent drafts an email, generates a report, or makes a recommendation, something needs to decide whether the output is good enough to use. That something is your eval. Without one, you are trusting the model’s default behavior, which means you are trusting a foundation model’s opinion of what “good” looks like for your business. That should make you uncomfortable.
The Objective/Subjective Split
The most useful framework I’ve found divides evaluation criteria into two categories: objective and subjective.
Objective criteria have deterministic answers. “Is the response under 200 words?” “Does the email include the client’s name?” “Is the topic sentence in active voice?” These can be graded automatically by another model or by simple code. They are cheap, fast, and infinitely scalable. Build as many of these as you can.
Subjective criteria require judgment. “Does this sound like our brand?” “Would a senior partner send this email?” “Is this analysis missing important context?” These require a human evaluator (most often a subject matter expert) who knows what good looks like. You cannot automate taste (at least, not yet). However, you can (and should) use AI to help surface the decisions that need human judgment.
Generally speaking, pure automation produces technically correct output that feels wrong. Pure human review creates a bottleneck that defeats the purpose of using AI. The art is in the split: automate what you can measure, and route the rest to the right human.
The HITL Problem
Human-in-the-loop (HITL) is a buzzwordy way to describe a simple idea. Someone reviews the output, approves or corrects it, and the system learns. In practice, HITL is where most agentic deployments die because the people best qualified to evaluate AI output are your most experienced (and most expensive) employees. They are already overloaded. Asking them to review agent output feels like adding work, and it is.
Worse, the Upton Sinclair quote I mentioned last week is real: “It is difficult to get a man to understand something when his job depends on him not understanding it.” I’ve seen three patterns that actually work.
First, make evals part of every job description, with dedicated time and explicit compensation. “Review and improve AI outputs” needs to appear on someone’s calendar, not exist as an afterthought squeezed between meetings.
Second, separate the evaluator role from the role being automated. The person reviewing AI-generated client emails should ideally be a senior leader who writes well, not the junior associate whose email-writing workload is being reduced.
Third, create feedback loops that are so frictionless people actually use them. A thumbs-up/thumbs-down button with a comments dialog box integrated into the workflow captures far more signal than a weekly review.
Building Your Eval Stack
The hardest part of building evals is deciding what to measure. The technology is straightforward. The taxonomy is where almost everyone gets stuck. Start with three layers:
Format Compliance: did the agent follow the structural rules? Correct length, required fields, proper formatting. This is fully automatable and should catch 60-70% of bad outputs before a human ever sees them.
Factual Accuracy: are the claims true, the numbers right, the references valid? This can be partially automated (fact-checking APIs, database lookups, cross-referencing source documents) but often needs a human to verify nuance.
Quality Judgment: is this actually good? Does it reflect the organization’s standards, voice, and values? This is exclusively human territory, and it is where most organizations underinvest. A brilliant eval at the format and accuracy layers will still produce mediocre output if no one is grading for quality.
The Continuous Improvement Flywheel
Every human correction improves performance. Every thumbs-down tells you something specific about where the model’s defaults diverge from your standards. Aggregate enough of this data and patterns emerge: the model consistently overexplains, or uses passive voice, or misses industry jargon your clients expect.
The flywheel is the magic! Corrections feed back into your prompts, your few-shot examples, your system instructions. The agent gets measurably better each week. I’ve seen diligently run eval flywheels yielding objectively measurable performance increases of 20-30% in the first 30 days. Without a continuous improvement eval schema, you’ll see quality plateau (or degrade) just as quickly.
Your Evals ToDo List
Identify one high-volume agentic workflow in your organization. Build three objective eval criteria and two subjective ones. Assign a senior person to review a sample of outputs for one hour per week. Track the results. Adjust the criteria based on what you learn. If you haven’t ever created an eval, just ask your favorite AI model to help you create one.
That is the entire playbook. Evals are a practice, like code review or quality assurance. They compound. Start now.
A bit of a shameless plug. If you need help organizing your agentic workflows or evals capabilities, please reach out.
Author’s note: This is not a sponsored post. I am the author of this article and it expresses my own opinions. I am not, nor is my company, receiving compensation for it. This work was created with the assistance of various generative AI models.