Who Needs a “Human in the Loop” When AI Gives Itself Feedback

Directionally Correct Newsletter, The #1 People Analytics Substack

Jun 11, 2025

Subscribe to Directionally Correct newsletter to find more articles and access more insights on people analytics.

Introduction

While organizations rush to deploy generative AI in their analytics workflows, they're making a critical oversight. Teams obsess over prompt engineering and model selection, yet almost none give the same energy to the hidden lever behind every exceptional AI system: evaluations. Just as human performance management systems define what good work looks like and measure progress against those standards, evals serve as the performance management framework for AI - determining whether your GenAI-powered systems succeed or fail in production. Those who success are the ones who have who've cracked the code on systematic evaluation. They understand that in a world where generative AI produces endless "right answers," the competitive advantage belongs to those who can reliably identify which answers drive business impact.

The Familiar Challenge in a New Context

Industrial-organizational psychology has long grappled with defining and measuring good performance in ambiguous, knowledge work environments — there’s even a name for it called the criterion problem. How do you evaluate the quality of a strategic recommendation? What makes one analysis more valuable than another? These same questions that have challenged HR professionals for decades are now at the heart of AI system development. Just as organizations invest heavily in competency frameworks to define excellent human performance, they must now develop evaluation frameworks to define excellent AI performance. The difference is stakes and scale. While a poorly performing employee might impact their immediate team, a poorly evaluated AI system can make thousands of flawed decisions before anyone notices the pattern.

The Analytics Performance Challenge

Analytics work has always walked the tightrope of technical rigor and business judgment. This challenge intensifies with generative AI, which can produce analytically sound but strategically misguided insights or brilliant recommendations based on subtle data misinterpretations.

The analytics lifecycle can help us better see why traditional software testing approaches applied to LLMs fall short. Consider these three critical stages:

Analyze: Planning and executing analyses to find signal in data
Build: Crafting compelling narratives to simplify and share insights
Connect: Linking insights to broader context and actionable recommendations

Each stage requires different evaluation criteria, much like how performance management systems use different competencies for individual contributors versus senior leaders. Or, what constitutes "good" analysis differs fundamentally from what makes a "good" executive summary or strategic recommendation.

From Competency Models to Evaluation Frameworks

The best evaluation frameworks for AI-driven analytics systems mirror proven competency modeling approaches. Just as HR professionals conduct structured interviews with subject matter experts to define performance standards, building effective AI evals requires capturing expert judgment about what separates excellent analytics from mediocre work — it all comes back to judgement.

This is where human-in-the-loop evaluation becomes critical but which human matters enormously. Different stages of analytics require different domain expertise:

Statistical rigor: Data scientists and analysts evaluate technical correctness
Narrative effectiveness: Experienced practitioners assess clarity and persuasiveness
Strategic alignment: Business leaders judge actionability and organizational fit

The goal isn't to replace human judgment but to systematically capture and scale it. Think of evals as performance management systems for AI - they encode organizational standards for what good work looks like and provide consistent feedback for improvement. However, there can also be unhelpful performance management that actually punishes the right behaviors and rewards the wrong ones. This is why these evaluation systems for AI are so tricky.

The Three Evaluation Approaches

Just as performance management combines multiple assessment methods — sometimes called a multi-trait, multi-method analysis in the social sciences — effective AI evaluation requires a multi-faceted approach:

Human Evaluations: Direct assessment by domain experts. This is what all of us are intuitively doing when we ask Claude or ChatGPT for something and go back and forth till we get an acceptable outcome. These provide the richest insights but are expensive and don't scale. Best used for high-stakes decisions or calibrating other evaluation methods.
Code-Based Evaluations: Automated checks for technical accuracy, like ensuring calculations are mathematically sound or data sources are correctly referenced. These parallel objective performance metrics like sales numbers - clear, measurable, but limited in scope, especially in the context of GenAI where you can have multiple “correct” answers to any question.
LLM-as-Judge Evaluations: Using AI systems to evaluate other AI systems based on carefully crafted criteria. This approach scales human expertise by encoding expert judgment into evaluation prompts. When done well, it combines the consistency of automated testing with the nuance of human assessment.

Building Your AI Performance Management System

The most sophisticated organizations are developing "LLM-as-judge" systems that mirror competency-based performance reviews. Each evaluation contains four essential components:

Role Definition: Just as performance reviews specify the evaluator's perspective (peer, manager, direct report), AI evaluations must clearly define the expert role the judge should adopt.
Context Provision: Supplying all relevant information, similar to how performance reviews consider project context and organizational priorities.
Success Criteria: Explicit standards for what constitutes good performance, analogous to competency definitions in human performance systems. This can be in the form of clear descriptions and actual examples of what good looks like for a given task.
Output Standards: Consistent scoring mechanisms that enable comparison and tracking over time. LLMs are really good at providing structured outputs that fit a predetermined schema, and having these can help scale - kind off like going from a on-off chat conversation to a spreadsheet with columns for each success criteria and an LLM judgement for each.

Here's how this could work across the analytics lifecycle:

Stage 1: Analyze (Narrow Set of Right Answers)

At the analysis stage, evaluation criteria focus on technical accuracy and methodological soundness. Like evaluating a financial analyst's calculations, there are relatively clear standards for correctness. Key evaluation dimensions can include:

Data accuracy and source verification
Appropriate statistical methods for the research question
Logical soundness of analytical approach
Completeness of relevant variables considered

Stage 2: Build (Expanding Set of Right Answers)

Building analytical narratives requires more subjective judgment, similar to evaluating a consultant's presentation skills. Multiple approaches can be equally valid, but quality still varies significantly. Evaluation focuses on:

Clarity and logical flow of the narrative
Appropriate use of visualizations and examples
Accessibility for the intended audience
Compelling evidence for key claims

Stage 3: Connect (Large Variety of Right Answers)

Strategic recommendations exist in the realm of endless right answers, much like evaluating leadership decisions. Context, timing, and organizational culture all influence what constitutes good advice. As the number of right answers increases, the ability to set evaluation criteria well gets more difficult. Evaluation criteria could include:

Feasibility given organizational constraints
Alignment with strategic priorities
Consideration of implementation challenges
Quality of risk assessment and mitigation strategies

Expertise Matters in an LLM World

As evaluation systems become more sophisticated, the question of human expertise keeps getting raised. As we build more GenAI-powered systems, organizations are discovering that we need expertise to evaluate whether the AI is doing what we need it to do - and to do that well, we need someone who can actually assess the AI’s output. As Amit Mohindra presciently asks “Who is the Human in the Loop?”. Building effective AI evaluation systems requires the same strategic thinking that goes into designing executive assessment centers or leadership development programs.

The most successful teams are those that recognize evaluation as a core competency, not an afterthought. They're building evaluation expertise just as deliberately as they once built analytics capabilities. This means:

Investing in evaluation training for AI product teams
Creating libraries of proven evaluation frameworks
Establishing feedback loops between evaluation results and system improvements
Building institutional knowledge about what good AI performance looks like in their specific context

Calibration / Learning from Human Performance Management

Organizations have spent decades trying to perfect the art of evaluating human performance in subjective, knowledge-intensive roles. The most sophisticated performance management systems don't rely on a single manager's judgment—they use calibration sessions, multiple raters, and structured frameworks to ensure consistency and fairness. Also, what works for one organization might be completely different from what works in another. These same principles are proving essential for AI evaluation systems, especially as we move beyond the unrealistic expectation that AI systems must be perfect.

One of the most counterproductive assumptions in AI development is that systems must achieve perfect accuracy or have zero room for error. This expectation becomes particularly problematic with generative AI, where success often depends on judgment and taste rather than binary correctness.

Consider how we evaluate human analysts: We don't expect every strategic recommendation to be identical or every market analysis to reach the same conclusions. Instead, we establish acceptable ranges of variance and focus on whether the reasoning process is sound and the outputs are valuable. We've learned that some degree of disagreement between expert evaluators often signals healthy diversity of thought rather than system failure.

The same principle should apply to AI systems. In a world of endless right answers, the goal isn't perfect consistency—it's reliable quality within acceptable bounds of variation.

The Multi-Judge Architecture

Just as 360-degree feedback incorporates perspectives from peers, subordinates, and supervisors, effective AI evaluation benefits from multiple evaluation "judges." For an AI-generated market analysis, you might employ:

A data scientist to verify statistical methodology and data interpretation
A domain expert to assess industry-specific insights and context
A business stakeholder to evaluate strategic relevance and actionability

But here's where AI evaluation can surpass human performance management: We can also deploy multiple AI judges to scale expert perspectives. Just as we might have three managers independently rate an employee's performance before calibrating their assessments, we can have multiple LLM judges evaluate the same AI output using different but complementary criteria.

When these evaluators disagree—which they will—that disagreement often reveals the most important insights about system performance. The key is distinguishing between productive variance (reflecting different valid perspectives) and problematic inconsistency (suggesting unclear standards or system issues).

Establishing Inter-Rater Reliability Standards

Human resources professionals have long relied on inter-rater reliability metrics to ensure performance evaluations are fair and consistent. These metrics measure the degree of agreement between different evaluators, with established thresholds for acceptable variance. For instance, a correlation of 0.7-0.8 between raters is often considered acceptable for performance reviews, acknowledging that some subjectivity is both inevitable and valuable.

AI evaluation systems need similar standards, but with three critical reliability measures:

Human-Human Inter-Rater Reliability: Establishing baseline agreement levels among human domain experts. This creates the foundation for understanding what constitutes reasonable variance in expert judgment.
Human-AI Inter-Rater Reliability: Measuring how closely AI judges align with human expert assessments. This validates whether LLM judges are capturing the nuanced criteria that human experts value.
AI-AI Inter-Rater Reliability: Assessing consistency across multiple AI judges evaluating the same outputs. This helps identify when evaluation criteria are clear enough for reliable assessment versus when they need refinement.

And in a system of LLM-as-Judge Evaluations, this third type of AI-AI Inter-Rater Reliability likely becomes the most important type, very quickly.

Consider this implementation framework:

Baseline Human Calibration: Have 3-5 domain experts evaluate 100 sample AI outputs, measuring inter-rater reliability and discussing discrepancies
Human-AI Alignment Testing: Deploy LLM judges on the same samples, measuring correlation with human assessments
Multi-Judge AI Deployment: Use 2-3 AI judges per evaluation, tracking both individual performance and consensus patterns
Acceptable Variance Thresholds: Establish standards (e.g., 80% agreement on binary decisions, correlation >0.75 on scaled ratings)
Continuous Recalibration: Regular spot-checks to ensure all reliability measures remain within acceptable bounds

Embracing Evaluator Variance as Signal

In human performance management, we've learned that some degree of rater variance is not just acceptable—it's valuable. Different perspectives capture different aspects of performance that matter to organizational success. A project manager and a technical architect will rightfully emphasize different qualities in the same work product.

This principle becomes even more powerful when applied to AI evaluation. When building an evaluation system for AI-generated strategic recommendations, you might discover that:

Finance-focused judges consistently prioritize cost considerations
Operations judges emphasize implementation feasibility
Innovation teams value creative approaches that others find risky

Rather than forcing artificial consensus, mature evaluation systems calibrate these differences and use them strategically. The variance itself becomes a quality signal—recommendations that satisfy all perspectives might be more robust, while those that strongly appeal to specific viewpoints might be perfect for targeted scenarios.

The Compound Effect of Calibrated Multi-Judge Systems

Organizations that master this multi-judge approach with proper inter-rater reliability standards gain a compounding advantage. They develop institutional intelligence about what good AI performance looks like across different contexts and stakeholder perspectives. And this institutional intelligence can be encoded into the “memory” of AI systems to ensure continued advantage as time progresses.

More importantly, they build confidence in AI systems by establishing realistic performance expectations. Instead of pursuing the impossible goal of perfect AI outputs, they create systems that reliably deliver valuable outputs within well-understood quality bounds. The errors cancel each other out.

The performance management parallel is instructive: Companies with mature, multi-rater performance systems don't just evaluate better, they develop talent more effectively and make decisions with greater confidence. Similarly, companies with mature, multi-judge AI evaluation systems can scale AI adoption more rapidly because they've solved the fundamental challenge of defining and measuring success in an endless-right-answers world.

The calibration imperative isn't just about measurement; it's about building the organizational intelligence to manage AI systems with the same sophistication we've learned to apply to human performance.

From Building to Babysitting AI Systems

The evolution from building AI systems to managing them mirrors the shift from hiring individual contributors to developing organizational capability. Early AI implementations focused on getting systems to work; mature AI operations focus on ensuring systems consistently deliver value.

This shift demands new organizational capabilities. Just as companies invested heavily in performance management systems as they scaled, they must now invest in AI evaluation infrastructure. The organizations that recognize this early will build sustainable competitive advantages in the AI-powered analytics era.

The evaluation advantage isn't just about better AI systems, it's about building the management discipline to harness AI's potential consistently and at scale. In a world where every analysis could be powered by AI, the differentiator won't be access to the technology. It will be the organizational capability to define, measure, and optimize what good AI-powered analytics actually looks like.

As we enter this new era of AI-augmented analytics, the lessons from decades of human performance management become our guide. The organizations that master this translation will be the ones that truly unlock AI's transformative potential for data-driven decision making.

I hope you like this article. If so, I have a few more articles coming out soon. Stay tuned. If you are interested in learning more directly from me, please connect with me on LinkedIn.

Cole’s recent articles

For access to all of Cole’s previous articles, go here.

Share Directionally Correct, The #1 People Analytics Substack

Directionally Correct, The #1 People Analytics Substack