We Built an AI Agent That Audits Itself — Here's What Broke

by Alien Brain Trust AI Learning
We Built an AI Agent That Audits Itself — Here's What Broke

We Built an AI Agent That Audits Itself — Here’s What Broke

In January, we set out to solve a problem that shouldn’t exist: our AI agents were producing work that looked correct but failed silently in production.

We’d already built a $7 QA pipeline. We’d already written evaluators. We’d already learned that test scores don’t correlate with real-world performance. So we decided to go deeper: what if the agent could audit its own work, in real-time, and flag problems before we shipped them?

Six months later, after three complete rewrites and one particularly expensive failure, we have a system that works. Here’s what we learned.

The Problem We Thought We Were Solving

Our agents were writing code, generating documentation, and building SQL queries. All of it passed our tests. None of it was correct when customers used it.

The root cause was simple: our test suite only validated the happy path. An agent could generate syntactically correct Python that would fail on edge cases we didn’t anticipate. A documentation generator could miss security implications. A query builder could be inefficient without being wrong.

We needed a second opinion. An agent that didn’t just check outputs against a rubric, but understood the domain and could spot subtle failures.

So we built one.

Iteration 1: The Naive Self-Auditor (Failure Rate: 78%)

Our first attempt was straightforward: take the agent’s output, pass it to an auditor agent, ask the auditor “is this correct?” and flag anything marked as “needs work.”

Primary Agent Output: SELECT * FROM users WHERE created_at > NOW() - INTERVAL 7 DAY
Auditor Prompt: "Is this SQL query correct? Yes or no."
Auditor Response: "Yes"

Result: The query was inefficient (missing an index hint), but the auditor said it was fine.

We ran 100 test cases. The auditor caught 22 actual problems. False negative rate: 78%.

What went wrong? The auditor was answering the question we asked, not the question we needed answered. “Is this correct?” is binary. Real quality is multidimensional.

What we learned: Yes/no questions produce yes/no answers. You get what you ask for.

Iteration 2: The Rubric-Based Auditor (Failure Rate: 45%)

We tried giving the auditor a detailed rubric. For code, we scored efficiency, readability, security, and correctness separately. For queries, we added performance impact, index usage, and query plan complexity.

Auditor Prompt:
"Rate this output on the following dimensions:
1. Correctness (0-10)
2. Efficiency (0-10)
3. Security (0-10)
4. Readability (0-10)

Anything below 7 should be flagged."

Better. The auditor caught 55 of 100 problems.

But the remaining 45 failures revealed a pattern: the auditor was scoring based on what it could see, not on what actually mattered. A query would score 9/10 on readability but fail in production because the auditor didn’t know about the actual data distribution. Code would score 10/10 on correctness but timeout because the auditor never ran it.

What we learned: Scoring systems can be gamed. An agent optimizes for the metric, not the outcome.

Iteration 3: The Execution-Based Auditor (Failure Rate: 12%)

We stopped asking the auditor to judge and started asking it to verify.

For code: we executed it against a test suite we built specifically for the auditor. For queries: we ran the query plan analyzer and checked for warnings. For documentation: we extracted claims and verified them against our knowledge base.

Auditor Workflow:
1. Execute the output (code) or analyze it (SQL/docs)
2. Collect concrete results (test pass/fail, execution time, claim verification)
3. Compare results to thresholds (tests must pass, execution < 2s, 90%+ claim accuracy)
4. Report findings, not opinions

This worked much better. Failure rate dropped to 12%.

But 12% is still significant. When we analyzed the failures, we found the issue: the auditor was solving the wrong problem at the wrong time.

In 8 out of 12 failures, the problem wasn’t with the agent’s output — it was with the test suite we’d given the auditor. The test suite didn’t cover the edge case that failed in production. The auditor was correctly validating against a broken spec.

In the other 4 failures, the auditor was catching real problems but too late. The agent had already committed to a direction that couldn’t be salvaged without a rewrite.

What we learned: Auditing after the fact catches some problems. Auditing during execution catches more.

What Actually Worked: Streaming Audits

We stopped treating auditing as a final gate and started treating it as a continuous process.

Instead of “run the agent, then audit the output,” we built:

  1. Early validation — before the agent commits to a direction, validate the approach (do we have the right data? do we understand the constraint?)
  2. Execution monitoring — as the agent works, sample intermediate outputs and check them
  3. Final verification — after completion, run the full test suite

The key insight: agents work step-by-step. If you can catch a problem at step 2, you avoid 50 more steps of work in the wrong direction.

Here’s the real-world impact:

MetricBefore StreamingAfter Streaming
Problems caught before production55%94%
Agent rewrites required67%12%
Cost per output$0.47$0.31
Time to completion45s (avg)38s (avg)

The efficiency gain came from agents spending less time going down dead ends.

The Numbers That Actually Matter

Over 6 months of testing:

  • 1,247 outputs audited across code generation, query building, and documentation
  • 78 critical problems caught before shipping (6.2% of total)
  • 3 complete system rewrites (iterations 1–3)
  • 2 database queries that would have caused incidents (wrong indexes, missing WHERE clause)
  • 1 documentation claim that contradicted security policy (would have created liability)
  • $0.00 in production incidents attributed to auditor failure since March

The failed audits (12% failure rate) were mostly in novel domains where we hadn’t built test suites yet. As we built more tests, the failure rate dropped. This is the important part: the auditor’s effectiveness is directly tied to test coverage. A better auditor won’t help if you’re testing the wrong things.

What We’d Do Differently

If we were starting over:

  1. Start with execution, not judgment. An auditor that runs your code beats an auditor that scores it.
  2. Build the test suite first. The auditor is only as good as what you ask it to verify.
  3. Audit continuously, not at the end. Catch problems early, when they’re cheap to fix.
  4. Track audit accuracy separately. Don’t trust that a high audit score means high quality. Test it. We were wrong 45% of the time with iteration 2, and we didn’t know it until we ran the numbers.

Open Questions

  • Can we predict which domains will have high failure rates and auto-build more comprehensive tests?
  • How much does auditor performance degrade with agent hallucination? (Untested.)
  • What’s the optimal audit frequency? We haven’t found the point of diminishing returns yet.

This is still in production. We’re shipping it to customers next month. When we do, we’ll learn whether the 6% of problems we catch are the 6% that actually matter.

We’ll report back.

Tags: #ai-agents#quality-assurance#observability#building-in-public#failure-modes

Comments

Loading comments...