Using an Agent-as-a-Judge to Improve AI-Generated Unit Tests

Using an Agent‑as‑a‑Judge to Fix AI‑Written Unit Tests

Max Roche

Published on:

June 18, 2026

Updated on:

July 27, 2026

min. read

Table of Contents

TABLA DE CONTENIDOS

ÍNDICE DE CONTEÚDO

The Problem

Does Claude constantly write bad unit tests for you?

In our repo, Claude kept producing tests with the same recurring issues:

Using Task.sleep instead of XCTestExpectation, leading to flaky tests
Writing far too many tests (often the same test expressed 10 different ways)
Producing tautological tests that always pass but don't actually test anything
Ignoring good dependency‑injection patterns

Despite trying all kinds of prompts, I couldn't get a Claude agent to reliably write solid unit tests. Reviewing its output started to feel very familiar — and very repetitive.

Every review cycle looked something like this:

"Don't use Task.sleep — always prefer XCTestExpectation."
"You don't need this many tests."
"This test isn't actually asserting anything meaningful."

Claude and I would do this dance for two or three iterations before the tests finally came out looking nice and clean 🧼

The Aha Moment

Then it hit me: I could automate myself out of this loop.

What I really needed wasn't a better prompt — it was a second agent.

I created a unit‑test‑reviewer agent whose only job is to review generated tests and look for:

Flaky async patterns
Redundant or duplicate tests
Tautological assertions
Poor dependency‑injection practices

This agent doesn't rewrite tests. It simply reviews them and returns a PASS or FAIL, along with concrete reasons for any failure.

Wiring It Together with a Skill

Once I had the two agents, I just needed to connect them. I did this using a skill called create-unit-tests.

Here's how the flow works:

The unit‑test‑writing agent generates tests for the current changes
The unit‑test‑reviewing agent reviews those tests
If the review returns PASS → we're done ✅
If the review returns FAIL → the failure reasons are fed back into step 1

This loop continues until the reviewing agent returns a PASS.

Crucially, the feedback is always specific and consistent — the same things I used to comment on manually, every single time.

The Result

Hooking up two agents with a skill like this has been a huge win for the quality and consistency of the unit tests we write. It also saves me a lot of time as I no longer need to review the "first pass" of unit tests.

Next up: applying this paradigm to other parts of our workflow where human review patterns are predictable and repeatable.

‍

Share this article

Comparte este artículo

Compartilhe este artigo

FIND & MEET YOURS

Using an Agent‑as‑a‑Judge to Fix AI‑Written Unit Tests

The Problem

The Aha Moment

Wiring It Together with a Skill

The Result

Find & Meet Yours

Find & Meet Yours

Find & Meet Yours

Find & Meet Yours

Browse bigger, chat faster.

Featured articles

Artículos destacados

Artigos em Destaque

Grindr debuta en Pride CDMX 2026 con "Seguro con lugar"

Más allá de la Ciudad de México, sin filtros: La guía de Grindr for Equality para celebrar el Mes del Orgullo, el placer y la comunidad (Parte 2)

Beyond Mexico City, Unfiltered: Grindr for Equality’s Guide to Pride, Pleasure, and Community (Part 2)

Related articles

Artículos relacionados

Artigos Relacionados

Más allá de la Ciudad de México, sin filtros: La guía de Grindr for Equality para celebrar el Mes del Orgullo, el placer y la comunidad (Parte 2)

Beyond Mexico City, Unfiltered: Grindr for Equality’s Guide to Pride, Pleasure, and Community (Part 2)

Grindr debuta en Pride CDMX 2026 con "Seguro con lugar"

Using an Agent‑as‑a‑Judge to Fix AI‑Written Unit Tests

Ciudad de México sin filtros: La guía de Grindr for Equality para celebrar el Mes del Orgullo, el placer y la comunidad (Parte 1)

Mexico City, Unfiltered: Grindr for Equality’s Guide to Pride, Pleasure, and Community (Part 1)

DoxyPEP, Rising STI Rates, and What Gay Men Need to Know Right Now

Grindr Rides Europe: 2026 Pride Bus Tour

Pack Your Bags: Host or Travel Is Back for Another Season

How AI Tools Made Our Grindr Engineering Team More Productive

How We Automated Memory Leak Debugging from Hours to Minutes with AI

Claude Can’t See Xcode Previews — Here’s How We Fixed It

How a Local AI Review Skill Is Reducing PR Round-Trips at Grindr

Why I Stopped Writing Simple View Components

How I taught Claude to write Maestro tests (so I don't have to)

From Java to Kotlin: Migrating Microservices at Grindr

Claude Couldn’t See Its Print Statements Without Xcode —Here’s How We Fixed It

How We Migrated Our Ad SDKs from CocoaPods to Swift Package Manager at Grindr

Company

Community

Legal

Find & Meet Yours

Encuentra y conoce a los tuyos

Encontre o Seu Match Perfeito