I was presenting a vision for AI-assisted development to my engineering team. The model is spec-driven — engineers define architecture, standards, and constraints, and AI writes the code. Someone asked: if an agent writes the code and the tests, what are the tests actually validating?

My first instinct was to treat it as a testing problem — make tests part of the specification, use TDD, have a human review them. But the more I sat with it, the more I realized the problem isn’t with the tests. It’s with a single agent playing roles that are supposed to hold each other accountable.

I see two structural problems with a single agent doing everything.

First, reasoning works differently across roles. During planning and review, reasoning is an asset — look around corners, surface edge cases, challenge assumptions. During implementation, it’s a liability. An implementation agent that questions the plan is no longer implementing. It’s redesigning.

Second, I’ve found that context carries intent with it. Every prompt in a session isn’t processed in isolation. The model sees the full conversation history — every prior decision, every line of code, every resolved tradeoff. During implementation, that context builds attention toward one goal: ship the feature. From what I’ve seen, asking the same session to review its work doesn’t reset that attention. The review executes within the same frame and confirms the biases the session already built.

I see the fix in multiple agents, each scoped to a single role. A planning agent that reasons expansively. An implementation agent that follows the spec. A review agent that sees the output against the original requirements and constraints — but not the session that produced it. Same model is fine. The independence comes from isolated context, not different intelligence.

But separation alone isn’t enough. In my experience, AI agents default to being helpful — and in a review context, helpful means agreeable. I scope review agents with adversarial intent — their job is to find problems, not confirm quality. Without that, in my experience, role separation produces independent agents that all agree with each other.

That brings me back to the original question. If an agent writes the code and the tests, what are the tests validating? When a separate agent writes tests from the original spec, those tests become another constraint — same as the spec itself. The implementation agent follows them. It doesn’t own them, can’t modify them, and has to satisfy them to ship. Tests validate the code against the spec.