Testing in the Age of AI

AI coding agents made writing software fast and cheap. They did not make it correct. Verification — establishing that software does what it should and keeps doing it under change — is now the binding constraint on quality. This is a working playbook for requirements, static analysis, testing, and reliability when most code is machine-written.

spec-driven developmentstatic analysisproperty-based testingmutation testingdeterministic simulationevalsprogressive delivery

The bottleneck moved. When humans wrote every line, the cost of producing code throttled output, and a careful author was the first line of defense against defects. AI removes that throttle. The scarce resource is no longer code — it is justified confidence that the code is correct. The engineer's central job shifts to two verbs: specify the intended behavior, and verify it independently of how the code was produced.

Why AI-written code is hard to verify

AI assistance does not just speed up the old workflow — it changes the shape of the risk. Five failure modes matter most:

Plausible but wrongA model emits the statistically most likely implementation. That is usually correct on the common path and silently wrong on the edge cases it never enumerated. The code reads as confident and reviews as clean.

Correlated blind spotsA model that writes code and a model that reviews it share a training distribution. The reviewer rationalizes the same mistakes the author made. Same-distribution review checks code against itself, not against intent.

Volume outpaces reviewAI raises the rate of change. Human review, QA, and incident response do not scale at the same rate. DORA's 2025 research links rapid AI adoption to rising change-failure and delivery instability where verification did not keep pace.

The oracle problem deepensThe hard part of a test is knowing the right answer. When a human no longer writes the code, fewer people hold the mental model of what "right" even is — so the expected value is itself uncertain.

Tests inherit the bugAsk one agent for the code and its tests in a single pass and both encode the same misconception. The suite goes green, coverage looks excellent, and the defect ships anyway.

The core implication: you cannot verify AI-generated code with AI-generated tests alone, and you cannot verify it with coverage numbers. Verification has to be anchored to something the code generator did not produce — an explicit specification, a type system, an independent property, or a real-world signal.

Five layers of verification

No single technique is sufficient. Reliability comes from layering — cheap, fast, narrow checks early; expensive, slow, broad checks late — so each layer catches what the previous one missed. The rest of this page works through the five layers and what to actually adopt in each.

1 · Specify intentTurn what the software should do into an explicit, versioned, checkable artifact — before generating code. 2 · Constrain by constructionUse types, contracts, and static analysis so whole classes of defect become impossible or are caught before a test runs. 3 · Test behaviorProperty-based, mutation, fuzz, metamorphic, differential, and simulation testing — verifying behavior, not line counts. 4 · Verify the AI itselfEvals, LLM-as-judge, and the generator–verifier pattern for AI-generated code and AI-powered features. 5 · Catch it in productionObservability, progressive delivery, chaos engineering, and SLOs for the bugs that reach real users.

Layer 1 — Specify intent

An AI agent cannot read your mind, and a prompt typed once and then discarded leaves nothing to verify against. The specification is the one artifact a same-distribution model cannot fake — so it must exist, it must be explicit, and it must outlive the prompt.

Specification-driven development

Treat a written specification as the durable source of truth and code as a derivative of it. The 2025–2026 tooling wave makes this concrete: GitHub Spec Kit structures work into Specify → Plan → Tasks → Implement phases and separates stable intent from flexible implementation; AWS Kiro is an agentic IDE that turns a prompt into structured requirements, a design document, and sequenced tasks; Tessl pushes toward "spec-as-source," where humans edit only the spec. Repo-level convention files — AGENTS.md and the equivalent CLAUDE.md — encode project-wide constraints every agent must respect.

Requirements that cannot be misread

Ambiguous requirements are where AI agents go wrong fastest. EARS notation ("When [trigger], the [system] shall [response]") forces requirements into a constrained, testable grammar. Run an LLM ambiguity pass over requirements before any code is generated — detecting vague quantifiers, missing triggers, and undefined terms is something models now do well.

Executable acceptance criteria

Acceptance criteria that only live in prose get skipped. Express them as executable scenarios — Gherkin/BDD scenarios authored and human-reviewed before implementation — so they double as agent guardrails and as the acceptance test. For AI-powered features, the equivalent of the spec is the eval suite (see Layer 4).

Recommended: keep a versioned spec in the repo for every non-trivial feature; couple it to automated acceptance tests; treat spec drift as a first-class defect. Spec-anchored — specs that evolve with the code and are validated by tests — is the right default. Reserve full "spec-as-source" for stable, safety-critical domains.

Layer 2 — Constrain by construction

The cheapest defect is the one the language will not let you express. Before writing a single test, push as much correctness as possible into types, contracts, and static analysis — so whole classes of bug are either impossible or caught automatically.

Make illegal states unrepresentable

Model the domain so invalid combinations cannot be constructed: sum types and discriminated unions instead of loose flags, non-empty types instead of "a list that must not be empty," enums for state machines. Parse, don't validate — validate raw input once at the boundary into a well-typed value that carries proof of validity, so downstream code never re-checks. Tools: TypeScript strict mode with Zod for boundary parsing; Pydantic in Python; Rust's enums and exhaustive matching. An agent that generates code against a well-typed model inherits these invariants automatically — they are the shape of the data, not a test it can forget to write.

Design by contract

Preconditions, postconditions, and invariants turn a specification into a machine-checkable annotation on the code itself: icontract (Python), the contracts crate (Rust), Eiffel, and — for certified software — SPARK/Ada, where contracts are formally proven, not merely checked at runtime.

Static analysis as a merge gate

Run semantic static analysis in CI on every change. Semgrep and CodeQL perform cross-file dataflow and taint analysis that finds injection and security flaws pattern-matching misses; SonarQube adds a quality dashboard. The 2025–2026 shift is AI-assisted triage: the tools now learn a codebase's false-positive patterns and propose autofixes, which is what makes high-volume scanning tolerable. Linters and type-checkers belong on the same gate — fast, deterministic, non-negotiable.

Lightweight formal methods, now practical

Formal methods are no longer only for academia. TLA+ and the P language model-check distributed-protocol designs — AWS uses both in production to find correctness bugs before implementation begins. Kani brings bounded model checking to Rust; Alloy verifies data models and access-control policies; Dafny and Lean prove critical algorithms correct. AI lowers the cost of writing the specs and proofs; SMT solvers do the checking. Reserve these for distributed protocols, security-sensitive state machines, and safety-critical logic — that is where they pay back.

Recommended: adopt strict typing plus boundary parsing everywhere; contracts on high-value boundary functions; Semgrep + CodeQL as a blocking CI gate with AI triage enabled. Add TLA+ or Kani only where the problem is a protocol or safety-critical — for ordinary application code, types and property tests give the better return.

Layer 3 — Test behavior, not lines

Code coverage measures which lines ran, not whether the assertions mean anything. AI reliably produces high-coverage suites with weak assertions — so coverage as a merge gate actively misleads. The techniques below test behavior, and the strength of the tests themselves.

Property-based testing

Instead of fixed examples, declare an invariant that must always hold, let the framework generate thousands of inputs, and have it shrink any failure to a minimal reproducing case: Hypothesis (Python), fast-check (JS/TS), jqwik (Java), proptest (Rust). This is the natural division of labor with AI: let the agent draft the implementation, but have a human — or an independent agent — state the properties. Use stateful / model-based property testing for APIs and state machines.

Mutation testing — who tests the tests?

Mutation testing injects small faults (flip a comparison, delete a line) and measures how many the suite catches. The kill rate is a direct measure of test effectiveness — the thing coverage cannot see, and the only practical way to know AI-written tests have teeth. Tools: Stryker (JS/TS, C#, Scala), PIT (Java), mutmut (Python), cargo-mutants (Rust). Gate on mutation score, not coverage percentage.

Fuzzing

Coverage-guided fuzzers — libFuzzer, AFL++ — explore inputs no one thought to write a test for, and OSS-Fuzz runs them continuously. The historic barrier was writing fuzz harnesses by hand; LLMs now generate harnesses well, which removes the main excuse. Make fuzzing routine for every parser, deserializer, and protocol handler.

Metamorphic and differential testing — the oracle problem

When you cannot compute the correct output directly, test relations instead. Metamorphic testing asserts properties that survive a transformation (rotate an image 360° and it should be unchanged; add an irrelevant document and the top search result should not move). Differential testing runs two implementations on the same input and flags divergence — and the pre-refactor version of a function is a free oracle for checking the AI's rewrite of it. For these techniques worked end to end in one domain, see the deep-dive on testing rule-based format converters.

Deterministic simulation testing

For distributed systems, run the whole system on a single thread with seeded randomness and injectable faults — network drops, disk failures, clock skew — so concurrency and timing bugs become perfectly reproducible. The pattern was proven by FoundationDB and TigerBeetle; Antithesis offers it as a platform, and madsim brings it to Rust. It finds the multi-failure, timing-dependent bugs that escape unit tests and ordinary CI entirely.

Lock behavior for safe change

Snapshot / approval testing (ApprovalTests, insta, Jest snapshots) and characterization tests capture current behavior before an AI-driven refactor. Consumer-driven contract testing with Pact protects service boundaries — it does not care how the AI rewrote the internals, only that the observable contract still holds.

Technique	What it catches	Representative tools
Property-based testing	Invariant violations across a huge input space	Hypothesis, fast-check, jqwik, proptest
Mutation testing	Weak or absent assertions in the test suite	Stryker, PIT, mutmut, cargo-mutants
Fuzzing	Crashes, panics, unhandled inputs	libFuzzer, AFL++, OSS-Fuzz
Metamorphic / differential	Wrong output when the correct answer is unknown	PBT frameworks; pre-/post-refactor comparison
Deterministic simulation	Concurrency, timing, and multi-failure bugs	Antithesis, madsim
Snapshot / contract testing	Unintended behavior or interface change	ApprovalTests, insta, Pact

The AI-generated-test trap: AI test suites fail in characteristic ways — tautological tests that re-derive the expected value from the implementation; mirror bugs, where code and tests written in one pass share the same misconception; and coverage theater, high coverage with assertions that prove nothing. Defenses: write the property or specification first; make every test fail before it passes (no red phase means no evidence it detects anything); gate on mutation score; and never accept code and its tests from the same agent in one pass without independent scrutiny.

Layer 4 — Verify the AI itself

Two distinct problems live here: verifying code that AI wrote, and verifying a feature that calls a model at runtime. Both come down to the same asymmetry — checking an answer is cheaper than producing one.

Evals — the spec and the test for AI features

For anything that calls a model, a deterministic unit test does not apply. An eval — a scored run over a dataset of inputs — is both the spec and the regression test. Practice eval-driven development: write the eval suite before the feature, gate merges on eval scores, and keep a locked regression set that survives model upgrades. Tooling spans a CI layer — Promptfoo, DeepEval, and RAGAS for retrieval — and a platform layer — Braintrust, LangSmith, Arize Phoenix. Inspect, from the UK AI Security Institute, is the reference framework for rigorous capability and safety evals.

LLM-as-judge — and its biases

Using a model to grade model output scales evaluation past human annotation, but the judge carries real biases: position bias (favoring whichever answer came first), verbosity bias (longer reads as better), and self-family bias (over-rewarding its own model family). Calibrate the judge against a human-labeled gold set, use a judge from a different model family than the generator, give it an explicit rubric rather than "rate 1–10," and recalibrate on a schedule.

The generator–verifier pattern

The deepest principle of the AI era: it is cheaper to check than to generate. Design AI workflows so an independent — ideally deterministic — check verifies the agent's output: a separate reviewer agent that reads the source itself rather than trusting the generator's summary; self-consistency voting across several generations; confidence scoring that routes low-confidence output to a human. Never let the agent that produced the code be the only thing that approves it.

AI code review, and non-determinism

AI review bots — CodeRabbit, Qodo, GitLab Duo and similar — are a useful first-pass triage that cuts human reviewer load, but they do not replace human judgment on architecture, concurrency, and cross-system impact, and they are themselves a prompt-injection surface. To test a system that includes a model, separate the deterministic layer (everything around the model call — unit-test it normally) from the model call itself (stub or replay it for fast CI; use probabilistic assertions and semantic-similarity scoring for integration runs). Ask whether output falls within an acceptable distribution, not whether it equals an exact string.

Recommended: treat evals as a required deliverable for any model-powered feature, versioned and gated like unit tests. Build at least one independent verification step into every agentic workflow. Keep AI review as triage, not as the gate of record.

Layer 5 — Catch it in production

No pre-merge process catches everything — the real distribution of inputs only exists in production. The goal of this layer is to make production observable and releases reversible, so the bugs that escape are found in minutes and contained to a fraction of users.

Observability

OpenTelemetry is the vendor-neutral standard for traces, metrics, and logs. Its GenAI semantic conventions now standardize telemetry for model calls — model, token counts, latency, cost — so AI features are observable on the same footing as everything else. Instrument every model call and every agent step.

Progressive delivery

Feature flags plus canary and ring releases bound the blast radius of bad — possibly AI-written — code: release to 1% of traffic, watch the metrics, then expand. Argo Rollouts and Flagger automate metric-gated canaries; LaunchDarkly leads on flags; OpenFeature is the vendor-neutral flag API. Ship every AI-built feature behind a flag; gate each promotion on metrics, not on the clock.

Chaos engineering

Deliberately inject failures to prove the system is as resilient as designed: Chaos Mesh, Gremlin, AWS Fault Injection Service. For AI features specifically, inject model timeouts and error responses to confirm the fallback paths actually work.

SLOs and error budgets

Define reliability quantitatively — a service-level objective and the error budget it implies. A healthy budget means ship freely; a budget burning hot means freeze and investigate. The error budget is the rate limiter for AI-accelerated delivery: it converts "we are shipping faster" into a measured, bounded decision.

AIOps

Anomaly detection, AI incident summarization, and automated rollback — Datadog Watchdog, Dynatrace Davis AI — cut detection and triage time. Automate remediation only for well-understood, reversible failures; require a human decision on severe incidents.

Recommended: instrument with OpenTelemetry from day one; ship every feature behind a flag with a metric-gated canary; define SLOs and burn-rate alerts before launch; run at least one chaos experiment per critical dependency.

A recommended verification pipeline

The five layers assemble into one staged pipeline. Each stage is a gate — cheap and fast first, broad and slow last. The point is not to run every tool on every change; it is that nothing reaches users without passing the gates its risk level demands.

Stage	What it gates	Representative tooling
Author	A reviewed spec exists in the repo; requirements are unambiguous	Spec Kit, Kiro, EARS, AGENTS.md
Pre-commit	Formatting, linting, type-checking, secret scanning	Language toolchain, type checker
Pull request	Static + taint analysis; AI review as triage; human review of intent	Semgrep, CodeQL, review bot
Test — fast	Unit + property tests pass; mutation score above threshold; contracts hold	Hypothesis / fast-check, Stryker / PIT, Pact
Test — deep	Fuzzing; differential vs. prior version; simulation for distributed systems; evals for AI features	OSS-Fuzz, Antithesis / madsim, Promptfoo
Release	Behind a flag; canary promoted on metrics, not time	OpenFeature, Argo Rollouts
Production	Tracing in place; SLO burn-rate alerts; chaos drills	OpenTelemetry, Datadog / Dynatrace, Chaos Mesh

Eight principles

Specify before you generate. A prompt typed once and discarded is not a specification.
Push correctness into types and contracts so tests have less to catch and the agent inherits the invariants.
Test properties and behavior, not line coverage. Gate on mutation score; stop using coverage percentage as a merge gate.
Make every test fail before it passes — including, especially, AI-written tests.
Never let the agent that wrote the code be the only thing that approves it. Build in an independent verifier.
Treat evals as the spec and the test for anything that calls a model at runtime.
Assume some defects reach production. Make releases reversible and systems observable.
Invest where the constraint is. Verification capacity, not generation speed, now sets real throughput.

Anti-patterns to retire: coverage percentage as a merge gate; accepting code and its tests generated together in one unreviewed pass; trusting an AI reviewer as the gate of record; shipping without an executable spec; and measuring the team on PRs merged or lines shipped rather than on escaped defects and change-failure rate.

The bottom line: AI did not make testing obsolete — it made testing the job. Writing code was the work that gated software; now it is the easy part, and the discipline that separates a strong team is its ability to specify intended behavior and verify it independently of how the code was produced. The organizations that come out ahead will treat verification — not generation — as the product.

Domain deep-dive

How the five layers play out in one demanding domain — a worked example:

Testing rule-based format convertersRound-trip and metamorphic properties, schema-driven corner-case generation, rule coverage, and environment-matrix testing for format conversion software.

Tool landscapeAssistants, agents, IDEs, review bots, test tools, and observability. Risks + governanceSecurity, hallucinations, technical debt, and adoption guardrails.

Testing in the Age of AI

Why AI-written code is hard to verify

Five layers of verification

Layer 1 — Specify intent

Specification-driven development

Requirements that cannot be misread

Executable acceptance criteria

Layer 2 — Constrain by construction

Make illegal states unrepresentable

Design by contract

Static analysis as a merge gate

Lightweight formal methods, now practical

Layer 3 — Test behavior, not lines

Property-based testing

Mutation testing — who tests the tests?

Fuzzing

Metamorphic and differential testing — the oracle problem

Deterministic simulation testing

Lock behavior for safe change

Layer 4 — Verify the AI itself

Evals — the spec and the test for AI features

LLM-as-judge — and its biases

The generator–verifier pattern

AI code review, and non-determinism

Layer 5 — Catch it in production

Observability

Progressive delivery

Chaos engineering

SLOs and error budgets

AIOps

A recommended verification pipeline

Eight principles

Domain deep-dive

Related