Testing in the Age of AI
AI coding agents made writing software fast and cheap. They did not make it correct. Verification — establishing that software does what it should and keeps doing it under change — is now the binding constraint on quality. This is a working playbook for requirements, static analysis, testing, and reliability when most code is machine-written.
spec-driven developmentstatic analysisproperty-based testingmutation testingdeterministic simulationevalsprogressive delivery
Why AI-written code is hard to verify
AI assistance does not just speed up the old workflow — it changes the shape of the risk. Five failure modes matter most:
Five layers of verification
No single technique is sufficient. Reliability comes from layering — cheap, fast, narrow checks early; expensive, slow, broad checks late — so each layer catches what the previous one missed. The rest of this page works through the five layers and what to actually adopt in each.
Layer 1 — Specify intent
An AI agent cannot read your mind, and a prompt typed once and then discarded leaves nothing to verify against. The specification is the one artifact a same-distribution model cannot fake — so it must exist, it must be explicit, and it must outlive the prompt.
Specification-driven development
Treat a written specification as the durable source of truth and code as a derivative of it. The 2025–2026 tooling wave makes this concrete: GitHub Spec Kit structures work into Specify → Plan → Tasks → Implement phases and separates stable intent from flexible implementation; AWS Kiro is an agentic IDE that turns a prompt into structured requirements, a design document, and sequenced tasks; Tessl pushes toward "spec-as-source," where humans edit only the spec. Repo-level convention files — AGENTS.md and the equivalent CLAUDE.md — encode project-wide constraints every agent must respect.
Requirements that cannot be misread
Ambiguous requirements are where AI agents go wrong fastest. EARS notation ("When [trigger], the [system] shall [response]") forces requirements into a constrained, testable grammar. Run an LLM ambiguity pass over requirements before any code is generated — detecting vague quantifiers, missing triggers, and undefined terms is something models now do well.
Executable acceptance criteria
Acceptance criteria that only live in prose get skipped. Express them as executable scenarios — Gherkin/BDD scenarios authored and human-reviewed before implementation — so they double as agent guardrails and as the acceptance test. For AI-powered features, the equivalent of the spec is the eval suite (see Layer 4).
Layer 2 — Constrain by construction
The cheapest defect is the one the language will not let you express. Before writing a single test, push as much correctness as possible into types, contracts, and static analysis — so whole classes of bug are either impossible or caught automatically.
Make illegal states unrepresentable
Model the domain so invalid combinations cannot be constructed: sum types and discriminated unions instead of loose flags, non-empty types instead of "a list that must not be empty," enums for state machines. Parse, don't validate — validate raw input once at the boundary into a well-typed value that carries proof of validity, so downstream code never re-checks. Tools: TypeScript strict mode with Zod for boundary parsing; Pydantic in Python; Rust's enums and exhaustive matching. An agent that generates code against a well-typed model inherits these invariants automatically — they are the shape of the data, not a test it can forget to write.
Design by contract
Preconditions, postconditions, and invariants turn a specification into a machine-checkable annotation on the code itself: icontract (Python), the contracts crate (Rust), Eiffel, and — for certified software — SPARK/Ada, where contracts are formally proven, not merely checked at runtime.
Static analysis as a merge gate
Run semantic static analysis in CI on every change. Semgrep and CodeQL perform cross-file dataflow and taint analysis that finds injection and security flaws pattern-matching misses; SonarQube adds a quality dashboard. The 2025–2026 shift is AI-assisted triage: the tools now learn a codebase's false-positive patterns and propose autofixes, which is what makes high-volume scanning tolerable. Linters and type-checkers belong on the same gate — fast, deterministic, non-negotiable.
Lightweight formal methods, now practical
Formal methods are no longer only for academia. TLA+ and the P language model-check distributed-protocol designs — AWS uses both in production to find correctness bugs before implementation begins. Kani brings bounded model checking to Rust; Alloy verifies data models and access-control policies; Dafny and Lean prove critical algorithms correct. AI lowers the cost of writing the specs and proofs; SMT solvers do the checking. Reserve these for distributed protocols, security-sensitive state machines, and safety-critical logic — that is where they pay back.
Layer 3 — Test behavior, not lines
Code coverage measures which lines ran, not whether the assertions mean anything. AI reliably produces high-coverage suites with weak assertions — so coverage as a merge gate actively misleads. The techniques below test behavior, and the strength of the tests themselves.
Property-based testing
Instead of fixed examples, declare an invariant that must always hold, let the framework generate thousands of inputs, and have it shrink any failure to a minimal reproducing case: Hypothesis (Python), fast-check (JS/TS), jqwik (Java), proptest (Rust). This is the natural division of labor with AI: let the agent draft the implementation, but have a human — or an independent agent — state the properties. Use stateful / model-based property testing for APIs and state machines.
Mutation testing — who tests the tests?
Mutation testing injects small faults (flip a comparison, delete a line) and measures how many the suite catches. The kill rate is a direct measure of test effectiveness — the thing coverage cannot see, and the only practical way to know AI-written tests have teeth. Tools: Stryker (JS/TS, C#, Scala), PIT (Java), mutmut (Python), cargo-mutants (Rust). Gate on mutation score, not coverage percentage.
Fuzzing
Coverage-guided fuzzers — libFuzzer, AFL++ — explore inputs no one thought to write a test for, and OSS-Fuzz runs them continuously. The historic barrier was writing fuzz harnesses by hand; LLMs now generate harnesses well, which removes the main excuse. Make fuzzing routine for every parser, deserializer, and protocol handler.
Metamorphic and differential testing — the oracle problem
When you cannot compute the correct output directly, test relations instead. Metamorphic testing asserts properties that survive a transformation (rotate an image 360° and it should be unchanged; add an irrelevant document and the top search result should not move). Differential testing runs two implementations on the same input and flags divergence — and the pre-refactor version of a function is a free oracle for checking the AI's rewrite of it. For these techniques worked end to end in one domain, see the deep-dive on testing rule-based format converters.
Deterministic simulation testing
For distributed systems, run the whole system on a single thread with seeded randomness and injectable faults — network drops, disk failures, clock skew — so concurrency and timing bugs become perfectly reproducible. The pattern was proven by FoundationDB and TigerBeetle; Antithesis offers it as a platform, and madsim brings it to Rust. It finds the multi-failure, timing-dependent bugs that escape unit tests and ordinary CI entirely.
Lock behavior for safe change
Snapshot / approval testing (ApprovalTests, insta, Jest snapshots) and characterization tests capture current behavior before an AI-driven refactor. Consumer-driven contract testing with Pact protects service boundaries — it does not care how the AI rewrote the internals, only that the observable contract still holds.
| Technique | What it catches | Representative tools |
|---|---|---|
| Property-based testing | Invariant violations across a huge input space | Hypothesis, fast-check, jqwik, proptest |
| Mutation testing | Weak or absent assertions in the test suite | Stryker, PIT, mutmut, cargo-mutants |
| Fuzzing | Crashes, panics, unhandled inputs | libFuzzer, AFL++, OSS-Fuzz |
| Metamorphic / differential | Wrong output when the correct answer is unknown | PBT frameworks; pre-/post-refactor comparison |
| Deterministic simulation | Concurrency, timing, and multi-failure bugs | Antithesis, madsim |
| Snapshot / contract testing | Unintended behavior or interface change | ApprovalTests, insta, Pact |
Layer 4 — Verify the AI itself
Two distinct problems live here: verifying code that AI wrote, and verifying a feature that calls a model at runtime. Both come down to the same asymmetry — checking an answer is cheaper than producing one.
Evals — the spec and the test for AI features
For anything that calls a model, a deterministic unit test does not apply. An eval — a scored run over a dataset of inputs — is both the spec and the regression test. Practice eval-driven development: write the eval suite before the feature, gate merges on eval scores, and keep a locked regression set that survives model upgrades. Tooling spans a CI layer — Promptfoo, DeepEval, and RAGAS for retrieval — and a platform layer — Braintrust, LangSmith, Arize Phoenix. Inspect, from the UK AI Security Institute, is the reference framework for rigorous capability and safety evals.
LLM-as-judge — and its biases
Using a model to grade model output scales evaluation past human annotation, but the judge carries real biases: position bias (favoring whichever answer came first), verbosity bias (longer reads as better), and self-family bias (over-rewarding its own model family). Calibrate the judge against a human-labeled gold set, use a judge from a different model family than the generator, give it an explicit rubric rather than "rate 1–10," and recalibrate on a schedule.
The generator–verifier pattern
The deepest principle of the AI era: it is cheaper to check than to generate. Design AI workflows so an independent — ideally deterministic — check verifies the agent's output: a separate reviewer agent that reads the source itself rather than trusting the generator's summary; self-consistency voting across several generations; confidence scoring that routes low-confidence output to a human. Never let the agent that produced the code be the only thing that approves it.
AI code review, and non-determinism
AI review bots — CodeRabbit, Qodo, GitLab Duo and similar — are a useful first-pass triage that cuts human reviewer load, but they do not replace human judgment on architecture, concurrency, and cross-system impact, and they are themselves a prompt-injection surface. To test a system that includes a model, separate the deterministic layer (everything around the model call — unit-test it normally) from the model call itself (stub or replay it for fast CI; use probabilistic assertions and semantic-similarity scoring for integration runs). Ask whether output falls within an acceptable distribution, not whether it equals an exact string.
Layer 5 — Catch it in production
No pre-merge process catches everything — the real distribution of inputs only exists in production. The goal of this layer is to make production observable and releases reversible, so the bugs that escape are found in minutes and contained to a fraction of users.
Observability
OpenTelemetry is the vendor-neutral standard for traces, metrics, and logs. Its GenAI semantic conventions now standardize telemetry for model calls — model, token counts, latency, cost — so AI features are observable on the same footing as everything else. Instrument every model call and every agent step.
Progressive delivery
Feature flags plus canary and ring releases bound the blast radius of bad — possibly AI-written — code: release to 1% of traffic, watch the metrics, then expand. Argo Rollouts and Flagger automate metric-gated canaries; LaunchDarkly leads on flags; OpenFeature is the vendor-neutral flag API. Ship every AI-built feature behind a flag; gate each promotion on metrics, not on the clock.
Chaos engineering
Deliberately inject failures to prove the system is as resilient as designed: Chaos Mesh, Gremlin, AWS Fault Injection Service. For AI features specifically, inject model timeouts and error responses to confirm the fallback paths actually work.
SLOs and error budgets
Define reliability quantitatively — a service-level objective and the error budget it implies. A healthy budget means ship freely; a budget burning hot means freeze and investigate. The error budget is the rate limiter for AI-accelerated delivery: it converts "we are shipping faster" into a measured, bounded decision.
AIOps
Anomaly detection, AI incident summarization, and automated rollback — Datadog Watchdog, Dynatrace Davis AI — cut detection and triage time. Automate remediation only for well-understood, reversible failures; require a human decision on severe incidents.
A recommended verification pipeline
The five layers assemble into one staged pipeline. Each stage is a gate — cheap and fast first, broad and slow last. The point is not to run every tool on every change; it is that nothing reaches users without passing the gates its risk level demands.
| Stage | What it gates | Representative tooling |
|---|---|---|
| Author | A reviewed spec exists in the repo; requirements are unambiguous | Spec Kit, Kiro, EARS, AGENTS.md |
| Pre-commit | Formatting, linting, type-checking, secret scanning | Language toolchain, type checker |
| Pull request | Static + taint analysis; AI review as triage; human review of intent | Semgrep, CodeQL, review bot |
| Test — fast | Unit + property tests pass; mutation score above threshold; contracts hold | Hypothesis / fast-check, Stryker / PIT, Pact |
| Test — deep | Fuzzing; differential vs. prior version; simulation for distributed systems; evals for AI features | OSS-Fuzz, Antithesis / madsim, Promptfoo |
| Release | Behind a flag; canary promoted on metrics, not time | OpenFeature, Argo Rollouts |
| Production | Tracing in place; SLO burn-rate alerts; chaos drills | OpenTelemetry, Datadog / Dynatrace, Chaos Mesh |
Eight principles
- Specify before you generate. A prompt typed once and discarded is not a specification.
- Push correctness into types and contracts so tests have less to catch and the agent inherits the invariants.
- Test properties and behavior, not line coverage. Gate on mutation score; stop using coverage percentage as a merge gate.
- Make every test fail before it passes — including, especially, AI-written tests.
- Never let the agent that wrote the code be the only thing that approves it. Build in an independent verifier.
- Treat evals as the spec and the test for anything that calls a model at runtime.
- Assume some defects reach production. Make releases reversible and systems observable.
- Invest where the constraint is. Verification capacity, not generation speed, now sets real throughput.
Domain deep-dive
How the five layers play out in one demanding domain — a worked example: