TDD + SDD for AI-Assisted Infrastructure Automation - WordPad

TDD + SDD for AI-Assisted Infrastructure Automation

Last reviewed: June 2026
Part 8 of 8 · Using AI in infrastructure work — the full series is listed at the end.

Part 7 ended on the question that decides whether coding agents are an asset or a liability: how do you verify delegated work rigorously? This final part answers it with the two practices that make AI-assisted automation trustworthy. Test-driven development (TDD) checks whether the implementation behaves correctly. Spec-driven development (SDD) checks whether the team agreed on the right behaviour before implementation started. One proves the code; the other proves you're building the correct thing. With AI in the loop, you need both — and you need them more than you did when humans wrote every line.

Why AI raises the stakes on both

An AI coding agent can produce a script, module, or pull request that looks complete while missing a business rule, a safety control, a rollback requirement, or an edge case. It does this confidently, with clean formatting and a tidy summary — which is exactly what makes the gap hard to spot. Infrastructure automation is unusually sensitive to this, because a small miss can affect identity, access, networking, reporting, or production operations. A reporting script that silently includes disabled accounts is an annoyance; one that feeds a disable list and includes the wrong accounts is an incident.

The answer is not to avoid AI. It's to give the agent a bounded specification, a testable contract, and a review process that catches gaps before production. SDD writes the contract; TDD enforces it; the agent fills in the middle at speed. Speed is only valuable when the specification is clear and the tests prove the behaviour that actually matters.

The workflow I prefer

  1. Write the problem statement.
  2. Define scope and non-goals.
  3. Define acceptance criteria.
  4. List safety constraints.
  5. Create test cases before implementation.
  6. Generate or write the implementation.
  7. Run tests locally and in CI where available.
  8. Review the result against the specification.
  9. Document decisions and known limits.

Steps 1–4 are SDD: they exist whether or not AI writes the code, and they're where most of the real thinking happens. Step 5 is the pivot — tests written before the implementation become the contract the agent has to satisfy, instead of an afterthought it can quietly weaken.

A worked example: stale AD account reporting

Make it concrete with a task I'd genuinely hand to an agent — a module that reports stale Active Directory accounts. Here's the spec.

Specification:

  • Find enabled user accounts inactive for 90 days.
  • Exclude disabled users.
  • Exclude service accounts by agreed naming or OU rules.
  • Exclude break-glass accounts.
  • Search only inside the approved search base.
  • Output CSV and HTML.
  • Include timestamp, domain, search base, and filter summary.
  • Make no changes to Active Directory.
  • Fail safely if the domain query fails.

Non-goals:

  • Do not disable accounts.
  • Do not move accounts.
  • Do not send email automatically.
  • Do not store credentials in the script.

Notice how much of that spec is about what the module must not do. In infrastructure automation, the non-goals are often more important than the goals — they're the difference between a read-only report and a script that can take down authentication. An agent given only the happy path will build something that works in the demo and surprises you in production. An agent given the non-goals has the boundary drawn for it.

The tests are the contract

Before any implementation exists, the behaviour gets pinned down as tests:

Test Expected behaviour
Inactive enabled user User appears in the report.
Disabled user User is excluded.
Service account Account is excluded by rule.
Break-glass account Account is excluded even if inactive.
Empty result Report is generated with a clear zero-result summary.
Domain query failure Script exits with an error and no misleading report.

Each of these encodes a way the naive implementation goes wrong. The break-glass test matters because a break-glass account is, by design, inactive almost all the time — it's the textbook case an inactivity report sweeps up and someone disables, right before the emergency where they needed it. The domain-query-failure test matters because the dangerous failure isn't a crash; it's a script that catches the error, returns an empty result, and lets someone conclude there are zero stale accounts when really there were zero results. Those two tests are worth more than the rest combined, because they encode hard-won operational knowledge an agent simply doesn't have.

What the agent gets wrong, and the tests catch

In practice the gaps an agent leaves on a task like this are predictable, which is the whole point of writing the tests first. It will usually nail "find inactive enabled users" — that's the obvious requirement. The misses cluster on the edges the spec called out: it forgets that break-glass accounts must survive the filter even when inactive; it treats an empty result and a failed query as the same thing; it includes service accounts because it inferred a naming rule instead of using the agreed one. With the tests written up front, none of those reach review as a surprise — they show up as red tests, which is exactly where you want a gap to appear. Without the tests, they show up in production, framed as "the report looked fine."

The prompt pattern for an agent

When I hand this to a coding agent, the prompt enforces the workflow rather than asking for code:

Implement the reporting module from this specification.
Do not add behavior outside the specification.
Write tests first.
The module must not modify Active Directory.
If a requirement is ambiguous, ask a question before implementing.
After implementation, summarize how each acceptance criterion is covered by tests.

The last two lines do the heavy lifting. "Ask a question before implementing" gives the agent permission to surface ambiguity instead of guessing — the behaviour the whole series has been arguing for. "Summarise how each acceptance criterion is covered by tests" forces it to map its work back to the contract, which is also my review checklist handed back to me pre-filled.

I hold my own projects to this

This isn't theory I apply only to client work. The tools I build for myself carry real test suites and CI as a matter of course — the identity reporting platform I built, the infrastructure-as-code tooling, the AD hardening lab. Partly that's professional habit; increasingly it's because tests are how I collaborate with coding agents at all. A clear failing test gives the model a concrete target. A good regression test protects the behaviour after the next change. Layered verification — static checks, targeted unit tests, integration tests, then a smoke check of the real workflow — turns "tests passed" from a vague claim into reviewable evidence. The agent should report exactly which layers it ran, and say why if it skipped one.

Review and approval boundaries

The specification gets reviewed before implementation starts — that's the cheapest place to catch a wrong assumption. Tests are part of the definition of done, not a follow-up ticket. And AI output is reviewed against the spec, not merely against whether the code runs. For infrastructure automation I also want a human reviewer checking the things tests don't fully capture: permissions, logging, failure behaviour, data exposure, and operational handover. A module can pass every test and still log a bind credential or run with more privilege than it needs.

Practical checklist

  • Does the spec define what the automation must not do?
  • Are the safety constraints explicit?
  • Are tests written for the edge cases and the failure paths?
  • Can the code run in report-only mode where appropriate?
  • Is the output suitable for operations or audit review?
  • Are secrets excluded by design, not by luck?
  • Is human approval required before any production-impacting action?

The point — and where the series lands

Use SDD to agree on the behaviour and TDD to prove it. Together they give an AI agent a clear contract and give the team a concrete review standard, which is what makes AI-assisted infrastructure automation genuinely useful rather than merely fast.

That's also where these eight parts come to rest. The whole series has made one argument from different angles: AI is worth using across infrastructure work — drafting, retrieving, reasoning, acting, and coding — but every one of those uses is safe only inside the controls we already know. Verification (Part 1). Bounded prompts (Parts 2–3). Governed retrieval and an identity-aware control plane (Parts 4–5). Workflow guardrails (Part 6). Reviewed delegation (Part 7). And a tested, specified contract (Part 8). The model keeps getting better. The discipline around it is still the job.


The full series

  1. How I Use AI Safely in Infrastructure Workflows
  2. Prompt Engineering for IT Infrastructure Consultants
  3. Advanced Prompting Patterns for Infrastructure Planning
  4. RAG for Enterprise Infrastructure Knowledge
  5. Context Engineering for Enterprise AI Systems
  6. Workflow Engineering for Agentic AI in Infrastructure Operations
  7. AI Coding Agents in Infrastructure Automation
  8. TDD + SDD for AI-Assisted Infrastructure Automation(you're reading this)
For Help, press F1 1491 words Ln 1, Col 1