Workflow Engineering for Agentic AI in Infrastructure Operations - WordPad

Workflow Engineering for Agentic AI in Infrastructure Operations

Last reviewed: June 2026
Part 6 of 8 · Using AI in infrastructure work — the full series is listed at the end.

Up to here the series has kept the model on the answering side of the line: prompt it, ground it, govern what it can read. This part crosses the line. An agentic system doesn't just answer — it takes actions: calls tools, queries systems, updates tickets, maybe changes state. That shift is exactly where most of the engineering actually lives, and where most of the risk does too.

The single most useful reframing I can offer: agentic AI is workflow engineering, not model autonomy. The model is one component. The real system is permissions, tools, state, approvals, logging, retries, rollback, and escalation wrapped around it. I learned this building orchestration rather than reading about it — wiring model steps into Windmill-based workflows in a project I call DevGodzilla, and building TerraformManager around secure Terraform blueprints, drift, and policy. In both, the "AI" was a small box in a much larger diagram, and the diagram was the product. The goal isn't an agent that's more autonomous. It's a workflow that's bounded enough to be useful and controlled enough to be trusted.

Why infrastructure raises the stakes

Infrastructure operations means privileged systems, identity changes, firewall rules, production incidents, and business-facing outages. In that environment a poorly designed agent doesn't just give a wrong answer — it takes a wrong action, faster than a human would have. "Move fast" is a fine motto for a prototype and a terrible one for something with write access to Active Directory.

A well-designed agent, on the other hand, can collect evidence, draft summaries, find missing steps, prepare reviews, and cut the manual coordination that eats an operations team's day. The entire difference between those two outcomes is workflow design — what the agent is allowed to touch, when it has to stop, and what happens when a step fails.

Three workflows I'd actually build

The pattern across all three: the agent does the gathering and drafting; a human owns anything that changes state.

Incident triage assistant

Step Allowed behaviour
Read alert Parse alert, affected service, severity, time, owner.
Collect evidence Query approved logs and monitoring dashboards.
Check runbook Retrieve the current runbook and known issues.
Summarise Separate facts, assumptions, and missing evidence.
Suggest next steps Recommend checks — not uncontrolled remediation.
Escalate Route to the on-call engineer or owner.

The boundary is hard: this assistant collects and summarises. It does not restart services, disable accounts, change firewall rules, or close incidents without approval. It compresses the first frantic twenty minutes of an incident — gathering context — without ever touching the levers.

Change-request assistant

It reads technical notes, drafts the CAB summary, identifies missing rollback and validation steps, lists affected systems, and routes to the owner for approval. The control that makes it safe is simple: it does not approve the change. It prepares better material for the people who own the risk — the same Part 2 idea, now running as a workflow instead of a single prompt.

Script review workflow

For PowerShell and infrastructure automation, an agent reviews code for risky operations, missing -WhatIf, missing SupportsShouldProcess, weak logging, secret exposure, missing rollback, and thin tests, then produces a reviewer checklist and test ideas. What it does not do is run privileged automation against production. Review is safe; execution is gated.

What "production-ready" actually requires

A demo agent needs a model and a tool. A production agent needs all of this, and the gap between the two lists is where most agentic projects quietly fail:

  • A clear workflow boundary.
  • Identity-aware permissions (the Part 5 control plane, now governing actions as well as reads).
  • Tool contracts with strict inputs and outputs.
  • Human approval gates for risky actions.
  • Audit logging for prompts, retrieval, tool calls, and outputs.
  • Retry and timeout handling.
  • Error handling that fails closed, not open.
  • A rollback path for any state-changing action.
  • Observability and alerting.
  • An escalation path to a named owner.

"Fails closed" deserves emphasis. When a step errors — a tool times out, a permission check is ambiguous, a response doesn't parse — the safe default is to stop and escalate, not to guess and proceed. A human who hits uncertainty slows down. An agent has to be designed to do the same, because its default is to keep going.

Tool contracts are where safety is enforced

The moment an agent can call a tool, it stops being language generation and becomes a small automation system — and that system needs the same discipline as any other automation with credentials. Each tool the agent can reach should have a narrow, explicit contract: what it accepts, what it returns, what it's allowed to touch, and whether it needs approval. Read-only tools (query logs, read a runbook, search the CMDB) can be broadly available. Write tools (update a ticket, change a config, run a deployment) need scoping and, for anything production-impacting, an explicit human gate. This is just least privilege — the principle every infrastructure engineer already applies to service accounts — pointed at a new kind of caller.

The decisions behind an agentic workflow

Decision The question to answer
Tool access Which APIs can the agent call, and under whose authority?
State What does the workflow remember between steps?
Approval Which actions require explicit human approval?
Rollback What happens if a state-changing action only partly succeeds?
Logging Can an engineer reconstruct exactly what happened, later?
Escalation When does the workflow stop and hand off to a person?

That partial-success row is the one teams skip and regret. A human operator interrupted halfway through a change knows to check what already happened. An agent has to be explicitly designed to handle "step 3 of 5 succeeded, step 4 failed" — or it will happily leave the system in a state no one planned for.

A production workflow shape

The pattern I trust looks more like a change-control workflow than a chat session:

Stage What happens Control
Intake User asks for analysis, triage, or a prepared change. Classify intent and risk before any tool runs.
Context build Workflow retrieves tickets, runbooks, logs, configuration, and policy. Retrieval is permission-trimmed and source-limited.
Plan Agent proposes actions, checks, or a change package. Output separates facts, assumptions, and recommendations.
Review Human owner approves, rejects, or edits the proposed action. Approval is recorded with identity and timestamp.
Execute Approved tool runs with bounded parameters. Tool identity has least privilege and strict input schema.
Verify Workflow checks the expected outcome. Verification is a separate step, not the tool's promise.
Close or escalate Workflow updates ticket or hands off with evidence. Audit record includes prompts, sources, tool calls, and result.

The important part is that planning and execution are different states. A model may draft a firewall-change request. It should not quietly become the firewall administrator because the next sentence sounded confident. State transitions need gates.

Failure modes I design for

Agentic systems fail in ways that normal automation does not. A script usually fails because an API call errors, a parameter is wrong, or a dependency is unavailable. An agent can fail because the plan is plausible but incomplete, the retrieved context is stale, the tool call is technically valid but operationally wrong, or the human approval prompt hides the real risk.

The failure modes I explicitly design for:

  • Context drift: the workflow retrieves an old runbook or stale architecture diagram.
  • Permission flattening: the agent sees more than the requesting user should see.
  • Tool overreach: a general-purpose tool can touch too many systems or too broad a scope.
  • Approval fatigue: humans approve generated plans without reading the risk.
  • Partial execution: one state-changing step succeeds and the next fails.
  • Silent verification failure: the workflow updates a ticket as complete without proving the system state changed.
  • Ownership ambiguity: nobody owns the agent's runbooks, credentials, incidents, or audit review.

For each of those, the fix is not a better prompt. It is an engineering control: source ranking, pre-retrieval authorization, narrower tools, approval summaries with risk callouts, transaction logs, independent verification, and named owners.

Tool design: narrow beats clever

The safest tools for agents are boring tools with narrow verbs. GetRecentFailedLogons is safer than RunPowerShell. CreateDraftChangeRequest is safer than UpdateTicket. QueryApprovedRunbook is safer than SearchAllDocuments. The narrower the tool, the less judgment the model needs to exercise at the boundary.

For infrastructure operations I would group tools like this:

Tool class Examples Default access
Read-only evidence query logs, get ticket, read runbook, list recent alerts Allowed when user has source access.
Drafting create CAB draft, summarize incident, propose checklist Allowed with clear labeling as draft.
Low-impact write add ticket comment, attach evidence, assign review task Allowed only inside approved workflow.
Production-impacting write restart service, change group membership, modify firewall, run Terraform Human approval plus scoped execution identity.

The execution identity matters. An agent should not inherit a human administrator's broad session just because the user is privileged. It should use a purpose-built identity with fewer permissions than the human, narrow API scopes, and audit logs that clearly identify machine action.

Observability and review

If an agentic workflow touches operations, it needs its own operational telemetry. I want dashboards and logs that answer:

  • Which workflows ran, by whom, for what ticket or incident?
  • Which sources were retrieved and which were used in the answer?
  • Which tools were called, with which parameters and results?
  • Which approvals were requested, granted, denied, or bypassed?
  • Which workflows failed closed and escalated?
  • Which outputs were later corrected by humans?

That last item is how the system improves. Without feedback loops, the team only sees the successful demos and the embarrassing failures. With feedback loops, recurring errors become backlog items: improve runbook quality, tighten a tool contract, add a missing validation step, or remove a source that keeps misleading the workflow.

Guardrails need owners, just like everything else

Agentic workflows need a RACI as much as any infrastructure project. Someone owns the process, someone owns the tools, someone approves risky actions, someone reviews incidents, someone maintains the runbooks the agent reads. If those roles are unclear, the agent becomes another unowned automation path — and an unowned automation path with tool access isn't a productivity gain, it's operational debt that compounds quietly until it causes an incident.

References

The point

Start where the risk is low and the value is real: triage, summarisation, review, checklist generation, CAB drafting. Add state-changing actions only after the permissions, approvals, logging, rollback, and support ownership are designed — not bolted on afterwards. The model is the easy part; the workflow around it is the engineering, and the engineering is what makes it safe to run in operations. Part 7 turns to the agentic workflow most engineers will meet first — coding agents, the ones that write and change the automation itself.


The full series

  1. How I Use AI Safely in Infrastructure Workflows
  2. Prompt Engineering for IT Infrastructure Consultants
  3. Advanced Prompting Patterns for Infrastructure Planning
  4. RAG for Enterprise Infrastructure Knowledge
  5. Context Engineering for Enterprise AI Systems
  6. Workflow Engineering for Agentic AI in Infrastructure Operations(you're reading this)
  7. AI Coding Agents in Infrastructure Automation
  8. TDD + SDD for AI-Assisted Infrastructure Automation
For Help, press F1 1890 words Ln 1, Col 1