Context Engineering for Enterprise AI Systems
Last reviewed: June 2026
Part 5 of 8 · Using AI in infrastructure work — the full series is listed at the end.
Part 4 was about getting the right content to the model through retrieval. This part is about everything that has to wrap around that retrieval before an enterprise can trust the result. Prompt wording matters, and retrieval quality matters, but enterprise reliability is decided by something larger: what the system is allowed to fetch, what it's allowed to reveal, what it refuses, and how the whole interaction is audited. That envelope is what I mean by context engineering — the control plane around the model.
It helps to be blunt about the distinction. Context engineering is not "a bigger context window." A larger window lets you stuff more text in; it does nothing about whether that text is something this user is allowed to see, whether it's current, or whether the answer leaves a trail you can inspect later. After twenty years in identity and infrastructure — and recent work focused on Azure security — my instinct is that the model is the least interesting part of an enterprise AI system. The interesting, hard, career-relevant part is the controls.
Why this is an architecture problem, not a prompt problem
An AI system that answers from stale runbooks, old architecture diagrams, draft migration plans, or documents the user shouldn't be able to read can create real operational and compliance risk. The answer may be perfectly fluent. The failure is architectural: the system retrieved something it shouldn't have, or revealed something it shouldn't have, and no prompt tweak fixes that. You fix it the way you fix any access problem in infrastructure — with identity, permissions, policy, and logging, applied before the model ever sees the data.
For enterprise infrastructure use I care less about whether the demo is impressive and more about whether the system respects permissions, cites approved sources, detects uncertainty, and leaves an audit trail. Those are the same four properties I'd demand of any system that touches production data, because that's what this is.
A reference architecture for the control plane
The pipeline I design toward puts the controls around retrieval and generation, not as an afterthought:
| Step | Purpose |
|---|---|
| User request | Capture the task, intent, and business context. |
| Identity and permission check | Confirm who the user is and what sources they may access. |
| Policy check | Apply security, compliance, data-loss, and operational rules. |
| Approved-source retrieval | Retrieve only from governed repositories, tickets, runbooks, logs, docs. |
| Ranking by authority and freshness | Prefer current, approved, high-authority sources over informal or stale ones. |
| Answer with citations | Show where the key claims came from. |
| Uncertainty check | Separate facts, assumptions, missing information, and recommendations. |
| Human escalation | Route high-risk or low-confidence cases to the right owner. |
| Audit log | Record prompt, retrieval, tool actions, response, and the user's decision. |
The hard part: permission-trimming before retrieval
If there's one idea in this whole post that earns its keep, it's this: permissions have to be enforced before retrieval, not after generation. The naive design retrieves from everything and hopes the model won't reveal what it shouldn't. That is not a security control — it's a wish. The correct design trims the candidate set to what this specific user is authorised to see, and only then retrieves and answers.
This matters acutely in infrastructure, because the source systems already have rich permission models that the AI layer loves to flatten. A junior operator and a security lead should get materially different answers from the same assistant about privileged-access configuration, because they have materially different rights to that information. If your AI system collapses that distinction, you haven't built a helpful assistant — you've built a beautifully conversational data-exfiltration tool. Identity-aware retrieval is the line between the two.
A Microsoft-oriented example
Because so much of my work lives in the Microsoft stack, here's what this looks like concretely. A realistic design might combine Microsoft Entra ID for identity and access, SharePoint and Teams as controlled document sources, Azure AI Search for retrieval, Azure OpenAI or Microsoft Foundry for model orchestration, Microsoft Purview for compliance and data governance, Azure Monitor and Log Analytics for operational evidence, and systems like ServiceNow, Jira, or Confluence for ticket and project context.
But the integration is the easy half. The design question is never "can we connect the data?" — of course we can. The question is: which data is approved for which user, for which task, with which evidence and audit trail? Entra carries the identity, Purview carries the sensitivity labels and policy, Azure AI Search has to honour both during retrieval, and Monitor has to capture what happened. Wire those together correctly and the model is almost incidental. Skip them and no model is smart enough to save you.
Source quality, made operational
Part 4's authority matrix shows up here too, but in context engineering it's not just advice — it's a control the system enforces:
| Source | Use | Control concern |
|---|---|---|
| Approved architecture document | High-authority design context | Version and owner must be clear |
| ServiceNow incident | Operational evidence | May be case-specific and noisy |
| Log Analytics query result | Current telemetry | Needs time range and query context |
| Teams discussion | Background context | Informal and often not approved |
| Old wiki page | Legacy knowledge | High stale-content risk |
| Vendor documentation | Product behaviour and limits | Must be current and correctly scoped |
Evaluation is part of the control plane
Context engineering is not complete until retrieval quality is measured. A system can have perfect architecture diagrams and still answer poorly if it retrieves the wrong chunks, ranks stale sources too highly, or fails to retrieve the one document that contains the exception. In an enterprise setting, evaluation has to be operational, not academic.
I would build an evaluation set from real infrastructure questions:
| Question type | Example | What evaluation checks |
|---|---|---|
| Known-answer | Which runbook owns Citrix Cloud Connector replacement? | Correct source retrieved and cited. |
| Permission-sensitive | Show privileged access process for break-glass accounts. | User only sees sources they are allowed to read. |
| Freshness-sensitive | What is the current VPN cutover plan? | Current approved plan outranks old Teams discussion. |
| Unanswerable | Which app owner approved an undocumented exception? | System says evidence is missing instead of guessing. |
| Conflict | Runbook and ticket disagree. | Answer surfaces conflict and escalates. |
This is the same discipline as testing infrastructure automation. You do not trust the workflow because it worked once in a demo. You keep a regression set of questions and run it whenever sources, chunking, permissions, ranking, or model behavior changes.
Source lifecycle matters more than chunk size
Teams often jump straight to chunking strategy because it feels technical and solvable. Chunking matters, but source lifecycle usually matters more. A beautifully chunked wiki full of stale migration notes will still produce bad answers. A context system needs document governance:
- Every approved source has an owner.
- Important documents have review dates.
- Superseded designs are archived or clearly marked.
- Drafts are excluded unless the user explicitly asks for draft material.
- Sensitive documents carry labels or access controls the retrieval layer respects.
- Operational runbooks include last-tested dates.
- Tickets and incident records are treated as evidence, not universal truth.
In infrastructure terms, this is configuration management for knowledge. Nobody would accept a CMDB with no owners, no lifecycle, and no retirement process. The knowledge base feeding an AI assistant deserves the same scrutiny.
Threat model the context layer
The context layer creates its own threat model. The obvious risk is data leakage, but the more subtle risks are just as important:
| Threat | Practical example | Control |
|---|---|---|
| Cross-user disclosure | Assistant cites a security design to a user who cannot open the source. | Pre-retrieval permission trimming and post-answer citation access checks. |
| Prompt injection in documents | A retrieved page tells the model to ignore policy. | Treat retrieved text as untrusted data, not instructions. |
| Stale authority | Old migration plan outranks the approved current plan. | Ranking by source authority, status, and freshness. |
| Overconfident synthesis | Model combines two partial facts into a false conclusion. | Require uncertainty labeling and citation-level traceability. |
| Audit gap | No one can reconstruct which sources shaped an answer. | Log request, retrieval set, final citations, and user action. |
This is why I dislike designs that treat retrieval as a backend convenience. Retrieval is a security boundary, an evidence boundary, and a governance boundary. The model response is downstream of all three.
Citations, uncertainty, and audit are not optional extras
Three properties turn a chatbot into something an enterprise can stand behind:
- Citations — every operational claim shows its source, so a human can open it and verify. Without this, "trust me" is the system's only argument.
- Uncertainty — facts, assumptions, missing information, and recommendations are kept visibly separate, so nobody mistakes a guess for a finding.
- Audit — prompt, retrieved sources, tool actions, response, and the resulting decision are logged. When the system eventually gives a bad answer — and it will — the audit trail is how you find out why, who acted on it, and what to fix.
These are the same instincts behind change control and access review, applied to an AI surface.
The control plane needs owners
Context engineering creates a set of ownership questions that don't answer themselves. Someone has to own source quality, document lifecycle, permission design, exception handling, and the incident review when the AI gives a wrong answer. In a delivery team I'd assign explicit owners for approved sources, stale-content cleanup, retrieval evaluation, policy exceptions, audit review, and the escalation paths. Context quality decays fast and silently when no one owns it — the same way an AD environment accumulates stale groups and orphaned service accounts when nobody runs the reviews.
Implementation checklist
For an enterprise pilot, I would not start with every document and every user. I would start with one controlled domain, such as infrastructure change management or identity operations, and prove the controls:
- Define approved source repositories and exclude drafts by default.
- Map source permissions to the retrieval layer before indexing.
- Assign source owners and review dates.
- Create an evaluation set of real questions, including unanswerable and permission-sensitive cases.
- Require citations for operational claims.
- Log retrieval sources and answer metadata.
- Define escalation rules for low-confidence, conflicting, or high-impact answers.
- Review failures like incidents: source issue, permission issue, retrieval issue, model issue, or user-process issue.
The goal of the pilot is not to impress people with a chatbot. The goal is to prove that the assistant can answer useful questions without bypassing the enterprise controls that already exist for a reason.
References
- Microsoft Learn: Retrieval Augmented Generation in Azure AI Search
- Microsoft Learn: Azure AI Search relevance and scoring
- Microsoft Learn: Microsoft Purview sensitivity labels
- NIST AI Risk Management Framework
The point
Don't treat context engineering as prompt expansion. Treat it as enterprise control design: the system should know what it can use, what it must ignore, what it must cite, and when it must stop and ask a person. Get identity, permissions, policy, citations, and audit right and a fairly ordinary model becomes a system you can put in front of an infrastructure team. Get them wrong and the smartest model on the market is a liability. So far this series has kept the model on the answering side of the line. Part 6 crosses it — what changes when the system is allowed to take actions, and how you keep that safe.
The full series
- How I Use AI Safely in Infrastructure Workflows
- Prompt Engineering for IT Infrastructure Consultants
- Advanced Prompting Patterns for Infrastructure Planning
- RAG for Enterprise Infrastructure Knowledge
- Context Engineering for Enterprise AI Systems — (you're reading this)
- Workflow Engineering for Agentic AI in Infrastructure Operations
- AI Coding Agents in Infrastructure Automation
- TDD + SDD for AI-Assisted Infrastructure Automation