RAG for Enterprise Infrastructure Knowledge - WordPad

RAG for Enterprise Infrastructure Knowledge

Last reviewed: June 2026
Part 4 of 8 · Using AI in infrastructure work — the full series is listed at the end.

Parts 2 and 3 kept circling the same conclusion: a prompt is only as good as the context behind it. Retrieval-augmented generation — RAG — is how you give a model the right context at the right time instead of hoping it remembers. It's also where I've done most of my own building. I forked RAGFlow into a platform I call FlowForge and rebuilt it around vision-language models; I built a narrow-domain RAG assistant where a wrong answer is genuinely costly; I run document-extraction pipelines with Docling to turn messy PDFs into structured, retrievable text. So this part is less "here's the theory" and more "here's what actually breaks when you point RAG at real enterprise knowledge."

The headline lesson first: the demo version of RAG is easy, and the enterprise version is hard, and they look almost identical right up until the enterprise one quietly gives someone the wrong answer.

Why RAG fails quietly

A RAG system can retrieve the wrong document, a stale document, half of a procedure, or a convincing but low-authority source — and still produce a fluent, confident answer on top of it. That is the dangerous failure mode, because nothing about the output signals that the foundation was bad. A model that hallucinates from memory at least feels like a gamble. A model that hallucinates from a retrieved document feels authoritative, because it can cite something.

For infrastructure work I want a RAG system to do four things a demo never bothers with: show its evidence, admit when it doesn't have enough, refuse unsafe or out-of-scope requests, and respect the boundary between what a given user is allowed to see. A grounded answer with no visible source quality is not actually grounded — it's just decorated.

The unglamorous 80%: parsing and chunking

Everyone wants to talk about embeddings and rerankers. Almost all of the quality, in my experience, is decided earlier — in how you get text out of your documents and how you split it up. This is the part building FlowForge and the Docling pipeline beat into me.

Infrastructure knowledge lives in formats that are hostile to naive extraction: runbooks as PDFs with screenshots, architecture documents where the diagram is the content, Visio exports, spreadsheets of firewall rules, wiki pages with nested tables. Dump those through a basic text extractor and you get soup — headers fused to body text, tables flattened into unreadable rows, the diagram (the actual answer) thrown away entirely. That's why I lean on structure-aware extraction (Docling for the documents, vision models for the pages where the picture carries the meaning). Garbage extraction guarantees garbage retrieval, no matter how good the embeddings are.

Chunking is the second half of the same problem. Split a runbook mid-procedure and the system can retrieve step 3 without steps 1, 2, and 4 — and confidently hand someone half a process for a production change. I chunk by meaning: whole procedures, complete sections, ownership boundaries — not by a fixed token count that happens to fall in the middle of a sentence. The boring work of parsing and chunking is where enterprise RAG is won or lost.

Source authority is a first-class property

Not all sources deserve equal trust, and the system needs to know the difference before it ranks anything. This is the matrix I work from:

Source Authority Risk Use in RAG?
Approved architecture document High May be outdated Yes, with version and owner check
Current runbook High May not cover exceptions Yes
ServiceNow incident Medium Context-specific and noisy Yes, as evidence
Teams chat Low Informal and not approved Supporting context only
Old wiki page Low–medium Stale content risk Only after review
Vendor documentation High Must match product version and date Yes, with current source link
Draft migration plan Low May contain rejected decisions No, unless explicitly marked approved

What I would not index by default

A surprising amount of RAG quality comes from what you exclude. The instinct to "just index everything" is how you end up with a system that confidently quotes a rejected design from a draft nobody approved:

  • Draft documents without an approval status.
  • Old migration plans that no longer match production.
  • Unapproved architecture diagrams.
  • Duplicate wiki pages with conflicting procedures.
  • Sensitive HR, legal, or personal data unrelated to the task.
  • Secrets, passwords, private keys, or exported credentials.
  • Chat history treated as if it were formal documentation.

Ranking: similarity is not authority

The default behaviour of a vector search is to return the most similar text. That is not the same as the most correct text, and the gap is where infrastructure RAG hurts. A beautifully worded but superseded design doc will out-rank a terse, current runbook on pure similarity. So ranking has to weigh authority and freshness, not just cosine distance: prefer approved over draft, current over stale, owned over orphaned. In the tax assistant I built, getting this wrong didn't mean a slightly worse answer — it meant a confidently wrong one, which in that domain is the only kind of failure that matters. Infrastructure is the same: a plausible answer from last year's runbook can be more dangerous than no answer.

A worked example: an AD migration assistant

Make it concrete. Say I'm building a retrieval assistant to support an Active Directory migration team. The wrong approach is to index every historical wiki page and let the model sort it out. The right approach is a governed source set: the approved domain inventory, the current trust diagram, the migration runbook, the validated application-owner matrix, the relevant ServiceNow change records, and vendor documentation for the current tooling — each tagged with an owner and a last-reviewed date.

Now the behaviour I want falls out naturally. A user asks, "can the finance application move in wave two?" The assistant cites the application-owner matrix and the test evidence. If that evidence exists, the user gets a grounded answer with a source they can open. If it doesn't exist, the assistant says so and raises a follow-up question — "no validation record found for the finance app; who owns it?" — instead of inventing a reassuring yes. That refusal is the feature. An assistant that admits a gap is doing the job; one that papers over it is the exact failure mode from the top of this post.

Evaluate it like a system, not a demo

Here is the habit that separates a RAG project that ships from one that just demos well: build the evaluation set before you tune anything. I write down the questions the system must answer, the questions it must refuse, the stale documents it must not cite, and the conflicting-source cases it has to handle — and I check every change against that set. Without it, "it feels better" is the only metric you have, and feelings are exactly what a fluent wrong answer is engineered to manipulate.

The evaluation checklist I hold a system to:

  • Can it refuse when the sources are missing?
  • Does it cite the source of each critical claim?
  • Does it respect the user's permissions on retrieved content?
  • Does it detect stale or superseded documents?
  • Does it separate facts from assumptions?
  • Does it handle two documents that disagree?
  • Does it avoid leaking sensitive content in summaries?
  • Does it escalate high-risk infrastructure decisions instead of answering them?

Someone has to own it

A RAG system is not a project you finish; it's a service you run. Sources go stale, owners change, new documents arrive, and answer quality decays the moment nobody is watching. So before one goes live for an infrastructure team I want a named source owner, an evaluation set, a permission model, an audit trail, and documented refusal behaviour. Without that ownership, even a system that launched well rots — and a rotting RAG system is worse than no system, because people have learned to trust it.

The point

Build RAG around source discipline, not around a demo. For enterprise infrastructure knowledge, the quality lives in the unglamorous places — extraction, chunking, source authority, ranking, evaluation, ownership — far more than in the choice of model. Judge it by whether engineers can trust the evidence, not by whether the answer sounds polished. Retrieval gets the right context to the model. Part 5 is about everything that has to wrap around it for an enterprise to trust the result: identity, permissions, policy, citations, and audit — the control plane.


The full series

  1. How I Use AI Safely in Infrastructure Workflows
  2. Prompt Engineering for IT Infrastructure Consultants
  3. Advanced Prompting Patterns for Infrastructure Planning
  4. RAG for Enterprise Infrastructure Knowledge(you're reading this)
  5. Context Engineering for Enterprise AI Systems
  6. Workflow Engineering for Agentic AI in Infrastructure Operations
  7. AI Coding Agents in Infrastructure Automation
  8. TDD + SDD for AI-Assisted Infrastructure Automation
For Help, press F1 1478 words Ln 1, Col 1