Risk Analysis for Infrastructure Migration Projects - WordPad

Risk Analysis for Infrastructure Migration Projects

Last reviewed: June 2026

Scope

This article covers risk analysis for infrastructure migration projects — Active Directory, Citrix, Azure, network, security, and operational change. It focuses on practical delivery control, not academic risk terminology. The goal is a risk process that actually changes decisions, not one that produces a tidy register nobody reads.

Why RAID logs fail

Risk analysis fails when it's treated as a document exercise. I've seen plenty of migration programs with a beautifully maintained RAID log and an outage anyway, because the log was a record, not a control. In infrastructure projects, risks have to carry owners, a review cadence, decision links, escalation paths, and visible mitigation actions. The RAID log is not risk management — it's the visible artifact of it. If a risk has no owner, no due date, and no mitigation, it is just a sentence in a spreadsheet, and a sentence has never prevented an incident.

Good risk ownership does two things: it protects production, and it forces decisions early — while they're still cheap.

What actually sinks a migration

In my experience the migrations that go wrong rarely fail on an unknown risk. They fail on a risk that was known, written down, and never escalated. The pattern is depressingly consistent: someone flags a dependency in week one — an unowned application, a firewall path nobody can confirm, a profile assumption that hasn't been tested — it gets logged as "medium," it sits there because no one owns it, and it resurfaces as a Sev-1 on cutover night. The risk register did its job. The escalation didn't.

That's why I care less about capturing risks (everyone can make a list) and far more about ownership and escalation thresholds. Risk analysis gives the project a way to make uncertainty visible before it becomes an outage — but only if the visible thing then forces a decision.

Common infrastructure risks

The recurring ones, by area, with the mitigation that usually works:

Area Example risk Typical mitigation
Identity Unknown service accounts block AD migration Discovery, owner mapping, pilot validation
Citrix Profile behaviour changes after migration Pilot users, rollback, profile testing
Azure Landing zone lacks a governance baseline Policy, tagging, RBAC, cost controls before workload migration
Network Firewall dependencies are incomplete Traffic analysis and application-owner validation
Security Excessive privileges copied into the target state Access review and least-privilege remediation
Operations No support model after go-live Service-desk briefing, runbooks, hypercare
Project No decision owner for cutover RACI and go/no-go authority
Business Unclear outage window Communication plan and business approval

RAID template

A RAID log earns its keep when every row is actionable. The columns I insist on:

Type Description Owner Impact Probability Mitigation Due date Status
Risk Application uses a hardcoded LDAP path to the old domain Application owner + identity engineer 5 3 Validate LDAP configuration in test before the wave Five business days before the wave Open
Issue Pilot user cannot access shared drive after migration EUC lead 4 5 Fix group mapping and retest before expanding pilot Today In progress
Decision Whether to migrate the legacy app in Wave 1 Business owner 4 4 Approve deferral or accept the migration risk Before CAB Pending

The "Owner" and "Due date" columns are the ones that get fudged, and they're the only two that matter for whether anything actually happens.

Risk scoring

I prefer a scoring model simple enough that a team will actually use it:

  • Impact from 1 to 5.
  • Probability from 1 to 5.
  • Score equals impact multiplied by probability.
  • Scores of 15 and higher require escalation or explicit acceptance.
  • Scores of 20 and higher should block the wave unless the accountable owner accepts the risk in writing.

The point of the threshold isn't precision — a 12 and a 15 aren't meaningfully different in reality. The point is that crossing the line triggers an action: an escalation, a written acceptance, a blocked wave. Scoring without a triggered action is just decorating the spreadsheet.

Turn risk into a decision path

A risk entry is weak until it says what decision it is trying to force. In infrastructure projects I classify high risks by the decision they need:

Risk type Decision needed Example
Mitigate What work reduces the risk before the wave? Build a test LDAP endpoint and validate the application bind path.
Accept Who is allowed to accept the remaining risk? Business owner accepts a short reporting outage during cutover.
Defer Should the workload move to a later wave? Legacy Citrix application waits until profile behavior is validated.
Block Is the wave allowed to proceed? No rollback path for a critical identity change blocks go-live.

This matters because many RAID logs confuse risk tracking with risk treatment. A row can be perfectly written and still not cause anything to happen. The delivery lead's job is to connect the risk to one of those decision paths and then make the path visible. If the decision owner will not accept, mitigate, defer, or block, the project has an accountability problem, not a documentation problem.

What good mitigation looks like

Mitigation has to be specific enough that another engineer can verify it. "Test application access" is not a mitigation. "Application owner validates login, search, create, update, and scheduled job behavior against the target LDAP path using pilot user and service accounts" is a mitigation. It has an owner, a method, and an observable result.

The mitigation I want in a migration risk log has five fields:

Field Why it matters
Action The concrete work that reduces probability or impact.
Owner The named person or role accountable for completion.
Evidence The artifact proving the action happened: log, screenshot, test result, sign-off.
Deadline The latest date before the wave where this still helps.
Residual risk What remains after mitigation, and who accepts it.

That last field is the one teams skip. Mitigation rarely removes all risk; it reduces it to a level someone can accept. If residual risk is not stated, the team can accidentally treat a partial mitigation as a complete fix.

The risks that hide inside dependencies

Infrastructure migration risk is often dependency risk wearing a technical label. "AD migration risk" may actually be an application-owner risk. "Citrix migration risk" may be a profile and print dependency. "Azure migration risk" may be an identity, cost-management, or firewall-rule dependency. The practical technique is to rewrite each technical risk into the dependency it exposes.

For example:

  • Weak: "Application may fail after AD migration."

  • Stronger: "Finance application uses an undocumented LDAP bind path to the source domain; no owner has confirmed target-domain behavior."

  • Weak: "Users may have Citrix issues."

  • Stronger: "Pilot did not include users with redirected folders, legacy printers, and profile containers from the highest-risk business unit."

  • Weak: "Azure landing zone not ready."

  • Stronger: "Policy, RBAC, diagnostic settings, and budget alerts are not approved; workloads could migrate without governance baseline."

The stronger version gives the team something to do. It identifies the missing owner, test, or control. It also stops the project from hiding behind broad category names.

Escalation is a designed workflow

Escalation should not depend on who is loudest in the meeting. I prefer to write escalation rules into the project operating model:

Trigger Escalation
High risk has no owner after 48 hours Migration lead escalates to accountable service owner.
Mitigation date slips inside the freeze window Wave lead requests go/no-go decision.
Application owner unavailable for validation Business owner accepts deferral or assigns alternate approver.
Rollback cannot be described in executable steps Change owner blocks CAB approval.
Same issue appears in two pilot waves Program lead pauses expansion and opens root-cause review.

This is not bureaucracy. It is how risk analysis becomes a control loop. Without a trigger and an escalation path, the risk log waits politely while the outage walks toward you.

Anti-patterns that make risk logs useless

The fastest way to weaken risk management is to let the register fill with vague, ownerless statements. These are the anti-patterns I challenge immediately:

Anti-pattern Why it fails Better version
"May impact users" Every migration may impact users; the risk says nothing. "Profile migration may add more than 30 seconds to logon for finance users with redirected folders."
"IT to investigate" No accountable owner or deadline. "EUC lead to validate profile container behavior with five pilot users by Friday."
"Accepted by project" A project cannot accept business risk. "Finance business owner accepts one-hour reporting outage during Wave 2 cutover."
"Rollback available" Not executable and not symptom-specific. "If LDAP bind fails, app owner repoints service to source-domain endpoint and validates scheduled job."
"Monitor closely" Observation is not mitigation. "Operations monitors failed logons, app errors, and queue length every 15 minutes during hypercare."

These look like writing problems, but they are delivery problems. Vague risks let everyone agree in the meeting and disagree during the incident. Precise risks force uncomfortable conversations earlier, when the team can still change the plan.

Risk review cadence

A register is only as good as how often it's worked:

  • Review risks weekly during planning.
  • Review risks daily during migration windows and hypercare.
  • Escalate blocked decisions early — early escalation is cheap, late escalation is an incident.
  • Close outdated risks so the log stays readable and trusted.
  • Convert issues into actions with owners.
  • Link major risks to milestones, CAB approvals, and go/no-go criteria.

Go/no-go risk questions

The questions that decide whether a wave proceeds:

  • Are all critical risks owned?
  • Are the open high risks accepted by the right accountable owner?
  • Are the rollback triggers clear?
  • Are support teams briefed?
  • Are application owners available during the migration window?
  • Are monitoring and validation steps ready?
  • Have user communications been sent?

Practical checklist

  • Start the RAID log during discovery, not the week before go-live.
  • Separate risks, assumptions, issues, and decisions — they need different handling.
  • Assign owners by name or role, never "the team."
  • Use a review cadence and explicit escalation thresholds.
  • Keep mitigations concrete and dated.
  • Connect the top risks to migration-wave planning.
  • Review accepted risks after go-live so they don't become permanent exceptions.

Final recommendation

Risk analysis should help the project make better decisions, not satisfy a template. Keep the model simple, make ownership visible, escalate early, and use the RAID log to control production risk rather than to document it after the fact. The migrations I've seen succeed weren't the ones with the prettiest registers — they were the ones where a flagged risk reliably turned into a decision with an owner, well before cutover night.

References

Related infrastructure guides

For Help, press F1 1826 words Ln 1, Col 1