Risk Analysis for Infrastructure Migration Projects
Last reviewed: June 2026
Scope
This article covers risk analysis for infrastructure migration projects — Active Directory, Citrix, Azure, network, security, and operational change. It focuses on practical delivery control, not academic risk terminology. The goal is a risk process that actually changes decisions, not one that produces a tidy register nobody reads.
Why RAID logs fail
Risk analysis fails when it's treated as a document exercise. I've seen plenty of migration programs with a beautifully maintained RAID log and an outage anyway, because the log was a record, not a control. In infrastructure projects, risks have to carry owners, a review cadence, decision links, escalation paths, and visible mitigation actions. The RAID log is not risk management — it's the visible artifact of it. If a risk has no owner, no due date, and no mitigation, it is just a sentence in a spreadsheet, and a sentence has never prevented an incident.
Good risk ownership does two things: it protects production, and it forces decisions early — while they're still cheap.
What actually sinks a migration
In my experience the migrations that go wrong rarely fail on an unknown risk. They fail on a risk that was known, written down, and never escalated. The pattern is depressingly consistent: someone flags a dependency in week one — an unowned application, a firewall path nobody can confirm, a profile assumption that hasn't been tested — it gets logged as "medium," it sits there because no one owns it, and it resurfaces as a Sev-1 on cutover night. The risk register did its job. The escalation didn't.
That's why I care less about capturing risks (everyone can make a list) and far more about ownership and escalation thresholds. Risk analysis gives the project a way to make uncertainty visible before it becomes an outage — but only if the visible thing then forces a decision.
Common infrastructure risks
The recurring ones, by area, with the mitigation that usually works:
| Area | Example risk | Typical mitigation |
|---|---|---|
| Identity | Unknown service accounts block AD migration | Discovery, owner mapping, pilot validation |
| Citrix | Profile behaviour changes after migration | Pilot users, rollback, profile testing |
| Azure | Landing zone lacks a governance baseline | Policy, tagging, RBAC, cost controls before workload migration |
| Network | Firewall dependencies are incomplete | Traffic analysis and application-owner validation |
| Security | Excessive privileges copied into the target state | Access review and least-privilege remediation |
| Operations | No support model after go-live | Service-desk briefing, runbooks, hypercare |
| Project | No decision owner for cutover | RACI and go/no-go authority |
| Business | Unclear outage window | Communication plan and business approval |
RAID template
A RAID log earns its keep when every row is actionable. The columns I insist on:
| Type | Description | Owner | Impact | Probability | Mitigation | Due date | Status |
|---|---|---|---|---|---|---|---|
| Risk | Application uses a hardcoded LDAP path to the old domain | Application owner + identity engineer | 5 | 3 | Validate LDAP configuration in test before the wave | Five business days before the wave | Open |
| Issue | Pilot user cannot access shared drive after migration | EUC lead | 4 | 5 | Fix group mapping and retest before expanding pilot | Today | In progress |
| Decision | Whether to migrate the legacy app in Wave 1 | Business owner | 4 | 4 | Approve deferral or accept the migration risk | Before CAB | Pending |
The "Owner" and "Due date" columns are the ones that get fudged, and they're the only two that matter for whether anything actually happens.
Risk scoring
I prefer a scoring model simple enough that a team will actually use it:
- Impact from 1 to 5.
- Probability from 1 to 5.
- Score equals impact multiplied by probability.
- Scores of 15 and higher require escalation or explicit acceptance.
- Scores of 20 and higher should block the wave unless the accountable owner accepts the risk in writing.
The point of the threshold isn't precision — a 12 and a 15 aren't meaningfully different in reality. The point is that crossing the line triggers an action: an escalation, a written acceptance, a blocked wave. Scoring without a triggered action is just decorating the spreadsheet.
Turn risk into a decision path
A risk entry is weak until it says what decision it is trying to force. In infrastructure projects I classify high risks by the decision they need:
| Risk type | Decision needed | Example |
|---|---|---|
| Mitigate | What work reduces the risk before the wave? | Build a test LDAP endpoint and validate the application bind path. |
| Accept | Who is allowed to accept the remaining risk? | Business owner accepts a short reporting outage during cutover. |
| Defer | Should the workload move to a later wave? | Legacy Citrix application waits until profile behavior is validated. |
| Block | Is the wave allowed to proceed? | No rollback path for a critical identity change blocks go-live. |
This matters because many RAID logs confuse risk tracking with risk treatment. A row can be perfectly written and still not cause anything to happen. The delivery lead's job is to connect the risk to one of those decision paths and then make the path visible. If the decision owner will not accept, mitigate, defer, or block, the project has an accountability problem, not a documentation problem.
What good mitigation looks like
Mitigation has to be specific enough that another engineer can verify it. "Test application access" is not a mitigation. "Application owner validates login, search, create, update, and scheduled job behavior against the target LDAP path using pilot user and service accounts" is a mitigation. It has an owner, a method, and an observable result.
The mitigation I want in a migration risk log has five fields:
| Field | Why it matters |
|---|---|
| Action | The concrete work that reduces probability or impact. |
| Owner | The named person or role accountable for completion. |
| Evidence | The artifact proving the action happened: log, screenshot, test result, sign-off. |
| Deadline | The latest date before the wave where this still helps. |
| Residual risk | What remains after mitigation, and who accepts it. |
That last field is the one teams skip. Mitigation rarely removes all risk; it reduces it to a level someone can accept. If residual risk is not stated, the team can accidentally treat a partial mitigation as a complete fix.
The risks that hide inside dependencies
Infrastructure migration risk is often dependency risk wearing a technical label. "AD migration risk" may actually be an application-owner risk. "Citrix migration risk" may be a profile and print dependency. "Azure migration risk" may be an identity, cost-management, or firewall-rule dependency. The practical technique is to rewrite each technical risk into the dependency it exposes.
For example:
-
Weak: "Application may fail after AD migration."
-
Stronger: "Finance application uses an undocumented LDAP bind path to the source domain; no owner has confirmed target-domain behavior."
-
Weak: "Users may have Citrix issues."
-
Stronger: "Pilot did not include users with redirected folders, legacy printers, and profile containers from the highest-risk business unit."
-
Weak: "Azure landing zone not ready."
-
Stronger: "Policy, RBAC, diagnostic settings, and budget alerts are not approved; workloads could migrate without governance baseline."
The stronger version gives the team something to do. It identifies the missing owner, test, or control. It also stops the project from hiding behind broad category names.
Escalation is a designed workflow
Escalation should not depend on who is loudest in the meeting. I prefer to write escalation rules into the project operating model:
| Trigger | Escalation |
|---|---|
| High risk has no owner after 48 hours | Migration lead escalates to accountable service owner. |
| Mitigation date slips inside the freeze window | Wave lead requests go/no-go decision. |
| Application owner unavailable for validation | Business owner accepts deferral or assigns alternate approver. |
| Rollback cannot be described in executable steps | Change owner blocks CAB approval. |
| Same issue appears in two pilot waves | Program lead pauses expansion and opens root-cause review. |
This is not bureaucracy. It is how risk analysis becomes a control loop. Without a trigger and an escalation path, the risk log waits politely while the outage walks toward you.
Anti-patterns that make risk logs useless
The fastest way to weaken risk management is to let the register fill with vague, ownerless statements. These are the anti-patterns I challenge immediately:
| Anti-pattern | Why it fails | Better version |
|---|---|---|
| "May impact users" | Every migration may impact users; the risk says nothing. | "Profile migration may add more than 30 seconds to logon for finance users with redirected folders." |
| "IT to investigate" | No accountable owner or deadline. | "EUC lead to validate profile container behavior with five pilot users by Friday." |
| "Accepted by project" | A project cannot accept business risk. | "Finance business owner accepts one-hour reporting outage during Wave 2 cutover." |
| "Rollback available" | Not executable and not symptom-specific. | "If LDAP bind fails, app owner repoints service to source-domain endpoint and validates scheduled job." |
| "Monitor closely" | Observation is not mitigation. | "Operations monitors failed logons, app errors, and queue length every 15 minutes during hypercare." |
These look like writing problems, but they are delivery problems. Vague risks let everyone agree in the meeting and disagree during the incident. Precise risks force uncomfortable conversations earlier, when the team can still change the plan.
Risk review cadence
A register is only as good as how often it's worked:
- Review risks weekly during planning.
- Review risks daily during migration windows and hypercare.
- Escalate blocked decisions early — early escalation is cheap, late escalation is an incident.
- Close outdated risks so the log stays readable and trusted.
- Convert issues into actions with owners.
- Link major risks to milestones, CAB approvals, and go/no-go criteria.
Go/no-go risk questions
The questions that decide whether a wave proceeds:
- Are all critical risks owned?
- Are the open high risks accepted by the right accountable owner?
- Are the rollback triggers clear?
- Are support teams briefed?
- Are application owners available during the migration window?
- Are monitoring and validation steps ready?
- Have user communications been sent?
Practical checklist
- Start the RAID log during discovery, not the week before go-live.
- Separate risks, assumptions, issues, and decisions — they need different handling.
- Assign owners by name or role, never "the team."
- Use a review cadence and explicit escalation thresholds.
- Keep mitigations concrete and dated.
- Connect the top risks to migration-wave planning.
- Review accepted risks after go-live so they don't become permanent exceptions.
Final recommendation
Risk analysis should help the project make better decisions, not satisfy a template. Keep the model simple, make ownership visible, escalate early, and use the RAID log to control production risk rather than to document it after the fact. The migrations I've seen succeed weren't the ones with the prettiest registers — they were the ones where a flagged risk reliably turned into a decision with an owner, well before cutover night.
References
- PMI: Project Management Professional handbook
- PMI: PMP Examination Content Outline
- NIST SP 800-30 Rev. 1: Guide for Conducting Risk Assessments