September 20, 2024 IT Management

Risk Analysis for Infrastructure Migration Projects

Last reviewed: June 2026

Scope

This article covers risk analysis for infrastructure migration projects — Active Directory, Citrix, Azure, network, security, and operational change. It focuses on practical delivery control, not academic risk terminology. The goal is a risk process that actually changes decisions, not one that produces a tidy register nobody reads.

Why RAID logs fail

Risk analysis fails when it's treated as a document exercise. I've seen plenty of migration programs with a beautifully maintained RAID log and an outage anyway, because the log was a record, not a control. In infrastructure projects, risks have to carry owners, a review cadence, decision links, escalation paths, and visible mitigation actions. The RAID log is not risk management — it's the visible artifact of it. If a risk has no owner, no due date, and no mitigation, it is just a sentence in a spreadsheet, and a sentence has never prevented an incident.

Good risk ownership does two things: it protects production, and it forces decisions early — while they're still cheap.

What actually sinks a migration

In my experience the migrations that go wrong rarely fail on an unknown risk. They fail on a risk that was known, written down, and never escalated. The pattern is depressingly consistent: someone flags a dependency in week one — an unowned application, a firewall path nobody can confirm, a profile assumption that hasn't been tested — it gets logged as "medium," it sits there because no one owns it, and it resurfaces as a Sev-1 on cutover night. The risk register did its job. The escalation didn't.

That's why I care less about capturing risks (everyone can make a list) and far more about ownership and escalation thresholds. Risk analysis gives the project a way to make uncertainty visible before it becomes an outage — but only if the visible thing then forces a decision.

Common infrastructure risks

The recurring ones, by area, with the mitigation that usually works:

Area	Example risk	Typical mitigation
Identity	Unknown service accounts block AD migration	Discovery, owner mapping, pilot validation
Citrix	Profile behaviour changes after migration	Pilot users, rollback, profile testing
Azure	Landing zone lacks a governance baseline	Policy, tagging, RBAC, cost controls before workload migration
Network	Firewall dependencies are incomplete	Traffic analysis and application-owner validation
Security	Excessive privileges copied into the target state	Access review and least-privilege remediation
Operations	No support model after go-live	Service-desk briefing, runbooks, hypercare
Project	No decision owner for cutover	RACI and go/no-go authority
Business	Unclear outage window	Communication plan and business approval

RAID template

A RAID log earns its keep when every row is actionable. The columns I insist on:

Type	Description	Owner	Impact	Probability	Mitigation	Due date	Status
Risk	Application uses a hardcoded LDAP path to the old domain	Application owner + identity engineer	5	3	Validate LDAP configuration in test before the wave	Five business days before the wave	Open
Issue	Pilot user cannot access shared drive after migration	EUC lead	4	5	Fix group mapping and retest before expanding pilot	Today	In progress
Decision	Whether to migrate the legacy app in Wave 1	Business owner	4	4	Approve deferral or accept the migration risk	Before CAB	Pending

The "Owner" and "Due date" columns are the ones that get fudged, and they're the only two that matter for whether anything actually happens.

Risk scoring

I prefer a scoring model simple enough that a team will actually use it:

Impact from 1 to 5.
Probability from 1 to 5.
Score equals impact multiplied by probability.
Scores of 15 and higher require escalation or explicit acceptance.
Scores of 20 and higher should block the wave unless the accountable owner accepts the risk in writing.

The point of the threshold isn't precision — a 12 and a 15 aren't meaningfully different in reality. The point is that crossing the line triggers an action: an escalation, a written acceptance, a blocked wave. Scoring without a triggered action is just decorating the spreadsheet.

Turn risk into a decision path

A risk entry is weak until it says what decision it is trying to force. In infrastructure projects I classify high risks by the decision they need:

Risk type	Decision needed	Example
Mitigate	What work reduces the risk before the wave?	Build a test LDAP endpoint and validate the application bind path.
Accept	Who is allowed to accept the remaining risk?	Business owner accepts a short reporting outage during cutover.
Defer	Should the workload move to a later wave?	Legacy Citrix application waits until profile behavior is validated.
Block	Is the wave allowed to proceed?	No rollback path for a critical identity change blocks go-live.

This matters because many RAID logs confuse risk tracking with risk treatment. A row can be perfectly written and still not cause anything to happen. The delivery lead's job is to connect the risk to one of those decision paths and then make the path visible. If the decision owner will not accept, mitigate, defer, or block, the project has an accountability problem, not a documentation problem.

What good mitigation looks like

Mitigation has to be specific enough that another engineer can verify it. "Test application access" is not a mitigation. "Application owner validates login, search, create, update, and scheduled job behavior against the target LDAP path using pilot user and service accounts" is a mitigation. It has an owner, a method, and an observable result.

The mitigation I want in a migration risk log has five fields:

Field	Why it matters
Action	The concrete work that reduces probability or impact.
Owner	The named person or role accountable for completion.
Evidence	The artifact proving the action happened: log, screenshot, test result, sign-off.
Deadline	The latest date before the wave where this still helps.
Residual risk	What remains after mitigation, and who accepts it.

That last field is the one teams skip. Mitigation rarely removes all risk; it reduces it to a level someone can accept. If residual risk is not stated, the team can accidentally treat a partial mitigation as a complete fix.

The risks that hide inside dependencies

Infrastructure migration risk is often dependency risk wearing a technical label. "AD migration risk" may actually be an application-owner risk. "Citrix migration risk" may be a profile and print dependency. "Azure migration risk" may be an identity, cost-management, or firewall-rule dependency. The practical technique is to rewrite each technical risk into the dependency it exposes.

For example:

Weak: "Application may fail after AD migration."
Stronger: "Finance application uses an undocumented LDAP bind path to the source domain; no owner has confirmed target-domain behavior."
Weak: "Users may have Citrix issues."
Stronger: "Pilot did not include users with redirected folders, legacy printers, and profile containers from the highest-risk business unit."
Weak: "Azure landing zone not ready."
Stronger: "Policy, RBAC, diagnostic settings, and budget alerts are not approved; workloads could migrate without governance baseline."

The stronger version gives the team something to do. It identifies the missing owner, test, or control. It also stops the project from hiding behind broad category names.

Escalation is a designed workflow

Escalation should not depend on who is loudest in the meeting. I prefer to write escalation rules into the project operating model:

Trigger	Escalation
High risk has no owner after 48 hours	Migration lead escalates to accountable service owner.
Mitigation date slips inside the freeze window	Wave lead requests go/no-go decision.
Application owner unavailable for validation	Business owner accepts deferral or assigns alternate approver.
Rollback cannot be described in executable steps	Change owner blocks CAB approval.
Same issue appears in two pilot waves	Program lead pauses expansion and opens root-cause review.

This is not bureaucracy. It is how risk analysis becomes a control loop. Without a trigger and an escalation path, the risk log waits politely while the outage walks toward you.

Anti-patterns that make risk logs useless

The fastest way to weaken risk management is to let the register fill with vague, ownerless statements. These are the anti-patterns I challenge immediately:

Anti-pattern	Why it fails	Better version
"May impact users"	Every migration may impact users; the risk says nothing.	"Profile migration may add more than 30 seconds to logon for finance users with redirected folders."
"IT to investigate"	No accountable owner or deadline.	"EUC lead to validate profile container behavior with five pilot users by Friday."
"Accepted by project"	A project cannot accept business risk.	"Finance business owner accepts one-hour reporting outage during Wave 2 cutover."
"Rollback available"	Not executable and not symptom-specific.	"If LDAP bind fails, app owner repoints service to source-domain endpoint and validates scheduled job."
"Monitor closely"	Observation is not mitigation.	"Operations monitors failed logons, app errors, and queue length every 15 minutes during hypercare."

These look like writing problems, but they are delivery problems. Vague risks let everyone agree in the meeting and disagree during the incident. Precise risks force uncomfortable conversations earlier, when the team can still change the plan.

Risk review cadence

A register is only as good as how often it's worked:

Review risks weekly during planning.
Review risks daily during migration windows and hypercare.
Escalate blocked decisions early — early escalation is cheap, late escalation is an incident.
Close outdated risks so the log stays readable and trusted.
Convert issues into actions with owners.
Link major risks to milestones, CAB approvals, and go/no-go criteria.

Go/no-go risk questions

The questions that decide whether a wave proceeds:

Are all critical risks owned?
Are the open high risks accepted by the right accountable owner?
Are the rollback triggers clear?
Are support teams briefed?
Are application owners available during the migration window?
Are monitoring and validation steps ready?
Have user communications been sent?

Practical checklist

Start the RAID log during discovery, not the week before go-live.
Separate risks, assumptions, issues, and decisions — they need different handling.
Assign owners by name or role, never "the team."
Use a review cadence and explicit escalation thresholds.
Keep mitigations concrete and dated.
Connect the top risks to migration-wave planning.
Review accepted risks after go-live so they don't become permanent exceptions.

Final recommendation

Risk analysis should help the project make better decisions, not satisfy a template. Keep the model simple, make ownership visible, escalate early, and use the RAID log to control production risk rather than to document it after the fact. The migrations I've seen succeed weren't the ones with the prettiest registers — they were the ones where a flagged risk reliably turned into a decision with an owner, well before cutover night.

References

Related infrastructure guides

For Help, press F1 1826 words Ln 1, Col 1

Risk Analysis for Infrastructure Migration Projects

Scope

Why RAID logs fail

What actually sinks a migration

Common infrastructure risks

RAID template

Risk scoring

Turn risk into a decision path

What good mitigation looks like

The risks that hide inside dependencies

Escalation is a designed workflow

Anti-patterns that make risk logs useless

Risk review cadence

Go/no-go risk questions

Practical checklist

Final recommendation

References

Related infrastructure guides

Contents

Welcome to the Ilya Win98 shell

Categories

What's New on the Web

Member Services

AltaVista

Popular Searches

Featured Categories

Arts & Humanities

Business

Computers

Education

Entertainment

News

Recreation

Science

Shopping

Society

Cool Dude's Lair

My Favorite Links

Sign My Guestbook

Favorites

History