A few years ago, a financial services company asked us to help them roll back a cloud migration. Not pause it. Not optimize it. Roll it back.
This wasn’t a company that was new to the cloud. They had been running cloud-native workloads on AWS for years: new applications, innovation projects, critical business services built from scratch on modern architectures. That part of their cloud journey was working well.
The problem started when they decided to migrate their development and testing environments, dozens of IaaS-based servers that had been running on-premises for years. They engaged a different partner for the migration. The workloads were lifted and shifted successfully. The servers were running in AWS. By any migration metric, the project was a success.
Then operations took over.
The team responsible for running those environments had never been trained on cloud operations. They applied the same on-premises practices they had always used: manual patching cycles, ticket-based change requests, monitoring dashboards designed for static infrastructure. Nobody had defined how roles needed to change. Nobody had rethought incident response for a cloud context. Nobody had addressed the fundamental question of who owns what when the infrastructure is no longer in your data center.
Eight months later, after persistent operational friction, escalating costs, security gaps they couldn’t close, and a team that was frustrated and overwhelmed, they made the call: bring it back on-premises.
That rollback wasn’t a technology failure. The migration had worked. It was an operational readiness failure, and it cost them over a year of effort and budget with nothing to show for it.
The Day Two Blind Spot
This pattern shows up consistently across engagements, and it doesn’t only affect organizations that struggle with migration. I’ve seen companies that executed technically sound migrations, well-planned waves, clean cutovers, regulatory-compliant landing zones, only to discover six months later that cloud costs were climbing beyond projections, incident response involved too many handoffs, and security posture was drifting because nobody owned it as a continuous practice.
Organizations invest months of planning, significant budget, and executive attention into the migration itself. They run discovery workshops, build business cases, execute readiness assessments, plan wave after wave of workload moves. The migration program has a dedicated team, a governance structure, and weekly steering committee reviews.
Then the migration ends. The program team disbands. The steering committee moves on to the next initiative. And the workloads that were carefully planned, tested, and migrated are handed over to an operations team that was never prepared to run them in the cloud.
The assumption, sometimes explicit but more often unspoken, is that operating in the cloud is just like operating on-premises, except the servers are somewhere else. That assumption is where the failure begins.
I wrote about the cost of migrating without a foundation in a previous post. But this is a different problem. That post was about organizations that weren’t ready to migrate. This one is about organizations that migrated well and still weren’t ready to operate.
The migration succeeded. The operations didn’t. And the gap between those two realities is where organizations lose the value they expected the cloud to deliver.
Five Operational Muscles That Migration Doesn’t Build
Cloud operations requires capabilities that migration programs rarely develop. Not because migration teams are negligent, but because the skills, processes, and organizational structures needed to operate in the cloud are fundamentally different from those needed to get there.
1. Monitoring That Understands Cloud-Native Behavior
Most organizations migrate their existing monitoring approach alongside their workloads. They set up CloudWatch dashboards that mirror what they had in Datadog or Nagios. They create alarms based on the same thresholds they used on-premises: CPU above 80%, disk above 90%, memory above 85%.
The problem is that cloud-native behavior doesn’t follow on-premises patterns. Auto-scaling groups are supposed to spike CPU. Serverless functions have no persistent memory to monitor. Managed databases handle storage differently than self-hosted ones. Container orchestration creates and destroys instances constantly.
When monitoring doesn’t understand the platform, it generates noise instead of signal. Teams either drown in false alarms and start ignoring them, or they set thresholds so high that real problems go undetected. Both outcomes are operational failures hiding behind a monitoring dashboard that looks active.
Effective cloud monitoring starts with observability: understanding what “normal” looks like for each workload in its cloud-native context, then instrumenting for deviations that actually matter. That requires operational knowledge that migration projects don’t build.
2. Incident Response Without the Handoff Chain
On-premises incident response typically follows a tiered model: the NOC detects the issue, escalates to the infrastructure team, who escalates to the application team, who escalates to the vendor. Each handoff adds time, loses context, and dilutes accountability.
Organizations that migrate without rethinking incident response bring this handoff chain into the cloud. The result is what I described in the opening: a four-hour outage for a problem that should have been resolved in thirty minutes, because the people closest to the workload didn’t have the authority, the access, or the operational literacy to act.
Cloud incident response works when the team that builds the service owns its operational posture. That means developers who understand monitoring, on-call rotations that include the people who wrote the code, and runbooks that are maintained by the teams that use them. As I explored in a recent post on DevOps culture, renaming teams doesn’t change how they respond to failure. Ownership does.
3. Cost Management as an Operational Discipline
In on-premises environments, infrastructure costs are capital expenditures: you buy servers, you depreciate them, the finance team understands the model. In the cloud, infrastructure costs are operational expenditures that change every hour based on usage, configuration, and decisions made by individual engineers.
Most migration programs include a TCO analysis that projects cloud costs based on current utilization. That projection is almost always wrong, not because the analysis was bad, but because it assumes static behavior in a dynamic environment. The moment workloads are live, engineers start experimenting, scaling, provisioning new resources, and forgetting to clean up the ones they no longer need.
Without FinOps as an operational discipline, cloud costs drift. Reservations expire without renewal. Development environments run 24/7 when they’re used 8 hours a day. Storage accumulates because nobody owns the lifecycle policy. The bill grows not because the cloud is expensive, but because nobody is managing it as the living, dynamic system it is.
Working across organizations that range from startups to regulated financial institutions, the pattern is remarkably consistent: the ones that control cloud costs are the ones that treat cost management as a continuous operational practice, not a quarterly review.
4. Security Posture That Doesn’t Drift
Migration programs typically include a security workstream: landing zone design, network segmentation, IAM policies, encryption standards. The security posture at the moment of migration is usually solid.
Then it drifts.
Engineers create security group rules to troubleshoot a connectivity issue and forget to remove them. IAM policies accumulate permissions over time because it’s easier to add than to audit. Patching baselines that were defined during migration aren’t enforced because the operations team doesn’t have the automation in place. New services are deployed without going through the security review that the original workloads received.
Security in the cloud is not a state; it’s a practice. It requires continuous posture assessment, automated compliance checks, and a team that treats security findings as operational work, not as audit preparation. Organizations that earned strong security posture during migration and lost it within a year didn’t have a security problem. They had an operational discipline problem.
5. Change Management That Matches Cloud Speed
On-premises change management was designed for a world where changes were infrequent, high-risk, and required maintenance windows. CAB meetings, change freezes, and manual approval chains made sense when a single deployment could affect the entire data center.
In the cloud, the deployment model is fundamentally different. Infrastructure changes can be tested, deployed, and rolled back in minutes. Services can be updated independently. Blue-green deployments and canary releases reduce the risk of individual changes to near zero.
But organizations that bring their on-premises change management process into the cloud create a bottleneck that negates one of the cloud’s primary advantages. Weekly CAB meetings for changes that could be safely deployed in minutes. Manual approval chains for infrastructure-as-code changes that have already been validated by automated tests. Change freezes that block the entire organization because one team has a compliance deadline.
The result is that teams either follow the process and lose agility, or they bypass the process and lose governance. Neither outcome is acceptable. Cloud change management needs to be automated, policy-driven, and fast enough to match the platform’s capabilities. That requires investment in Infrastructure as Code, automated testing, and deployment pipelines, capabilities that our experience delivering AWS CloudFormation and Control Tower implementations has shown are consistently underinvested during migration programs.
Why This Keeps Happening
The structural reason is straightforward: migration programs and operations programs have different sponsors, different teams, different success metrics, and different timelines.
Migration success is measured by workloads moved, downtime avoided, and budget adherence. Those are project metrics. They have a start date and an end date. When the project ends, success is declared.
Operational success is measured by availability, incident response time, cost efficiency, security posture, and change velocity. Those are continuous metrics. They don’t have an end date. And they require sustained investment in people, processes, and automation that migration budgets rarely include.
Having earned both the AWS Cloud Operations Competency and the AWS Migration & Modernization Competency, we operate on both sides of this divide every day. The migration side is well understood: there are frameworks, tools, financial incentives, and a mature partner ecosystem to support it. The operations side is where organizations are left to figure things out on their own, often after the migration budget is spent and the executive attention has moved elsewhere.
The gap isn’t technical. The cloud provides every tool you need to operate well: CloudWatch, Config, Security Hub, Cost Explorer, Systems Manager, EventBridge. The gap is organizational. It’s the absence of a deliberate transition from “we moved to the cloud” to “we operate in the cloud,” with the people, processes, and practices that transition requires.
What Operational Readiness Actually Requires
I outlined a 90-day framework for building cloud operational readiness in a previous post. The specifics matter, but the principle is simple: operational readiness is not a phase that happens after migration. It’s a capability that must be built alongside it.
That means:
Monitoring and observability designed for cloud-native behavior from day one, not retrofitted after the first outage. Incident response processes that give workload teams ownership and authority, not just escalation paths. FinOps embedded as a continuous practice with clear accountability, not a quarterly cost review. Security posture management automated and continuous, not a point-in-time assessment. Change management that leverages Infrastructure as Code and automated pipelines, not manual approval chains designed for a different era.
And underneath all of it, the people dimension. Operational capability is not a tooling problem. It’s a skills problem, a culture problem, and a leadership problem. The qualities that make operations teams effective, the discipline to maintain standards, the adaptability to learn a new platform, the ownership to act without waiting for escalation, are the same qualities that define effective teams in every context. You can’t automate judgment. You can’t migrate culture. You have to build it.
The Question Worth Asking
If your organization has completed a cloud migration, or is in the middle of one, here’s the question that matters most:
When the migration program ends, who owns the operational posture of your cloud environment? Not who monitors the dashboards. Not who responds to tickets. Who owns the continuous improvement of how you operate, secure, and optimize your cloud?
If the answer is unclear, or if the answer is “the same team that ran our data center,” then your migration may have succeeded, but your cloud journey is just beginning.
And the longer you wait to build that operational muscle, the harder it becomes. Because unlike migration, which is a project with a defined scope, operations is a discipline that either improves continuously or degrades silently.
The cloud doesn’t reward organizations that arrive. It rewards organizations that operate.
Ricardo
