The multi-cloud conversation in tech tends to happen at two altitudes: the architecture diagram altitude, where everything is clean and portable, and the 2am incident altitude, where you discover that your EKS cluster's IAM assumptions are fundamentally different from your GKE cluster's Workload Identity setup and the deployment tool that works perfectly on one doesn't understand the other.
I use multi-cloud labs to study the operational seams between AWS, GCP, Azure, and OCI. The lessons below come from independent lab work, public patterns, and technology behavior that shows up when platforms cross provider boundaries.
Why Multi-Cloud Happens (It's Usually Not Strategic)
Before the lessons, some honesty about why organizations end up multi-cloud, because it's usually not what the strategy deck says.
Acquisition. Company buys another company. That company runs on GCP. Now you're multi-cloud. Consolidation takes years, if it happens at all.
Vendor lock-in paranoia. Engineering leadership decides that running everything on one cloud is a risk. This is a legitimate concern at a certain scale; it's a premature optimization for most organizations.
Specific services that don't exist on your primary cloud. GCP's BigQuery is meaningfully better than its AWS equivalent for certain workloads. Some teams use it for that reason alone, then end up with a data pipeline that straddles two clouds.
SaaS dependencies that run on different clouds. Your data enrichment vendor runs on GCP. Your ML vendor runs on Azure. The networking to connect them to your AWS-primary infrastructure is your problem now.
The real-world multi-cloud organization usually has a primary cloud (most workloads), a secondary cloud (specific use cases or an acquired company's infrastructure), and a collection of SaaS products that technically run on various clouds.
Lesson 1: Networking Is Where Complexity Lives
The hardest part of multi-cloud is networking. Not at the concept level — connect VPCs with VPN or Direct Connect and you're done — but at the operational level, where routing decisions, firewall rules, DNS resolution, and latency characteristics create a combinatorial mess.
Key pain points that show up in multi-cloud designs:
Asymmetric latency kills assumptions. AWS → GCP traffic goes over the public internet unless you've set up Cloud Interconnect or Dedicated Interconnect. Depending on where your data centers are and what path the traffic takes, you can have latency variance that makes synchronous cross-cloud calls unreliable. If you have services that depend on sub-10ms latency, they cannot span clouds.
DNS becomes complicated. Private DNS resolution inside a VPC is straightforward. Cross-cloud private DNS is not. If service A on AWS needs to reach service B on GCP using a private hostname, you need a resolution path — whether that's centralized DNS with forwarding rules, a service mesh that handles discovery, or explicit endpoint configuration. The number of places this can silently break is non-trivial.
Security group equivalence is a lie. AWS security groups, GCP firewall rules, Azure NSGs, OCI security lists — conceptually similar, operationally different. The Terraform abstractions paper over some of this, but the mental model for "how does traffic flow" is different on each cloud, and your platform team needs to understand each one, not just the abstraction.
My practical advice: treat cross-cloud network calls as inherently less reliable than same-cloud calls. Build timeouts, retries, and circuit breakers explicitly for cross-cloud traffic. Monitor cross-cloud latency as a first-class metric.
Lesson 2: Identity and Access Is a Full-Time Problem
Each cloud has a different identity model. AWS has IAM roles with IRSA for Kubernetes workloads. GCP has Workload Identity and Service Accounts. Azure has Managed Identities. OCI has Instance Principals and Dynamic Groups.
When you're multi-cloud, the questions get complicated:
- How does an AWS workload assume a GCP Service Account? (Federation with Workload Identity pools)
- How do you rotate credentials that services on different clouds share?
- How do you audit access across all four clouds in a unified way?
- How do you enforce least-privilege when each cloud's IAM model expresses permissions differently?
A strong lab pattern is centralizing secrets management in HashiCorp Vault because it gives teams a single plane for credential management across clouds. Vault can authenticate workloads using their native cloud identity (AWS IAM, GCP Workload Identity, Azure AD) and issue short-lived credentials or tokens for other clouds. The complexity of the identity federation is hidden from the applications — they ask Vault for credentials, Vault handles the cross-cloud translation.
This is not a simple setup. But the alternative — managing secrets per-cloud with per-cloud tooling — creates an audit and governance nightmare at scale.
Lesson 3: Terraform Works, Until It Doesn't
Terraform is the practical choice for multi-cloud infrastructure-as-code. The provider ecosystem is comprehensive. The state model is portable. HCP Terraform (formerly Terraform Cloud) gives you a managed control plane.
What Terraform doesn't solve is the conceptual incompatibility between cloud resource models. You can manage an S3 bucket and a GCS bucket and an Azure Blob container with Terraform modules, but they're not the same thing. The permission model is different. The consistency guarantees are different. The event notification model is different. If you abstract them behind a "blob storage module," you end up with the lowest common denominator of functionality or with a leaky abstraction that exposes cloud-specific differences.
My view: abstract for deployment consistency (using Terraform modules to enforce tagging standards, naming conventions, network configuration), not for portability. The promise of "write once, run on any cloud" infrastructure is mostly fiction for anything beyond the most basic resources.
Lesson 4: AI Agents Are Actually Well-Suited for Multi-Cloud Ops
This is the pattern I'm most optimistic about. Multi-cloud operations are expensive because human expertise is specialized. Your engineer who knows AWS deeply probably doesn't have the same depth on GCP. Rotating on-call coverage across a multi-cloud environment means your on-call engineer might get paged for a GCP incident they're less equipped to handle.
LLM-powered incident response agents can operate across clouds without the expertise bottleneck. The agent that knows "when you see a spike in GCS error rates, check the associated service account's recent activity and look for quota exhaustion events in the same region" can apply that knowledge regardless of which cloud engineer is on-call.
A useful lab pattern is an incident response agent that has tools spanning AWS, GCP, Azure, and OCI lab/provider environments:
tools = [
# AWS tools
{"name": "query_cloudwatch_logs", ...},
{"name": "describe_ec2_instances", ...},
{"name": "get_rds_events", ...},
# GCP tools
{"name": "query_cloud_logging", ...},
{"name": "get_gke_cluster_events", ...},
{"name": "check_gcp_quotas", ...},
# Azure tools
{"name": "query_azure_monitor", ...},
{"name": "get_aks_diagnostics", ...},
# OCI tools
{"name": "query_oci_logging", ...},
{"name": "get_oci_metrics", ...},
# Cross-cloud tools
{"name": "get_cross_cloud_latency", ...},
{"name": "check_vault_secret_health", ...}
]The agent doesn't need to know which cloud a problem is on — it discovers that through the investigation. It queries the alert context, determines which cloud and region, and then selects the appropriate tools. The on-call engineer gets a diagnosis that spans all the relevant surfaces, regardless of their cloud-specific expertise.
Lesson 5: Egress Costs Will Surprise You
Cloud providers charge for data leaving their network. Within a cloud, cross-AZ data transfer is billed. Between clouds, you're paying public internet egress rates.
In a multi-cloud architecture, this cost can be material — and it's easy to miss during design because you're thinking about compute and storage costs, not bandwidth. A service that does significant data movement between clouds can generate substantial egress costs that dwarf its compute bill.
Mitigations: minimize cross-cloud data movement by design (process data where it lives; don't move raw data across clouds for processing), use direct interconnect where the volume justifies it, use a FinOps tagging strategy that attributes egress costs to the services generating them (so teams feel the cost of their cross-cloud calls).
What I'd Tell Someone Starting a Multi-Cloud Architecture
Don't start multi-cloud by choice. If you're greenfield and choosing your cloud provider, pick one and go deep. The operational complexity of multi-cloud is real and it's a cost you pay forever. Portability is less valuable than you think.
If you're already multi-cloud, invest in the control plane. Unified observability (a single place to query logs and metrics across clouds), unified identity (Vault or equivalent), and unified deployment (Terraform with consistent module standards) are the foundation. Without these, you're operating multiple siloed clouds, not a multi-cloud system.
Treat cross-cloud calls as external. The same reliability engineering you'd apply to calls to third-party APIs — timeouts, retries with backoff, circuit breakers, fallback behavior — apply to cross-cloud calls. They are slower and less reliable than same-cloud calls.
Use AI to close the expertise gap. A well-instrumented, well-prompted incident agent is a force multiplier for on-call coverage in a multi-cloud environment. The expertise bottleneck is real; the agent approach is a tractable solution.
*Zak Hassan is a Staff SRE with experience operating production systems across AWS, GCP, Azure, and OCI. Find him at zakhassan.com or on LinkedIn.*
Topic Paths