Staff SRE / Platform Engineer / AI Infrastructure
Staff SRE / Platform Engineer for high-scale reliability, identity, Kubernetes, and AI infrastructure.
Zak Hassan is a staff-level site reliability engineer and production engineer with 10+ years building and operating mission-critical backend infrastructure at internet scale: identity systems processing hundreds of millions of authentications, enterprise platforms serving 400M+ global users, social platforms serving 200M+ users, Kubernetes fleets, real-time data pipelines, and GPU-backed ML infrastructure. I turn high-pressure infrastructure into observable, automated, cost-efficient systems.
iad03greenyyz01canarysfo08greenfra11canarynrt04greensyd02canaryReliability Signal
What hiring managers should know in the first minute.
I am useful when the system is too important for guesswork: identity paths, platform migrations, Kubernetes fleets, cloud cost pressure, observability gaps, incident prevention, and AI infrastructure that needs real operational judgment.
Connect on LinkedInReliability measured by user pain, not dashboard decoration.
I build SLOs, burn-rate alerts, synthetic checks, and release gates that tell teams when customers are actually at risk.
Blast-radius thinking for authentication and access paths.
My strongest reliability work sits around identity, backend platforms, migrations, and the operational edges where small mistakes become global incidents.
Large migrations without turning users into test traffic.
I like progressive cutovers, reversible changes, shadow validation, data consistency checks, and visible rollback criteria.
AI used as operational leverage, not theater.
I use independent lab work to pressure-test agents, log analysis, cloud-cost telemetry, and triage workflows before treating them as operational patterns.
Scale Proof
Career signal for teams operating at serious scale.
These numbers come from previous organizations, public speaking, and resume-level experience, with the work framed around public context recruiters can evaluate quickly.
authentication events protected through SLOs, progressive delivery, synthetic monitoring, and incident prevention patterns.
global users served by previous enterprise platform work across Kubernetes, service mesh, and multi-cloud systems.
users supported during zero-downtime Kafka, EC2, and EKS migrations with no data loss.
production microservices migrated to Kubernetes with Istio service mesh, mTLS, and distributed observability.
for critical Kafka, EC2, EKS, and Kubernetes migrations supporting large production user bases.
infrastructure cost reductions through fleet right-sizing, serverless GPU capacity, and automated cloud waste detection.
production deployment automation using validation gates, rollout safety, and human-error reduction patterns.
data processing, object storage, streaming, model training, and GPU infrastructure reliability experience.
Operating Ledger
The pattern across my career: reduce uncertainty under load.
The work has ranged from user-facing identity reliability to Kubernetes migrations, service mesh, GPU-backed ML infrastructure, data platforms, and multi-cloud cost control.
- Hundreds of millions of authentications across critical identity and backend services.
- 400M+ global users on previous enterprise platform infrastructure.
- 200M+ user social platform migration from on-prem VMware and EC2 toward AWS and EKS.
- 200+ production microservices migrated to Kubernetes with Istio, mTLS, and observability.
- 80%+ cost reductions through multi-cloud resource scanners, fleet right-sizing, and serverless GPU capacity.
- Petabyte-scale data and ML platform experience with Ceph, Spark, Airflow, MLFlow, SageMaker, Snowflake, and Flink.
- Conference talks at KubeCon EU, Spark Summit, and OpenShift Commons on Prometheus, GPU monitoring, MLFlow, and Kubernetes.
SLO Thinking
The job is not to worship uptime. The job is to make risk visible early.
I build systems around practical reliability loops: service-level objectives, telemetry that answers operational questions, release health, capacity forecasts, and incident learning that turns into better architecture.
Career pipeline
Experience as a production commit graph.
Staff-level Production Engineering
Identity reliability, production architecture, incident leadership
Owns reliability patterns for critical identity and backend services processing massive authentication volume, with SLOs, progressive delivery, synthetic monitoring, and systemic incident prevention.
Workday
Kubernetes migration, developer platforms, production standards
Led service migrations to Kubernetes, built self-service deployment tooling, standardized Helm charts, Prometheus sidecars, secret management, and capacity-aware autoscaling.
SAP
400M+ user enterprise platform, service mesh, multi-cloud cost control
Migrated 200+ production microservices to Kubernetes, implemented Istio service mesh, built Terratest-backed Terraform modules, and drove major multi-cloud spend reductions.
Hootsuite and Red Hat
200M+ user migrations, GPU ML infrastructure, open-source platforms
Delivered zero-downtime Kafka and EKS migrations, operated Kubernetes ML platforms, built serverless GPU fleet management, Spark-on-Kubernetes, Ceph, Airflow, MLFlow, and anomaly detection systems.
Operating stack
Tools chosen for leverage under load.
Kubernetes
Design multi-tenant platforms with predictable rollouts, policy boundaries, and useful failure modes.
AWS
Shape global primitives around IAM, networking, queues, compute, and observability with cost-aware defaults.
Terraform
Turn infrastructure into reviewed, tested, reusable modules instead of tribal CLI archaeology.
Proxmox
Run durable lab and edge clusters for experiments that deserve production-grade feedback loops.
LXC
Use lightweight isolation for fast, reproducible services without wasting compute on ceremony.
Tailscale
Build private mesh access paths that keep operators fast and exposed surfaces small.
Prometheus
Model systems as signals: useful SLOs, actionable alerts, and dashboards that reduce uncertainty.
Cloudflare
Push compute, cache, and AI inference closer to users for low-latency edge experiences.
Curated Signal Paths
Choose the engineering surface you care about.
The writing is organized around the operating problems senior infrastructure teams keep running into: reliability, Kubernetes platforms, observability, AI infrastructure, and identity paths.
Site Reliability Engineering
Reliability engineering research on SLOs, incident learning, capacity, safe deployments, operational risk, and production engineering judgment.
Kubernetes and Platform EngineeringKubernetes Platform Engineering
Kubernetes, service mesh, GitOps, Terraform, platform engineering, and cloud-native migration patterns for serious infrastructure teams.
Observability and Incident LearningObservability and Incident Learning
Prometheus, Grafana, OpenTelemetry, tracing, metrics, alerting, and operational feedback loops that reduce uncertainty during incidents.
AI Infrastructure and OperationsAI Infrastructure and Operations
Independent research on AI infrastructure, LLM operations, AI agents, model serving, GPU telemetry, and reliability for AI systems.
Identity ReliabilityIdentity Reliability
Reliability thinking for authentication, authorization, secrets, zero trust, blast radius, and critical access paths.
Cloud Cost and CapacityCloud Cost and Capacity
FinOps, forecasting, capacity planning, right-sizing, edge architecture, and cost-aware reliability patterns for cloud platforms.
Public Proof
Pillar posts and talks that show how I reason.
These are the pieces I would send to a technical peer who wants to understand my operating model: reliability loops, migration safety, observability, AI operations, and capacity thinking.
CI/CD Pipeline Reliability: Flaky Tests, Build Observability, and Deployment Gates
Most engineering teams invest heavily in making user-facing services reliable — SLOs, blog posts, on-call rotations, incident reviews.
Read post →Capacity Forecasting for SREs: Time Series Models, Anomaly Detection, and Automated Scaling Triggers
Most capacity planning conversations start the same way: someone pulls up a Grafana dashboard, draws a mental line through the last thirty days of CPU or memory data, and declares "capacity will hit the limit in about six weeks." That estimate gets entered...
Read post →Zero-Downtime Deployments: Rolling Updates, Blue-Green, and Traffic Shifting
Every engineering team eventually ships the "the team has zero-downtime deployments" slide in their reliability review. Then they get paged at 2am because a rolling update dropped three percent of requests during a high-traffic window, and the slide quietly...
Read post →