Staff SRE / Platform Engineer / AI Infrastructure

Staff SRE / Platform Engineer for high-scale reliability, identity, Kubernetes, and AI infrastructure.

Zak Hassan is a staff-level site reliability engineer and production engineer with 10+ years building and operating mission-critical backend infrastructure at internet scale: identity systems processing hundreds of millions of authentications, enterprise platforms serving 400M+ global users, social platforms serving 200M+ users, Kubernetes fleets, real-time data pipelines, and GPU-backed ML infrastructure. I turn high-pressure infrastructure into observable, automated, cost-efficient systems.

Read the technical writing Connect on LinkedIn

live-telemetry.zakhassan.edgeSLO OK

Auth events100M+identity reliability

Enterprise reach400M+global users

Migration downtime0critical systems

Cost reduction80%+fleet optimization

edge latency distributionp99 87ms

iad03green

yyz01canary

sfo08green

fra11canary

nrt04green

syd02canary

Reliability Signal

What hiring managers should know in the first minute.

I am useful when the system is too important for guesswork: identity paths, platform migrations, Kubernetes fleets, cloud cost pressure, observability gaps, incident prevention, and AI infrastructure that needs real operational judgment.

Connect on LinkedIn

SLO programs

Reliability measured by user pain, not dashboard decoration.

I build SLOs, burn-rate alerts, synthetic checks, and release gates that tell teams when customers are actually at risk.

Identity reliability

Blast-radius thinking for authentication and access paths.

My strongest reliability work sits around identity, backend platforms, migrations, and the operational edges where small mistakes become global incidents.

Migration leadership

Large migrations without turning users into test traffic.

I like progressive cutovers, reversible changes, shadow validation, data consistency checks, and visible rollback criteria.

AI operations

AI used as operational leverage, not theater.

I use independent lab work to pressure-test agents, log analysis, cloud-cost telemetry, and triage workflows before treating them as operational patterns.

Scale Proof

Career signal for teams operating at serious scale.

These numbers come from previous organizations, public speaking, and resume-level experience, with the work framed around public context recruiters can evaluate quickly.

Authentication scale100M+

authentication events protected through SLOs, progressive delivery, synthetic monitoring, and incident prevention patterns.

Enterprise platform reach400M+

global users served by previous enterprise platform work across Kubernetes, service mesh, and multi-cloud systems.

Social platform scale200M+

users supported during zero-downtime Kafka, EC2, and EKS migrations with no data loss.

Microservice migration200+

production microservices migrated to Kubernetes with Istio service mesh, mTLS, and distributed observability.

Migration safety0 downtime

for critical Kafka, EC2, EKS, and Kubernetes migrations supporting large production user bases.

Cost efficiency80%+

infrastructure cost reductions through fleet right-sizing, serverless GPU capacity, and automated cloud waste detection.

Deployment automation95%

production deployment automation using validation gates, rollout safety, and human-error reduction patterns.

Data and ML systemsPB-scale

data processing, object storage, streaming, model training, and GPU infrastructure reliability experience.

Operating Ledger

The pattern across my career: reduce uncertainty under load.

The work has ranged from user-facing identity reliability to Kubernetes migrations, service mesh, GPU-backed ML infrastructure, data platforms, and multi-cloud cost control.

Hundreds of millions of authentications across critical identity and backend services.
400M+ global users on previous enterprise platform infrastructure.
200M+ user social platform migration from on-prem VMware and EC2 toward AWS and EKS.
200+ production microservices migrated to Kubernetes with Istio, mTLS, and observability.
80%+ cost reductions through multi-cloud resource scanners, fleet right-sizing, and serverless GPU capacity.
Petabyte-scale data and ML platform experience with Ceph, Spark, Airflow, MLFlow, SageMaker, Snowflake, and Flink.
Conference talks at KubeCon EU, Spark Summit, and OpenShift Commons on Prometheus, GPU monitoring, MLFlow, and Kubernetes.

slo-control-planehealthy

burn_rate_1h: 0.42xcanary_error_delta: +0.03%rollback_readiness: verified

SLO Thinking

The job is not to worship uptime. The job is to make risk visible early.

I build systems around practical reliability loops: service-level objectives, telemetry that answers operational questions, release health, capacity forecasts, and incident learning that turns into better architecture.

01
Define the user-visible failure mode before choosing the metric.
02
Make every rollout observable, reversible, and boring to operate.
03
Treat cost, capacity, security, and reliability as one system.
04
Write the blog post after the lab work, not instead of the lab work.

Career pipeline

Experience as a production commit graph.

commit 9f42c1aHEAD -> main
Staff-level Production Engineering
Identity reliability, production architecture, incident leadership
Owns reliability patterns for critical identity and backend services processing massive authentication volume, with SLOs, progressive delivery, synthetic monitoring, and systemic incident prevention.
commit 61aa8e0main~1
Workday
Kubernetes migration, developer platforms, production standards
Led service migrations to Kubernetes, built self-service deployment tooling, standardized Helm charts, Prometheus sidecars, secret management, and capacity-aware autoscaling.
commit 2d71b90main~2
SAP
400M+ user enterprise platform, service mesh, multi-cloud cost control
Migrated 200+ production microservices to Kubernetes, implemented Istio service mesh, built Terratest-backed Terraform modules, and drove major multi-cloud spend reductions.
commit c0ffee7main~3
Hootsuite and Red Hat
200M+ user migrations, GPU ML infrastructure, open-source platforms
Delivered zero-downtime Kafka and EKS migrations, operated Kubernetes ML platforms, built serverless GPU fleet management, Spark-on-Kubernetes, Ceph, Airflow, MLFlow, and anomaly detection systems.

Operating stack

Tools chosen for leverage under load.

Kubernetes

Design multi-tenant platforms with predictable rollouts, policy boundaries, and useful failure modes.

AWS

Shape global primitives around IAM, networking, queues, compute, and observability with cost-aware defaults.

Terraform

Turn infrastructure into reviewed, tested, reusable modules instead of tribal CLI archaeology.

Proxmox

Run durable lab and edge clusters for experiments that deserve production-grade feedback loops.

LXC

Use lightweight isolation for fast, reproducible services without wasting compute on ceremony.

Tailscale

Build private mesh access paths that keep operators fast and exposed surfaces small.

Prometheus

Model systems as signals: useful SLOs, actionable alerts, and dashboards that reduce uncertainty.

Cloudflare

Push compute, cache, and AI inference closer to users for low-latency edge experiences.

Curated Signal Paths

Choose the engineering surface you care about.

The writing is organized around the operating problems senior infrastructure teams keep running into: reliability, Kubernetes platforms, observability, AI infrastructure, and identity paths.

SRE and Reliability

Site Reliability Engineering

Reliability engineering research on SLOs, incident learning, capacity, safe deployments, operational risk, and production engineering judgment.

Kubernetes and Platform Engineering

Kubernetes Platform Engineering

Kubernetes, service mesh, GitOps, Terraform, platform engineering, and cloud-native migration patterns for serious infrastructure teams.

Observability and Incident Learning

Prometheus, Grafana, OpenTelemetry, tracing, metrics, alerting, and operational feedback loops that reduce uncertainty during incidents.

AI Infrastructure and Operations

Independent research on AI infrastructure, LLM operations, AI agents, model serving, GPU telemetry, and reliability for AI systems.

Identity Reliability

Reliability thinking for authentication, authorization, secrets, zero trust, blast radius, and critical access paths.

Cloud Cost and Capacity

FinOps, forecasting, capacity planning, right-sizing, edge architecture, and cost-aware reliability patterns for cloud platforms.

Public Proof

Pillar posts and talks that show how I reason.

These are the pieces I would send to a technical peer who wants to understand my operating model: reliability loops, migration safety, observability, AI operations, and capacity thinking.

May 9, 20269 min readObservability

CI/CD Pipeline Reliability: Flaky Tests, Build Observability, and Deployment Gates

Most engineering teams invest heavily in making user-facing services reliable — SLOs, blog posts, on-call rotations, incident reviews.

Read post →

May 9, 202611 min readSRE Research

Capacity Forecasting for SREs: Time Series Models, Anomaly Detection, and Automated Scaling Triggers

Most capacity planning conversations start the same way: someone pulls up a Grafana dashboard, draws a mental line through the last thirty days of CPU or memory data, and declares "capacity will hit the limit in about six weeks." That estimate gets entered...

Read post →

May 8, 202610 min readPlatform Engineering

Zero-Downtime Deployments: Rolling Updates, Blue-Green, and Traffic Shifting

Every engineering team eventually ships the "the team has zero-downtime deployments" slide in their reliability review. Then they get paged at 2am because a rolling update dropped three percent of requests during a high-traffic window, and the slide quietly...

Read post →

Read all blog posts Watch past talks Connect on LinkedIn

Staff SRE / Platform Engineer for high-scale reliability, identity, Kubernetes, and AI infrastructure.

What hiring managers should know in the first minute.

Reliability measured by user pain, not dashboard decoration.

Blast-radius thinking for authentication and access paths.

Large migrations without turning users into test traffic.

AI used as operational leverage, not theater.

Career signal for teams operating at serious scale.

The pattern across my career: reduce uncertainty under load.

The job is not to worship uptime. The job is to make risk visible early.

Experience as a production commit graph.

Staff-level Production Engineering

Workday

SAP

Hootsuite and Red Hat

Tools chosen for leverage under load.

Kubernetes

AWS

Terraform

Proxmox

LXC

Tailscale

Prometheus

Cloudflare

Choose the engineering surface you care about.

Site Reliability Engineering

Kubernetes Platform Engineering

Observability and Incident Learning

AI Infrastructure and Operations

Identity Reliability

Cloud Cost and Capacity

Pillar posts and talks that show how I reason.

CI/CD Pipeline Reliability: Flaky Tests, Build Observability, and Deployment Gates

Capacity Forecasting for SREs: Time Series Models, Anomaly Detection, and Automated Scaling Triggers

Zero-Downtime Deployments: Rolling Updates, Blue-Green, and Traffic Shifting