Staff SRE / Platform Engineer / AI Infrastructure

Staff SRE / Platform Engineer for high-scale reliability, identity, Kubernetes, and AI infrastructure.

Zak Hassan is a staff-level site reliability engineer and production engineer with 10+ years building and operating mission-critical backend infrastructure at internet scale: identity systems processing hundreds of millions of authentications, enterprise platforms serving 400M+ global users, social platforms serving 200M+ users, Kubernetes fleets, real-time data pipelines, and GPU-backed ML infrastructure. I turn high-pressure infrastructure into observable, automated, cost-efficient systems.

live-telemetry.zakhassan.edgeSLO OK
Auth events100M+identity reliability
Enterprise reach400M+global users
Migration downtime0critical systems
Cost reduction80%+fleet optimization
edge latency distributionp99 87ms
iad03green
yyz01canary
sfo08green
fra11canary
nrt04green
syd02canary

Reliability Signal

What hiring managers should know in the first minute.

I am useful when the system is too important for guesswork: identity paths, platform migrations, Kubernetes fleets, cloud cost pressure, observability gaps, incident prevention, and AI infrastructure that needs real operational judgment.

Connect on LinkedIn
SLO programs

Reliability measured by user pain, not dashboard decoration.

I build SLOs, burn-rate alerts, synthetic checks, and release gates that tell teams when customers are actually at risk.

Identity reliability

Blast-radius thinking for authentication and access paths.

My strongest reliability work sits around identity, backend platforms, migrations, and the operational edges where small mistakes become global incidents.

Migration leadership

Large migrations without turning users into test traffic.

I like progressive cutovers, reversible changes, shadow validation, data consistency checks, and visible rollback criteria.

AI operations

AI used as operational leverage, not theater.

I use independent lab work to pressure-test agents, log analysis, cloud-cost telemetry, and triage workflows before treating them as operational patterns.

Scale Proof

Career signal for teams operating at serious scale.

These numbers come from previous organizations, public speaking, and resume-level experience, with the work framed around public context recruiters can evaluate quickly.

Authentication scale100M+

authentication events protected through SLOs, progressive delivery, synthetic monitoring, and incident prevention patterns.

Enterprise platform reach400M+

global users served by previous enterprise platform work across Kubernetes, service mesh, and multi-cloud systems.

Social platform scale200M+

users supported during zero-downtime Kafka, EC2, and EKS migrations with no data loss.

Microservice migration200+

production microservices migrated to Kubernetes with Istio service mesh, mTLS, and distributed observability.

Migration safety0 downtime

for critical Kafka, EC2, EKS, and Kubernetes migrations supporting large production user bases.

Cost efficiency80%+

infrastructure cost reductions through fleet right-sizing, serverless GPU capacity, and automated cloud waste detection.

Deployment automation95%

production deployment automation using validation gates, rollout safety, and human-error reduction patterns.

Data and ML systemsPB-scale

data processing, object storage, streaming, model training, and GPU infrastructure reliability experience.

Operating Ledger

The pattern across my career: reduce uncertainty under load.

The work has ranged from user-facing identity reliability to Kubernetes migrations, service mesh, GPU-backed ML infrastructure, data platforms, and multi-cloud cost control.

  • Hundreds of millions of authentications across critical identity and backend services.
  • 400M+ global users on previous enterprise platform infrastructure.
  • 200M+ user social platform migration from on-prem VMware and EC2 toward AWS and EKS.
  • 200+ production microservices migrated to Kubernetes with Istio, mTLS, and observability.
  • 80%+ cost reductions through multi-cloud resource scanners, fleet right-sizing, and serverless GPU capacity.
  • Petabyte-scale data and ML platform experience with Ceph, Spark, Airflow, MLFlow, SageMaker, Snowflake, and Flink.
  • Conference talks at KubeCon EU, Spark Summit, and OpenShift Commons on Prometheus, GPU monitoring, MLFlow, and Kubernetes.
slo-control-planehealthy
burn_rate_1h: 0.42xcanary_error_delta: +0.03%rollback_readiness: verified

SLO Thinking

The job is not to worship uptime. The job is to make risk visible early.

I build systems around practical reliability loops: service-level objectives, telemetry that answers operational questions, release health, capacity forecasts, and incident learning that turns into better architecture.

  1. 01

    Define the user-visible failure mode before choosing the metric.

  2. 02

    Make every rollout observable, reversible, and boring to operate.

  3. 03

    Treat cost, capacity, security, and reliability as one system.

  4. 04

    Write the blog post after the lab work, not instead of the lab work.

Career pipeline

Experience as a production commit graph.

  1. commit 9f42c1aHEAD -> main

    Staff-level Production Engineering

    Identity reliability, production architecture, incident leadership

    Owns reliability patterns for critical identity and backend services processing massive authentication volume, with SLOs, progressive delivery, synthetic monitoring, and systemic incident prevention.

  2. commit 61aa8e0main~1

    Workday

    Kubernetes migration, developer platforms, production standards

    Led service migrations to Kubernetes, built self-service deployment tooling, standardized Helm charts, Prometheus sidecars, secret management, and capacity-aware autoscaling.

  3. commit 2d71b90main~2

    SAP

    400M+ user enterprise platform, service mesh, multi-cloud cost control

    Migrated 200+ production microservices to Kubernetes, implemented Istio service mesh, built Terratest-backed Terraform modules, and drove major multi-cloud spend reductions.

  4. commit c0ffee7main~3

    Hootsuite and Red Hat

    200M+ user migrations, GPU ML infrastructure, open-source platforms

    Delivered zero-downtime Kafka and EKS migrations, operated Kubernetes ML platforms, built serverless GPU fleet management, Spark-on-Kubernetes, Ceph, Airflow, MLFlow, and anomaly detection systems.

Operating stack

Tools chosen for leverage under load.

Kubernetes

Design multi-tenant platforms with predictable rollouts, policy boundaries, and useful failure modes.

AWS

Shape global primitives around IAM, networking, queues, compute, and observability with cost-aware defaults.

Terraform

Turn infrastructure into reviewed, tested, reusable modules instead of tribal CLI archaeology.

Proxmox

Run durable lab and edge clusters for experiments that deserve production-grade feedback loops.

LXC

Use lightweight isolation for fast, reproducible services without wasting compute on ceremony.

Tailscale

Build private mesh access paths that keep operators fast and exposed surfaces small.

Prometheus

Model systems as signals: useful SLOs, actionable alerts, and dashboards that reduce uncertainty.

Cloudflare

Push compute, cache, and AI inference closer to users for low-latency edge experiences.

Public Proof

Pillar posts and talks that show how I reason.

These are the pieces I would send to a technical peer who wants to understand my operating model: reliability loops, migration safety, observability, AI operations, and capacity thinking.