*By Zak Hassan — Staff SRE | May 2026*


Platform engineering is the practice of building internal infrastructure products that let application developers deploy, operate, and observe their services without deep expertise in Kubernetes, cloud networking, or observability tooling. The platform team is, in effect, a product team whose customers are internal developers.

Most platform efforts fail not because of technical limitations but because of product failures: platforms that provide capabilities developers don't need in the forms they won't use. A world-class Kubernetes abstraction that requires 20 YAML files to deploy a service isn't an improvement — it's just different complexity. The platform's job is to reduce the total burden of building and operating software, not to centralize complexity behind a different interface.


The Golden Path: Opinionated Defaults Over Infinite Flexibility

The most valuable thing a platform team can offer isn't the ability to do anything — it's sensible defaults for the things most teams need. The "golden path" is the opinionated, well-supported, end-to-end journey from new service to production deployment.

A golden path includes:

  • A service template that generates a new service with correct structure, dependencies, and configuration
  • A CI/CD pipeline that tests, scans, builds, and deploys automatically
  • Default monitoring configuration that provides the four golden signals on day one
  • Default alerting rules tuned to reasonable defaults
  • A service catalog entry that documents the service for other teams

When a developer uses the golden path, they get all of this without thinking about any of it. When they deviate from the golden path, they take on responsibility for what they've replaced.

Service scaffolding:

bash
# Developer creates a new service with one command
platform new-service \
  --name payment-processor \
  --language python \
  --type api \
  --team backend

# Generated structure:
# payment-processor/
# ├── src/
# │   ├── main.py           (with health endpoint, OTel instrumentation)
# │   └── config.py         (with 12-factor config pattern)
# ├── tests/
# │   └── test_main.py      (basic structure test)
# ├── Dockerfile            (multi-stage build, non-root user, health check)
# ├── k8s/
# │   ├── deployment.yaml   (correct resources, probes, disruption budget)
# │   ├── service.yaml
# │   └── hpa.yaml          (HPA with sensible defaults)
# ├── .github/workflows/
# │   └── ci.yaml           (lint, test, build, security scan, deploy)
# ├── monitoring/
# │   ├── dashboard.json    (pre-built Grafana dashboard)
# │   └── alerts.yaml       (SLO-based alert rules)
# └── catalog-info.yaml     (Backstage service catalog entry)

The scaffold isn't a starting point that developers immediately customize — it's a working system from day one. The Dockerfile is production-ready. The Kubernetes manifests have correct resource requests. The monitoring is pre-wired. A team using the golden path deploys to production with zero additional platform work.


The Service Catalog: Making Systems Legible

In a microservice architecture with 50+ services, the hardest operational problem is often just knowing what exists. Which service owns the user authentication flow? Who is the on-call for the recommendation engine? What does payment-service depend on?

Backstage is the open-source platform for service catalogs, and its catalog-info.yaml format is the standard that most platforms adopt:

yaml
# catalog-info.yaml — lives in the service repo, registered in the catalog
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-processor
  description: Handles payment processing and reconciliation for all order flows
  tags:
    - payments
    - critical
  annotations:
    # Links to operational resources
    pagerduty.com/service-id: "PABC123"
    grafana/dashboard-url: "https://grafana.example.com/d/payment-processor"
    github.com/project-slug: "myorg/payment-processor"
    backstage.io/techdocs-ref: dir:.  # Documentation location
    
    # Blog post location
    blog_post_url: "https://zakhassan.com/blog/payment-processor"
    
    # SLO information
    slo/availability-target: "99.95"
    slo/latency-p99-ms: "200"

spec:
  type: service
  lifecycle: production
  owner: team:backend-payments
  
  # What this service depends on
  dependsOn:
    - component:postgres-primary
    - component:stripe-api
    - component:redis-cache
    - resource:orders-kafka-topic
  
  # What APIs this service provides
  providesApis:
    - payment-processor-api

The catalog becomes the authoritative source for: who owns what, what depends on what, where to find relevant posts and dashboards, and who to page when something breaks. With Backstage's plugin ecosystem, the catalog can also surface current deployment status, recent incidents, and SLO status directly on the service page.


Self-Service Infrastructure: The Platform API

The goal of platform engineering is self-service — teams getting what they need without filing tickets to the platform team. The platform exposes infrastructure capabilities through APIs, CLIs, and UIs that developers use directly.

Database provisioning (self-service):

yaml
# Developer creates a database by committing this file to their repo
# The platform's controller reconciles it against the actual cloud state

apiVersion: platform.example.com/v1
kind: Database
metadata:
  name: payment-processor-db
  namespace: payment
spec:
  engine: postgresql
  version: "15"
  tier: production        # production | staging | development (controls size + backup policy)
  storage_gb: 100
  
  # Platform handles: provisioning, backups, monitoring, failover, connection pooling
  # Team gets: a secret with connection string, automatic PgBouncer sidecar
  
  backup:
    retention_days: 30    # Default for production tier
    point_in_time: true   # PITR enabled for production
  
  monitoring:
    enabled: true         # Automatic dashboard + alerts for provisioned databases
python
# Platform controller reconciles the desired state with actual cloud resources
from kubernetes import client, config, watch

def reconcile_database(db_resource: dict):
    spec = db_resource['spec']
    name = db_resource['metadata']['name']
    namespace = db_resource['metadata']['namespace']
    
    # Check if RDS instance exists
    rds_instance = get_rds_instance(f"{namespace}-{name}")
    
    if not rds_instance:
        # Provision new RDS instance
        provision_rds(
            identifier=f"{namespace}-{name}",
            engine=spec['engine'],
            version=spec['version'],
            storage_gb=spec['storage_gb'],
            instance_class=get_instance_class(spec['tier']),
            backup_retention=spec['backup']['retention_days'],
            tags={"namespace": namespace, "service": name, "tier": spec['tier']}
        )
        
        # Create Kubernetes secret with connection string
        create_database_secret(namespace, name, connection_string)
        
        # Provision PgBouncer sidecar
        create_pgbouncer_config(namespace, name)
        
        # Set up monitoring
        deploy_database_dashboard(namespace, name)
        deploy_database_alerts(namespace, name)

Self-service infrastructure eliminates the platform team as a bottleneck. Instead of filing a ticket and waiting for a database, a developer commits a YAML file and the platform provisions it. The platform team's job shifts from request fulfillment to building and maintaining the reconciliation systems.


Developer Experience Metrics: Measuring Platform Value

Platform teams often struggle to demonstrate value because their work is invisible when working correctly. Measuring developer experience gives the platform team concrete data on where to invest.

DORA metrics (DevOps Research and Assessment):

The four DORA metrics measure engineering team performance, and platform investments directly improve them:

python
def compute_dora_metrics(deployments: list, incidents: list, lookback_days: int = 30) -> DORAMetrics:
    """
    Deployment Frequency: How often does the team deploy to production?
    Lead Time for Changes: How long from commit to production?
    Change Failure Rate: What % of deployments cause incidents?
    Mean Time to Recovery: How long to recover from incidents?
    """
    window_start = datetime.utcnow() - timedelta(days=lookback_days)
    recent_deployments = [d for d in deployments if d.timestamp > window_start]
    recent_incidents = [i for i in incidents if i.start_time > window_start]
    
    # Deployment Frequency
    deployment_frequency_per_day = len(recent_deployments) / lookback_days
    
    # Lead Time: median time from PR merge to production deployment
    lead_times = [(d.deployed_at - d.merged_at).total_seconds() / 3600 
                  for d in recent_deployments if d.merged_at]
    lead_time_hours = sorted(lead_times)[len(lead_times) // 2] if lead_times else None
    
    # Change Failure Rate
    deployments_causing_incidents = sum(
        1 for d in recent_deployments 
        if any(i.start_time > d.deployed_at and 
               i.start_time < d.deployed_at + timedelta(hours=1) 
               for i in recent_incidents)
    )
    change_failure_rate = deployments_causing_incidents / len(recent_deployments) if recent_deployments else 0
    
    # MTTR
    recovery_times = [(i.resolved_at - i.start_time).total_seconds() / 60 
                      for i in recent_incidents if i.resolved_at]
    mttr_minutes = sum(recovery_times) / len(recovery_times) if recovery_times else None
    
    return DORAMetrics(
        deployment_frequency=deployment_frequency_per_day,
        lead_time_hours=lead_time_hours,
        change_failure_rate=change_failure_rate,
        mttr_minutes=mttr_minutes
    )

Platform-specific metrics:

Beyond DORA, measure the platform's direct value:

text
Time to first deployment for a new service:
  Target: <30 minutes for a team using the golden path

Percentage of services on golden path:
  Target: >80% — services deviating from the path are operational risk

Platform ticket volume per developer:
  Target: declining over time — self-service working
  
P75 CI/CD pipeline duration:
  Target: <15 minutes — long pipelines slow down everyone

Infrastructure provisioning time:
  Target: <5 minutes for databases, caches, queues via self-service

Platform Team Structure: Treating Developers as Customers

The organizational structure of a platform team determines whether it builds things developers use or things developers work around.

Product management for platforms: every platform capability should have a product owner who talks to developer customers, understands their pain points, and prioritizes investments based on developer impact — not on what's technically interesting or what the platform team wants to build. The platform team without a product manager builds internal tools no one asked for.

The platform team anti-patterns:

The "gatekeeper" anti-pattern: the platform team reviews and approves every infrastructure request. This creates a bottleneck, builds resentment, and makes the platform team's job worse (pure request processing) while providing no leverage.

The "one-size-fits-all" anti-pattern: insisting that every service use the golden path for everything, with no escape hatches. Services with unusual requirements (GPU-intensive ML training, ultra-low latency trading systems) need customization, and refusing it pushes them off the platform entirely.

The "beautiful infrastructure, no one asked for" anti-pattern: building technically impressive systems without validating that application developers need them. A custom service mesh implementation is impressive; developers who had a perfectly working managed solution they were replaced from are not impressed.

The product model that works: quarterly developer surveys ("What slows you down the most?"), regular office hours where developers can ask questions and the platform team learns where they're struggling, and a public roadmap where developers can see what's coming and vote on priorities. A platform team that knows its customers' problems builds things they use.


The Migration Problem: Getting Teams Onto the Platform

The hardest part of platform engineering isn't building the platform — it's getting existing teams to adopt it. Every team has their own deployment scripts, their own monitoring setup, their own infrastructure provisioning process. Migrating to the platform is work that competes with feature development.

The strategies that work:

Incentive alignment: tie platform adoption to things teams care about. SLO dashboards available automatically for platform users. Production deployment approval gates that bypass for services on the platform. Cost visibility dashboards that show teams their spend — only available when they're in the service catalog.

Making the first step easy: the migration journey should start with low-effort, high-value steps. Registering in the service catalog takes 30 minutes. A team that's registered gets the directory, published-note aggregation, and the on-call routing automatically. That's enough to make registration worthwhile before they've touched CI/CD or infrastructure.

Avoiding the big bang: never require full migration before any benefit. Every step should provide value independent of the steps before and after it. A team that adopts CI/CD but not infrastructure provisioning should still benefit from the CI/CD adoption.


*Zak Hassan is a Staff SRE specializing in platform engineering, developer experience, and internal infrastructure products. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn