Prometheus Operator at Scale: CRD-Based Monitoring for Large Kubernetes Clusters

*By Zak Hassan — Staff SRE | May 2026*

Running Prometheus inside Kubernetes sounds straightforward until the cluster reaches any meaningful size. At a few dozen pods the friction is manageable. At a few hundred—or a few thousand—the operational model breaks down completely. This post covers how the Prometheus Operator and its CRD-based approach solve the core problems of static configuration, tenant isolation, and long-term retention, and what you need to think about when cardinality and federation enter the picture.

The Problem with Static Prometheus Config

Kubernetes workloads are ephemeral. Pods are scheduled, rescheduled, and replaced constantly. A static prometheus.yml that lists scrape targets by IP address is wrong before you finish writing it. Even using DNS-based service discovery with hand-maintained config files creates a painful ops loop: every new team that wants to expose metrics has to open a ticket, wait for an SRE to edit the config, and trigger a Prometheus reload. At scale that loop becomes a bottleneck that actively discourages teams from instrumenting their services.

The other failure mode is ownership. When every team's scrape config lives in one giant file owned by the platform team, nobody has clear accountability for individual jobs. A misconfigured label or a runaway high-cardinality metric from one team affects everyone on the same Prometheus instance. You need a model where teams own their monitoring configuration the same way they own their deployment manifests.

Prometheus Operator CRDs: Self-Service Monitoring

The Prometheus Operator introduces a set of Kubernetes Custom Resource Definitions that replace static config files entirely. Instead of editing YAML inside a ConfigMap, teams declare their monitoring intent as first-class Kubernetes objects.

The four CRDs you'll use every day are ServiceMonitor, PodMonitor, PrometheusRule, and Alertmanager. The Operator watches these resources and reconciles the running Prometheus configuration automatically.

A ServiceMonitor is the most common. It selects a Kubernetes Service by label and tells Prometheus how to scrape it:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: payments-api
  namespace: payments
  labels:
    team: payments
    prometheus: platform
spec:
  selector:
    matchLabels:
      app: payments-api
  namespaceSelector:
    matchNames:
      - payments
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          targetLabel: node
        - sourceLabels: [__meta_kubernetes_namespace]
          targetLabel: namespace
      metricRelabelings:
        - sourceLabels: [__name__]
          regex: "go_.*"
          action: drop

The prometheus: platform label on the ServiceMonitor is important. The Prometheus CRD specifies which ServiceMonitor objects it should pick up via serviceMonitorSelector. This is how you route monitoring config to the right Prometheus instance without giving teams access to global configuration.

PodMonitor works the same way but targets Pod objects directly—useful when a pod exposes metrics but isn't fronted by a Service.

PrometheusRule handles alerting and recording rules:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payments-api-rules
  namespace: payments
  labels:
    team: payments
    prometheus: platform
    role: alert-rules
spec:
  groups:
    - name: payments.recording
      interval: 60s
      rules:
        - record: job:http_requests_total:rate5m
          expr: sum(rate(http_requests_total[5m])) by (job, status_code)
        - record: job:http_request_duration_seconds:p99_5m
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
            )

    - name: payments.slo_burn
      rules:
        - alert: PaymentsHighErrorBurnRate
          expr: |
            (
              job:http_requests_total:rate5m{job="payments-api", status_code=~"5.."}
              /
              job:http_requests_total:rate5m{job="payments-api"}
            ) > (14.4 * (1 - 0.999))
          for: 2m
          labels:
            severity: critical
            team: payments
          annotations:
            summary: "Payments API burning error budget at 14.4x rate"
            description: |
              Error rate {{ $value | humanizePercentage }} over 5m window.
              At this rate the monthly error budget is exhausted in under 1 hour.
        - alert: PaymentsElevatedErrorBurnRate
          expr: |
            (
              job:http_requests_total:rate5m{job="payments-api", status_code=~"5.."}
              /
              job:http_requests_total:rate5m{job="payments-api"}
            ) > (6 * (1 - 0.999))
          for: 15m
          labels:
            severity: warning
            team: payments
          annotations:
            summary: "Payments API elevated error burn rate"

Tenant Isolation with Multiple Prometheus Instances

A single Prometheus instance for an entire large cluster is an antipattern. Memory pressure from one team's high-cardinality metrics affects every other team's query latency. The Operator makes it straightforward to run multiple Prometheus instances, each scoped to a namespace or team:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: payments-prometheus
  namespace: monitoring
spec:
  replicas: 2
  serviceMonitorSelector:
    matchLabels:
      prometheus: payments
  serviceMonitorNamespaceSelector:
    matchLabels:
      team: payments
  ruleSelector:
    matchLabels:
      prometheus: payments
  resources:
    requests:
      memory: 4Gi
      cpu: 500m
    limits:
      memory: 8Gi
  retention: 12h
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 50Gi

RBAC is the enforcement layer. Grant each team create, get, list, update, and delete on servicemonitors and prometheusrules within their own namespace, and nothing else:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: monitoring-editor
  namespace: payments
rules:
  - apiGroups: ["monitoring.coreos.com"]
    resources: ["servicemonitors", "podmonitors", "prometheusrules"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Teams ship their monitoring config alongside their application in the same PR. No tickets. No platform-team bottleneck.

Recording Rules: Query Performance at Scale

As the time series count grows, ad-hoc PromQL queries that scan millions of raw series become slow. A histogram over a 30-day window across thousands of pods will time out in Grafana before it returns a result. Recording rules pre-compute expensive expressions on a regular interval and store the result as a new, lower-cardinality time series.

Design recording rules in a hierarchy. The first level aggregates per-pod raw counters into per-job rates. The second level aggregates per-job into per-team or per-environment totals. Dashboards query the highest-level recording rule that still contains the labels they need.

# Level 1: job-level aggregation from raw pod metrics
- record: job:http_requests_total:rate5m
  expr: sum(rate(http_requests_total[5m])) by (job, namespace, status_code)

# Level 2: namespace-level aggregation
- record: namespace:http_requests_total:rate5m
  expr: sum(job:http_requests_total:rate5m) by (namespace, status_code)

# Level 3: cluster-level totals for executive dashboards
- record: cluster:http_requests_total:rate5m
  expr: sum(namespace:http_requests_total:rate5m) by (status_code)

Keep recording rule intervals consistent across levels. If level-1 rules evaluate every 60 seconds, level-2 rules that depend on them should also evaluate every 60 seconds or a multiple thereof. Mismatched intervals cause subtle staleness in aggregated dashboards.

Long-Term Storage: Thanos Sidecar Pattern

Prometheus's local storage is not designed for long retention. Querying data older than a few weeks against local TSDB blocks is slow, and local disk is expensive in Kubernetes. For multi-month retention—quarterly reviews, SLO trending, capacity planning—you need external object storage.

Thanos is the most widely deployed solution. The sidecar pattern is the lowest-friction entry point: a Thanos sidecar container runs alongside each Prometheus instance, uploads completed TSDB blocks to object storage, and exposes a gRPC Store API for queries against historical data.

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: payments-prometheus
  namespace: monitoring
spec:
  replicas: 2
  retention: 12h
  thanos:
    image: quay.io/thanos/thanos:v0.35.0
    objectStorageConfig:
      secret:
        name: thanos-objstore-config
        key: objstore.yml
---
apiVersion: v1
kind: Secret
metadata:
  name: thanos-objstore-config
  namespace: monitoring
stringData:
  objstore.yml: |
    type: GCS
    config:
      bucket: my-thanos-metrics-bucket
      service_account: ""

The Thanos Querier then federates across all sidecar Store APIs and the long-term object store, presenting a single query endpoint to Grafana:

# thanos-querier deployment args (abbreviated)
args:
  - query
  - --store=dnssrv+_grpc._tcp.thanos-store-gateway.monitoring.svc.cluster.local
  - --store=dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local
  - --query.replica-label=prometheus_replica
  - --query.auto-downsampling

Set Prometheus local retention to something short—12 to 24 hours—and let Thanos own everything older. This keeps Prometheus memory and disk footprint bounded regardless of how long you retain data.

Cardinality Management

High cardinality is the most common cause of unexpected memory growth in Prometheus. Every unique combination of label values is a distinct time series. A label like user_id or request_id on a counter creates as many series as there are users or requests—potentially millions. Each series consumes memory in the TSDB head block, which is always in RAM.

Monitor Prometheus's own cardinality with these queries:

# Total active time series per job
topk(20, count by (job) (up))

# Series count growth rate — rising fast means cardinality leak
rate(prometheus_tsdb_head_series[1h])

# Samples ingested per second — budget this against your RAM
rate(prometheus_tsdb_head_samples_appended_total[5m])

# Scrape duration — slow scrapes often mean high-cardinality targets
topk(10, scrape_duration_seconds)

# Rule evaluation latency — slow rules are usually querying too many series
topk(10, prometheus_rule_evaluation_duration_seconds)

# Memory usage of Prometheus process itself
process_resident_memory_bytes{job="prometheus"}

When you find a high-cardinality label, the fix is almost always to drop it at the metricRelabelings stage in the ServiceMonitor:

metricRelabelings:
  - sourceLabels: [user_id]
    regex: ".*"
    action: labeldrop
  - sourceLabels: [__name__]
    regex: "go_memstats_.*"
    action: drop

Dropping a label after the fact without breaking dashboards requires a coordinated migration: add a recording rule that pre-aggregates the metric without the offending label, update dashboards to use the recording rule, then drop the label. Do not drop first and ask questions later.

Federation and Remote Write

Federation—having one Prometheus scrape metrics from another—is appropriate for pulling small, pre-aggregated summary metrics from many tenant Prometheus instances up into a global view. It is not appropriate for copying raw high-resolution data. Federated scrapes are themselves scrape targets that can fail and create gaps.

Remote write is the right mechanism for global aggregation at scale. Configure each tenant Prometheus to remote-write its recording-rule output (not raw series) to a central Prometheus or Thanos Receiver:

spec:
  remoteWrite:
    - url: https://thanos-receive.monitoring.svc.cluster.local/api/v1/receive
      writeRelabelConfigs:
        - sourceLabels: [__name__]
          regex: "^(job|namespace|cluster):.*"
          action: keep
      queueConfig:
        capacity: 10000
        maxSamplesPerSend: 5000
        batchSendDeadline: 5s

The writeRelabelConfigs filter is critical. Only remote-write metrics whose names match the recording-rule naming convention. Sending every raw series to a central store multiplies your cardinality problem across the entire cluster.

Use federation for global dashboards that only need aggregated totals. Use remote write when you need the central store for alerting on cross-team signals. Use Thanos Querier when you need ad-hoc queries that span both historical data and live data from multiple instances.

*Zak Hassan is a Staff SRE specializing in observability platforms, Kubernetes infrastructure, and reliability engineering. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning AI Infrastructure and Operations Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn