*By Zak Hassan — Staff SRE | May 2026*
Running Prometheus inside Kubernetes sounds straightforward until the cluster reaches any meaningful size. At a few dozen pods the friction is manageable. At a few hundred—or a few thousand—the operational model breaks down completely. This post covers how the Prometheus Operator and its CRD-based approach solve the core problems of static configuration, tenant isolation, and long-term retention, and what you need to think about when cardinality and federation enter the picture.
The Problem with Static Prometheus Config
Kubernetes workloads are ephemeral. Pods are scheduled, rescheduled, and replaced constantly. A static prometheus.yml that lists scrape targets by IP address is wrong before you finish writing it. Even using DNS-based service discovery with hand-maintained config files creates a painful ops loop: every new team that wants to expose metrics has to open a ticket, wait for an SRE to edit the config, and trigger a Prometheus reload. At scale that loop becomes a bottleneck that actively discourages teams from instrumenting their services.
The other failure mode is ownership. When every team's scrape config lives in one giant file owned by the platform team, nobody has clear accountability for individual jobs. A misconfigured label or a runaway high-cardinality metric from one team affects everyone on the same Prometheus instance. You need a model where teams own their monitoring configuration the same way they own their deployment manifests.
Prometheus Operator CRDs: Self-Service Monitoring
The Prometheus Operator introduces a set of Kubernetes Custom Resource Definitions that replace static config files entirely. Instead of editing YAML inside a ConfigMap, teams declare their monitoring intent as first-class Kubernetes objects.
The four CRDs you'll use every day are ServiceMonitor, PodMonitor, PrometheusRule, and Alertmanager. The Operator watches these resources and reconciles the running Prometheus configuration automatically.
A ServiceMonitor is the most common. It selects a Kubernetes Service by label and tells Prometheus how to scrape it:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: payments-api
namespace: payments
labels:
team: payments
prometheus: platform
spec:
selector:
matchLabels:
app: payments-api
namespaceSelector:
matchNames:
- payments
endpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
metricRelabelings:
- sourceLabels: [__name__]
regex: "go_.*"
action: dropThe prometheus: platform label on the ServiceMonitor is important. The Prometheus CRD specifies which ServiceMonitor objects it should pick up via serviceMonitorSelector. This is how you route monitoring config to the right Prometheus instance without giving teams access to global configuration.
PodMonitor works the same way but targets Pod objects directly—useful when a pod exposes metrics but isn't fronted by a Service.
PrometheusRule handles alerting and recording rules:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: payments-api-rules
namespace: payments
labels:
team: payments
prometheus: platform
role: alert-rules
spec:
groups:
- name: payments.recording
interval: 60s
rules:
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job, status_code)
- record: job:http_request_duration_seconds:p99_5m
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
)
- name: payments.slo_burn
rules:
- alert: PaymentsHighErrorBurnRate
expr: |
(
job:http_requests_total:rate5m{job="payments-api", status_code=~"5.."}
/
job:http_requests_total:rate5m{job="payments-api"}
) > (14.4 * (1 - 0.999))
for: 2m
labels:
severity: critical
team: payments
annotations:
summary: "Payments API burning error budget at 14.4x rate"
description: |
Error rate {{ $value | humanizePercentage }} over 5m window.
At this rate the monthly error budget is exhausted in under 1 hour.
- alert: PaymentsElevatedErrorBurnRate
expr: |
(
job:http_requests_total:rate5m{job="payments-api", status_code=~"5.."}
/
job:http_requests_total:rate5m{job="payments-api"}
) > (6 * (1 - 0.999))
for: 15m
labels:
severity: warning
team: payments
annotations:
summary: "Payments API elevated error burn rate"Tenant Isolation with Multiple Prometheus Instances
A single Prometheus instance for an entire large cluster is an antipattern. Memory pressure from one team's high-cardinality metrics affects every other team's query latency. The Operator makes it straightforward to run multiple Prometheus instances, each scoped to a namespace or team:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: payments-prometheus
namespace: monitoring
spec:
replicas: 2
serviceMonitorSelector:
matchLabels:
prometheus: payments
serviceMonitorNamespaceSelector:
matchLabels:
team: payments
ruleSelector:
matchLabels:
prometheus: payments
resources:
requests:
memory: 4Gi
cpu: 500m
limits:
memory: 8Gi
retention: 12h
storage:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 50GiRBAC is the enforcement layer. Grant each team create, get, list, update, and delete on servicemonitors and prometheusrules within their own namespace, and nothing else:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: monitoring-editor
namespace: payments
rules:
- apiGroups: ["monitoring.coreos.com"]
resources: ["servicemonitors", "podmonitors", "prometheusrules"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]Teams ship their monitoring config alongside their application in the same PR. No tickets. No platform-team bottleneck.
Recording Rules: Query Performance at Scale
As the time series count grows, ad-hoc PromQL queries that scan millions of raw series become slow. A histogram over a 30-day window across thousands of pods will time out in Grafana before it returns a result. Recording rules pre-compute expensive expressions on a regular interval and store the result as a new, lower-cardinality time series.
Design recording rules in a hierarchy. The first level aggregates per-pod raw counters into per-job rates. The second level aggregates per-job into per-team or per-environment totals. Dashboards query the highest-level recording rule that still contains the labels they need.
# Level 1: job-level aggregation from raw pod metrics
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job, namespace, status_code)
# Level 2: namespace-level aggregation
- record: namespace:http_requests_total:rate5m
expr: sum(job:http_requests_total:rate5m) by (namespace, status_code)
# Level 3: cluster-level totals for executive dashboards
- record: cluster:http_requests_total:rate5m
expr: sum(namespace:http_requests_total:rate5m) by (status_code)Keep recording rule intervals consistent across levels. If level-1 rules evaluate every 60 seconds, level-2 rules that depend on them should also evaluate every 60 seconds or a multiple thereof. Mismatched intervals cause subtle staleness in aggregated dashboards.
Long-Term Storage: Thanos Sidecar Pattern
Prometheus's local storage is not designed for long retention. Querying data older than a few weeks against local TSDB blocks is slow, and local disk is expensive in Kubernetes. For multi-month retention—quarterly reviews, SLO trending, capacity planning—you need external object storage.
Thanos is the most widely deployed solution. The sidecar pattern is the lowest-friction entry point: a Thanos sidecar container runs alongside each Prometheus instance, uploads completed TSDB blocks to object storage, and exposes a gRPC Store API for queries against historical data.
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: payments-prometheus
namespace: monitoring
spec:
replicas: 2
retention: 12h
thanos:
image: quay.io/thanos/thanos:v0.35.0
objectStorageConfig:
secret:
name: thanos-objstore-config
key: objstore.yml
---
apiVersion: v1
kind: Secret
metadata:
name: thanos-objstore-config
namespace: monitoring
stringData:
objstore.yml: |
type: GCS
config:
bucket: my-thanos-metrics-bucket
service_account: ""The Thanos Querier then federates across all sidecar Store APIs and the long-term object store, presenting a single query endpoint to Grafana:
# thanos-querier deployment args (abbreviated)
args:
- query
- --store=dnssrv+_grpc._tcp.thanos-store-gateway.monitoring.svc.cluster.local
- --store=dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local
- --query.replica-label=prometheus_replica
- --query.auto-downsamplingSet Prometheus local retention to something short—12 to 24 hours—and let Thanos own everything older. This keeps Prometheus memory and disk footprint bounded regardless of how long you retain data.
Cardinality Management
High cardinality is the most common cause of unexpected memory growth in Prometheus. Every unique combination of label values is a distinct time series. A label like user_id or request_id on a counter creates as many series as there are users or requests—potentially millions. Each series consumes memory in the TSDB head block, which is always in RAM.
Monitor Prometheus's own cardinality with these queries:
# Total active time series per job
topk(20, count by (job) (up))
# Series count growth rate — rising fast means cardinality leak
rate(prometheus_tsdb_head_series[1h])
# Samples ingested per second — budget this against your RAM
rate(prometheus_tsdb_head_samples_appended_total[5m])
# Scrape duration — slow scrapes often mean high-cardinality targets
topk(10, scrape_duration_seconds)
# Rule evaluation latency — slow rules are usually querying too many series
topk(10, prometheus_rule_evaluation_duration_seconds)
# Memory usage of Prometheus process itself
process_resident_memory_bytes{job="prometheus"}When you find a high-cardinality label, the fix is almost always to drop it at the metricRelabelings stage in the ServiceMonitor:
metricRelabelings:
- sourceLabels: [user_id]
regex: ".*"
action: labeldrop
- sourceLabels: [__name__]
regex: "go_memstats_.*"
action: dropDropping a label after the fact without breaking dashboards requires a coordinated migration: add a recording rule that pre-aggregates the metric without the offending label, update dashboards to use the recording rule, then drop the label. Do not drop first and ask questions later.
Federation and Remote Write
Federation—having one Prometheus scrape metrics from another—is appropriate for pulling small, pre-aggregated summary metrics from many tenant Prometheus instances up into a global view. It is not appropriate for copying raw high-resolution data. Federated scrapes are themselves scrape targets that can fail and create gaps.
Remote write is the right mechanism for global aggregation at scale. Configure each tenant Prometheus to remote-write its recording-rule output (not raw series) to a central Prometheus or Thanos Receiver:
spec:
remoteWrite:
- url: https://thanos-receive.monitoring.svc.cluster.local/api/v1/receive
writeRelabelConfigs:
- sourceLabels: [__name__]
regex: "^(job|namespace|cluster):.*"
action: keep
queueConfig:
capacity: 10000
maxSamplesPerSend: 5000
batchSendDeadline: 5sThe writeRelabelConfigs filter is critical. Only remote-write metrics whose names match the recording-rule naming convention. Sending every raw series to a central store multiplies your cardinality problem across the entire cluster.
Use federation for global dashboards that only need aggregated totals. Use remote write when you need the central store for alerting on cross-team signals. Use Thanos Querier when you need ad-hoc queries that span both historical data and live data from multiple instances.
*Zak Hassan is a Staff SRE specializing in observability platforms, Kubernetes infrastructure, and reliability engineering. Find him at zakhassan.com or on LinkedIn.*
Topic Paths