Service Mesh Observability: Getting the Golden Signals Without Touching Application Code

*By Zak Hassan — Staff SRE | May 2026*

A service mesh sits between the services and handles network communication transparently: retries, circuit breaking, mTLS, load balancing, and — the part SREs care about most — observability. Every request that transits the mesh is visible to the mesh's data plane, which means you can get latency, error rate, and throughput metrics for every service-to-service call without writing a single line of application code.

This is the operational guide to service mesh observability with Istio: what the mesh gives you by default, how to extend it, and how to integrate it with your broader observability stack.

What the Mesh Observes Automatically

Istio's data plane is Envoy proxy, running as a sidecar container alongside every application pod. Envoy intercepts all inbound and outbound network traffic and reports L7 metrics to the control plane.

Out of the box, every service in the mesh gets:

# Request rate per service pair (source → destination)
sum(rate(istio_requests_total{destination_service="payment-service.production.svc.cluster.local"}[5m])) by (source_workload, response_code)

# Latency distribution (p50, p95, p99)
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{destination_service="payment-service.production.svc.cluster.local"}[5m])) by (le, source_workload))

# Error rate (non-2xx responses) between any two services
sum(rate(istio_requests_total{destination_service="payment-service.production.svc.cluster.local", response_code!~"2.."}[5m]))
/ sum(rate(istio_requests_total{destination_service="payment-service.production.svc.cluster.local"}[5m]))

# Bytes transferred
sum(rate(istio_response_bytes_sum{destination_service="payment-service.production.svc.cluster.local"}[5m]))

These metrics exist for every service-to-service call in the cluster without any instrumentation. The service doesn't need to emit metrics, have a Prometheus exporter, or even be aware it's in a mesh.

Configuring Telemetry for Production

The default Istio telemetry configuration is functional but not production-tuned. Key adjustments:

# Increase sampling rate for tracing (default is 1%)
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: default-tracing
  namespace: istio-system  # Applies cluster-wide
spec:
  tracing:
    - providers:
        - name: tempo        # Your tracing backend
      randomSamplingPercentage: 5.0  # 5% of traces — adjust based on volume

---
# Per-namespace override: higher sampling for critical namespaces
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: payment-tracing
  namespace: payment
spec:
  tracing:
    - providers:
        - name: tempo
      randomSamplingPercentage: 100.0  # Trace everything in the payment namespace

---
# Access logging configuration — disable for high-volume paths to reduce cost
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: access-log-config
  namespace: production
spec:
  accessLogging:
    - providers:
        - name: otel
      filter:
        # Only log failed requests and requests over 1 second
        expression: "response.code >= 400 || request.duration > duration('1s')"

Metric customization: Istio's default metrics are L7-centric. For L4 (TCP) services, configure additional metrics:

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: tcp-metrics
  namespace: production
spec:
  metrics:
    - providers:
        - name: prometheus
      overrides:
        - match:
            metric: ALL_METRICS
            mode: CLIENT_AND_SERVER
          tagOverrides:
            # Add custom labels from request headers
            user_tier:
              value: "request.headers['x-user-tier'] | 'unknown'"
            region:
              value: "node.labels['topology.kubernetes.io/region'] | 'unknown'"

The Service Graph: Understanding Dependency Topology

The mesh gives you something metrics alone can't: the actual service dependency graph derived from observed traffic, not from documentation or code.

Kiali is the Istio-native visualization tool for this:

# Install Kiali with Helm
helm install kiali-server kiali/kiali-server \
  --namespace istio-system \
  --set auth.strategy="anonymous" \
  --set external_services.prometheus.url="http://prometheus.monitoring:9090" \
  --set external_services.tracing.url="http://tempo.monitoring:3100" \
  --set external_services.grafana.url="http://grafana.monitoring:3000"

# Port-forward for access
kubectl port-forward svc/kiali 20001:20001 -n istio-system

The Kiali service graph shows every active service-to-service connection with real-time error rate and latency overlaid on the edges. A red edge between checkout-service and payment-service immediately shows where errors are occurring without needing to query logs or metrics.

For programmatic access to the service graph:

# Query Prometheus for the actual service topology
def get_service_dependency_graph(namespace: str) -> dict:
    """
    Return the actual observed service dependency graph from Istio metrics.
    """
    query = f"""
    sum by (source_workload, destination_service_name) (
      rate(istio_requests_total{{
        reporter="source",
        source_workload_namespace="{namespace}",
        destination_service_namespace="{namespace}"
      }}[5m])
    )
    """
    
    result = prometheus_query(query)
    
    graph = {"nodes": set(), "edges": []}
    for series in result:
        source = series['metric']['source_workload']
        dest = series['metric']['destination_service_name']
        rps = float(series['value'][1])
        
        graph["nodes"].add(source)
        graph["nodes"].add(dest)
        
        if rps > 0:
            graph["edges"].append({
                "source": source,
                "destination": dest,
                "requests_per_second": rps
            })
    
    graph["nodes"] = list(graph["nodes"])
    return graph

mTLS: Observing Encryption Status

Istio can enforce mutual TLS between all services in the mesh. The observability question: is mTLS actually active, and are there any services communicating in plaintext?

# Check mTLS status across the mesh
istioctl x describe service payment-service.production

# Check for any plaintext traffic (should return nothing in a properly configured mesh)
kubectl exec -n production deployment/checkout-service -c istio-proxy -- \
  curl -s localhost:15000/config_dump | \
  jq '.configs[] | select(.["@type"] | contains("Listener")) | 
      .dynamic_listeners[] | 
      select(.active_state.listener.filter_chains[].transport_socket | not)'

# Verify peer authentication policy
kubectl get peerauthentication -A

# Enforce strict mTLS cluster-wide
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT   # Reject all plaintext traffic

---
# Exception for specific services that can't support mTLS yet
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: legacy-service-permissive
  namespace: production
spec:
  selector:
    matchLabels:
      app: legacy-service
  mtls:
    mode: PERMISSIVE  # Accept both mTLS and plaintext during migration

Monitor the Istio metric istio_requests_total{connection_security_policy="mutual_tls"} to track what percentage of traffic is encrypted. Alert if plaintext traffic appears in a namespace configured for STRICT mode.

Traffic Management Observability

Istio's traffic management features (retries, timeouts, circuit breaking) need their own observability. When a circuit breaker opens, when retries are occurring, or when timeout policies are firing — these events should be visible.

# VirtualService with observability-friendly configuration
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payment-service
  namespace: production
spec:
  hosts:
    - payment-service
  http:
    - timeout: 5s          # Timeout — fires if payment takes >5s
      retries:
        attempts: 3        # Retry up to 3 times
        perTryTimeout: 2s
        retryOn: "5xx,reset,connect-failure,retriable-4xx"
      route:
        - destination:
            host: payment-service
            port:
              number: 8080

---
# DestinationRule with circuit breaking
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payment-service
  namespace: production
spec:
  host: payment-service
  trafficPolicy:
    outlierDetection:
      consecutiveGatewayErrors: 5     # Eject after 5 consecutive errors
      interval: 30s                   # Check interval
      baseEjectionTime: 30s           # Minimum ejection duration
      maxEjectionPercent: 50          # Never eject more than 50% of endpoints
      splitExternalLocalOriginErrors: false

Monitor circuit breaker activity:

# Endpoints currently ejected by outlier detection (circuit breaker open)
envoy_cluster_outlier_detection_ejections_active{cluster_name=~"outbound.*payment-service.*"}

# Retry rate — high retry rate indicates upstream instability
sum(rate(istio_requests_total{
  destination_service="payment-service.production.svc.cluster.local",
  response_flags=~".*UR.*"  # UpstreamRetry flag
}[5m]))

# Upstream request timeout rate
sum(rate(istio_requests_total{
  destination_service="payment-service.production.svc.cluster.local",
  response_flags="UT"  # UpstreamRequestTimeout
}[5m]))

Alert when retry rate exceeds 5% of total requests — it indicates that the upstream service is flapping and retries are masking instability. High retry rates also multiply load on the upstream service, which can accelerate a degradation into a full outage.

Integrating Mesh Metrics with Existing Observability

Istio metrics, application metrics, and infrastructure metrics should live in the same Prometheus to enable correlation:

# PodMonitor for Istio sidecar metrics
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: istio-proxy-metrics
  namespace: monitoring
spec:
  namespaceSelector:
    any: true
  selector:
    matchLabels:
      security.istio.io/tlsMode: istio
  podMetricsEndpoints:
    - port: http-envoy-prom   # Port 15090 — Envoy's metrics port
      path: /stats/prometheus
      interval: 15s

With mesh and application metrics in the same Prometheus, you can build correlation queries:

# Is latency from the mesh perspective matching application-reported latency?
# Large discrepancy indicates the bottleneck is in the network, not the application

# Mesh-observed p99 latency (Envoy measurement — includes network)
histogram_quantile(0.99, rate(istio_request_duration_milliseconds_bucket{
  destination_service="payment-service.production.svc.cluster.local",
  reporter="source"
}[5m]))

# Application-reported p99 latency (inside the process)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{
  job="payment-service"
}[5m])) * 1000

# If mesh latency >> application latency: network is the bottleneck
# If mesh latency ≈ application latency: the application itself is slow

This correlation query is one of the highest-value things a service mesh enables that pure application instrumentation can't: distinguishing network latency from application processing latency without touching application code.

*Zak Hassan is a Staff SRE specializing in service mesh operations, Kubernetes networking, and observability engineering. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning AI Infrastructure and Operations Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn