*By Zak Hassan — Staff SRE | May 2026*


In a microservices architecture, every service eventually reinvents the same wheel. One team wires up JWT validation with a subtle clock-skew bug. Another ships rate limiting with an off-by-one in the sliding window. A third service logs request IDs in a different format from every other service in the fleet. The API gateway exists to solve exactly this problem — not as an architectural luxury, but as an operational necessity. When cross-cutting concerns live at the edge, they become consistent, auditable, and operationally manageable without touching every downstream service.

Why the Gateway Is the Right Place

The argument against a gateway usually goes: "it's a single point of failure, and it adds latency." Both concerns are addressable through redundancy and co-location. The argument for a gateway is harder to dismiss: without one, you are distributing your authentication and rate-limiting logic across every service team, every language runtime, and every deployment cycle in the organization.

When auth logic lives in fifteen services, a CVE in your JWT library requires fifteen coordinated deployments. When rate limiting is per-service, a badly-behaved API consumer can hammer one endpoint while staying safely under every individual service's threshold. When logs are inconsistent, correlating a cascade failure across services becomes archaeological work.

The gateway centralizes: authentication, authorization policy enforcement, rate limiting, request/response transformation, TLS termination, routing, and observability instrumentation. Services behind the gateway can focus on their domain logic and trust that the edge has already validated the caller.

Rate Limiting Strategies

Three algorithms dominate gateway-level rate limiting, and they are not interchangeable.

Token bucket maintains a bucket with a maximum capacity. Tokens are added at a fixed rate. Each request consumes one token. Bursts are allowed up to the bucket's capacity. This is the right model for APIs where legitimate clients have bursty workloads — a CI pipeline that fires off thirty requests in two seconds and then goes quiet.

Leaky bucket enforces a strictly uniform output rate, queuing or dropping excess requests. It prevents bursts entirely, which is useful when backends are genuinely sensitive to instantaneous load spikes, but it creates head-of-line blocking for legitimate clients during burst.

Sliding window tracks request counts within a rolling time window rather than a fixed bucket. It avoids the boundary-game exploit of fixed windows (where a client sends 100 requests at 11:59:59 and 100 more at 12:00:01 without triggering a 100-req/min limit), and it gives a more accurate view of recent activity than a token bucket's fill-level.

For most production APIs, sliding window per API key is the right default. Token bucket is the right choice when you are explicitly designing for burst tolerance.

When a request is rate-limited, return 429 Too Many Requests with enough information for the client to back off correctly:

text
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1746700800
Retry-After: 47

{"error": "rate_limit_exceeded", "message": "Request limit of 1000/hour exceeded. Retry after 47 seconds."}

Never return a 503 for rate limiting — that signals a service problem, not a client problem, and will confuse your SLO dashboards. The Retry-After header is mandatory if you want clients to implement sensible backoff rather than hammering you with retries that compound the problem.

Here is a Kong rate-limiting plugin configuration with JWT auth applied to a route:

yaml
# kong-plugins.yaml
_format_version: "3.0"

services:
  - name: orders-api
    url: http://orders-service.internal:8080
    routes:
      - name: orders-v1
        paths:
          - /v1/orders
        methods:
          - GET
          - POST
        plugins:
          - name: jwt
            config:
              key_claim_name: kid
              claims_to_verify:
                - exp
                - nbf
              uri_param_names:
                - jwt
              cookie_names: []
              header_names:
                - authorization
              maximum_expiration: 3600
              run_on_preflight: false

          - name: rate-limiting
            config:
              minute: ~
              hour: 1000
              day: ~
              policy: redis
              redis_host: redis.internal
              redis_port: 6379
              redis_database: 0
              limit_by: credential
              fault_tolerant: true
              hide_client_headers: false
              error_code: 429
              error_message: "API rate limit exceeded"

          - name: request-transformer
            config:
              add:
                headers:
                  - "X-Consumer-ID:$(consumer.id)"
                  - "X-Forwarded-Service:orders-api"
              remove:
                headers:
                  - "X-Internal-Token"

fault_tolerant: true on the rate-limiting plugin is important: if Redis goes down, Kong will let requests through rather than blocking all traffic. The alternative — failing closed — sounds safer but will cause a full outage if your rate-limit store becomes unavailable.

Authentication at the Gateway

The gateway should validate tokens, not just forward them. The distinction matters: validating at the gateway means a revoked or malformed token never reaches your backend services, reducing their attack surface and eliminating redundant validation logic in each service.

For JWT validation, the gateway checks signature, expiry, and the nbf (not before) claim. It does not need to call an identity service for stateless JWTs — the signature verification against the known public key is sufficient. Pass verified claims downstream as trusted headers (X-Consumer-ID, X-Consumer-Username) so services can use them without re-parsing the token.

For API key management, store a hash of the key (not the key itself) in the gateway's data store. The gateway hashes the incoming key and does a lookup. If found, it resolves the associated consumer ID and applies that consumer's rate limits and permissions.

For OAuth2 token introspection, the gateway calls the authorization server's introspection endpoint. This adds a network hop and latency, so cache introspection results with a TTL shorter than the token's expiry — typically 60 seconds is a good balance between freshness and performance.

When to pass through to services: if a service handles its own user-level permissions (e.g., "can this user access this specific document?"), that authorization check belongs in the service, not the gateway. The gateway enforces: is this caller authenticated and allowed to call this API at all? The service enforces: is this caller allowed to perform this specific action on this specific resource?

Request and Response Transformation

The gateway is the natural home for API versioning. Rather than routing /v1/ and /v2/ to entirely different service deployments, you can often route both to the same service and use the gateway to transform requests:

yaml
# Route /v1/users/{id} to the users service, rewriting the path
# and injecting a version header so the service can behave accordingly
- name: users-v1
  paths:
    - ~/v1/users/(?P<id>[^/]+)$
  strip_path: false
  plugins:
    - name: request-transformer
      config:
        replace:
          uri: "/users/$(uri_captures.id)"
        add:
          headers:
            - "X-API-Version:1"

- name: users-v2
  paths:
    - ~/v2/users/(?P<id>[^/]+)$
  strip_path: false
  plugins:
    - name: request-transformer
      config:
        replace:
          uri: "/users/$(uri_captures.id)"
        add:
          headers:
            - "X-API-Version:2"

This keeps the service interface clean while letting the gateway manage the versioning contract with external consumers. When you deprecate v1, you update one gateway route, not a service.

Circuit Breaking and Timeouts at the Gateway

A slow backend is more dangerous than a failed one, because slow backends consume connection pool slots, queue space, and thread time — they cascade. The gateway is the right place to impose timeouts and circuit breaking because it sees all traffic to a service and can trip a circuit before the service takes down dependent callers.

Envoy's route configuration expresses both timeout and circuit breaking in a single place:

yaml
# envoy-route-config.yaml
virtual_hosts:
  - name: orders-service
    domains:
      - "api.example.com"
    routes:
      - match:
          prefix: "/v1/orders"
        route:
          cluster: orders-cluster
          timeout: 5s
          retry_policy:
            retry_on: "5xx,reset,connect-failure,retriable-4xx"
            num_retries: 2
            per_try_timeout: 2s
            retry_host_predicate:
              - name: envoy.retry_host_predicates.previous_hosts

clusters:
  - name: orders-cluster
    connect_timeout: 1s
    circuit_breakers:
      thresholds:
        - priority: DEFAULT
          max_connections: 1000
          max_pending_requests: 500
          max_requests: 2000
          max_retries: 10
          track_remaining: true
    outlier_detection:
      consecutive_5xx: 5
      interval: 10s
      base_ejection_time: 30s
      max_ejection_percent: 50
      success_rate_minimum_hosts: 5
      success_rate_request_volume: 100

max_ejection_percent: 50 is critical — it prevents Envoy from ejecting so many hosts that you lose quorum. The outlier_detection block handles the gradual failure case: if a single host starts returning 5xx, it gets ejected temporarily without tripping the global circuit.

Set per_try_timeout to be shorter than your overall route timeout. If you allow 5 seconds total and 2 retries at 2 seconds each, a slow backend can still absorb the full 5 seconds across retries rather than letting a single misbehaving attempt consume the budget.

Observability: What the Gateway Sees That Services Don't

The gateway has a unique observational position. It sees every request before routing — which means it captures latency broken into two components that are invisible from the service's perspective: gateway latency (the overhead introduced by auth, rate limiting, and routing logic) and upstream latency (the time the backend service took to respond). The difference between these is your gateway tax, and it should be stable and measurable.

More importantly, the gateway sees requests that never reach a service: rate-limited requests, auth failures, requests to unknown routes. These are invisible to the service-level metrics, but they are rich signals. A spike in 401 responses from a single IP often precedes a credential stuffing attempt. A surge in 429s from a single API key means a client has a retry loop with no backoff.

Here is a Python script for parsing gateway access logs to surface these patterns:

python
#!/usr/bin/env python3
"""
analyze_gateway_logs.py

Parses structured JSON gateway access logs to identify:
- Top rate-limited consumers
- Auth failure spikes by IP
- Upstream latency outliers by route
"""

import json
import sys
from collections import defaultdict
from datetime import datetime, timedelta
from statistics import mean, quantiles


def parse_logs(path: str) -> list[dict]:
    entries = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            try:
                entries.append(json.loads(line))
            except json.JSONDecodeError:
                continue
    return entries


def analyze(entries: list[dict]) -> None:
    rate_limited: dict[str, int] = defaultdict(int)
    auth_failures: dict[str, int] = defaultdict(int)
    upstream_latencies: dict[str, list[float]] = defaultdict(list)
    gateway_overhead: list[float] = []

    for e in entries:
        status = e.get("status", 0)
        consumer = e.get("consumer_id", e.get("client_ip", "unknown"))
        client_ip = e.get("client_ip", "unknown")
        route = e.get("route_name", "unknown")

        upstream_ms = e.get("upstream_latency_ms")
        gateway_ms = e.get("gateway_latency_ms")

        if status == 429:
            rate_limited[consumer] += 1

        if status == 401:
            auth_failures[client_ip] += 1

        if upstream_ms is not None and route != "unknown":
            upstream_latencies[route].append(float(upstream_ms))

        if upstream_ms is not None and gateway_ms is not None:
            overhead = float(gateway_ms) - float(upstream_ms)
            if overhead >= 0:
                gateway_overhead.append(overhead)

    print("=== Top Rate-Limited Consumers ===")
    for consumer, count in sorted(rate_limited.items(), key=lambda x: -x[1])[:10]:
        print(f"  {consumer}: {count} requests blocked")

    print("\n=== Auth Failure Spikes by IP (top 10) ===")
    for ip, count in sorted(auth_failures.items(), key=lambda x: -x[1])[:10]:
        flag = " *** INVESTIGATE" if count > 100 else ""
        print(f"  {ip}: {count} 401s{flag}")

    print("\n=== Upstream Latency p50/p95/p99 by Route ===")
    for route, latencies in sorted(upstream_latencies.items()):
        if len(latencies) < 10:
            continue
        qs = quantiles(latencies, n=100)
        print(
            f"  {route}: p50={qs[49]:.1f}ms  p95={qs[94]:.1f}ms  p99={qs[98]:.1f}ms"
            f"  (n={len(latencies)})"
        )

    if gateway_overhead:
        qs = quantiles(gateway_overhead, n=100)
        print(f"\n=== Gateway Processing Overhead ===")
        print(
            f"  p50={qs[49]:.1f}ms  p95={qs[94]:.1f}ms  p99={qs[98]:.1f}ms"
            f"  mean={mean(gateway_overhead):.1f}ms"
        )


if __name__ == "__main__":
    log_path = sys.argv[1] if len(sys.argv) > 1 else "/var/log/kong/access.log"
    entries = parse_logs(log_path)
    print(f"Parsed {len(entries)} log entries\n")
    analyze(entries)

Run this against an hour of logs after any incident — the auth failure distribution by IP and the rate-limit hit counts by consumer frequently tell a cleaner story than service-level traces.

Choosing a Gateway

Kong is plugin-heavy by design. Its declarative configuration model (shown above) is mature, and its plugin ecosystem covers nearly every cross-cutting concern you can name. The operational complexity scales with your plugin count — every enabled plugin adds latency and a potential failure mode. Kong is the right choice when you need rapid feature iteration on gateway behavior and the team is comfortable managing a Postgres-backed control plane.

Envoy / Contour is the Kubernetes-native option. Envoy's configuration model is verbose but extraordinarily precise — the circuit breaking and outlier detection configuration above is representative. Contour wraps Envoy with a Kubernetes-native CRD layer (HTTPProxy) that makes common routing patterns manageable without writing raw Envoy xDS config. Choose this when you are already deep in Kubernetes and want your gateway configuration to live alongside your workload manifests.

AWS API Gateway eliminates the infrastructure management problem entirely. It scales to zero, handles TLS, and integrates natively with Lambda, Cognito, and IAM. The trade-offs are real: cold starts affect latency, the request/response size limits (10MB payload, 29-second timeout) are hard constraints, and complex routing logic quickly becomes difficult to test locally. It is the right choice for serverless workloads and teams that cannot justify operating gateway infrastructure.

The selection framework: if you are on Kubernetes and want infrastructure-as-code, use Envoy/Contour. If you need rapid plugin iteration and a rich ecosystem, use Kong. If you are building serverless and want managed infrastructure, use AWS API Gateway.

What all three share: they will tell you more about the health of your API surface than any individual service can. A gateway that is configured well, observed properly, and given authority over cross-cutting concerns will prevent more incidents than it causes — and the incidents it prevents are the hardest kind to debug after the fact.


*Zak Hassan is a Staff SRE specializing in distributed systems reliability, API infrastructure, and observability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn