*By Zak Hassan — Staff SRE | May 2026*
In a microservices architecture, every service eventually reinvents the same wheel. One team wires up JWT validation with a subtle clock-skew bug. Another ships rate limiting with an off-by-one in the sliding window. A third service logs request IDs in a different format from every other service in the fleet. The API gateway exists to solve exactly this problem — not as an architectural luxury, but as an operational necessity. When cross-cutting concerns live at the edge, they become consistent, auditable, and operationally manageable without touching every downstream service.
Why the Gateway Is the Right Place
The argument against a gateway usually goes: "it's a single point of failure, and it adds latency." Both concerns are addressable through redundancy and co-location. The argument for a gateway is harder to dismiss: without one, you are distributing your authentication and rate-limiting logic across every service team, every language runtime, and every deployment cycle in the organization.
When auth logic lives in fifteen services, a CVE in your JWT library requires fifteen coordinated deployments. When rate limiting is per-service, a badly-behaved API consumer can hammer one endpoint while staying safely under every individual service's threshold. When logs are inconsistent, correlating a cascade failure across services becomes archaeological work.
The gateway centralizes: authentication, authorization policy enforcement, rate limiting, request/response transformation, TLS termination, routing, and observability instrumentation. Services behind the gateway can focus on their domain logic and trust that the edge has already validated the caller.
Rate Limiting Strategies
Three algorithms dominate gateway-level rate limiting, and they are not interchangeable.
Token bucket maintains a bucket with a maximum capacity. Tokens are added at a fixed rate. Each request consumes one token. Bursts are allowed up to the bucket's capacity. This is the right model for APIs where legitimate clients have bursty workloads — a CI pipeline that fires off thirty requests in two seconds and then goes quiet.
Leaky bucket enforces a strictly uniform output rate, queuing or dropping excess requests. It prevents bursts entirely, which is useful when backends are genuinely sensitive to instantaneous load spikes, but it creates head-of-line blocking for legitimate clients during burst.
Sliding window tracks request counts within a rolling time window rather than a fixed bucket. It avoids the boundary-game exploit of fixed windows (where a client sends 100 requests at 11:59:59 and 100 more at 12:00:01 without triggering a 100-req/min limit), and it gives a more accurate view of recent activity than a token bucket's fill-level.
For most production APIs, sliding window per API key is the right default. Token bucket is the right choice when you are explicitly designing for burst tolerance.
When a request is rate-limited, return 429 Too Many Requests with enough information for the client to back off correctly:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1746700800
Retry-After: 47
{"error": "rate_limit_exceeded", "message": "Request limit of 1000/hour exceeded. Retry after 47 seconds."}Never return a 503 for rate limiting — that signals a service problem, not a client problem, and will confuse your SLO dashboards. The Retry-After header is mandatory if you want clients to implement sensible backoff rather than hammering you with retries that compound the problem.
Here is a Kong rate-limiting plugin configuration with JWT auth applied to a route:
# kong-plugins.yaml
_format_version: "3.0"
services:
- name: orders-api
url: http://orders-service.internal:8080
routes:
- name: orders-v1
paths:
- /v1/orders
methods:
- GET
- POST
plugins:
- name: jwt
config:
key_claim_name: kid
claims_to_verify:
- exp
- nbf
uri_param_names:
- jwt
cookie_names: []
header_names:
- authorization
maximum_expiration: 3600
run_on_preflight: false
- name: rate-limiting
config:
minute: ~
hour: 1000
day: ~
policy: redis
redis_host: redis.internal
redis_port: 6379
redis_database: 0
limit_by: credential
fault_tolerant: true
hide_client_headers: false
error_code: 429
error_message: "API rate limit exceeded"
- name: request-transformer
config:
add:
headers:
- "X-Consumer-ID:$(consumer.id)"
- "X-Forwarded-Service:orders-api"
remove:
headers:
- "X-Internal-Token"fault_tolerant: true on the rate-limiting plugin is important: if Redis goes down, Kong will let requests through rather than blocking all traffic. The alternative — failing closed — sounds safer but will cause a full outage if your rate-limit store becomes unavailable.
Authentication at the Gateway
The gateway should validate tokens, not just forward them. The distinction matters: validating at the gateway means a revoked or malformed token never reaches your backend services, reducing their attack surface and eliminating redundant validation logic in each service.
For JWT validation, the gateway checks signature, expiry, and the nbf (not before) claim. It does not need to call an identity service for stateless JWTs — the signature verification against the known public key is sufficient. Pass verified claims downstream as trusted headers (X-Consumer-ID, X-Consumer-Username) so services can use them without re-parsing the token.
For API key management, store a hash of the key (not the key itself) in the gateway's data store. The gateway hashes the incoming key and does a lookup. If found, it resolves the associated consumer ID and applies that consumer's rate limits and permissions.
For OAuth2 token introspection, the gateway calls the authorization server's introspection endpoint. This adds a network hop and latency, so cache introspection results with a TTL shorter than the token's expiry — typically 60 seconds is a good balance between freshness and performance.
When to pass through to services: if a service handles its own user-level permissions (e.g., "can this user access this specific document?"), that authorization check belongs in the service, not the gateway. The gateway enforces: is this caller authenticated and allowed to call this API at all? The service enforces: is this caller allowed to perform this specific action on this specific resource?
Request and Response Transformation
The gateway is the natural home for API versioning. Rather than routing /v1/ and /v2/ to entirely different service deployments, you can often route both to the same service and use the gateway to transform requests:
# Route /v1/users/{id} to the users service, rewriting the path
# and injecting a version header so the service can behave accordingly
- name: users-v1
paths:
- ~/v1/users/(?P<id>[^/]+)$
strip_path: false
plugins:
- name: request-transformer
config:
replace:
uri: "/users/$(uri_captures.id)"
add:
headers:
- "X-API-Version:1"
- name: users-v2
paths:
- ~/v2/users/(?P<id>[^/]+)$
strip_path: false
plugins:
- name: request-transformer
config:
replace:
uri: "/users/$(uri_captures.id)"
add:
headers:
- "X-API-Version:2"This keeps the service interface clean while letting the gateway manage the versioning contract with external consumers. When you deprecate v1, you update one gateway route, not a service.
Circuit Breaking and Timeouts at the Gateway
A slow backend is more dangerous than a failed one, because slow backends consume connection pool slots, queue space, and thread time — they cascade. The gateway is the right place to impose timeouts and circuit breaking because it sees all traffic to a service and can trip a circuit before the service takes down dependent callers.
Envoy's route configuration expresses both timeout and circuit breaking in a single place:
# envoy-route-config.yaml
virtual_hosts:
- name: orders-service
domains:
- "api.example.com"
routes:
- match:
prefix: "/v1/orders"
route:
cluster: orders-cluster
timeout: 5s
retry_policy:
retry_on: "5xx,reset,connect-failure,retriable-4xx"
num_retries: 2
per_try_timeout: 2s
retry_host_predicate:
- name: envoy.retry_host_predicates.previous_hosts
clusters:
- name: orders-cluster
connect_timeout: 1s
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 1000
max_pending_requests: 500
max_requests: 2000
max_retries: 10
track_remaining: true
outlier_detection:
consecutive_5xx: 5
interval: 10s
base_ejection_time: 30s
max_ejection_percent: 50
success_rate_minimum_hosts: 5
success_rate_request_volume: 100max_ejection_percent: 50 is critical — it prevents Envoy from ejecting so many hosts that you lose quorum. The outlier_detection block handles the gradual failure case: if a single host starts returning 5xx, it gets ejected temporarily without tripping the global circuit.
Set per_try_timeout to be shorter than your overall route timeout. If you allow 5 seconds total and 2 retries at 2 seconds each, a slow backend can still absorb the full 5 seconds across retries rather than letting a single misbehaving attempt consume the budget.
Observability: What the Gateway Sees That Services Don't
The gateway has a unique observational position. It sees every request before routing — which means it captures latency broken into two components that are invisible from the service's perspective: gateway latency (the overhead introduced by auth, rate limiting, and routing logic) and upstream latency (the time the backend service took to respond). The difference between these is your gateway tax, and it should be stable and measurable.
More importantly, the gateway sees requests that never reach a service: rate-limited requests, auth failures, requests to unknown routes. These are invisible to the service-level metrics, but they are rich signals. A spike in 401 responses from a single IP often precedes a credential stuffing attempt. A surge in 429s from a single API key means a client has a retry loop with no backoff.
Here is a Python script for parsing gateway access logs to surface these patterns:
#!/usr/bin/env python3
"""
analyze_gateway_logs.py
Parses structured JSON gateway access logs to identify:
- Top rate-limited consumers
- Auth failure spikes by IP
- Upstream latency outliers by route
"""
import json
import sys
from collections import defaultdict
from datetime import datetime, timedelta
from statistics import mean, quantiles
def parse_logs(path: str) -> list[dict]:
entries = []
with open(path) as f:
for line in f:
line = line.strip()
if not line:
continue
try:
entries.append(json.loads(line))
except json.JSONDecodeError:
continue
return entries
def analyze(entries: list[dict]) -> None:
rate_limited: dict[str, int] = defaultdict(int)
auth_failures: dict[str, int] = defaultdict(int)
upstream_latencies: dict[str, list[float]] = defaultdict(list)
gateway_overhead: list[float] = []
for e in entries:
status = e.get("status", 0)
consumer = e.get("consumer_id", e.get("client_ip", "unknown"))
client_ip = e.get("client_ip", "unknown")
route = e.get("route_name", "unknown")
upstream_ms = e.get("upstream_latency_ms")
gateway_ms = e.get("gateway_latency_ms")
if status == 429:
rate_limited[consumer] += 1
if status == 401:
auth_failures[client_ip] += 1
if upstream_ms is not None and route != "unknown":
upstream_latencies[route].append(float(upstream_ms))
if upstream_ms is not None and gateway_ms is not None:
overhead = float(gateway_ms) - float(upstream_ms)
if overhead >= 0:
gateway_overhead.append(overhead)
print("=== Top Rate-Limited Consumers ===")
for consumer, count in sorted(rate_limited.items(), key=lambda x: -x[1])[:10]:
print(f" {consumer}: {count} requests blocked")
print("\n=== Auth Failure Spikes by IP (top 10) ===")
for ip, count in sorted(auth_failures.items(), key=lambda x: -x[1])[:10]:
flag = " *** INVESTIGATE" if count > 100 else ""
print(f" {ip}: {count} 401s{flag}")
print("\n=== Upstream Latency p50/p95/p99 by Route ===")
for route, latencies in sorted(upstream_latencies.items()):
if len(latencies) < 10:
continue
qs = quantiles(latencies, n=100)
print(
f" {route}: p50={qs[49]:.1f}ms p95={qs[94]:.1f}ms p99={qs[98]:.1f}ms"
f" (n={len(latencies)})"
)
if gateway_overhead:
qs = quantiles(gateway_overhead, n=100)
print(f"\n=== Gateway Processing Overhead ===")
print(
f" p50={qs[49]:.1f}ms p95={qs[94]:.1f}ms p99={qs[98]:.1f}ms"
f" mean={mean(gateway_overhead):.1f}ms"
)
if __name__ == "__main__":
log_path = sys.argv[1] if len(sys.argv) > 1 else "/var/log/kong/access.log"
entries = parse_logs(log_path)
print(f"Parsed {len(entries)} log entries\n")
analyze(entries)Run this against an hour of logs after any incident — the auth failure distribution by IP and the rate-limit hit counts by consumer frequently tell a cleaner story than service-level traces.
Choosing a Gateway
Kong is plugin-heavy by design. Its declarative configuration model (shown above) is mature, and its plugin ecosystem covers nearly every cross-cutting concern you can name. The operational complexity scales with your plugin count — every enabled plugin adds latency and a potential failure mode. Kong is the right choice when you need rapid feature iteration on gateway behavior and the team is comfortable managing a Postgres-backed control plane.
Envoy / Contour is the Kubernetes-native option. Envoy's configuration model is verbose but extraordinarily precise — the circuit breaking and outlier detection configuration above is representative. Contour wraps Envoy with a Kubernetes-native CRD layer (HTTPProxy) that makes common routing patterns manageable without writing raw Envoy xDS config. Choose this when you are already deep in Kubernetes and want your gateway configuration to live alongside your workload manifests.
AWS API Gateway eliminates the infrastructure management problem entirely. It scales to zero, handles TLS, and integrates natively with Lambda, Cognito, and IAM. The trade-offs are real: cold starts affect latency, the request/response size limits (10MB payload, 29-second timeout) are hard constraints, and complex routing logic quickly becomes difficult to test locally. It is the right choice for serverless workloads and teams that cannot justify operating gateway infrastructure.
The selection framework: if you are on Kubernetes and want infrastructure-as-code, use Envoy/Contour. If you need rapid plugin iteration and a rich ecosystem, use Kong. If you are building serverless and want managed infrastructure, use AWS API Gateway.
What all three share: they will tell you more about the health of your API surface than any individual service can. A gateway that is configured well, observed properly, and given authority over cross-cutting concerns will prevent more incidents than it causes — and the incidents it prevents are the hardest kind to debug after the fact.
*Zak Hassan is a Staff SRE specializing in distributed systems reliability, API infrastructure, and observability. Find him at zakhassan.com or on LinkedIn.*
Topic Paths