API Gateway Reliability: Rate Limiting, Auth, and the Patterns That Actually Scale

The API gateway is the front door of your platform. It's the layer that authenticates every incoming request, enforces rate limits, routes to the right backend, and — if designed correctly — protects your services from the failure modes that would otherwise reach them directly. Getting the gateway layer right is a reliability multiplier for everything behind it.

This is the production guide: what the gateway should handle, how to configure it reliably, and the patterns that protect your platform at scale.

What the Gateway Layer Should Own

The question of "what belongs at the gateway vs. in the service" is foundational. Get it wrong and you have either a bloated gateway that's a single point of failure for too much logic, or anemic services that have to re-implement cross-cutting concerns themselves.

Gateway-owned responsibilities:

Authentication (JWT validation, API key verification, OAuth token introspection)
Rate limiting and quota enforcement
Request routing (path-based, header-based, weighted)
TLS termination
Request/response logging for audit purposes
DDoS protection and IP-based blocking

Service-owned responsibilities:

Authorization (does this authenticated user have permission to this resource?)
Business logic validation
Domain-specific rate limiting (user can create max 100 items/day)
Application-level error handling

The dividing line: authentication (who are you?) belongs at the gateway; authorization (what can you do?) belongs in the service. The service has the domain context to make authorization decisions; the gateway does not.

Rate Limiting That Actually Protects Your Services

Naive rate limiting (100 requests/minute per API key) fails in predictable ways. A client that sends 100 requests in the first second of each minute passes the rate limiter but sends bursts that look like 6,000 requests/minute to the backend. Token bucket and sliding window algorithms prevent burst abuse; leaky bucket smooths traffic.

Token bucket for AWS API Gateway:

# Lambda authorizer with token bucket rate limiting using ElastiCache
import boto3
import time
import json

redis_client = get_redis_client()  # ElastiCache Redis

def rate_limit_check(api_key: str, limit_per_minute: int) -> bool:
    """
    Token bucket: replenish tokens over time, consume one per request.
    Returns True if request should proceed, False if rate limited.
    """
    now = time.time()
    bucket_key = f"rate_limit:{api_key}"
    
    pipe = redis_client.pipeline()
    pipe.get(bucket_key)
    pipe.ttl(bucket_key)
    tokens_str, ttl = pipe.execute()
    
    if tokens_str is None:
        # New key: initialize with full bucket
        redis_client.setex(bucket_key, 60, limit_per_minute - 1)
        return True
    
    tokens = int(tokens_str)
    
    if tokens <= 0:
        return False  # Rate limited
    
    # Consume one token
    redis_client.decr(bucket_key)
    return True

def lambda_handler(event, context):
    api_key = event['headers'].get('x-api-key', '')
    
    if not api_key:
        return generate_deny_policy('unauthorized', event['methodArn'])
    
    # Validate API key and get associated limit
    key_config = validate_api_key(api_key)
    if not key_config:
        return generate_deny_policy('invalid_key', event['methodArn'])
    
    # Check rate limit
    if not rate_limit_check(api_key, key_config['requests_per_minute']):
        # Return 429 via context (API Gateway will translate)
        raise Exception('Too Many Requests')
    
    return generate_allow_policy(key_config['principal_id'], event['methodArn'])

Tiered rate limits by customer plan. Free tier customers get 60 requests/minute; paid customers get 1,000; enterprise customers get custom limits. Store the limit in your API key metadata store and look it up during the auth/rate-limit check.

Rate limiting by endpoint, not just by key. Some endpoints are more expensive than others. A search endpoint that does full-text queries should have a lower rate limit than a simple GET-by-ID endpoint. Apply endpoint-specific limits in addition to overall key limits.

Circuit Breaking at the Gateway

The gateway can implement circuit breaking to protect services from cascading failures. When a backend is returning elevated error rates, the circuit breaker stops forwarding requests to it rather than letting them queue and timeout.

AWS API Gateway's built-in integration doesn't provide circuit breaking. The pattern requires a custom Lambda authorizer or an intermediary (Kong, Envoy as a gateway sidecar, or a purpose-built gateway like AWS App Mesh).

With Kong or similar:

# Kong circuit breaker plugin configuration
plugins:
- name: request-termination
  config:
    status_code: 503
    message: "Service temporarily unavailable"
  enabled: false  # Toggle via Admin API when circuit is open

- name: proxy-cache
  config:
    response_code: [200, 201]
    request_method: ["GET"]
    cache_ttl: 30  # Serve stale cached responses during backend degradation

A more sophisticated pattern: proxy to a fallback response or a degraded-mode backend when the primary is circuit-tripped, rather than returning 503. The gateway that returns cached-but-stale data during a backend outage is more reliable than the gateway that returns errors.

JWT Validation at the Gateway

Validating JWTs at the gateway — rather than in each individual service — eliminates duplicated auth code and ensures consistency. Every service behind the gateway can trust that requests have been authenticated.

The validation steps that must happen at the gateway:

import jwt
import requests
from functools import lru_cache

@lru_cache(maxsize=1)  # Cache JWKS — refresh only on key rotation events
def get_jwks() -> dict:
    """Fetch public keys from your IdP's JWKS endpoint."""
    response = requests.get(
        'https://your-auth-provider.com/.well-known/jwks.json',
        timeout=5
    )
    return response.json()

def validate_jwt(token: str) -> dict:
    """
    Validate JWT signature, expiry, audience, and issuer.
    Returns claims if valid, raises if invalid.
    """
    try:
        # 1. Decode header to find key ID
        header = jwt.get_unverified_header(token)
        kid = header.get('kid')
        
        # 2. Find the public key
        jwks = get_jwks()
        public_key = find_key_by_kid(jwks, kid)
        
        # 3. Validate signature, expiry, issuer, audience
        claims = jwt.decode(
            token,
            public_key,
            algorithms=['RS256'],
            audience='your-api-audience',
            issuer='https://your-auth-provider.com'
        )
        
        return claims
        
    except jwt.ExpiredSignatureError:
        raise AuthError("Token expired")
    except jwt.InvalidAudienceError:
        raise AuthError("Token not intended for this API")
    except jwt.InvalidIssuerError:
        raise AuthError("Token from untrusted issuer")
    except Exception as e:
        raise AuthError(f"Token validation failed: {e}")

Forward validated claims to services. After JWT validation, forward the claims (user ID, roles, org ID) as request headers to the backend:

# Add to API Gateway integration request mapping
context.requestOverride.header['X-User-Id'] = "$context.authorizer.userId"
context.requestOverride.header['X-User-Roles'] = "$context.authorizer.roles"
context.requestOverride.header['X-Org-Id'] = "$context.authorizer.orgId"

Services receive pre-validated identity information without doing JWT validation themselves.

API Versioning as a Reliability Practice

API versioning isn't just a developer experience concern — it's a reliability practice. When you need to make breaking changes to an API, versioning gives you the ability to migrate consumers incrementally rather than forcing simultaneous cutover.

URL versioning (/v1/orders, /v2/orders) is the most common approach and the most gateway-friendly: routing rules are simple path prefixes.

The gateway routing pattern for versioned APIs:

# API Gateway route configuration
routes:
  - path: /v1/**
    backend: orders-service-v1
    timeout: 5s
    
  - path: /v2/**
    backend: orders-service-v2
    timeout: 5s
    
  - path: /orders/**  # Unversioned → route to latest stable
    redirect: /v2/orders/**
    status_code: 301

Deprecation as a process: When deprecating v1, add a Deprecation response header with the sunset date. Log v1 usage by consumer so you can identify which consumers haven't migrated. Set a hard sunset date and enforce it — a deprecated API that never actually sunsets trains consumers to ignore deprecation warnings.

Observability at the Gateway Layer

The gateway sees every API request. This makes it an ideal place to capture API usage metrics that individual services can't:

Request rate by endpoint, by consumer, by region
Error rate by endpoint and error type (auth failure vs. rate limit vs. backend error)
Latency by endpoint (P50, P95, P99)
Traffic by consumer (which API keys generate the most traffic?)

These metrics inform both reliability (is error rate elevated for a specific endpoint?) and business decisions (which API consumers are power users? which endpoints are most critical?).

Emit these from your Lambda authorizer or from your gateway's access logging:

# Lambda authorizer: emit gateway metrics
import boto3

def emit_gateway_metrics(api_key: str, endpoint: str, result: str, latency_ms: float):
    cw = boto3.client('cloudwatch')
    cw.put_metric_data(
        Namespace='APIGateway/Custom',
        MetricData=[
            {
                'MetricName': 'RequestCount',
                'Dimensions': [
                    {'Name': 'Endpoint', 'Value': endpoint},
                    {'Name': 'ApiKey', 'Value': api_key[:8]},  # Partial key for privacy
                    {'Name': 'Result', 'Value': result}        # 'allowed', 'rate_limited', 'unauthorized'
                ],
                'Value': 1,
                'Unit': 'Count'
            }
        ]
    )

The gateway metric layer is the foundation for understanding your API's usage patterns — and for detecting abuse (a consumer suddenly generating 100x their normal traffic) before it becomes a reliability incident.

*Zak Hassan is a Staff SRE specializing in distributed systems reliability, API platform engineering, and AI-powered operations. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Identity Reliability Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn