When most SREs think about their observability data, they think about it in silos: logs in CloudWatch or Splunk, metrics in Prometheus or Datadog, traces in Jaeger or Tempo. You query each tool separately, mentally join the results, and piece together a picture of what happened.
This works. Mostly. Until the incident is complex enough that the picture you need spans all three data types, goes back 30 days, involves multiple services, and you're trying to do it at 2am.
There's a better architecture for how you store and query operational data, and it's been quietly maturing under the radar while SREs were busy watching the AI wave. Apache Iceberg on S3, queried with Amazon Athena, is becoming the foundation for how serious teams handle observability data at scale.
Here's why I'm bullish on it, and how we actually use it.
Why Iceberg for Observability Data?
Apache Iceberg is an open table format for huge analytic datasets. If that sounds like a data engineering problem, not an SRE problem, stay with me.
The traditional pain with storing observability data long-term is the query problem. Storing logs in S3 is cheap. Querying them efficiently is hard. Raw S3 with prefix partitioning gets you partway there, but you end up with queries that scan far more data than they should, poorly estimated costs, and no schema evolution as your log format changes.
Iceberg solves these problems for observability data:
Partition pruning that actually works. Iceberg maintains a metadata tree that knows exactly which files contain data for a given time range and service. A query for "logs from the payment service between 14:00 and 15:00 yesterday" skips 99% of your data lake and only reads the relevant files. This is not how raw S3 partitioning works.
Schema evolution without rewriting data. Log schemas change. A new field gets added, a field gets renamed, a service starts emitting a new structured key. With raw Parquet on S3, schema changes are painful. With Iceberg, you add a column and old data just shows nulls for that field. Zero rewrites.
Time travel. Iceberg tables maintain historical snapshots. You can query "what did sample error logs look like on March 15th at 11:47" against the exact snapshot from that moment, even if the table has been compacted since. For incident retrospectives and compliance audits, this is invaluable.
ACID transactions. When multiple services are writing observability data simultaneously, you need guarantees about consistency. Iceberg provides this. Concurrent writers don't corrupt the table.
The Architecture We Use
Here is a production-style lab setup:
Services (multiple, multiple regions)
↓
Kinesis Firehose (with transformation Lambda)
↓
S3 (Iceberg table format, partitioned by service/date/hour)
↓
AWS Glue Data Catalog (table metadata)
↓
Amazon Athena (queries) ←── Ad hoc investigation
↓
CloudWatch Dashboards ←── Automated reports from scheduled queries
↓
Claude LLM Agent ←── AI-powered analysisThe key is the Kinesis Firehose → S3 Iceberg pipeline. Firehose buffers incoming log and metric data, applies a transformation Lambda that normalizes schema and enriches with metadata (account ID, region, environment), and writes Iceberg-formatted Parquet files to S3.
The Glue Data Catalog registers the Iceberg table metadata, which is what allows Athena to discover and query it without knowing the physical file layout.
Sample Athena Query for Incident Investigation
-- Error rate by service in the last 2 hours, including P95 latency
SELECT
service_name,
COUNT(*) as total_requests,
SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) as error_count,
ROUND(100.0 * SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) / COUNT(*), 2) as error_rate_pct,
APPROX_PERCENTILE(duration_ms, 0.95) as p95_latency_ms
FROM prod_logs.application_events
WHERE event_time >= NOW() - INTERVAL '2' HOUR
AND environment = 'production'
GROUP BY service_name
ORDER BY error_rate_pct DESC;This query scans only the relevant Iceberg partitions — not the entire table. On a table with 6 months of data across 50 services, the difference is night and day.
The AI Angle: LLMs Over Iceberg Tables
This is where it gets interesting for 2026. Once your observability data lives in a queryable, structured format like Iceberg, you can give an LLM agent the ability to write and execute SQL against it.
Teams have a Claude-powered agent with a run_athena_query tool. The agent receives an incident description, generates SQL to gather the relevant data, interprets the results, generates follow-up queries based on what it finds, and produces a synthesis.
The key insight: LLMs are quite good at writing SQL, and Iceberg tables with good schemas are an ideal surface for LLM-driven investigation. The alternative — giving an LLM raw log files to scan — is expensive, slow, and less accurate.
Here's the tool definition in the agent:
tools = [
{
"name": "run_observability_query",
"description": "Execute a SQL query against the observability data lake to investigate incidents. Tables available: prod_logs.application_events, prod_logs.infrastructure_metrics, prod_logs.deployment_events, prod_logs.dependency_health",
"input_schema": {
"type": "object",
"properties": {
"sql": {
"type": "string",
"description": "The SQL query to execute against Athena"
},
"reasoning": {
"type": "string",
"description": "Why you're running this query and what you expect to find"
}
},
"required": ["sql", "reasoning"]
}
}
]Requiring the agent to provide reasoning for each query is a pattern I strongly recommend. It makes the agent's investigation legible to humans reviewing the output, and it tends to produce better queries because the model has to articulate intent before writing SQL.
AWS Iceberg Updates You Should Know (2025-2026)
AWS has been aggressive about Iceberg support:
- Amazon Redshift now writes directly to Iceberg tables. This means your analytical workloads and your operational data can live in the same format, queryable by the same tools.
- S3 Tables went GA — native Iceberg table management with automatic compaction and snapshot management. Less operational overhead than managing it yourself.
- Iceberg V3 support landed in EMR, Glue, and Athena. V3 brings deletion vectors, which allow updates and deletes without expensive file rewrites — critical for GDPR compliance and incident data correction workflows.
- AWS Glue Data Catalog can now federate queries to remote Iceberg catalogs. If you're mixing AWS and non-AWS storage, you can still query everything through a unified catalog.
Getting Started
If you're starting from scratch:
- Enable S3 Tables in your account and create an Iceberg namespace
- Set up Kinesis Firehose with a Parquet/Iceberg output format (available natively since re:Invent 2025)
- Register the table in Glue Data Catalog
- Run your first Athena query
The full setup is documented in AWS's Iceberg getting started guide and takes an afternoon to get a basic pipeline working. The investment pays back the first time you run a complex cross-service incident investigation and get sub-10-second query results instead of waiting for a CloudWatch Insights query to time out.
*Zak Hassan is a Staff SRE specializing in AI-powered infrastructure automation and data platform reliability. Find him at zakhassan.com or on LinkedIn.*
Topic Paths