AWS re:Invent 2025: The Announcements That Actually Matter for SREs

re:Invent is a firehose. AWS announces hundreds of services, features, and previews across five days and dozens of keynotes. Most of it is noise for any individual team; a small fraction is signal that changes how you operate production systems. Here's my annotated list of the 2025 announcements with real operational relevance for SRE and platform teams, six months after the event when it's clearer which things have actually landed in production and which were vapor.

The Ones That Landed

Kinesis Data Firehose Native Iceberg Support (GA)

This is the change that simplifies the observability data lake architecture significantly. Previously, writing structured data from Kinesis to Iceberg-formatted Parquet on S3 required an intermediate Lambda for schema enforcement and format conversion. With native Iceberg support, Firehose handles the format directly.

What changed operationally: The transformation Lambda is gone from the critical path. One fewer component to maintain, one fewer failure mode, simpler debugging. For teams building the Iceberg+Athena observability stack, this removes the highest-friction piece of the setup.

Production status: In production for a lab log ingestion pipeline since February. Reliable, no operational surprises. The setup time for a new Iceberg sink went from an afternoon to about 30 minutes.

Amazon S3 Tables (GA)

S3 Tables is AWS's managed Iceberg table service. Rather than managing Iceberg metadata yourself (or using Glue Data Catalog), S3 Tables provides a fully managed namespace with automatic compaction, automatic snapshot cleanup, and native integration with Athena and Redshift.

What changed operationally: Compaction — the process of merging small Iceberg files into larger ones for query efficiency — used to be a scheduled job you ran yourself (or forgot to run, and your Athena queries got progressively slower). S3 Tables handles this automatically. Snapshot management (old snapshots accumulating and consuming storage) is also automatic.

Production status: Using S3 Tables for new pipeline outputs. Existing Iceberg tables migrated incrementally. The operational overhead reduction is real — no more compaction cron jobs.

AWS DevOps Agent (GA) — Already Covered Separately

Covered in depth in an earlier post. The short version: impressive for AWS-native stacks, limited value if your observability is Datadog or Grafana.

Amazon Bedrock AgentCore (GA)

Also covered in depth separately. The managed agent runtime is genuinely useful for teams that don't want to build and operate their own agent infrastructure.

EC2 Auto Scaling Predictive Scaling Improvements

Predictive scaling (using ML to pre-scale before traffic arrives) was significantly improved. The model now incorporates more signals: not just historical traffic patterns but deployment events, marketing calendar, and custom business metrics you can inject.

What changed operationally: For teams with predictable traffic patterns (business hours peaks, weekly cycles), predictive scaling now works noticeably better. The false positive rate (pre-scaling when you didn't need to) is lower, and the lead time accuracy (scaling the right amount the right number of minutes in advance) improved.

Lab status: Tested against a production-style API fleet model. The lab modeled two significant traffic events since then — both modeled without the reactive scaling scramble teams often see.

The Ones Still Maturing

Amazon Q Developer Operational Investigations

AWS's expansion of Amazon Q into operational investigations — where Q can analyze CloudWatch data and propose diagnoses — is directionally interesting but not yet at the level of DIY Claude-based agents in my experience. The query generation is solid; the multi-step investigation (follow-up queries based on what the first query reveals) is limited compared to a properly prompted custom agent.

Honest assessment: Worth having as a supplementary tool for your AWS team. Not a replacement for a custom incident response agent if you've built one.

VPC Lattice Improvements

VPC Lattice (AWS's application networking layer for service-to-service communication) received improved observability and auth capabilities. For teams building multi-account, multi-VPC architectures, Lattice continues to mature toward being the right abstraction.

Honest assessment: Still not at feature parity with service meshes like Istio for teams that need sophisticated traffic management. The integration story with non-EKS workloads improved but isn't complete.

RDS Multi-AZ Cluster Improved Failover Times

Aurora and RDS Multi-AZ failover times improved again. Aurora Multi-AZ can now failover in under 10 seconds in many cases.

What changed operationally: For applications that experience connection pool exhaustion during failover events, 10-second failover is meaningfully better than 30-second. The connection retry logic in your application still needs to be correct, but the window is shorter.

The Ones Worth Watching

AWS Inferentia3 / Trainium3

New generations of AWS's purpose-built ML chips. For teams running inference on-premises on AWS (rather than through Bedrock APIs), the cost-per-token is improving generation over generation. If your AI workloads are large enough to justify managed instances rather than API costs, the Inferentia/Trainium lineup deserves re-evaluation with each generation.

EKS Auto Mode

Auto Mode extends EKS's managed node groups to handle more of the node lifecycle automatically — including compute optimization, storage management, and networking configuration. For teams that are primarily Kubernetes users rather than Kubernetes operators, this reduces the amount of EKS-specific knowledge required to operate the cluster.

The tradeoff: Less control in exchange for less operational overhead. For teams with specialized node requirements or security constraints, Auto Mode's automation may conflict with your policies. For teams that mainly want "run my containers reliably," the direction is right.

CloudWatch Logs Transformation (Preview)

The ability to parse, filter, and transform log data within CloudWatch Logs — without routing through Firehose or a Lambda — is a significant operational simplification if it lands as advertised. Fewer hops in the log pipeline means fewer failure modes and lower cost. Still in preview; production recommendation withheld until GA.

The Broader Pattern

Looking across re:Invent 2025 as a whole, the pattern is clear: AWS is investing heavily in making its AI and data services feel native to infrastructure teams, not just data scientists. The Iceberg integrations, the Agent runtimes, the operational AI tooling — these are all aimed at the SRE and platform engineering audience that has historically been less engaged with AI than product engineering teams.

The implication: SREs who understand the AI tooling landscape have a real advantage in using AWS services that are being specifically designed for their use cases. The teams that are most effective in 2026 are using the new AWS capabilities (S3 Tables, Bedrock AgentCore, DevOps Agent) as building blocks rather than treating them as black boxes to avoid.

*Zak Hassan is a Staff SRE specializing in AWS infrastructure, AI-powered operations, and data platform reliability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn