There's a category of production system that most SRE teams have now deployed but almost none have properly instrumented: LLM-powered agents. Your incident response agent, your cost optimization bot, your AI-powered release validator — these are production systems making consequential decisions, and most of them have worse observability than a 2018 monolith.

This is a problem that's about to get a lot more expensive as these agents become more autonomous. Here's what the observability gap looks like, why it's hard to close, and what actually works.


Why LLM Observability Is Different

Traditional service observability answers three questions: Is the service up? Is it fast? Is it correct? For a REST API, "correct" usually means "returns the right HTTP status codes and response schema." Monitoring this is well-understood — you track error rates, latency percentiles, and run synthetic checks against known inputs.

LLM observability needs to answer different questions:

  • Is the reasoning correct? The agent returned a response and called tools — but was the diagnosis right? Did it find the actual root cause or a plausible-sounding wrong one?
  • Is the cost within budget? LLM inference has per-token costs. An agent that's consuming 10x the expected tokens per invocation is either reasoning very deeply or is stuck in a loop.
  • Is the behavior drifting? Model providers update models. Prompt changes have non-obvious effects. The agent that worked correctly last month may behave differently today.
  • What did it actually do? When an autonomous agent takes an action in production, you need an immutable record of what it decided, why, and what it executed.

None of these questions are answered by standard APM tooling. Datadog, New Relic, CloudWatch — they can all tell you your agent Lambda ran for 4.2 seconds. They cannot tell you whether the diagnosis the agent produced was correct.


The OpenTelemetry GenAI Semantic Conventions

The finalization of OpenTelemetry's GenAI semantic conventions in late 2025 was a meaningful step. For the first time, there's a standard schema for how LLM interactions should be instrumented. The key span attributes:

text
gen_ai.system                  # "anthropic", "openai", "bedrock"
gen_ai.request.model           # "claude-sonnet-4-6"
gen_ai.request.max_tokens      # token budget
gen_ai.request.temperature     # sampling temperature
gen_ai.response.id             # response identifier (for correlation)
gen_ai.response.model          # actual model used (may differ from requested)
gen_ai.response.finish_reason  # "end_turn", "max_tokens", "tool_use"
gen_ai.usage.input_tokens      # prompt tokens consumed
gen_ai.usage.output_tokens     # completion tokens generated
gen_ai.usage.cache_read_tokens # tokens served from prompt cache
gen_ai.usage.cache_creation_tokens  # tokens written to cache

For tool-use agents, each tool call should be a child span:

text
gen_ai.tool.name               # "fetch_cloudwatch_logs"
gen_ai.tool.call.id            # unique call identifier
gen_ai.event.prompt            # the tool arguments (as event)
gen_ai.event.completion        # the tool result (as event)

This gives you a trace tree that maps the full agent reasoning: the top-level LLM call, each tool call it spawned, the results returned, and the final response. In Jaeger or Tempo or any OTLP-compatible backend, you can see the entire investigation lifecycle as a structured trace.


Implementing Agent Instrumentation

Here's a practical wrapper that adds OTel instrumentation to a Claude agent:

python
from opentelemetry import trace
from opentelemetry.trace import SpanKind
import anthropic
import time

tracer = trace.get_tracer("sre-agent", "1.0.0")
client = anthropic.Anthropic()

def instrumented_agent_run(
    system_prompt: str,
    user_message: str,
    tools: list,
    incident_id: str
) -> dict:
    
    with tracer.start_as_current_span(
        "agent.run",
        kind=SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "anthropic",
            "gen_ai.request.model": "claude-sonnet-4-6",
            "incident.id": incident_id,
            "agent.type": "incident_response"
        }
    ) as root_span:
        
        messages = [{"role": "user", "content": user_message}]
        total_input_tokens = 0
        total_output_tokens = 0
        tool_calls_made = 0
        
        while True:
            with tracer.start_as_current_span(
                "gen_ai.chat",
                attributes={
                    "gen_ai.request.max_tokens": 4096,
                    "message_count": len(messages)
                }
            ) as llm_span:
                start = time.time()
                response = client.messages.create(
                    model="claude-sonnet-4-6",
                    max_tokens=4096,
                    system=system_prompt,
                    tools=tools,
                    messages=messages
                )
                latency_ms = (time.time() - start) * 1000
                
                llm_span.set_attributes({
                    "gen_ai.response.id": response.id,
                    "gen_ai.response.finish_reason": response.stop_reason,
                    "gen_ai.usage.input_tokens": response.usage.input_tokens,
                    "gen_ai.usage.output_tokens": response.usage.output_tokens,
                    "gen_ai.latency_ms": latency_ms
                })
                
                total_input_tokens += response.usage.input_tokens
                total_output_tokens += response.usage.output_tokens
            
            if response.stop_reason == "end_turn":
                break
            
            if response.stop_reason == "tool_use":
                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        tool_calls_made += 1
                        
                        with tracer.start_as_current_span(
                            f"gen_ai.tool.{block.name}",
                            attributes={
                                "gen_ai.tool.name": block.name,
                                "gen_ai.tool.call.id": block.id,
                                "tool.input": str(block.input)
                            }
                        ) as tool_span:
                            result = execute_tool(block.name, block.input)
                            tool_span.set_attribute("tool.result_length", len(str(result)))
                            tool_results.append({
                                "type": "tool_result",
                                "tool_use_id": block.id,
                                "content": str(result)
                            })
                
                messages.append({"role": "assistant", "content": response.content})
                messages.append({"role": "user", "content": tool_results})
        
        # Set summary attributes on root span
        root_span.set_attributes({
            "agent.total_input_tokens": total_input_tokens,
            "agent.total_output_tokens": total_output_tokens,
            "agent.tool_calls_made": tool_calls_made,
            "agent.estimated_cost_usd": calculate_cost(total_input_tokens, total_output_tokens)
        })
        
        return extract_final_response(response)

This gives you a trace that shows every LLM call and tool call in the investigation, with latencies, token counts, and estimated costs attached. When an investigation goes wrong, you can pull the trace and understand exactly what the agent observed and decided at each step.


The Quality Metrics You Need

Beyond infrastructure metrics, you need quality metrics — and these are harder. Some approaches:

Outcome tracking. After an agent produces a diagnosis, does the human on-call confirm it or override it? Track the override rate. A rising override rate is a signal that agent quality is degrading.

Resolution correlation. Did incidents where the agent produced a confident diagnosis resolve faster than incidents where it expressed uncertainty or the human ignored the recommendation? This tells you whether agent confidence is calibrated.

Tool call efficiency. How many tool calls does the agent make per investigation? A well-designed agent should converge to a diagnosis in a reasonable number of calls. An agent making 30+ tool calls for a routine incident is probably confused or has a prompt issue.

Cost per investigation. Track this over time. Unexpected cost increases can indicate prompt drift (the agent is generating longer responses), context window growth (you've been accumulating more context per investigation), or an agent entering reasoning loops.


Alerting on Agent Behavior

Once you have these metrics, you can alert on agent health like any other service:

yaml
# CloudWatch Alarm: Agent quality degrading
AlarmName: SREAgent-HighOverrideRate
MetricName: agent.human_override_count
Threshold: 5  # more than 5 overrides per 100 investigations
EvaluationPeriods: 24
# -> Alert: prompt review needed

# CloudWatch Alarm: Cost spike
AlarmName: SREAgent-CostAnomaly
MetricName: agent.estimated_cost_usd
Threshold: 50  # $50 per day baseline
TreatMissingData: notBreaching
# -> Alert: possible reasoning loop or input data change

# CloudWatch Alarm: Tool call explosion
AlarmName: SREAgent-ToolCallAnomaly
MetricName: agent.tool_calls_per_investigation_p95
Threshold: 25  # p95 above 25 tool calls per investigation

The Bigger Picture

The observability gap in AI agents isn't a technical problem — the tooling exists, the OTel conventions are standardized, the instrumentation patterns are known. It's a prioritization problem. Teams are focused on building agents and treating observability as something to add later.

"Later" is the wrong time. The trust your platform team places in an AI agent — the degree to which they rely on its diagnosis, extend its autonomy, build workflows around its output — is built on their ability to see what it's doing and verify that it's working. Invisible agents don't get trusted. Trusted agents don't get used at their full potential.

Instrument from day one. The tooling overhead is low, the payoff in debugging time and trust-building is high, and when something goes wrong at 2am — and it will — you'll want to understand what your agent was thinking.


*Zak Hassan is a Staff SRE specializing in AI-powered infrastructure automation and observability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn