SREs are good at writing precise specifications — blog posts, alert definitions, SLO documents. Prompt engineering is the same discipline applied to AI systems: writing instructions that produce consistent, correct behavior from a model. It's not magic; it's systems thinking applied to a probabilistic system.
After building several production AI agents for operational use cases — incident response, cost optimization, security triage — here are the patterns that work and the mistakes I made that you don't need to repeat.
The Mental Model: You're Configuring a System, Not Having a Conversation
The biggest mistake people make with production prompts is treating them like chat messages. Chat is casual and iterative — you can follow up, clarify, and correct. A production system prompt runs thousands of times without you in the loop. Every ambiguity, every missing constraint, every unclear instruction will manifest as variance in outputs — some of them wrong.
Write system prompts the way you write blog posts: complete, precise, and assuming the reader has no additional context beyond what's written.
Specifically:
- Define the role precisely. Not "you are a helpful assistant" but "you are a Tier-1 incident triage agent for a cloud infrastructure team. Your outputs will be used to make real-time operational decisions."
- Define the output format explicitly. Not "give me a summary" but "respond in JSON with fields:
diagnosis(string),confidence(enum: high/medium/low),recommended_action(string),evidence(array of strings),risk_flags(array of strings)." - Define what NOT to do. Negative constraints are as important as positive ones. "Do not recommend database configuration changes during business hours." "Do not classify an alert as false positive if you cannot explain the behavior."
- Define escalation conditions. When should the agent ask for help rather than produce an answer? "If you have less than medium confidence in your diagnosis, say so explicitly and recommend human review."
The System Prompt Structure That Works
For production operational agents, I've settled on this structure:
[ROLE AND CONTEXT]
You are [specific role] for [specific organization/system type].
Your outputs are used for [specific downstream use].
[INVESTIGATION APPROACH]
When investigating, follow this order:
1. [Step 1 with specific criteria]
2. [Step 2 with specific criteria]
...
[OUTPUT FORMAT]
Respond exclusively in the following JSON format:
{
"field1": "description of what goes here",
"field2": "description",
...
}
[CONSTRAINTS - HARD RULES]
NEVER:
- [Hard constraint 1]
- [Hard constraint 2]
ALWAYS:
- [Hard requirement 1]
[ESCALATION CONDITIONS]
Escalate to human review (set confidence: "low") when:
- [Condition 1]
- [Condition 2]
[DOMAIN KNOWLEDGE]
Key facts about this system:
- [Relevant operational knowledge that helps the agent]
- [Known issues, quirks, historical patterns]Each section serves a different purpose. Role and context sets expectations. Investigation approach prevents the agent from jumping to conclusions before gathering evidence. Output format enables programmatic parsing. Constraints prevent the specific failure modes you've identified. Domain knowledge lets you inject operational knowledge that isn't in the training data.
Structured Output: Don't Parse Prose in Production
Free-text responses are fine for chat applications. In production systems, you need to parse the agent's output programmatically — extract the diagnosis, check the confidence level, route based on the recommended action. Parsing prose is fragile. Parsing JSON is reliable.
Force structured output at the prompt level:
SYSTEM_PROMPT = """
...investigation instructions...
YOU MUST RESPOND ONLY WITH VALID JSON. No preamble, no explanation outside
the JSON structure. Any explanatory text goes inside the JSON fields.
Required response format:
{
"diagnosis": "One-sentence description of the root cause",
"confidence": "high | medium | low",
"evidence": [
"Specific fact 1 that supports the diagnosis",
"Specific fact 2"
],
"recommended_action": "Specific action to take",
"rollback_safe": true | false,
"estimated_resolution_time_minutes": 5,
"risk_flags": [
"Any risks or caveats the human should know"
],
"follow_up_queries": [
"Additional data that would increase confidence if available"
]
}
"""Then validate and parse on the output side:
import json
from pydantic import BaseModel
from typing import Literal
class IncidentDiagnosis(BaseModel):
diagnosis: str
confidence: Literal["high", "medium", "low"]
evidence: list[str]
recommended_action: str
rollback_safe: bool
estimated_resolution_time_minutes: int
risk_flags: list[str]
follow_up_queries: list[str]
def parse_diagnosis(raw_response: str) -> IncidentDiagnosis:
try:
data = json.loads(raw_response)
return IncidentDiagnosis(**data)
except (json.JSONDecodeError, ValueError) as e:
# Log the raw response for debugging
logger.error(f"Failed to parse agent response: {raw_response[:500]}")
raise DiagnosisParseError(f"Agent produced invalid output: {e}")Pydantic validation catches schema violations automatically. When the model produces output that doesn't match the schema — wrong field names, wrong types, missing required fields — you get a clean error rather than silent incorrect behavior.
Few-Shot Examples: Calibrating for Your Domain
For complex classification tasks — "is this alert a false positive?" or "what category of failure is this?" — few-shot examples in the system prompt dramatically improve accuracy. The model learns the target behavior from examples rather than trying to infer it from abstract instructions.
Structure: provide 3-5 examples, covering the most common cases and at least one edge case.
[CLASSIFICATION EXAMPLES]
Example 1 - HIGH CONFIDENCE INFRASTRUCTURE FAILURE:
Alert: payment-service P99 latency > 5000ms
Recent events: deployment of payment-service v2.3 at 14:22 UTC
Evidence gathered: error spike begins 14:24, slow query log shows
full table scan on transactions table for queries with 90-day lookback
Expected output:
{
"diagnosis": "Deployment v2.3 introduced a missing database index causing full table scans",
"confidence": "high",
"rollback_safe": true,
...
}
Example 2 - LOW CONFIDENCE / ESCALATE:
Alert: checkout-service error rate 0.8% (threshold 0.5%)
Recent events: no deployments in 48 hours
Evidence gathered: errors concentrated in eu-west-1, no corresponding AWS health issues,
upstream dependencies healthy, traffic pattern normal
Expected output:
{
"diagnosis": "Intermittent errors in eu-west-1 with no clear root cause identified",
"confidence": "low",
"recommended_action": "Escalate to on-call engineer for manual investigation",
...
}
Example 3 - FALSE POSITIVE:
Alert: high memory utilization on worker-pool-3 (88%)
Recent events: scheduled batch job began 30 minutes ago
Evidence gathered: batch job is expected to use 85-90% memory per blog post,
job is 40% complete, memory is stable (not growing)
Expected output:
{
"diagnosis": "Expected memory utilization from scheduled batch job batch-etl-daily",
"confidence": "high",
"recommended_action": "No action required. Alert can be suppressed during batch window.",
...
}The Reasoning Field: Making Agents Legible
For any agent that takes or recommends actions with real consequences, add a reasoning field to the output schema. Require the agent to explain its chain of logic before stating its conclusion.
{
"reasoning": "The error spike began at 14:23 UTC. The deployment of v2.3 completed at 14:21 UTC. The 2-minute gap is consistent with deployment propagation time. The slow query log shows queries on the transactions table taking 2,400ms where the same queries took 8ms before the deployment. The git diff for v2.3 shows the removal of an index on (customer_id, created_at). These facts together strongly support a missing index as the root cause.",
"diagnosis": "Missing database index introduced in v2.3 deployment",
"confidence": "high"
}Two reasons this matters. First, it improves output quality — models that articulate reasoning before concluding make fewer logical errors. Second, it makes the agent's output auditable. When a human reviews the recommendation, they can see whether the reasoning is sound rather than just whether the conclusion looks plausible.
Iterating on Prompts: Treating It Like Software
Production prompts need a development lifecycle:
Version control. Store system prompts in git. Tag versions. Treat prompt changes as code changes — reviewed, tested, and deployed intentionally.
Evaluation sets. Maintain a set of test cases — historical incidents with known correct diagnoses. When you change a prompt, run it against the evaluation set and compare outputs. Did accuracy improve? Did it regress on any cases?
A/B testing for significant changes. For changes to critical prompts, run the new version against a small fraction of real traffic alongside the current version. Compare output quality and human override rates.
Prompt change postmortems. When a prompt change causes a regression, do a brief postmortem. What assumption was wrong? What test case was missing from the evaluation set? Update the evaluation set before deploying the fix.
The teams that get the most out of AI operational tooling treat prompts as first-class engineering artifacts — not as configuration that gets tweaked and forgotten.
*Zak Hassan is a Staff SRE specializing in AI-powered infrastructure automation, prompt engineering, and operational tooling. Find him at zakhassan.com or on LinkedIn.*
Topic Paths