SRE: Incident Management, On-Call, and Postmortems as Code


Introduction

In the previous article we covered how to define SLIs and SLOs as code, deploy them with ArgoCD, and use MCP servers to automate SRE workflows. But we left out a critical question: what happens when your SLOs actually break?


That is what this article is about. We are going to cover the full incident lifecycle: detecting problems through good alerting, responding effectively with on-call rotations and escalation policies, mitigating incidents fast with runbook automation, and learning from failures through blameless postmortems. All of it as code, because if it is not automated and version-controlled, it will rot.


Let’s get into it.


The incident lifecycle

Before we dive into tooling, let’s understand the phases every incident goes through:


  1. Detection: Your monitoring notices something is wrong (ideally before users do)
  2. Triage: Someone assesses severity and decides if it’s a real incident
  3. Response: The right people are paged and start working on it
  4. Mitigation: You stop the bleeding, even if you don’t fully understand the root cause yet
  5. Resolution: You fix the underlying problem
  6. Learning: You run a postmortem to prevent it from happening again

Most teams are decent at steps 1 through 5 but terrible at step 6. We are going to fix that.


Alerting that does not suck

The foundation of incident management is alerting. Bad alerts lead to alert fatigue, which leads to people ignoring alerts, which leads to incidents that go unnoticed for hours. I have seen this cycle more times than I would like to admit.


If you followed the previous article, you already have SLO-based alerts generated by Sloth. Those are great because they use multi-window, multi-burn-rate detection. But let’s talk about the principles that make alerting actually useful:


1. Alert on symptoms, not causes

Your users do not care that CPU usage is at 90%. They care that the page takes 10 seconds to load. Alert on the SLIs (latency, availability, quality), not on infrastructure metrics.


# BAD: alerting on cause
- alert: HighCPUUsage
  expr: node_cpu_seconds_total > 0.9
  labels:
    severity: critical

# GOOD: alerting on symptom (SLI-based)
- alert: TrWebHighLatency
  expr: |
    sli:latency:ratio_rate5m < 0.99
    and
    sli:latency:ratio_rate1h < 0.99
  labels:
    severity: critical
    team: platform
  annotations:
    summary: "Latency SLO is being violated"
    runbook: "https://runbooks.example.com/tr-web-high-latency"

2. Every alert must have a runbook

If an alert fires and the on-call person has no idea what to do, the alert is useless. Every alert annotation should include a link to a runbook that explains what the alert means, how to triage it, and how to mitigate it.


3. Two severity levels are enough

Keep it simple. You need two levels:


  • Page (critical): Wake someone up at 3am. The service is down or the SLO is being burned fast.
  • Ticket (warning): Create a ticket for business hours. Something is degrading slowly.

If you have five severity levels, nobody knows which ones to take seriously. Two levels force you to make a clear decision: is this worth waking someone up? If not, it is a ticket.


On-call that does not burn people out

On-call is one of the hardest parts of SRE to get right. Done poorly, it destroys morale and burns out your best engineers. Done well, it is a manageable responsibility that makes everyone better at building reliable systems.


Here are the principles that work:


1. Rotation schedule as code

Store your on-call schedule in a Git repository and sync it with your alerting tool. Here is an example using PagerDuty’s Terraform provider:


# on-call/main.tf
resource "pagerduty_schedule" "platform_primary" {
  name      = "Platform Team - Primary"
  time_zone = "America/Argentina/Buenos_Aires"

  layer {
    name                         = "Weekly Rotation"
    start                        = "2026-01-01T09:00:00-03:00"
    rotation_virtual_start       = "2026-01-01T09:00:00-03:00"
    rotation_turn_length_seconds = 604800  # 1 week

    users = [
      pagerduty_user.alice.id,
      pagerduty_user.bob.id,
      pagerduty_user.charlie.id,
      pagerduty_user.diana.id,
    ]

    restriction {
      type              = "daily_restriction"
      start_time_of_day = "09:00:00"
      duration_seconds  = 57600  # 16 hours (9am to 1am)
    }
  }
}

resource "pagerduty_escalation_policy" "platform" {
  name      = "Platform Team Escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 15

    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.platform_primary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 15

    target {
      type = "user_reference"
      id   = pagerduty_user.team_lead.id
    }
  }
}

2. Follow the sun (if you can)

If your team spans multiple time zones, set up follow-the-sun rotations so nobody gets paged at 3am. Even with a small team, you can overlap shifts so the person in the best time zone handles most off-hours alerts.


3. Compensate on-call properly

This is not a technical point, but it matters. On-call is extra work. If your organization does not compensate it (extra pay, time off in lieu, or at minimum acknowledging it), people will resent it and leave. The Google SRE book recommends that on-call should not exceed 25% of an engineer’s time.


4. Track on-call health metrics

Measure how your on-call is doing:


  • Pages per shift: If someone gets paged more than twice per shift, your alerts are too noisy
  • Time to acknowledge: How long before someone responds?
  • False positive rate: What percentage of pages turn out to be nothing?
  • After-hours pages: How many pages happen outside business hours?

You can pull these metrics from PagerDuty’s API and feed them into your dashboards:


# on-call-health/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: oncall-health-metrics
  namespace: monitoring
spec:
  schedule: "0 8 * * 1"  # Every Monday at 8am
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: metrics-collector
              image: kainlite/oncall-health:latest
              env:
                - name: PAGERDUTY_API_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: pagerduty-token
                      key: token
                - name: PROMETHEUS_PUSHGATEWAY
                  value: "http://prometheus-pushgateway:9091"
              command:
                - /bin/sh
                - -c
                - |
                  # Fetch incidents from the last 7 days
                  SINCE=$(date -d '7 days ago' -Iseconds)
                  UNTIL=$(date -Iseconds)

                  curl -s -H "Authorization: Token token=$PAGERDUTY_API_TOKEN" \
                    "https://api.pagerduty.com/incidents?since=$SINCE&until=$UNTIL" \
                    | jq '{
                        total_incidents: .incidents | length,
                        acknowledged_avg_seconds: ([.incidents[].acknowledges[0].at // empty | fromdateiso8601] | add / length),
                        off_hours: [.incidents[] | select((.created_at | strftime("%H") | tonumber) < 9 or (.created_at | strftime("%H") | tonumber) > 17)] | length
                      }' > /tmp/metrics.json

                  # Push to Prometheus
                  TOTAL=$(jq .total_incidents /tmp/metrics.json)
                  OFF_HOURS=$(jq .off_hours /tmp/metrics.json)

                  cat <<PROM | curl --data-binary @- http://prometheus-pushgateway:9091/metrics/job/oncall_health
                  # HELP oncall_incidents_total Total incidents in the last 7 days
                  oncall_incidents_total $TOTAL
                  # HELP oncall_off_hours_total Off-hours incidents in the last 7 days
                  oncall_off_hours_total $OFF_HOURS
                  PROM
          restartPolicy: Never

Runbooks as code

A runbook is a step-by-step guide for handling a specific alert or incident type. The best runbooks are not documents gathering dust in a wiki — they are executable code that lives next to your infrastructure.


Here is a structure that works well:


runbooks/
 tr-web-high-latency.md
 tr-web-high-error-rate.md
 database-connection-pool-exhausted.md
 certificate-expiring.md
 disk-space-critical.md
 templates/
     runbook-template.md

And here is what a good runbook looks like:


TR-Web High Latency Runbook
Alert
  • Name: TrWebHighLatency
  • Severity: critical (page) / warning (ticket)
  • SLO: 99% of requests under 300ms
Quick diagnosis
  1. Check if it is a deploy-related issue:

    kubectl -n default rollout history deployment/tr-web
    argocd app history tr-web
  2. Check pod resource usage:

    kubectl -n default top pods -l app=tr-web
  3. Check database connection pool:

    kubectl -n default exec deploy/tr-web -- bin/tr rpc "Ecto.Adapters.SQL.query(Tr.Repo, \"SELECT count(*) FROM pg_stat_activity\")"
  4. Check for upstream dependency issues:

    kubectl -n default logs deploy/tr-web --since=10m | grep -i error | head -20
Mitigation actions

If caused by a recent deploy

argocd app rollback tr-web

If caused by database

kubectl -n default rollout restart deployment/tr-web

If caused by external dependency (e.g., GitHub API)

  • Check lib/tr/sponsors.ex - the GitHub sponsor fetch runs on a schedule
  • The dedicated :github_pool Hackney pool should isolate it, but verify:
    kubectl -n default logs deploy/tr-web --since=10m | grep "github_pool\|sponsors"

Escalation

  • If not resolved in 30 minutes, escalate to team lead
  • If data loss suspected, escalate immediately

The key is that every diagnostic step has actual commands you can copy-paste. When you are half-asleep at 3am, you do not want to think — you want to follow steps.


Automating runbook steps with Kubernetes Jobs

Some runbook steps can be fully automated. For example, if a high latency alert always resolves with a pod restart, why not automate that?


Here is a pattern using Alertmanager webhooks and a simple automation controller:


# alertmanager-config.yaml
receivers:
  - name: auto-remediation
    webhook_configs:
      - url: http://remediation-controller:8080/webhook
        send_resolved: true

route:
  routes:
    - match:
        alertname: TrWebHighLatency
        auto_remediate: "true"
      receiver: auto-remediation
      continue: true  # also send to normal receivers

The remediation controller receives the webhook and creates a Kubernetes Job:


# remediation-controller/handler.py
from kubernetes import client, config
import json

def handle_alert(alert):
    alertname = alert["labels"]["alertname"]
    namespace = alert["labels"].get("namespace", "default")

    if alertname == "TrWebHighLatency" and alert["status"] == "firing":
        # Create a remediation job
        job = client.V1Job(
            metadata=client.V1ObjectMeta(
                name=f"remediate-{alertname.lower()}-{int(time.time())}",
                namespace=namespace,
            ),
            spec=client.V1JobSpec(
                ttl_seconds_after_finished=3600,
                template=client.V1PodTemplateSpec(
                    spec=client.V1PodSpec(
                        service_account_name="remediation-sa",
                        restart_policy="Never",
                        containers=[
                            client.V1Container(
                                name="remediate",
                                image="bitnami/kubectl:latest",
                                command=[
                                    "/bin/sh", "-c",
                                    f"kubectl -n {namespace} rollout restart deployment/tr-web"
                                ],
                            )
                        ],
                    )
                ),
            ),
        )

        batch_v1 = client.BatchV1Api()
        batch_v1.create_namespaced_job(namespace=namespace, body=job)
        print(f"Created remediation job for {alertname}")

This is powerful but dangerous. Automatic remediation should:


  • Only trigger for well-understood, safe actions (like a pod restart)
  • Have a cooldown period (do not restart pods in a loop)
  • Always notify humans about what it did
  • Never make destructive changes (scaling down, deleting data)

Blameless postmortems

This is arguably the most important part of incident management, and the one most teams skip. A postmortem is not about finding who screwed up. It is about understanding what happened, why it happened, and how to prevent it from happening again.


The word “blameless” is critical. If people are afraid of being blamed, they will hide information and the postmortem becomes useless. The goal is to improve the system, not punish individuals.


Postmortem template as code

Store your postmortem template in Git and generate new postmortems from it:


# postmortems/template.md
# Postmortem: [TITLE]

## Incident summary
- **Date**: YYYY-MM-DD
- **Duration**: X hours Y minutes
- **Severity**: SEV-1 / SEV-2
- **Impact**: [Who was affected and how]
- **Detection**: [How was it detected - alert, user report, etc.]

## Timeline (all times UTC)
| Time | Event |
|------|-------|
| HH:MM | Alert fires |
| HH:MM | On-call acknowledges |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Full resolution |

## Root cause
[What actually went wrong at a technical level]

## Contributing factors
- [Factor 1]
- [Factor 2]

## What went well
- [Thing 1]
- [Thing 2]

## What could be improved
- [Thing 1]
- [Thing 2]

## Action items
| Action | Owner | Priority | Due date | Status |
|--------|-------|----------|----------|--------|
| [Action 1] | @person | P1 | YYYY-MM-DD | TODO |
| [Action 2] | @person | P2 | YYYY-MM-DD | TODO |

## Lessons learned
[Key takeaways for the team]

## Error budget impact
- **SLO affected**: [Which SLO was impacted]
- **Budget consumed**: [How much error budget this incident burned]
- **Budget remaining**: [Current error budget after this incident]

Automating postmortem creation

You can automate the creation of a postmortem document when a SEV-1 incident is resolved:


# postmortem-bot/create.sh
#!/bin/bash
# Triggered by PagerDuty webhook when a SEV-1 incident is resolved

INCIDENT_ID=$1
DATE=$(date +%Y-%m-%d)
TITLE=$(curl -s -H "Authorization: Token token=$PAGERDUTY_API_TOKEN" \
  "https://api.pagerduty.com/incidents/$INCIDENT_ID" | jq -r '.incident.title')

SLUG=$(echo "$TITLE" | tr '[:upper:]' '[:lower:]' | tr ' ' '-' | tr -cd '[:alnum:]-')
FILENAME="postmortems/${DATE}-${SLUG}.md"

# Copy template and fill in known fields
cp postmortems/template.md "$FILENAME"
sed -i "s/\[TITLE\]/$TITLE/g" "$FILENAME"
sed -i "s/YYYY-MM-DD/$DATE/g" "$FILENAME"

# Create a GitHub issue for the postmortem review
gh issue create \
  --title "Postmortem: $TITLE" \
  --body "Postmortem document created at \`$FILENAME\`. Please fill in the details within 48 hours of incident resolution." \
  --label "postmortem" \
  --assignee "$ON_CALL_ENGINEER"

# Create a PR with the postmortem stub
git checkout -b "postmortem/$SLUG"
git add "$FILENAME"
git commit -m "chore: create postmortem for $TITLE"
git push origin "postmortem/$SLUG"
gh pr create --title "Postmortem: $TITLE" --body "Auto-generated postmortem. Please fill in details."

This ensures that postmortems are never forgotten. The 48-hour rule is important — details fade fast, and if you wait a week to write the postmortem, people will have forgotten crucial context.


Tracking action items

The postmortem is only useful if the action items actually get done. Too many teams write great postmortems and then never follow through on the actions.


Here is a simple approach: track postmortem action items as GitHub issues with a dedicated label, and report on completion rates weekly:


# action-item-tracker/report.sh
#!/bin/bash
# Run weekly to report on postmortem action item completion

OPEN=$(gh issue list --label "postmortem-action" --state open --json number | jq length)
CLOSED=$(gh issue list --label "postmortem-action" --state closed --json number | jq length)
TOTAL=$((OPEN + CLOSED))

if [ "$TOTAL" -gt 0 ]; then
  RATE=$(echo "scale=1; $CLOSED * 100 / $TOTAL" | bc)
else
  RATE="N/A"
fi

OVERDUE=$(gh issue list --label "postmortem-action" --state open --json number,title,createdAt \
  | jq '[.[] | select((.createdAt | fromdateiso8601) < (now - 2592000))] | length')

cat <<EOF | curl -X POST -H 'Content-Type: application/json' \
  -d @- "$SLACK_WEBHOOK_URL"
{
  "text": "Weekly Postmortem Action Items Report",
  "blocks": [
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*Postmortem Action Items Report*\n\n:white_check_mark: Completed: $CLOSED/$TOTAL ($RATE%)\n:hourglass: Open: $OPEN\n:warning: Overdue (>30 days): $OVERDUE"
      }
    }
  ]
}
EOF

If your completion rate is below 80%, something is wrong. Either the action items are too ambitious, they are not prioritized properly, or the team does not see value in them. Fix the process, not the people.


Incident severity levels

Having clear severity levels prevents the “is this bad enough to page someone?” debate during an active incident. Here is a simple framework:


  • SEV-1 (Critical): Service is down or severely degraded for most users. Error budget is burning fast. Page immediately, all hands on deck.
  • SEV-2 (Major): Service is degraded for some users or a non-critical function is down. Page the on-call, but no need to wake up the whole team.
  • SEV-3 (Minor): Something is wrong but users are not significantly impacted. Create a ticket, handle during business hours.

Map these to your SLO burn rates:


# severity-mapping.yaml
# Based on Sloth-generated multi-burn-rate alerts

severity_mapping:
  SEV-1:
    description: "Fast burn on critical SLO"
    conditions:
      - burn_rate: "> 14.4x"
        window: "5m and 1h"
    response:
      - page on-call
      - open incident channel
      - start timeline

  SEV-2:
    description: "Medium burn on any SLO"
    conditions:
      - burn_rate: "> 6x"
        window: "30m and 6h"
    response:
      - page on-call
      - create incident ticket

  SEV-3:
    description: "Slow burn on any SLO"
    conditions:
      - burn_rate: "> 3x"
        window: "2h and 1d"
    response:
      - create ticket
      - review next business day

Incident communication

During an incident, communication is as important as the technical fix. Users and stakeholders need to know what is happening, and the responders need a clear channel to coordinate.


1. Dedicated incident channel

When a SEV-1 or SEV-2 fires, automatically create a dedicated Slack channel. This keeps incident noise out of the main channels:


# incident-bot/create_channel.py
import slack_sdk
import datetime

def create_incident_channel(incident_title, severity):
    client = slack_sdk.WebClient(token=os.environ["SLACK_BOT_TOKEN"])

    date = datetime.datetime.now().strftime("%Y%m%d")
    slug = incident_title.lower().replace(" ", "-")[:30]
    channel_name = f"inc-{date}-{slug}"

    channel = client.conversations_create(name=channel_name)
    channel_id = channel["channel"]["id"]

    # Post incident header
    client.chat_postMessage(
        channel=channel_id,
        text=f"*Incident: {incident_title}*\n"
             f"*Severity*: {severity}\n"
             f"*Status*: Investigating\n"
             f"*On-call*: <@{get_current_oncall()}>\n\n"
             f"Please use this channel for all incident communication.\n"
             f"Update the status with `/incident update <status>`"
    )

    # Invite relevant people
    client.conversations_invite(
        channel=channel_id,
        users=get_oncall_and_leads()
    )

    return channel_id

2. Status page updates

Keep users informed with a status page. You do not need a fancy tool — a simple GitHub Pages site that gets updated by your incident bot works fine:


# status-page/update.sh
#!/bin/bash
# Update status page during an incident

STATUS=$1  # investigating | identified | monitoring | resolved
MESSAGE=$2
DATE=$(date -Iseconds)

# Append to incidents file
cat >> _data/incidents.yml <<EOF
- date: "$DATE"
  status: "$STATUS"
  message: "$MESSAGE"
  service: "tr-web"
EOF

git add _data/incidents.yml
git commit -m "status: $STATUS - $MESSAGE"
git push origin main

MCP tools for incident management

Building on the MCP server from the previous article, you can add incident management tools that let you manage incidents through natural language:


#[derive(Tool)]
#[tool(name = "create_incident", description = "Create a new incident with severity and description")]
struct CreateIncident {
    title: String,
    severity: String,  // SEV-1, SEV-2, SEV-3
    description: String,
}

impl CreateIncident {
    async fn execute(&self) -> ToolResult {
        // Create PagerDuty incident
        let pd_incident = pagerduty_create_incident(&self.title, &self.severity).await?;

        // Create Slack channel
        let channel = create_incident_channel(&self.title, &self.severity).await?;

        // Start postmortem stub if SEV-1
        if self.severity == "SEV-1" {
            create_postmortem_stub(&self.title).await?;
        }

        ToolResult::text(format!(
            "Incident created:\n- PagerDuty: {}\n- Slack: #{}\n- Severity: {}",
            pd_incident.id, channel.name, self.severity
        ))
    }
}

#[derive(Tool)]
#[tool(name = "incident_timeline", description = "Get the timeline of the current active incident")]
struct IncidentTimeline {
    incident_id: String,
}

impl IncidentTimeline {
    async fn execute(&self) -> ToolResult {
        let events = pagerduty_get_timeline(&self.incident_id).await?;
        let timeline = events.iter()
            .map(|e| format!("{} - {}", e.timestamp, e.description))
            .collect::<Vec<_>>()
            .join("\n");

        ToolResult::text(format!("Incident timeline:\n{}", timeline))
    }
}

#[derive(Tool)]
#[tool(name = "recent_deploys", description = "List recent deployments that might have caused the incident")]
struct RecentDeploys {
    service: String,
    hours: u32,
}

impl RecentDeploys {
    async fn execute(&self) -> ToolResult {
        let deploys = argocd_get_history(&self.service, self.hours).await?;
        let summary = deploys.iter()
            .map(|d| format!("{} - {} by {} ({})", d.timestamp, d.revision[..8].to_string(), d.author, d.message))
            .collect::<Vec<_>>()
            .join("\n");

        ToolResult::text(format!("Recent deploys for {} (last {}h):\n{}", self.service, self.hours, summary))
    }
}

With these tools you can ask things like:


  • “Create a SEV-2 incident for high latency on tr-web”
  • “What deployments happened in the last 6 hours?”
  • “Show me the timeline for the current incident”

This is incredibly useful during an incident because you do not have to context-switch between multiple UIs. You stay in one place and let the AI coordinate across systems.


Putting it all together

Here is a summary of the full incident management workflow as code:


  1. SLO-based alerts detect the problem (from the previous article)
  2. PagerDuty/OpsGenie pages the on-call engineer based on Terraform-managed schedules
  3. Incident bot creates a Slack channel and starts the timeline
  4. Runbooks guide the on-call through diagnosis and mitigation
  5. Auto-remediation handles known issues automatically (pod restarts, cache clears)
  6. MCP tools let you query systems and take actions without leaving your terminal
  7. Postmortem bot creates a document from template when the incident resolves
  8. Action item tracker ensures follow-through with weekly reports

The key insight is that every step can be automated or at least templated. You should never start from scratch during an incident. The tools, processes, and documents should already be there waiting for you.


What not to do

I have been in enough incident responses to have a list of anti-patterns:


  • Do not blame people in the postmortem. “Bob pushed a bad config” is useless. “Our config review process did not catch an invalid value” is actionable.
  • Do not skip the postmortem for small incidents. Small incidents are the best opportunities to learn because the stakes are low.
  • Do not let one person be on-call forever. Rotate. If you have only one person who can handle incidents, that is a bus factor problem, not an on-call problem.
  • Do not alert on everything. More alerts does not mean better monitoring. It means more noise.
  • Do not treat incidents as failures. Incidents are inevitable in complex systems. They are opportunities to learn and improve.

Closing notes

Incident management is not just about responding to fires. It is about building a system where fires are detected quickly, responded to efficiently, mitigated automatically when possible, and learned from every time.


The tools we covered today — PagerDuty/Terraform for on-call, runbooks as code, auto-remediation with Kubernetes Jobs, blameless postmortems with templates, and MCP tools for incident coordination — are all things you can start implementing today. You do not need to do everything at once. Pick the one that would have helped most in your last incident and start there.


In the next article we could explore observability in depth — distributed tracing with OpenTelemetry, log aggregation, and building dashboards that actually help during incidents. But that is for another day.


Hope you found this useful and enjoyed reading it, until next time!


Errata

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here



No account? Register here

Already registered? Sign in to your account now.

Sign in with GitHub
Sign in with Google
  • Comments

    Online: 0

Please sign in to be able to write comments.

by Gabriel Garrido