DevOps from Zero to Hero: Incident Response and On-Call

2026-06-14 | Gabriel Garrido | 25 min read

On this page

Introduction
What is an incident?
The incident lifecycle
On-call basics
On-call tools
Setting up alerting to on-call tools
Alert fatigue
Runbooks
Practical example: API response time runbook
Communication during incidents
Blameless postmortems
Building a healthy on-call culture
Advanced topics
Try it yourself
Debugging mystery: the disk is full at 3 a.m.
Closing notes
Errata

Support this blog

If you find this content useful, consider supporting the blog.

Introduction#

Welcome to article nineteen of the DevOps from Zero to Hero series. In the previous articles we set up observability with Prometheus and Grafana, built dashboards, configured alerts, and deployed complete CI/CD pipelines. Everything is monitored and automated. But here is the question nobody wants to ask: what happens at 3am when an alert fires and your API is down?

That is what incident response is about. It is the human side of reliability. You can have the best monitoring in the world, but if nobody knows what to do when an alert goes off, it does not matter. In this article we are going to cover the fundamentals: what incidents are, how to classify them, how on-call rotations work, how to write runbooks that actually help, and how to learn from failures without blaming anyone.

This is a beginner-friendly introduction. If you want to go deeper into topics like incident commanders, SRE-specific practices, postmortems as code, and advanced on-call automation, check out the SRE Incident Management article from the SRE series. That article assumes you already understand the basics we cover here.

Let's get into it.

What is an incident?#

An incident is any unplanned event that disrupts or degrades a service for your users. Not every bug is an incident. A typo in a footer is a bug. Your payment processing being down for 500 users is an incident. The key distinction is user impact.

Most teams classify incidents by severity levels. The exact definitions vary between organizations, but here is a common framework:

SEV1 (Critical): Complete service outage or data loss. All or most users are affected. Example: the entire API is returning 500 errors, the database is unreachable, or customer data has been corrupted. This requires all hands on deck, immediately.

SEV2 (Major): Significant degradation but the service is partially working. Example: response times are 10x slower than normal, a key feature like checkout is broken, or 30% of requests are failing. This needs immediate attention from the on-call engineer.

SEV3 (Minor): A noticeable issue that affects a small number of users or a non-critical feature. Example: search suggestions are not loading, a dashboard widget shows stale data, or image uploads are slow. This should be addressed during business hours.

SEV4 (Low): A cosmetic issue or minor inconvenience with minimal user impact. Example: a tooltip has the wrong text, a non-critical background job is retrying more than usual, or a monitoring dashboard has a broken panel. This goes into the normal backlog.

The severity level determines everything else: who gets paged, how fast you need to respond, whether you need a status page update, and how much of the team gets pulled in. Getting this classification right is important because over-escalating burns people out and under-escalating lets problems grow.

The incident lifecycle#

Every incident, regardless of severity, follows the same basic lifecycle. Understanding these phases helps you stay organized when things are stressful.

flowchart LR
  D["Detect<br/>alerts fire"] --> R["Respond<br/>page the on-call"] --> M["Mitigate<br/>stop the bleeding"] --> S["Resolve<br/>fix the root cause"] --> L["Learn<br/>run a postmortem"]
  L -.->|prevention improves detection| D

Let's walk through each phase:

1. Detect

Something tells you there is a problem. Ideally, your monitoring catches it before users do. In article fifteen we set up Prometheus alerts that fire when error rates or latency exceed thresholds. Those alerts are your first line of detection. Other sources include health check failures, user reports, and automated smoke tests from your CI/CD pipeline.

The goal is simple: know about problems before your users tweet about them.

2. Respond

The alert reaches the on-call engineer through a tool like PagerDuty or OpsGenie. The on-call engineer acknowledges the alert (so the system knows someone is looking at it), assesses the severity, and decides if they need to pull in more people. For a SEV1, they might immediately start a war room. For a SEV3, they might just open a ticket and investigate during normal hours.

3. Mitigate

This is the most important phase and the one that trips up beginners. Mitigation is not about finding the root cause. It is about stopping the user impact as fast as possible. If your API is slow because a bad deployment went out, you roll back first and investigate later. If a database is overwhelmed, you scale it up or redirect traffic. Fix it enough to stop the pain, then figure out why it happened.

Common mitigation actions include:

Rollback: Revert the last deployment if the issue started after a deploy

Restart: Sometimes a simple pod restart clears a stuck process

Scale up: Add more replicas or increase resource limits

Feature flag: Disable a broken feature without rolling back everything

Traffic shift: Route users to a healthy region or instance

4. Resolve

Once users are no longer affected, you can take the time to find and fix the actual root cause. Maybe the deployment was fine but it exposed a latent bug triggered by a specific data pattern. Maybe the database needs an index. Maybe the retry logic is creating a thundering herd. This is where you do the real engineering work.

5. Learn

After the incident is resolved, you run a postmortem. We will cover this in detail later in the article, but the short version is: you document what happened, build a timeline, identify the root cause, and create action items to prevent it from happening again. No blame. Just learning.

On-call basics#

On-call means you are the designated person who responds when alerts fire outside of normal working hours (and often during them too). If you have never been on call before, the idea can be intimidating. Let's break it down.

What does being on call actually mean?

When you are on call, you carry a phone (or have a laptop nearby) and you commit to responding to alerts within a defined time window, usually 5 to 15 minutes for critical alerts. You do not need to be sitting at your computer staring at dashboards. You can go to dinner, watch a movie, or sleep. But you need to be reachable and able to start investigating within the response time.

Rotation schedules

No one should be on call all the time. Teams set up rotations where the on-call responsibility passes from person to person on a regular schedule. Common patterns include:

Weekly rotation: Person A is on call Monday to Monday, then Person B takes over. Simple and predictable. Works well for teams of 4 or more.

Daily rotation: On-call shifts change every day. Less burden per shift but more handoffs. Good for teams that want to spread the load evenly.

Follow-the-sun: If your team spans time zones, each region covers their daytime hours. Nobody gets woken up at 3am. This is the dream, but requires a globally distributed team.

Primary and secondary: Two people are on call at the same time. The primary gets paged first. If they do not respond within the escalation window (say 10 minutes), the secondary gets paged. This provides a safety net.

Compensation

Being on call is work. Good organizations compensate for it. This can take different forms: extra pay for on-call shifts, time off after a busy on-call week, or a flat per-shift stipend. The specific approach varies, but the principle is clear: if you are asking someone to be available outside normal hours, you should recognize and compensate that time. Teams that do not compensate on-call eventually lose their best engineers.

Handoff procedures

When your on-call shift ends and someone else takes over, you should do a proper handoff. This means summarizing any ongoing issues, alerting quirks you noticed, or anything the next person should know. A quick message in Slack or a shared document works. The worst thing is inheriting an on-call shift with no context about what has been happening.

On-call tools#

You need a tool that receives alerts from your monitoring system and routes them to the right person at the right time through the right channel (phone call, SMS, push notification, Slack). Here are the most common options:

PagerDuty: The most established incident management platform. It handles alert routing, escalation policies, on-call schedules, and incident tracking. It integrates with everything: Prometheus, Grafana, AWS CloudWatch, Datadog, you name it. It is the industry standard but it is also the most expensive option.

OpsGenie (by Atlassian): Similar to PagerDuty in features, with strong integrations into the Atlassian ecosystem (Jira, Confluence, Statuspage). A solid choice if your team already uses Atlassian tools. Pricing is more accessible than PagerDuty.

Grafana OnCall: An open-source option that integrates natively with Grafana. If you already use the Grafana stack for observability (as we set up in article fifteen), this is a natural fit. You can self-host it or use the Grafana Cloud managed version. It handles schedules, escalations, and routing, and it is free for self-hosted.

All three tools follow the same basic flow:

flowchart TD
  A[Prometheus Alert] --> AM[Alertmanager] --> W[Webhook] --> PD["PagerDuty / OpsGenie / Grafana OnCall"]
  PD --> SCH[On-call schedule] --> PAGE["Page the on-call engineer<br/>(phone, SMS, push, Slack)"]
  PAGE --> Q{"Acknowledged<br/>within SLA?"}
  Q -->|yes| WORK[Engineer works the incident]
  Q -->|no| ESC["Page secondary /<br/>escalate to manager"]

Setting up alerting to on-call tools#

In article fifteen we configured Prometheus alerts using PrometheusRule resources. Those alerts go to Alertmanager, which is part of the kube-prometheus-stack. Now we need to connect Alertmanager to an on-call tool so alerts actually reach a human.

Here is how you configure Alertmanager to send critical alerts to PagerDuty and non-critical alerts to a Slack channel:

# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m

    route:
      receiver: slack-default
      group_by: [alertname, namespace]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h

      routes:
        # Critical alerts go to PagerDuty
        - receiver: pagerduty-critical
          match:
            severity: critical
          continue: false

        # Warning alerts go to Slack only
        - receiver: slack-warnings
          match:
            severity: warning
          continue: false

    receivers:
      - name: slack-default
        slack_configs:
          - api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
            channel: "#alerts"
            title: '{{ .GroupLabels.alertname }}'
            text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

      - name: pagerduty-critical
        pagerduty_configs:
          - routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY"
            severity: critical
            description: '{{ .GroupLabels.alertname }}'
            details:
              namespace: '{{ .GroupLabels.namespace }}'
              summary: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

      - name: slack-warnings
        slack_configs:
          - api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
            channel: "#alerts-low-priority"
            title: '{{ .GroupLabels.alertname }}'
            text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

For OpsGenie, you would replace the pagerduty_configs with:

      - name: opsgenie-critical
        opsgenie_configs:
          - api_key: "YOUR_OPSGENIE_API_KEY"
            message: '{{ .GroupLabels.alertname }}'
            priority: P1
            description: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

For Grafana OnCall, you typically use a webhook receiver that points to your Grafana OnCall instance:

      - name: grafana-oncall
        webhook_configs:
          - url: "https://oncall.your-grafana.com/integrations/v1/alertmanager/YOUR_ID/"
            send_resolved: true

The key concept is routing. Not every alert should wake someone up. Route by severity: critical alerts page the on-call, warnings go to Slack, and informational alerts go to a low-priority channel. This is foundational for avoiding alert fatigue.

Alert fatigue#

Alert fatigue is the number one killer of on-call programs. It happens when engineers get so many alerts that they start ignoring them. When you are getting paged 20 times a night, you stop taking alerts seriously. And when you stop taking alerts seriously, the one real outage that matters gets lost in the noise.

Here is the hard truth: too many alerts is worse than too few. With too few alerts, you might miss something, but at least when an alert fires, people pay attention. With too many alerts, people tune them out entirely, and you miss everything.

Signs of alert fatigue:

High acknowledge rate, low action rate: People click "acknowledge" on alerts just to silence them, without actually investigating.

Duplicate alerts: The same underlying issue triggers five different alerts, flooding the on-call with noise.

Flapping alerts: An alert fires, resolves, fires, resolves, all within minutes. Each cycle generates a page.

Low-value alerts: Alerts for things that do not require human action. "Disk usage at 70%" when you auto-scale at 80% is noise, not signal.

After-hours pages for non-urgent issues: Getting woken up for a SEV4 that could wait until morning.

How to fight alert fatigue:

Alert on symptoms, not causes: Page when users are affected (high error rate, slow responses), not when infrastructure metrics spike (CPU at 80%). We covered this in the observability article.

Set meaningful thresholds: Do not set a latency alert at 200ms if your p99 is normally 180ms. Set it at a level that indicates a real problem, like 2x your normal p99.

Use severity-based routing: Only page the on-call for critical alerts. Everything else goes to Slack or a ticket queue.

Group related alerts: Configure Alertmanager's group_by to combine related alerts into a single notification instead of five separate pages.

Add inhibition rules: If the entire cluster is down, you do not need individual alerts for every service. An inhibition rule suppresses child alerts when a parent alert is firing.

Review alerts regularly: Once a month, review all alerts that fired. Delete the ones that never led to action. Tune the thresholds on the ones that fire too often. This is ongoing maintenance, not a one-time task.

A good benchmark: the on-call engineer should get no more than two pages per on-call shift on average. If your team is consistently above that, you have a tuning problem, not a reliability problem.

Runbooks#

A runbook is a documented procedure for handling a specific type of incident. When an alert fires and you are half-asleep at 3am, you do not want to figure out the debugging steps from scratch. You want a clear, step-by-step guide that tells you exactly what to check and what to do.

What makes a good runbook:

It is linked from the alert: The alert annotation includes a URL to the runbook. One click from the page to the instructions.

It starts with quick checks: The first steps should help you assess the severity and scope in under two minutes.

It has concrete commands: Not "check the database" but "run this specific query and compare the result to this threshold."

It covers mitigation first, root cause second: Tell the engineer how to stop the bleeding before asking them to diagnose.

It is kept up to date: A stale runbook is worse than no runbook because it gives false confidence. Review runbooks after every incident that uses them.

Here is a template you can use for any runbook:

# Runbook: [Alert Name]

## Overview
- **Alert**: [Name of the alert that links here]
- **Severity**: [SEV1/SEV2/SEV3]
- **Service**: [Which service is affected]
- **Last updated**: [Date]
- **Owner**: [Team or person responsible for this runbook]

## Quick assessment (do this first, under 2 minutes)
1. Check [dashboard link] for the current state
2. Run: `[specific command]` to confirm the issue
3. Determine scope: is it all users, a subset, or a single endpoint?

## Mitigation steps (stop the bleeding)
1. If this started after a recent deploy, rollback:
   `kubectl rollout undo deployment/[service] -n [namespace]`
2. If the issue is load-related, scale up:
   `kubectl scale deployment/[service] --replicas=[N] -n [namespace]`
3. [Any other quick fixes specific to this alert]

## Diagnosis (find the root cause)
1. Check logs: `kubectl logs -l app=[service] -n [namespace] --tail=100`
2. Check metrics: [specific PromQL query]
3. Check recent changes: [link to deploy history or git log]

## Escalation
- If you cannot mitigate within 30 minutes, escalate to [team/person]
- For data loss or security issues, immediately page [team/person]

## Previous incidents
- [Date]: [Brief description and link to postmortem]

The most important part of this template is the "Quick assessment" section. It is what the on-call engineer reads first, bleary-eyed and trying to figure out if this is a real problem or a false alarm.

Practical example: API response time runbook#

Let's write a real runbook for one of the most common alerts: API response time exceeding 2 seconds. This connects directly to the Prometheus alerts we set up in article fifteen.

First, the alert rule that would trigger this:

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-latency-alerts
  namespace: monitoring
spec:
  groups:
    - name: api-latency
      rules:
        - alert: APIHighLatency
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{service="task-api"}[5m]))
              by (le)
            ) > 2
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "API p99 latency is above 2 seconds"
            description: "The task-api p99 latency has been above 2s for 5 minutes."
            runbook_url: "https://wiki.example.com/runbooks/api-high-latency"

Notice the runbook_url annotation. When this alert fires and reaches PagerDuty or OpsGenie, the runbook link is included in the notification. The on-call engineer can click it immediately.

Now the runbook itself:

# Runbook: APIHighLatency

## Overview
- **Alert**: APIHighLatency
- **Severity**: SEV2 (becomes SEV1 if latency exceeds 10s or error rate rises above 5%)
- **Service**: task-api
- **Last updated**: 2026-06-14
- **Owner**: Platform team

## Quick assessment (under 2 minutes)
1. Open the API dashboard: https://grafana.example.com/d/task-api
2. Check current p99 latency. Is it above 2s? How far above?
3. Check if error rate has also increased (indicates a deeper problem)
4. Check if the issue is isolated to one endpoint or all endpoints:
   Query: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))

## Mitigation steps
### If latency started after a recent deploy:
1. Check the last deploy time:
   kubectl rollout history deployment/task-api -n production
2. If timing matches, roll back:
   kubectl rollout undo deployment/task-api -n production
3. Verify latency is recovering on the dashboard

### If latency is caused by high traffic:
1. Check current replica count and CPU usage:
   kubectl top pods -l app=task-api -n production
2. Scale up if replicas are at resource limits:
   kubectl scale deployment/task-api --replicas=6 -n production
3. Verify new pods are healthy:
   kubectl get pods -l app=task-api -n production

### If latency is caused by slow database queries:
1. Check database connection pool usage:
   Query: pg_stat_activity_count{datname="taskapi"}
2. Check for long-running queries:
   SELECT pid, now() - pg_stat_activity.query_start AS duration, query
   FROM pg_stat_activity
   WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;
3. If a single query is blocking, consider cancelling it:
   SELECT pg_cancel_backend(<pid>);

## Diagnosis
1. Check logs for slow request patterns:
   kubectl logs -l app=task-api -n production --tail=200 | grep -i "slow\|timeout"
2. Check traces in Jaeger/Tempo for high-latency requests
3. Compare with normal baseline: typical p99 is 200-400ms
4. Review recent PRs merged to main for query changes or new endpoints

## Escalation
- If not mitigated within 30 minutes: page the backend team lead
- If database-related: page the DBA or infrastructure team
- If latency exceeds 10s or error rate > 5%: escalate to SEV1

## Previous incidents
- 2026-05-20: Slow queries after migration added missing index. Fixed with CREATE INDEX.
- 2026-04-15: Memory leak caused GC pauses. Fixed with Node.js version upgrade.

This runbook is specific, actionable, and organized by likelihood. The on-call engineer does not need to guess. They follow the steps, check the relevant data, and take action based on what they find.

Communication during incidents#

When something is broken, people want to know. Your users, your support team, your leadership. Good communication during incidents reduces panic, builds trust, and lets you focus on fixing the problem instead of answering "is it fixed yet?" messages from twelve different people.

Status pages

A public (or internal) status page is the single source of truth during an incident. Tools like Statuspage (by Atlassian), Cachet (open source), or even a simple static page give your users a place to check instead of flooding your support channels.

A good status update includes:

What is affected: "The checkout API is experiencing slow response times"

Current status: "Investigating / Identified / Monitoring / Resolved"

Impact: "Some users may experience delays when completing purchases"

Next update: "We will provide an update in 30 minutes or when we have more information"

Keep updates factual and concise. Do not speculate about root causes in public updates. Say "we have identified the issue and are implementing a fix" rather than "we think the database index got corrupted."

Internal communication

For your team and stakeholders, you need more detail. Most teams use a dedicated Slack channel per incident (for example, #inc-2026-06-14-api-latency). This keeps the conversation focused and creates a written record you can reference in the postmortem.

In the incident channel, post regular updates even if there is nothing new. "Still investigating, no new findings" is better than silence. Silence makes people nervous.

War rooms

For SEV1 incidents, teams often open a video call (sometimes called a war room or bridge call) where everyone working on the incident can communicate in real time. The key rules for war rooms:

Keep it focused: Only people actively working on the incident should be in the call. Observers can follow the Slack channel.

Designate a communication lead: One person handles all external updates so the engineers can focus on fixing things.

Document decisions: Someone should be writing down what is happening, what has been tried, and what the current plan is. This becomes your postmortem timeline.

Blameless postmortems#

A postmortem (also called a retrospective or incident review) is a structured analysis of what happened during an incident. The word "blameless" is the most important part. A blameless postmortem focuses on systems and processes, not on individuals.

Why blameless matters

If people are afraid they will be punished for causing an incident, they will hide information, avoid taking risks, and not report near-misses. Blame creates a culture of fear and silence. Blamelessness creates a culture of transparency and learning. The person who made the change that caused the outage is often the person who best understands the system and can help prevent it from happening again. You want them talking openly, not defending themselves.

This does not mean ignoring accountability. It means recognizing that most incidents are caused by system flaws (bad tooling, missing guardrails, unclear processes), not by people being careless.

Postmortem template

# Incident Postmortem: [Title]

## Summary
- **Date**: [When the incident occurred]
- **Duration**: [How long it lasted]
- **Severity**: [SEV level]
- **Impact**: [Who was affected and how]
- **Authors**: [Who wrote this postmortem]

## Timeline (all times in UTC)
- 14:32 - Monitoring alert fires: APIHighLatency
- 14:35 - On-call engineer acknowledges the alert
- 14:38 - Engineer checks dashboard, confirms p99 latency at 4.2s
- 14:42 - Identifies that latency spike started at 14:25, correlating with deploy abc123
- 14:45 - Initiates rollback of deployment
- 14:48 - Rollback complete, latency begins recovering
- 14:55 - Latency back to normal (p99 at 280ms)
- 14:58 - Incident marked as resolved

## Root cause
A database migration in commit abc123 added a new column to the orders table without an index.
The /orders endpoint performs a filter query on this column, which caused a full table scan on
every request. Under normal traffic, this increased p99 latency from 250ms to over 4 seconds.

## What went well
- Alert fired within 7 minutes of the deploy
- On-call engineer responded within 3 minutes
- Rollback was fast and effective (under 5 minutes from decision to recovery)
- Status page was updated within 10 minutes

## What went wrong
- The migration did not include an index for the new column
- No load testing was done against the staging database with realistic data volumes
- The staging database has 1,000 rows; production has 2 million, so the performance difference
  was not visible in staging

## Action items
- [ ] Add an index to the new column (owner: backend team, due: 2026-06-16)
- [ ] Add a CI check that flags migrations without indexes on queried columns (owner: platform team, due: 2026-06-30)
- [ ] Seed staging database with realistic data volumes (owner: platform team, due: 2026-07-15)
- [ ] Add a latency check to the post-deploy smoke tests (owner: platform team, due: 2026-06-30)

Notice the structure. The timeline is factual and precise. The root cause is technical, not personal. "What went well" is just as important as "what went wrong" because it reinforces the things your team should keep doing. And the action items are specific, assigned, and have deadlines.

Running the postmortem meeting

Schedule the postmortem within 48 hours of the incident while memories are fresh. Keep it to 30-60 minutes. The facilitator (usually not someone directly involved in the incident) walks through the timeline and asks questions:

"What information did you have at this point?"

"What did you try and why?"

"What would have helped you resolve this faster?"

"Were there signals we missed that could have caught this earlier?"

The goal is to understand the system, not to judge decisions made under pressure. People make reasonable decisions based on the information they have at the time. If the system made it easy to deploy a migration without an index, the fix is a better system, not a lecture.

Building a healthy on-call culture#

On-call does not have to be miserable. I have seen teams where on-call is dreaded and teams where it is manageable and even rewarding. The difference comes down to culture and investment.

Reasonable expectations

Frequency: Nobody should be on call more than one week in four. If your team is too small for that rotation, you need to hire, share the rotation with another team, or reduce your on-call scope.

Workload: The on-call engineer should be able to do their regular work during calm on-call shifts. If on-call is so busy that they cannot write code during the day, your alerts need tuning.

Sleep: Getting paged once a night is acceptable occasionally. Getting paged three or four times every night is a systemic problem. Track after-hours pages as a metric and set a goal to reduce them.

Practice incidents

The worst time to learn incident response is during a real incident. Practice with game days or tabletop exercises. A game day is a planned exercise where you intentionally break something (in a controlled way) and practice the response. A tabletop exercise is where you walk through an incident scenario verbally without actually breaking anything.

Example tabletop scenario:

"It is 2am on a Tuesday. You get paged for APIHighLatency.
 You check the dashboard and see p99 latency at 8 seconds.
 Error rate is at 12%. The last deploy was 6 hours ago.

 What do you do first?
 What do you check?
 Who do you contact?
 How do you communicate with stakeholders?"

These exercises build muscle memory. When a real incident happens, the on-call engineer is not thinking "what do I do?" They are thinking "I have done this before, let me follow the process."

Handoff quality

A good handoff between on-call shifts includes:

Active incidents: Anything still ongoing or recently resolved

Recent alerts: Alerts that fired and were handled, with context

Known issues: Things that might page you but are already being worked on

Environment changes: Recent deployments, infrastructure changes, or maintenance windows

A quick 15-minute call or a structured Slack message at handoff time prevents a lot of confusion.

Investing in tooling

Every time someone gets paged for something that could have been automated, that is a failure of tooling. Track your incidents and look for patterns. If the same issue keeps happening and the runbook is always "restart the pod," automate the restart. If a particular alert always turns out to be a false positive, fix the alert. On-call should be for problems that genuinely need a human brain, not for tasks a script could handle.

Advanced topics#

We have covered the fundamentals of incident response in this article, but there is much more to explore as your team and systems grow:

Incident commander role: For SEV1 incidents, a dedicated incident commander coordinates the response, manages communication, and makes decisions about escalation. This role is separate from the engineers doing the technical work.

SRE practices: Error budgets, SLO-based alerting, and toil reduction are advanced concepts that build on everything we covered here.

Postmortems as code: Version-controlled postmortem templates, automated timeline generation, and action item tracking integrated into your project management tool.

Chaos engineering: Intentionally injecting failures to test your incident response process before real incidents happen.

For a deep dive into all of these topics, check out the SRE Incident Management article. It covers incident commander workflows, on-call automation with Kubernetes operators, postmortem templates as code managed through GitOps, and advanced alerting strategies.

Try it yourself#

You're on call and a release just broke production. Triage it and roll back, in the terminal below.

Debugging mystery: the disk is full at 3 a.m.#

An incident classic: the root disk is 100% full, but du swears it is nearly empty. Work it the way you would on call, read the evidence, resist the urge to flail, and find where the space actually went.

Closing notes#

Incident response is not just about tools and processes. It is about people. It is about making sure the person who gets paged at 3am has what they need to solve the problem: clear alerts, good runbooks, the right access, and the confidence that comes from practice.

In this article we covered what incidents are and how to classify them with severity levels, the five phases of the incident lifecycle, how on-call rotations work and how to make them fair, setting up alerting from Prometheus to PagerDuty or OpsGenie, why alert fatigue is dangerous and how to fight it, how to write runbooks that actually help, communication best practices during incidents, blameless postmortems that drive improvement, and building a culture where on-call is sustainable.

The most important takeaway is this: invest in your incident response process before you need it. Write the runbooks, tune the alerts, practice the scenarios, and run the postmortems. When the real incident happens, you will be ready.

In the next and final article of the series, we will bring everything together and look at what comes after mastering the fundamentals.

Hope you found this useful and enjoyed reading it, until next time!

Errata#

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here

$ Comments

Online: 0

Please sign in to be able to write comments.

2026-06-14 | Gabriel Garrido

$ Related Posts

> DevOps from Zero to Hero: Cost Optimization and What Comes Next (2026-06-17)

> DevOps from Zero to Hero: Security Hardening (2026-06-11)

> DevOps from Zero to Hero: Database Migrations and Zero-Downtime Deployments (2026-06-08)