SRE: SLIs, SLOs, and Automations That Actually Help


Introduction

In this article we will explore the practical side of Site Reliability Engineering (SRE), specifically how to define Service Level Indicators (SLIs) and Service Level Objectives (SLOs) as code, deploy them using ArgoCD, and leverage MCP servers and automations to make the whole process less painful.


If you have been doing operations or platform engineering for a while, you probably already know that monitoring alone is not enough. Having a dashboard full of green lights does not mean your users are happy. SLIs and SLOs give you a framework to measure what actually matters and make informed decisions about reliability vs. feature velocity.


Let’s get into it.


What is SRE anyway?

Site Reliability Engineering is a discipline that applies software engineering practices to operations problems. Google popularized the concept, but the core idea is simple: treat your infrastructure and operational processes as code, measure what matters, and use error budgets to balance reliability with the speed of shipping new features.


The key components are:

  • SLIs (Service Level Indicators): Metrics that measure the quality of your service from the user’s perspective
  • SLOs (Service Level Objectives): Targets you set for your SLIs (e.g., “99.9% of requests should succeed”)
  • Error Budgets: The acceptable amount of unreliability (100% - SLO target)
  • SLAs (Service Level Agreements): Business contracts based on SLOs (we won’t focus on these here)

Understanding SLIs

An SLI is a carefully defined quantitative measure of some aspect of the level of service provided. The most common SLIs are:


  • Availability: The proportion of requests that succeed
  • Latency: The proportion of requests that are faster than a threshold
  • Quality: The proportion of responses that are not degraded

The important thing here is the “proportion” part. SLIs are expressed as ratios:

SLI = good events / total events

For example, for an HTTP service:

# Availability SLI
availability = (total_requests - 5xx_errors) / total_requests

# Latency SLI
latency = requests_faster_than_300ms / total_requests

This is much more useful than raw metrics because it directly reflects user experience. A spike in errors that lasts 5 seconds is very different from one that lasts 5 minutes, and the ratio captures that difference over a time window.


Understanding SLOs

An SLO is the target value for an SLI over a specific time window. For example:


  • “99.9% of HTTP requests should return a non-error response over a 30-day rolling window”
  • “99% of requests should complete in less than 300ms over a 30-day rolling window”

The SLO gives you an error budget. If your SLO is 99.9%, your error budget is 0.1%. Over 30 days, that means you can afford roughly 43 minutes of total downtime. This is incredibly powerful because it turns reliability into a measurable resource you can spend. Want to do a risky deployment? Check your error budget first.


Putting SLIs into code with Prometheus

Now let’s get practical. The most common way to implement SLIs is with Prometheus metrics. If you are running workloads in Kubernetes, you probably already have Prometheus or a compatible system collecting metrics.


For a typical web service, you want to expose a histogram that tracks request duration and status:

# If your application uses Prometheus client, expose something like:
# histogram: http_request_duration_seconds (with labels: method, path, status)
# counter: http_requests_total (with labels: method, path, status)

# For our Phoenix/Elixir app, we rely on phoenix_telemetry and peep to expose these.
# But the concept applies to any language.

With those metrics in Prometheus, you can define recording rules that calculate the SLI ratios. Here is an example of Prometheus recording rules for an HTTP availability SLI:

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: sli-availability
  namespace: monitoring
spec:
  groups:
    - name: sli.availability
      interval: 30s
      rules:
        # Total requests rate over 5m window
        - record: sli:http_requests:rate5m
          expr: sum(rate(http_requests_total[5m]))

        # Error requests rate over 5m window (5xx responses)
        - record: sli:http_errors:rate5m
          expr: sum(rate(http_requests_total{status=~"5.."}[5m]))

        # Availability SLI (ratio of successful requests)
        - record: sli:availability:ratio_rate5m
          expr: |
            1 - (sli:http_errors:rate5m / sli:http_requests:rate5m)

And for a latency SLI:

# prometheus-rules-latency.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: sli-latency
  namespace: monitoring
spec:
  groups:
    - name: sli.latency
      interval: 30s
      rules:
        # Requests faster than 300ms
        - record: sli:http_request_duration:rate5m
          expr: sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))

        # All requests
        - record: sli:http_request_duration_total:rate5m
          expr: sum(rate(http_request_duration_seconds_count[5m]))

        # Latency SLI
        - record: sli:latency:ratio_rate5m
          expr: |
            sli:http_request_duration:rate5m / sli:http_request_duration_total:rate5m

These recording rules pre-compute the SLI ratios so you can use them in alerting and dashboards without running expensive queries every time.


SLOs as code with Sloth

Writing Prometheus recording rules and alert rules by hand for every SLO gets tedious fast. That’s where Sloth comes in. Sloth is a tool that generates all the Prometheus rules you need from a simple SLO definition.


Here is an SLO definition for our service:

# slos/tr-web.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: tr-web
  namespace: default
spec:
  service: "tr-web"
  labels:
    team: "platform"
  slos:
    - name: "requests-availability"
      objective: 99.9
      description: "99.9% of HTTP requests should succeed"
      sli:
        events:
          error_query: sum(rate(http_requests_total{status=~"5..",service="tr-web"}[{{.window}}]))
          total_query: sum(rate(http_requests_total{service="tr-web"}[{{.window}}]))
      alerting:
        name: TrWebHighErrorRate
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate on tr-web"
        page_alert:
          labels:
            severity: critical
        ticket_alert:
          labels:
            severity: warning

    - name: "requests-latency"
      objective: 99.0
      description: "99% of requests should be faster than 300ms"
      sli:
        events:
          error_query: |
            sum(rate(http_request_duration_seconds_count{service="tr-web"}[{{.window}}]))
            -
            sum(rate(http_request_duration_seconds_bucket{le="0.3",service="tr-web"}[{{.window}}]))
          total_query: sum(rate(http_request_duration_seconds_count{service="tr-web"}[{{.window}}]))
      alerting:
        name: TrWebHighLatency
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High latency on tr-web"
        page_alert:
          labels:
            severity: critical
        ticket_alert:
          labels:
            severity: warning

Then you generate the Prometheus rules:

sloth generate -i slos/tr-web.yaml -o prometheus-rules/tr-web-slo.yaml

Sloth generates multi-window, multi-burn-rate alerts following the Google SRE book recommendations. You get fast-burn alerts (something is very wrong right now) and slow-burn alerts (you are consuming error budget faster than expected). This is a massive improvement over manually crafting alert thresholds.


Deploying SLOs with ArgoCD

Now that we have our SLO definitions and generated Prometheus rules as YAML files, we can deploy them the GitOps way using ArgoCD. If you read my previous article about GitOps, this will feel familiar.


The idea is simple: store your SLO definitions and generated rules in a Git repository, and let ArgoCD sync them to your cluster.


Here is the repository structure:

slo-configs/
 slos/
    tr-web.yaml            # Sloth SLO definitions
    api-gateway.yaml
 generated/
    tr-web-slo.yaml        # Generated PrometheusRule resources
    api-gateway-slo.yaml
 dashboards/
    tr-web-slo.json        # Grafana dashboard JSON
    api-gateway-slo.json
 argocd/
     application.yaml        # ArgoCD Application manifest

The ArgoCD Application manifest:

# argocd/application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: slo-configs
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/kainlite/slo-configs
    targetRevision: HEAD
    path: generated
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

With this setup, every time you update an SLO definition, regenerate the rules, and push to Git, ArgoCD automatically applies the changes to your cluster. No manual kubectl commands, no forgetting to apply that one file you changed last week.


You can also set up a CI step to automatically regenerate the Prometheus rules when the SLO definitions change:

# .github/workflows/generate-slos.yaml
name: Generate SLO Rules

on:
  push:
    paths:
      - 'slos/**'

jobs:
  generate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Sloth
        run: |
          curl -L https://github.com/slok/sloth/releases/latest/download/sloth-linux-amd64 -o sloth
          chmod +x sloth

      - name: Generate rules
        run: |
          for slo in slos/*.yaml; do
            name=$(basename "$slo" .yaml)
            ./sloth generate -i "$slo" -o "generated/${name}-slo.yaml"
          done

      - name: Commit and push
        run: |
          git config user.name "github-actions"
          git config user.email "[email protected]"
          git add generated/
          git diff --staged --quiet || git commit -m "chore: regenerate SLO rules"
          git push

Now you have a fully automated pipeline: edit an SLO definition, push, CI generates the rules, ArgoCD deploys them. Beautiful.


MCP servers for SRE automation

This is where things get really interesting. Model Context Protocol (MCP) servers allow you to give AI assistants like Claude access to your infrastructure tools. Imagine being able to ask “what’s my current error budget for tr-web?” and getting an actual answer from your live Prometheus data.


An MCP server is essentially an API that exposes tools an AI can call. You can build one that wraps your Prometheus and Kubernetes APIs:

// mcp-sre-server/src/main.rs
// A simplified example of an MCP server for SRE queries

use mcp_server::{Server, Tool, ToolResult};

#[derive(Tool)]
#[tool(name = "query_error_budget", description = "Query remaining error budget for a service")]
struct QueryErrorBudget {
    service: String,
    slo_name: String,
}

impl QueryErrorBudget {
    async fn execute(&self) -> ToolResult {
        let query = format!(
            r#"1 - (
                sli:availability:ratio_rate30d{{service="{}"}}
            ) / (1 - {}.0/100)"#,
            self.service, self.objective
        );

        let result = prometheus_query(&query).await?;
        ToolResult::text(format!(
            "Error budget for {}/{}: {:.2}% remaining",
            self.service, self.slo_name, result * 100.0
        ))
    }
}

#[derive(Tool)]
#[tool(name = "list_slo_violations", description = "List SLOs that are currently burning too fast")]
struct ListSloViolations;

impl ListSloViolations {
    async fn execute(&self) -> ToolResult {
        let query = r#"ALERTS{alertname=~".*SLO.*", alertstate="firing"}"#;
        let alerts = prometheus_query(query).await?;
        ToolResult::text(format!("Active SLO violations:\n{}", alerts))
    }
}

#[derive(Tool)]
#[tool(name = "get_deployment_risk", description = "Assess deployment risk based on current error budget")]
struct GetDeploymentRisk {
    service: String,
}

impl GetDeploymentRisk {
    async fn execute(&self) -> ToolResult {
        let budget = get_error_budget(&self.service).await?;
        let recent_deploys = get_recent_deploys(&self.service).await?;

        let risk = match budget {
            b if b > 0.5 => "LOW - plenty of error budget remaining",
            b if b > 0.2 => "MEDIUM - error budget is getting low",
            b if b > 0.0 => "HIGH - very little error budget left",
            _ => "CRITICAL - error budget exhausted, consider freezing deploys",
        };

        ToolResult::text(format!(
            "Deployment risk for {}: {}\nBudget remaining: {:.1}%\nRecent deploys: {}",
            self.service, risk, budget * 100.0, recent_deploys
        ))
    }
}

With this MCP server running, you can configure Claude Code or any MCP-compatible client to connect to it. Then you get natural language access to your SRE data:


  • “What’s the error budget for tr-web?” → Queries Prometheus, returns remaining budget
  • “Is it safe to deploy right now?” → Checks error budget + recent incidents
  • “Which SLOs are at risk this week?” → Lists SLOs with high burn rates
  • “Show me the latency trend for the last 24h” → Queries Prometheus and summarizes

You can also build MCP tools that integrate with ArgoCD:

#[derive(Tool)]
#[tool(name = "argocd_sync_status", description = "Check ArgoCD sync status for SLO configs")]
struct ArgoCDSyncStatus;

impl ArgoCDSyncStatus {
    async fn execute(&self) -> ToolResult {
        let output = Command::new("argocd")
            .args(["app", "get", "slo-configs", "-o", "json"])
            .output()
            .await?;

        let app: ArgoApp = serde_json::from_slice(&output.stdout)?;
        ToolResult::text(format!(
            "SLO configs sync status: {}\nHealth: {}\nLast sync: {}",
            app.status.sync.status,
            app.status.health.status,
            app.status.sync.compared_to.revision
        ))
    }
}

#[derive(Tool)]
#[tool(name = "rollback_deployment", description = "Rollback a service deployment via ArgoCD")]
struct RollbackDeployment {
    service: String,
    revision: Option<String>,
}

impl RollbackDeployment {
    async fn execute(&self) -> ToolResult {
        // This would be gated behind confirmation in a real setup
        let revision = self.revision.as_deref().unwrap_or("HEAD~1");
        let output = Command::new("argocd")
            .args(["app", "rollback", &self.service, "--revision", revision])
            .output()
            .await?;

        ToolResult::text(format!("Rollback initiated for {} to {}", self.service, revision))
    }
}

The MCP server config in your Claude Code settings would look something like:

{
  "mcpServers": {
    "sre-tools": {
      "command": "mcp-sre-server",
      "args": ["--prometheus-url", "http://prometheus:9090", "--argocd-url", "https://argocd.example.com"],
      "env": {
        "ARGOCD_AUTH_TOKEN": "your-token-here"
      }
    }
  }
}

Automations that tie it all together

The real power comes when you combine SLOs, ArgoCD, and MCP servers into automated workflows. Here are some patterns that work well in practice:


1. Automated deployment gates

Use error budgets as deployment gates. If the error budget is below a threshold, block deployments automatically:

# In your CI pipeline
- name: Check error budget
  run: |
    BUDGET=$(curl -s "http://prometheus:9090/api/v1/query?query=error_budget_remaining{service='tr-web'}" \
      | jq -r '.data.result[0].value[1]')

    if (( $(echo "$BUDGET < 0.1" | bc -l) )); then
      echo "Error budget below 10%, blocking deployment"
      exit 1
    fi

2. Automated incident creation

When an SLO is breached, automatically create an issue or incident:

# alertmanager-config.yaml
receivers:
  - name: slo-breach
    webhook_configs:
      - url: http://incident-bot:8080/create
        send_resolved: true

route:
  routes:
    - match:
        severity: critical
        type: slo_breach
      receiver: slo-breach

3. Weekly SLO reports

Automate weekly SLO reporting to keep the team informed:

# A CronJob that queries Prometheus and sends a summary to Slack
apiVersion: batch/v1
kind: CronJob
metadata:
  name: slo-weekly-report
  namespace: monitoring
spec:
  schedule: "0 9 * * 1"  # Every Monday at 9am
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: reporter
              image: kainlite/slo-reporter:latest
              env:
                - name: PROMETHEUS_URL
                  value: "http://prometheus:9090"
                - name: SLACK_WEBHOOK
                  valueFrom:
                    secretKeyRef:
                      name: slack-webhook
                      key: url
          restartPolicy: Never

4. Error budget-based feature freeze

This is one of the most powerful SRE patterns. When error budget is exhausted, the team should shift focus from features to reliability work:


  • Budget > 50%: Ship features freely
  • Budget 20-50%: Be cautious with risky changes
  • Budget 5-20%: Focus on reliability improvements
  • Budget < 5%: Feature freeze, all hands on reliability

You can automate this by having your MCP server update a status page or Slack channel with the current budget level, so everyone on the team knows where things stand without having to check dashboards.


Putting it all together

Here is a summary of what we built:


  1. SLIs as Prometheus metrics: Recording rules that calculate availability and latency ratios
  2. SLOs with Sloth: Declarative SLO definitions that generate multi-window, multi-burn-rate alerts
  3. GitOps with ArgoCD: SLO configs stored in Git, automatically synced to the cluster
  4. MCP servers: Natural language interface to query error budgets, check deployment risk, and manage ArgoCD
  5. Automations: Deployment gates, incident creation, weekly reports, and error budget policies

The beauty of this approach is that each piece is simple on its own, but together they create a system where reliability is measurable, automated, and part of the team’s daily workflow rather than an afterthought.


Closing notes

SRE does not have to be complicated. Start with one SLI for your most important service, set a reasonable SLO, and build from there. The tooling we covered (Prometheus, Sloth, ArgoCD, MCP servers) is all open source and battle-tested.


The key takeaway is this: measure what matters to your users, set targets, and let automation handle the rest. Your future self during the next on-call rotation will thank you.


Hope you found this useful and enjoyed reading it, until next time!


Errata

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here



No account? Register here

Already registered? Sign in to your account now.

Sign in with GitHub
Sign in with Google
  • Comments

    Online: 0

Please sign in to be able to write comments.

by Gabriel Garrido