SRE: Cost Optimization in the Cloud

2026-03-13 | Gabriel Garrido | 20 min read

Support this blog

If you find this content useful, consider supporting the blog.

Introduction

Throughout this SRE series we have covered SLIs and SLOs, incident management, observability, chaos engineering, capacity planning, GitOps, and secrets management. We have built a solid foundation for running reliable systems, but reliability is only half the picture. If your infrastructure bill keeps growing unchecked, it does not matter how reliable things are because eventually someone is going to ask hard questions about cost.

Cloud spending has a tendency to creep up. You spin up a test cluster and forget about it, someone requests a large instance “just in case,” dev environments run 24/7, and before you know it your monthly bill has doubled. The FinOps movement emerged to bring financial accountability to cloud spending, and SRE teams are in a unique position to drive cost optimization because they already understand the infrastructure deeply.

In this article we will cover FinOps principles, right-sizing workloads, spot instances, resource quotas, cost visibility with Kubecost and OpenCost, idle resource detection, storage tiering, reserved capacity planning, cost alerts tied to SLOs, and tagging strategies for cost allocation. These are all practical techniques you can start applying today.

Let’s get into it.

FinOps principles

FinOps (Financial Operations) is a cultural practice that brings together engineering, finance, and business teams to manage cloud costs collaboratively. It is not about cutting costs at all costs. It is about making informed decisions and getting the most value from every dollar spent.

The FinOps lifecycle has three phases:

Inform: Understand what you are spending, where, and why. You cannot optimize what you cannot see.

Optimize: Take action to reduce waste. Right-size instances, use spot nodes, clean up idle resources.

Operate: Continuously monitor costs, set budgets, and build cost awareness into your engineering culture.

For SRE teams, the key insight is that cost should be treated as a first-class metric, just like latency, availability, and error rate. You already have dashboards for SLIs. Add a cost panel to those dashboards. When you review your SLO performance weekly, review your cost metrics too.

Some practical principles to adopt:

Everyone is accountable for cost, not just finance. Engineers who provision resources should understand the cost impact.

Cost decisions are data-driven. Use actual utilization data, not guesses or “we might need it someday.”

Cost optimization is continuous, not a one-time project. Treat it like reliability, always improving.

Optimize for value, not just savings. Sometimes spending more is the right call if it improves reliability or developer productivity.

Right-sizing workloads

Right-sizing is the single most impactful cost optimization you can make in Kubernetes. Most teams over-provision their workloads significantly because developers request resources based on worst-case estimates rather than actual usage.

The Vertical Pod Autoscaler (VPA) is your best friend here. Even if you do not enable it in auto mode, running it in recommendation mode gives you data on what your pods actually use versus what they request.

Install the VPA:

# Install VPA components
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

Create a VPA in recommendation mode for your workloads:

# vpa/tr-web-vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: tr-web-vpa
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tr-web
  updatePolicy:
    updateMode: "Off"  # Recommendation only, no auto-updates
  resourcePolicy:
    containerPolicies:
      - containerName: tr-web
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 2000m
          memory: 2Gi
        controlledResources:
          - cpu
          - memory

After a few days of running, check the recommendations:

kubectl describe vpa tr-web-vpa

# Output will look something like:
# Recommendation:
#   Container Recommendations:
#     Container Name: tr-web
#     Lower Bound:
#       Cpu:     25m
#       Memory:  80Mi
#     Target:
#       Cpu:     100m
#       Memory:  180Mi
#     Uncapped Target:
#       Cpu:     100m
#       Memory:  180Mi
#     Upper Bound:
#       Cpu:     350m
#       Memory:  400Mi

Now compare that to what you actually requested:

# Check current resource requests across all pods
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\t"}{range .spec.containers[*]}{.name}{"\t"}Req: {.resources.requests.cpu}/{.resources.requests.memory}{"\t"}Lim: {.resources.limits.cpu}/{.resources.limits.memory}{"\n"}{end}{end}' | column -t

If your pods are requesting 500m CPU but only using 100m on average, you are paying for 5x more compute than you need. That gap is pure waste.

A good rule of thumb for setting requests and limits:

Requests: Set to the P95 of actual usage (from VPA recommendations or Prometheus metrics). This ensures the scheduler places pods on nodes with enough capacity.

Limits: Set to 2-3x the request for CPU (to allow bursting), and 1.5-2x for memory (to avoid OOM kills while still preventing runaway consumption).

Review quarterly: Usage patterns change as your application evolves. What was right-sized six months ago might be wrong today.

Here is a Prometheus query to find the most over-provisioned workloads:

# CPU over-provisioning ratio by deployment
# Values > 2 mean the workload is requesting 2x+ more CPU than it uses
sum by (namespace, owner_name) (
  kube_pod_container_resource_requests{resource="cpu"}
) /
sum by (namespace, owner_name) (
  rate(container_cpu_usage_seconds_total[24h])
)

Spot and preemptible instances

Spot instances (AWS), preemptible VMs (GCP), or spot VMs (Azure) offer 60-90% discounts compared to on-demand pricing. The tradeoff is that the cloud provider can reclaim them with short notice (usually 2 minutes). For stateless, fault-tolerant workloads in Kubernetes, this is a great deal.

The trick is to run your workloads on a mix of on-demand and spot nodes. Critical workloads like your database go on on-demand nodes. Stateless web servers and batch jobs go on spot nodes.

Set up a spot node group (EKS example):

# eks-nodegroup-spot.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production-cluster
  region: us-east-1
spec:
  managedNodeGroups:
    - name: on-demand-critical
      instanceType: t3.large
      desiredCapacity: 2
      minSize: 2
      maxSize: 4
      labels:
        node-type: on-demand
        workload-type: critical
      taints:
        - key: workload-type
          value: critical
          effect: NoSchedule

    - name: spot-general
      instanceTypes:
        - t3.large
        - t3.xlarge
        - t3a.large
        - t3a.xlarge
        - m5.large
        - m5a.large
      spot: true
      desiredCapacity: 3
      minSize: 1
      maxSize: 10
      labels:
        node-type: spot
        workload-type: general

Notice the spot node group uses multiple instance types. This is important because spot availability varies by instance type. Using a diverse set increases your chances of getting capacity.

Now schedule your workloads appropriately using node affinity and tolerations:

# deployments/tr-web.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tr-web
  namespace: default
spec:
  replicas: 3
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              preference:
                matchExpressions:
                  - key: node-type
                    operator: In
                    values:
                      - spot
          # Spread across nodes for resilience
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchExpressions:
                      - key: app
                        operator: In
                        values:
                          - tr-web
                  topologyKey: kubernetes.io/hostname
      tolerations:
        - key: "node-type"
          operator: "Equal"
          value: "spot"
          effect: "NoSchedule"
      containers:
        - name: tr-web
          image: kainlite/tr:latest
          resources:
            requests:
              cpu: 100m
              memory: 180Mi
            limits:
              cpu: 300m
              memory: 360Mi

The preferredDuringSchedulingIgnoredDuringExecution with weight 80 means the scheduler will try to place pods on spot nodes but will fall back to on-demand if no spot capacity is available. This is important for resilience.

You also need a PodDisruptionBudget to handle spot node reclamation gracefully:

# pdb/tr-web-pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: tr-web-pdb
  namespace: default
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: tr-web

This ensures that at least 2 pods are always running, even during spot node reclamation. Combined with multiple replicas spread across different nodes, your service stays available while saving 60-90% on compute.

Resource quotas and limit ranges

Without guardrails, any team member can deploy a workload that requests 64 CPUs and 256GB of memory. Resource quotas and limit ranges prevent this kind of runaway cost.

A ResourceQuota sets hard limits per namespace:

# quotas/dev-namespace-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: dev
spec:
  hard:
    requests.cpu: "4"           # Total CPU requests across all pods
    requests.memory: 8Gi        # Total memory requests
    limits.cpu: "8"             # Total CPU limits
    limits.memory: 16Gi         # Total memory limits
    pods: "20"                  # Maximum number of pods
    services.loadbalancers: "2" # Limit expensive LB services
    persistentvolumeclaims: "10"
    requests.storage: 100Gi     # Total PVC storage

A LimitRange sets defaults and per-pod constraints. This is especially useful for catching pods deployed without resource requests:

# quotas/dev-namespace-limitrange.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: dev
spec:
  limits:
    - type: Container
      default:          # Default limits if not specified
        cpu: 200m
        memory: 256Mi
      defaultRequest:   # Default requests if not specified
        cpu: 50m
        memory: 64Mi
      min:              # Minimum allowed
        cpu: 10m
        memory: 16Mi
      max:              # Maximum allowed per container
        cpu: "2"
        memory: 4Gi
    - type: Pod
      max:              # Maximum per pod (all containers combined)
        cpu: "4"
        memory: 8Gi
    - type: PersistentVolumeClaim
      min:
        storage: 1Gi
      max:
        storage: 50Gi

Now if someone deploys a pod without resource requests, it automatically gets 50m CPU and 64Mi memory as defaults. And if someone tries to request 32 CPUs, the API server rejects the request.

For production namespaces, you want different quotas:

# quotas/production-namespace-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    requests.cpu: "16"
    requests.memory: 32Gi
    limits.cpu: "32"
    limits.memory: 64Gi
    pods: "50"
    services.loadbalancers: "5"
    persistentvolumeclaims: "20"
    requests.storage: 500Gi
  scopeSelector:
    matchExpressions:
      - scopeName: PriorityClass
        operator: In
        values:
          - high
          - medium

Kubecost and OpenCost

You cannot optimize what you cannot measure. Kubecost (and its open source core, OpenCost) gives you cost visibility into your Kubernetes cluster, broken down by namespace, deployment, label, and team.

Install OpenCost with Helm:

helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update

helm install opencost opencost/opencost \
  --namespace opencost \
  --create-namespace \
  --set opencost.exporter.defaultClusterId="production" \
  --set opencost.ui.enabled=true \
  --set opencost.prometheus.internal.enabled=false \
  --set opencost.prometheus.external.url="http://prometheus-server.monitoring.svc:9090"

For Kubecost (which includes more features like recommendations and savings insights):

helm repo add kubecost https://kubecost.github.io/cost-analyzer
helm repo update

helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost \
  --create-namespace \
  --set kubecostToken="your-token-here" \
  --set prometheus.server.global.external_labels.cluster_id="production" \
  --set prometheus.nodeExporter.enabled=false \
  --set prometheus.serviceAccounts.nodeExporter.create=false

Once installed, you can query cost data via the API:

# Get cost allocation by namespace for the last 7 days
curl -s "http://kubecost.kubecost.svc:9090/model/allocation?window=7d&aggregate=namespace" \
  | jq '.data[0] | to_entries[] | {
    namespace: .key,
    totalCost: .value.totalCost,
    cpuCost: .value.cpuCost,
    memCost: .value.ramCost,
    pvCost: .value.pvCost,
    cpuEfficiency: .value.cpuEfficiency,
    ramEfficiency: .value.ramEfficiency
  }'

# Example output:
# {
#   "namespace": "default",
#   "totalCost": 42.15,
#   "cpuCost": 18.30,
#   "memCost": 15.85,
#   "pvCost": 8.00,
#   "cpuEfficiency": 0.35,
#   "ramEfficiency": 0.42
# }

That CPU efficiency of 0.35 means you are only using 35% of the CPU you are paying for. That is a big optimization opportunity.

Create a Grafana dashboard for cost visibility:

# grafana/cost-dashboard.json (simplified)
# Useful Prometheus queries for cost panels:

# Monthly cost estimate by namespace
sum by (namespace) (
  container_cpu_allocation * on(node) group_left()
  node_cpu_hourly_cost * 730
) +
sum by (namespace) (
  container_memory_allocation_bytes / 1024 / 1024 / 1024 * on(node) group_left()
  node_ram_hourly_cost * 730
)

# Idle cost (resources requested but not used)
sum by (namespace) (
  (kube_pod_container_resource_requests{resource="cpu"} -
   rate(container_cpu_usage_seconds_total[1h]))
  * on(node) group_left() node_cpu_hourly_cost * 730
)

# Cost per request (useful for cost-per-SLI tracking)
sum(rate(container_cpu_usage_seconds_total{namespace="default"}[1h])
  * on(node) group_left() node_cpu_hourly_cost)
/
sum(rate(http_requests_total{namespace="default"}[1h]))

Idle resource detection

Idle resources are the low-hanging fruit of cost optimization. These are things you are paying for but nobody is using. In a typical Kubernetes cluster, 20-30% of spend goes to idle resources.

Here is a script to find common idle resources:

#!/bin/bash
# idle-resource-audit.sh
# Find idle and wasted resources in your cluster

echo "=== Unused PersistentVolumeClaims ==="
# PVCs not mounted by any pod
kubectl get pvc -A -o json | jq -r '
  .items[] |
  select(.status.phase == "Bound") |
  .metadata.namespace + "/" + .metadata.name
' | while read pvc; do
  ns=$(echo $pvc | cut -d/ -f1)
  name=$(echo $pvc | cut -d/ -f2)
  # Check if any pod references this PVC
  used=$(kubectl get pods -n $ns -o json | jq -r \
    --arg pvc "$name" \
    '.items[].spec.volumes[]? | select(.persistentVolumeClaim.claimName == $pvc) | .name' \
    2>/dev/null)
  if [ -z "$used" ]; then
    size=$(kubectl get pvc $name -n $ns -o jsonpath='{.spec.resources.requests.storage}')
    echo "  UNUSED: $pvc ($size)"
  fi
done

echo ""
echo "=== LoadBalancer Services ==="
# Each LB costs money even if no traffic flows through it
kubectl get svc -A --field-selector spec.type=LoadBalancer \
  -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,IP:.status.loadBalancer.ingress[0].ip,AGE:.metadata.creationTimestamp'

echo ""
echo "=== Deployments with 0 replicas ==="
# Scaled to 0 but still have PVCs, configmaps, secrets attached
kubectl get deploy -A -o json | jq -r '
  .items[] |
  select(.spec.replicas == 0) |
  .metadata.namespace + "/" + .metadata.name
'

echo ""
echo "=== Pods in CrashLoopBackOff ==="
# Burning CPU on restart loops
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded \
  -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount'

echo ""
echo "=== Unattached Persistent Volumes ==="
kubectl get pv -o json | jq -r '
  .items[] |
  select(.status.phase == "Available" or .status.phase == "Released") |
  .metadata.name + " (" + .spec.capacity.storage + ") - " + .status.phase
'

For a more automated approach, set up a CronJob that runs this audit weekly and sends results to Slack:

# cronjob/idle-resource-audit.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: idle-resource-audit
  namespace: monitoring
spec:
  schedule: "0 9 * * 1"  # Every Monday at 9am
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: resource-auditor
          containers:
            - name: auditor
              image: bitnami/kubectl:latest
              command:
                - /bin/bash
                - -c
                - |
                  # Run audit and post to Slack
                  UNUSED_PVCS=$(kubectl get pvc -A -o json | jq '[.items[] | select(.status.phase == "Bound")] | length')
                  TOTAL_PVCS=$(kubectl get pvc -A -o json | jq '.items | length')
                  LB_COUNT=$(kubectl get svc -A --field-selector spec.type=LoadBalancer -o json | jq '.items | length')
                  ZERO_REPLICAS=$(kubectl get deploy -A -o json | jq '[.items[] | select(.spec.replicas == 0)] | length')

                  curl -X POST "$SLACK_WEBHOOK_URL" \
                    -H 'Content-type: application/json' \
                    -d "{
                      \"text\": \"Weekly Idle Resource Report\",
                      \"blocks\": [{
                        \"type\": \"section\",
                        \"text\": {
                          \"type\": \"mrkdwn\",
                          \"text\": \"*Weekly Idle Resource Audit*\n- PVCs: $TOTAL_PVCS total\n- LoadBalancers: $LB_COUNT active\n- Zero-replica deployments: $ZERO_REPLICAS\"
                        }
                      }]
                    }"
              env:
                - name: SLACK_WEBHOOK_URL
                  valueFrom:
                    secretKeyRef:
                      name: slack-webhook
                      key: url
          restartPolicy: OnFailure

Storage tiering

Storage costs can sneak up on you, especially if everything defaults to high-performance SSD. Not all data needs fast storage. Logs, backups, and archived data can live on cheaper storage tiers.

Define multiple StorageClasses for different tiers:

# storage/storageclass-fast.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
  labels:
    cost-tier: high
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "5000"
  throughput: "250"
  encrypted: "true"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# storage/storageclass-standard.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
  labels:
    cost-tier: medium
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# storage/storageclass-cold.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cold-storage
  labels:
    cost-tier: low
provisioner: ebs.csi.aws.com
parameters:
  type: sc1
  encrypted: "true"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Use the right tier for each workload:

# Database: fast SSD for low latency
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgresql-data
  namespace: default
spec:
  storageClassName: fast-ssd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
---
# Application logs: standard storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-logs
  namespace: default
spec:
  storageClassName: standard
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
---
# Backups and archives: cold storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: backup-archive
  namespace: default
spec:
  storageClassName: cold-storage
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 200Gi

For object storage (S3, GCS), set up lifecycle policies to move data to cheaper tiers automatically:

# terraform/s3-lifecycle.tf
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
  bucket = aws_s3_bucket.logs.id

  rule {
    id     = "archive-old-logs"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"  # ~45% cheaper
    }

    transition {
      days          = 90
      storage_class = "GLACIER"       # ~80% cheaper
    }

    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"  # ~95% cheaper
    }

    expiration {
      days = 730  # Delete after 2 years
    }
  }
}

The cost difference between tiers is significant. For AWS EBS, gp3 costs about $0.08/GB/month while sc1 costs $0.015/GB/month. For S3, Standard is $0.023/GB/month while Deep Archive is $0.00099/GB/month. Moving 1TB of archive data from Standard to Deep Archive saves about $264/year.

Reserved vs on-demand

If you know you will need a certain amount of compute for the next 1-3 years, reserved instances or savings plans offer 30-60% discounts compared to on-demand. The tradeoff is commitment, you pay whether you use it or not.

The key is to only commit to your baseline, the minimum compute you always need. Let on-demand and spot handle the peaks.

Here is how to analyze your reservation coverage:

# Prometheus query: average CPU utilization over 30 days
# This shows your baseline compute needs
avg_over_time(
  sum(
    rate(container_cpu_usage_seconds_total[5m])
  )[30d:1h]
)

# Compare against your reserved capacity
# If reserved < baseline, you are under-committed (paying too much on-demand)
# If reserved > baseline, you are over-committed (paying for unused reservations)

A practical approach to reservation planning:

Measure your baseline for at least 3 months. Look at the minimum sustained usage, not the average.

Reserve 70-80% of baseline. This gives you a safety margin for workload changes.

Use savings plans over reserved instances when possible. Savings plans are more flexible because they apply to any instance family.

Review quarterly. If your baseline has shifted, adjust your commitments at renewal time.

Consider 1-year terms first. The savings gap between 1-year and 3-year is often not worth the risk of being locked in.

For Kubernetes specifically, you can use Karpenter (AWS) or the cluster autoscaler with mixed instance policies to automatically choose the cheapest available instance types:

# karpenter/provisioner.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values:
        - on-demand
        - spot
    - key: node.kubernetes.io/instance-type
      operator: In
      values:
        - t3.medium
        - t3.large
        - t3a.medium
        - t3a.large
        - m5.large
        - m5a.large
        - m6i.large
        - m6a.large
    - key: kubernetes.io/arch
      operator: In
      values:
        - amd64
        - arm64   # ARM instances are ~20% cheaper
  limits:
    resources:
      cpu: "64"
      memory: 128Gi
  providerRef:
    name: default
  # Consolidation: Karpenter will replace underutilized nodes
  # with smaller ones to save money
  consolidation:
    enabled: true
  ttlSecondsAfterEmpty: 30

Notice the arm64 architecture option. ARM instances (like AWS Graviton) are typically 20% cheaper and offer comparable or better performance for most workloads. If your container images support multi-arch builds (which they should), this is an easy win.

Cost alerts tied to SLOs

Here is where SRE and FinOps intersect beautifully: using your error budget as a cost control mechanism. The idea is that if you are spending more than necessary to maintain your SLOs, you have room to optimize.

Think about it this way. If your availability SLO is 99.9% and you are running at 99.99%, you are probably over-provisioned. That extra “9” is costing you money and it is not required by the SLO. You could reduce capacity until availability drops to around 99.95% and still have plenty of error budget left.

Set up cost-per-request as a metric:

# prometheus/cost-per-request-rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cost-metrics
  namespace: monitoring
spec:
  groups:
    - name: cost.rules
      interval: 5m
      rules:
        # Cost per request (estimated)
        - record: cost:per_request:ratio
          expr: |
            (
              sum(container_cpu_allocation{namespace="default"} *
                on(node) group_left() node_cpu_hourly_cost)
              +
              sum(container_memory_allocation_bytes{namespace="default"} / 1024 / 1024 / 1024 *
                on(node) group_left() node_ram_hourly_cost)
            )
            /
            sum(rate(http_requests_total{namespace="default"}[1h]))

        # Monthly cost estimate
        - record: cost:monthly:estimate
          expr: |
            sum(
              container_cpu_allocation * on(node) group_left()
              node_cpu_hourly_cost * 730
            ) +
            sum(
              container_memory_allocation_bytes / 1024 / 1024 / 1024 *
              on(node) group_left() node_ram_hourly_cost * 730
            )

        # Cost efficiency: value delivered per dollar
        - record: cost:efficiency:ratio
          expr: |
            sum(rate(http_requests_total{status=~"2.."}[1h]))
            /
            (
              sum(container_cpu_allocation{namespace="default"} *
                on(node) group_left() node_cpu_hourly_cost)
              +
              sum(container_memory_allocation_bytes{namespace="default"} / 1024 / 1024 / 1024 *
                on(node) group_left() node_ram_hourly_cost)
            )

Now create alerts that fire when costs exceed thresholds:

# prometheus/cost-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cost-alerts
  namespace: monitoring
spec:
  groups:
    - name: cost.alerts
      rules:
        # Alert when monthly cost estimate exceeds budget
        - alert: MonthlyCostExceedsBudget
          expr: cost:monthly:estimate > 500
          for: 6h
          labels:
            severity: warning
            team: platform
          annotations:
            summary: "Monthly cost estimate exceeds $500 budget"
            description: "Current estimated monthly cost is ${{ $value | printf \"%.2f\" }}. Budget is $500."
            runbook_url: "https://wiki.internal/runbooks/cost-overrun"

        # Alert when cost per request spikes
        - alert: CostPerRequestSpike
          expr: cost:per_request:ratio > 0.001
          for: 1h
          labels:
            severity: warning
            team: platform
          annotations:
            summary: "Cost per request exceeds $0.001"
            description: "Current cost per request is ${{ $value | printf \"%.6f\" }}. This may indicate over-provisioning or a traffic drop."

        # Alert when CPU efficiency drops (over-provisioning)
        - alert: LowCPUEfficiency
          expr: |
            sum by (namespace) (rate(container_cpu_usage_seconds_total[24h]))
            /
            sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"})
            < 0.2
          for: 24h
          labels:
            severity: info
            team: platform
          annotations:
            summary: "Namespace {{ $labels.namespace }} CPU utilization below 20%"
            description: "The namespace {{ $labels.namespace }} is only using {{ $value | printf \"%.1f\" }}% of requested CPU. Consider right-sizing."

        # Alert when error budget is healthy but costs are high
        # This is the key FinOps+SRE integration
        - alert: OverProvisionedForSLO
          expr: |
            (1 - slo:error_budget:remaining_ratio) < 0.1
            and
            cost:monthly:estimate > 400
          for: 24h
          labels:
            severity: info
            team: platform
          annotations:
            summary: "Over-provisioned: SLO healthy but costs high"
            description: "Error budget consumed is only {{ $value | printf \"%.1f\" }}% but monthly cost is high. Consider reducing capacity to save costs while maintaining SLO."

The OverProvisionedForSLO alert is the most interesting one. It fires when your error budget is barely touched (meaning you are way above your SLO target) AND your costs are high. This is a signal that you can safely reduce capacity.

Tagging strategies

Without proper tagging, your cost data is just a big number with no context. You need to know which team, project, and environment is responsible for each cost.

In Kubernetes, labels serve as tags for cost allocation. Define a consistent labeling standard:

# labels/standard-labels.yaml
# Every resource should have these labels
metadata:
  labels:
    # Who owns this?
    app.kubernetes.io/name: tr-web
    app.kubernetes.io/component: frontend
    app.kubernetes.io/part-of: tr-blog
    app.kubernetes.io/managed-by: argocd

    # Cost allocation
    cost-center: engineering
    team: platform
    environment: production
    project: tr-blog

    # Lifecycle
    lifecycle: permanent   # or: temporary, ephemeral, review
    expiry: "none"         # or: "2026-04-01" for temporary resources

Enforce these labels with a policy engine like Kyverno:

# kyverno/require-cost-labels.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-cost-labels
  annotations:
    policies.kyverno.io/title: Require Cost Allocation Labels
    policies.kyverno.io/description: >-
      All deployments must have cost allocation labels for
      tracking and chargeback purposes.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: check-cost-labels
      match:
        any:
          - resources:
              kinds:
                - Deployment
                - StatefulSet
                - DaemonSet
                - Job
                - CronJob
      validate:
        message: >-
          All workloads must have cost allocation labels:
          cost-center, team, environment, and project.
        pattern:
          metadata:
            labels:
              cost-center: "?*"
              team: "?*"
              environment: "?*"
              project: "?*"

    - name: check-pvc-labels
      match:
        any:
          - resources:
              kinds:
                - PersistentVolumeClaim
      validate:
        message: "PVCs must have cost-center and team labels."
        pattern:
          metadata:
            labels:
              cost-center: "?*"
              team: "?*"

    - name: check-service-labels
      match:
        any:
          - resources:
              kinds:
                - Service
      validate:
        message: "Services must have cost-center and team labels."
        pattern:
          metadata:
            labels:
              cost-center: "?*"
              team: "?*"

With this policy in place, any deployment without cost allocation labels is rejected at admission time. This ensures 100% label coverage, which means your cost reports are accurate.

For cloud resources outside Kubernetes (S3 buckets, RDS instances, etc.), use Terraform to enforce tags:

# terraform/provider.tf
provider "aws" {
  region = "us-east-1"

  default_tags {
    tags = {
      Environment = "production"
      Team        = "platform"
      Project     = "tr-blog"
      ManagedBy   = "terraform"
      CostCenter  = "engineering"
    }
  }
}

# terraform/tag-policy.tf
resource "aws_organizations_policy" "require_tags" {
  name        = "require-cost-tags"
  description = "Require cost allocation tags on all resources"
  type        = "TAG"

  content = jsonencode({
    tags = {
      CostCenter = {
        tag_key = {
          "@@assign" = "CostCenter"
        }
        enforced_for = {
          "@@assign" = [
            "ec2:instance",
            "ec2:volume",
            "s3:bucket",
            "rds:db",
            "elasticloadbalancing:loadbalancer"
          ]
        }
      }
      Team = {
        tag_key = {
          "@@assign" = "Team"
        }
        enforced_for = {
          "@@assign" = [
            "ec2:instance",
            "ec2:volume",
            "s3:bucket",
            "rds:db"
          ]
        }
      }
    }
  })
}

Once tagging is consistent, you can generate cost reports per team:

# Query Kubecost for cost by team label
curl -s "http://kubecost.kubecost.svc:9090/model/allocation?window=30d&aggregate=label:team" \
  | jq '.data[0] | to_entries[] | {
    team: .key,
    monthlyCost: (.value.totalCost | . * 100 | round / 100),
    cpuEfficiency: (.value.cpuEfficiency | . * 100 | round),
    ramEfficiency: (.value.ramEfficiency | . * 100 | round)
  }'

# Example output:
# { "team": "platform", "monthlyCost": 285.42, "cpuEfficiency": 45, "ramEfficiency": 52 }
# { "team": "backend", "monthlyCost": 156.78, "cpuEfficiency": 62, "ramEfficiency": 58 }
# { "team": "data", "monthlyCost": 412.33, "cpuEfficiency": 78, "ramEfficiency": 71 }

This data makes cost conversations productive. Instead of “we need to cut costs,” you can say “the platform team has 45% CPU efficiency, let’s right-size those workloads to save an estimated $128/month.”

Closing notes

Cost optimization in the cloud is not a one-time project. It is an ongoing practice that requires visibility, accountability, and continuous improvement. The good news is that as an SRE team, you already have most of the skills and tooling you need. You know how to measure things (SLIs), set targets (SLOs), and automate responses (alerts and runbooks). Apply those same patterns to cost.

Start with the quick wins: run VPA in recommendation mode and right-size your top 10 over-provisioned workloads. Install OpenCost to get visibility into where your money goes. Set up a weekly cost review alongside your SLO review. Then gradually adopt spot instances, storage tiering, and cost-aware alerting.

The key takeaway is that reliability and cost efficiency are not in conflict. With the right approach, you can reduce spending while maintaining or even improving your SLOs. Every dollar saved on over-provisioning is a dollar you can invest in better tooling, more reliability features, or your team.

Hope you found this useful and enjoyed reading it, until next time!

Errata

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here

$ Comments

Online: 0

Please sign in to be able to write comments.

2026-03-13 | Gabriel Garrido

$ Related Posts

> SRE: Disaster Recovery and Business Continuity (2026-04-03)

> SRE: Security as Code (2026-03-29)

> SRE: Database Reliability (2026-03-23)