SRE: Cost Optimization in the Cloud
Support this blog
If you find this content useful, consider supporting the blog.
Introduction
Throughout this SRE series we have covered SLIs and SLOs, incident management, observability, chaos engineering, capacity planning, GitOps, and secrets management. We have built a solid foundation for running reliable systems, but reliability is only half the picture. If your infrastructure bill keeps growing unchecked, it does not matter how reliable things are because eventually someone is going to ask hard questions about cost.
Cloud spending has a tendency to creep up. You spin up a test cluster and forget about it, someone requests a large instance “just in case,” dev environments run 24/7, and before you know it your monthly bill has doubled. The FinOps movement emerged to bring financial accountability to cloud spending, and SRE teams are in a unique position to drive cost optimization because they already understand the infrastructure deeply.
In this article we will cover FinOps principles, right-sizing workloads, spot instances, resource quotas, cost visibility with Kubecost and OpenCost, idle resource detection, storage tiering, reserved capacity planning, cost alerts tied to SLOs, and tagging strategies for cost allocation. These are all practical techniques you can start applying today.
Let’s get into it.
FinOps principles
FinOps (Financial Operations) is a cultural practice that brings together engineering, finance, and business teams to manage cloud costs collaboratively. It is not about cutting costs at all costs. It is about making informed decisions and getting the most value from every dollar spent.
The FinOps lifecycle has three phases:
- Inform: Understand what you are spending, where, and why. You cannot optimize what you cannot see.
- Optimize: Take action to reduce waste. Right-size instances, use spot nodes, clean up idle resources.
- Operate: Continuously monitor costs, set budgets, and build cost awareness into your engineering culture.
For SRE teams, the key insight is that cost should be treated as a first-class metric, just like latency, availability, and error rate. You already have dashboards for SLIs. Add a cost panel to those dashboards. When you review your SLO performance weekly, review your cost metrics too.
Some practical principles to adopt:
- Everyone is accountable for cost, not just finance. Engineers who provision resources should understand the cost impact.
- Cost decisions are data-driven. Use actual utilization data, not guesses or “we might need it someday.”
- Cost optimization is continuous, not a one-time project. Treat it like reliability, always improving.
- Optimize for value, not just savings. Sometimes spending more is the right call if it improves reliability or developer productivity.
Right-sizing workloads
Right-sizing is the single most impactful cost optimization you can make in Kubernetes. Most teams over-provision their workloads significantly because developers request resources based on worst-case estimates rather than actual usage.
The Vertical Pod Autoscaler (VPA) is your best friend here. Even if you do not enable it in auto mode, running it in recommendation mode gives you data on what your pods actually use versus what they request.
Install the VPA:
# Install VPA components
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
Create a VPA in recommendation mode for your workloads:
# vpa/tr-web-vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: tr-web-vpa
namespace: default
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: tr-web
updatePolicy:
updateMode: "Off" # Recommendation only, no auto-updates
resourcePolicy:
containerPolicies:
- containerName: tr-web
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 2000m
memory: 2Gi
controlledResources:
- cpu
- memory
After a few days of running, check the recommendations:
kubectl describe vpa tr-web-vpa
# Output will look something like:
# Recommendation:
# Container Recommendations:
# Container Name: tr-web
# Lower Bound:
# Cpu: 25m
# Memory: 80Mi
# Target:
# Cpu: 100m
# Memory: 180Mi
# Uncapped Target:
# Cpu: 100m
# Memory: 180Mi
# Upper Bound:
# Cpu: 350m
# Memory: 400Mi
Now compare that to what you actually requested:
# Check current resource requests across all pods
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\t"}{range .spec.containers[*]}{.name}{"\t"}Req: {.resources.requests.cpu}/{.resources.requests.memory}{"\t"}Lim: {.resources.limits.cpu}/{.resources.limits.memory}{"\n"}{end}{end}' | column -t
If your pods are requesting 500m CPU but only using 100m on average, you are paying for 5x more compute than you need. That gap is pure waste.
A good rule of thumb for setting requests and limits:
- Requests: Set to the P95 of actual usage (from VPA recommendations or Prometheus metrics). This ensures the scheduler places pods on nodes with enough capacity.
- Limits: Set to 2-3x the request for CPU (to allow bursting), and 1.5-2x for memory (to avoid OOM kills while still preventing runaway consumption).
- Review quarterly: Usage patterns change as your application evolves. What was right-sized six months ago might be wrong today.
Here is a Prometheus query to find the most over-provisioned workloads:
# CPU over-provisioning ratio by deployment
# Values > 2 mean the workload is requesting 2x+ more CPU than it uses
sum by (namespace, owner_name) (
kube_pod_container_resource_requests{resource="cpu"}
) /
sum by (namespace, owner_name) (
rate(container_cpu_usage_seconds_total[24h])
)
Spot and preemptible instances
Spot instances (AWS), preemptible VMs (GCP), or spot VMs (Azure) offer 60-90% discounts compared to on-demand pricing. The tradeoff is that the cloud provider can reclaim them with short notice (usually 2 minutes). For stateless, fault-tolerant workloads in Kubernetes, this is a great deal.
The trick is to run your workloads on a mix of on-demand and spot nodes. Critical workloads like your database go on on-demand nodes. Stateless web servers and batch jobs go on spot nodes.
Set up a spot node group (EKS example):
# eks-nodegroup-spot.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production-cluster
region: us-east-1
spec:
managedNodeGroups:
- name: on-demand-critical
instanceType: t3.large
desiredCapacity: 2
minSize: 2
maxSize: 4
labels:
node-type: on-demand
workload-type: critical
taints:
- key: workload-type
value: critical
effect: NoSchedule
- name: spot-general
instanceTypes:
- t3.large
- t3.xlarge
- t3a.large
- t3a.xlarge
- m5.large
- m5a.large
spot: true
desiredCapacity: 3
minSize: 1
maxSize: 10
labels:
node-type: spot
workload-type: general
Notice the spot node group uses multiple instance types. This is important because spot availability varies by instance type. Using a diverse set increases your chances of getting capacity.
Now schedule your workloads appropriately using node affinity and tolerations:
# deployments/tr-web.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tr-web
namespace: default
spec:
replicas: 3
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node-type
operator: In
values:
- spot
# Spread across nodes for resilience
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- tr-web
topologyKey: kubernetes.io/hostname
tolerations:
- key: "node-type"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
containers:
- name: tr-web
image: kainlite/tr:latest
resources:
requests:
cpu: 100m
memory: 180Mi
limits:
cpu: 300m
memory: 360Mi
The preferredDuringSchedulingIgnoredDuringExecution with weight 80 means the scheduler will try to place pods
on spot nodes but will fall back to on-demand if no spot capacity is available. This is important for resilience.
You also need a PodDisruptionBudget to handle spot node reclamation gracefully:
# pdb/tr-web-pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: tr-web-pdb
namespace: default
spec:
minAvailable: 2
selector:
matchLabels:
app: tr-web
This ensures that at least 2 pods are always running, even during spot node reclamation. Combined with multiple replicas spread across different nodes, your service stays available while saving 60-90% on compute.
Resource quotas and limit ranges
Without guardrails, any team member can deploy a workload that requests 64 CPUs and 256GB of memory. Resource quotas and limit ranges prevent this kind of runaway cost.
A ResourceQuota sets hard limits per namespace:
# quotas/dev-namespace-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: dev
spec:
hard:
requests.cpu: "4" # Total CPU requests across all pods
requests.memory: 8Gi # Total memory requests
limits.cpu: "8" # Total CPU limits
limits.memory: 16Gi # Total memory limits
pods: "20" # Maximum number of pods
services.loadbalancers: "2" # Limit expensive LB services
persistentvolumeclaims: "10"
requests.storage: 100Gi # Total PVC storage
A LimitRange sets defaults and per-pod constraints. This is especially useful for catching pods deployed without resource requests:
# quotas/dev-namespace-limitrange.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: dev
spec:
limits:
- type: Container
default: # Default limits if not specified
cpu: 200m
memory: 256Mi
defaultRequest: # Default requests if not specified
cpu: 50m
memory: 64Mi
min: # Minimum allowed
cpu: 10m
memory: 16Mi
max: # Maximum allowed per container
cpu: "2"
memory: 4Gi
- type: Pod
max: # Maximum per pod (all containers combined)
cpu: "4"
memory: 8Gi
- type: PersistentVolumeClaim
min:
storage: 1Gi
max:
storage: 50Gi
Now if someone deploys a pod without resource requests, it automatically gets 50m CPU and 64Mi memory as defaults. And if someone tries to request 32 CPUs, the API server rejects the request.
For production namespaces, you want different quotas:
# quotas/production-namespace-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: production
spec:
hard:
requests.cpu: "16"
requests.memory: 32Gi
limits.cpu: "32"
limits.memory: 64Gi
pods: "50"
services.loadbalancers: "5"
persistentvolumeclaims: "20"
requests.storage: 500Gi
scopeSelector:
matchExpressions:
- scopeName: PriorityClass
operator: In
values:
- high
- medium
Kubecost and OpenCost
You cannot optimize what you cannot measure. Kubecost (and its open source core, OpenCost) gives you cost visibility into your Kubernetes cluster, broken down by namespace, deployment, label, and team.
Install OpenCost with Helm:
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update
helm install opencost opencost/opencost \
--namespace opencost \
--create-namespace \
--set opencost.exporter.defaultClusterId="production" \
--set opencost.ui.enabled=true \
--set opencost.prometheus.internal.enabled=false \
--set opencost.prometheus.external.url="http://prometheus-server.monitoring.svc:9090"
For Kubecost (which includes more features like recommendations and savings insights):
helm repo add kubecost https://kubecost.github.io/cost-analyzer
helm repo update
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost \
--create-namespace \
--set kubecostToken="your-token-here" \
--set prometheus.server.global.external_labels.cluster_id="production" \
--set prometheus.nodeExporter.enabled=false \
--set prometheus.serviceAccounts.nodeExporter.create=false
Once installed, you can query cost data via the API:
# Get cost allocation by namespace for the last 7 days
curl -s "http://kubecost.kubecost.svc:9090/model/allocation?window=7d&aggregate=namespace" \
| jq '.data[0] | to_entries[] | {
namespace: .key,
totalCost: .value.totalCost,
cpuCost: .value.cpuCost,
memCost: .value.ramCost,
pvCost: .value.pvCost,
cpuEfficiency: .value.cpuEfficiency,
ramEfficiency: .value.ramEfficiency
}'
# Example output:
# {
# "namespace": "default",
# "totalCost": 42.15,
# "cpuCost": 18.30,
# "memCost": 15.85,
# "pvCost": 8.00,
# "cpuEfficiency": 0.35,
# "ramEfficiency": 0.42
# }
That CPU efficiency of 0.35 means you are only using 35% of the CPU you are paying for. That is a big optimization opportunity.
Create a Grafana dashboard for cost visibility:
# grafana/cost-dashboard.json (simplified)
# Useful Prometheus queries for cost panels:
# Monthly cost estimate by namespace
sum by (namespace) (
container_cpu_allocation * on(node) group_left()
node_cpu_hourly_cost * 730
) +
sum by (namespace) (
container_memory_allocation_bytes / 1024 / 1024 / 1024 * on(node) group_left()
node_ram_hourly_cost * 730
)
# Idle cost (resources requested but not used)
sum by (namespace) (
(kube_pod_container_resource_requests{resource="cpu"} -
rate(container_cpu_usage_seconds_total[1h]))
* on(node) group_left() node_cpu_hourly_cost * 730
)
# Cost per request (useful for cost-per-SLI tracking)
sum(rate(container_cpu_usage_seconds_total{namespace="default"}[1h])
* on(node) group_left() node_cpu_hourly_cost)
/
sum(rate(http_requests_total{namespace="default"}[1h]))
Idle resource detection
Idle resources are the low-hanging fruit of cost optimization. These are things you are paying for but nobody is using. In a typical Kubernetes cluster, 20-30% of spend goes to idle resources.
Here is a script to find common idle resources:
#!/bin/bash
# idle-resource-audit.sh
# Find idle and wasted resources in your cluster
echo "=== Unused PersistentVolumeClaims ==="
# PVCs not mounted by any pod
kubectl get pvc -A -o json | jq -r '
.items[] |
select(.status.phase == "Bound") |
.metadata.namespace + "/" + .metadata.name
' | while read pvc; do
ns=$(echo $pvc | cut -d/ -f1)
name=$(echo $pvc | cut -d/ -f2)
# Check if any pod references this PVC
used=$(kubectl get pods -n $ns -o json | jq -r \
--arg pvc "$name" \
'.items[].spec.volumes[]? | select(.persistentVolumeClaim.claimName == $pvc) | .name' \
2>/dev/null)
if [ -z "$used" ]; then
size=$(kubectl get pvc $name -n $ns -o jsonpath='{.spec.resources.requests.storage}')
echo " UNUSED: $pvc ($size)"
fi
done
echo ""
echo "=== LoadBalancer Services ==="
# Each LB costs money even if no traffic flows through it
kubectl get svc -A --field-selector spec.type=LoadBalancer \
-o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,IP:.status.loadBalancer.ingress[0].ip,AGE:.metadata.creationTimestamp'
echo ""
echo "=== Deployments with 0 replicas ==="
# Scaled to 0 but still have PVCs, configmaps, secrets attached
kubectl get deploy -A -o json | jq -r '
.items[] |
select(.spec.replicas == 0) |
.metadata.namespace + "/" + .metadata.name
'
echo ""
echo "=== Pods in CrashLoopBackOff ==="
# Burning CPU on restart loops
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded \
-o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount'
echo ""
echo "=== Unattached Persistent Volumes ==="
kubectl get pv -o json | jq -r '
.items[] |
select(.status.phase == "Available" or .status.phase == "Released") |
.metadata.name + " (" + .spec.capacity.storage + ") - " + .status.phase
'
For a more automated approach, set up a CronJob that runs this audit weekly and sends results to Slack:
# cronjob/idle-resource-audit.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: idle-resource-audit
namespace: monitoring
spec:
schedule: "0 9 * * 1" # Every Monday at 9am
jobTemplate:
spec:
template:
spec:
serviceAccountName: resource-auditor
containers:
- name: auditor
image: bitnami/kubectl:latest
command:
- /bin/bash
- -c
- |
# Run audit and post to Slack
UNUSED_PVCS=$(kubectl get pvc -A -o json | jq '[.items[] | select(.status.phase == "Bound")] | length')
TOTAL_PVCS=$(kubectl get pvc -A -o json | jq '.items | length')
LB_COUNT=$(kubectl get svc -A --field-selector spec.type=LoadBalancer -o json | jq '.items | length')
ZERO_REPLICAS=$(kubectl get deploy -A -o json | jq '[.items[] | select(.spec.replicas == 0)] | length')
curl -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-type: application/json' \
-d "{
\"text\": \"Weekly Idle Resource Report\",
\"blocks\": [{
\"type\": \"section\",
\"text\": {
\"type\": \"mrkdwn\",
\"text\": \"*Weekly Idle Resource Audit*\n- PVCs: $TOTAL_PVCS total\n- LoadBalancers: $LB_COUNT active\n- Zero-replica deployments: $ZERO_REPLICAS\"
}
}]
}"
env:
- name: SLACK_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: slack-webhook
key: url
restartPolicy: OnFailure
Storage tiering
Storage costs can sneak up on you, especially if everything defaults to high-performance SSD. Not all data needs fast storage. Logs, backups, and archived data can live on cheaper storage tiers.
Define multiple StorageClasses for different tiers:
# storage/storageclass-fast.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
labels:
cost-tier: high
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "5000"
throughput: "250"
encrypted: "true"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# storage/storageclass-standard.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
labels:
cost-tier: medium
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
encrypted: "true"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# storage/storageclass-cold.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cold-storage
labels:
cost-tier: low
provisioner: ebs.csi.aws.com
parameters:
type: sc1
encrypted: "true"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
Use the right tier for each workload:
# Database: fast SSD for low latency
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgresql-data
namespace: default
spec:
storageClassName: fast-ssd
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
---
# Application logs: standard storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: app-logs
namespace: default
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
---
# Backups and archives: cold storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: backup-archive
namespace: default
spec:
storageClassName: cold-storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
For object storage (S3, GCS), set up lifecycle policies to move data to cheaper tiers automatically:
# terraform/s3-lifecycle.tf
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
id = "archive-old-logs"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA" # ~45% cheaper
}
transition {
days = 90
storage_class = "GLACIER" # ~80% cheaper
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE" # ~95% cheaper
}
expiration {
days = 730 # Delete after 2 years
}
}
}
The cost difference between tiers is significant. For AWS EBS, gp3 costs about $0.08/GB/month while sc1 costs $0.015/GB/month. For S3, Standard is $0.023/GB/month while Deep Archive is $0.00099/GB/month. Moving 1TB of archive data from Standard to Deep Archive saves about $264/year.
Reserved vs on-demand
If you know you will need a certain amount of compute for the next 1-3 years, reserved instances or savings plans offer 30-60% discounts compared to on-demand. The tradeoff is commitment, you pay whether you use it or not.
The key is to only commit to your baseline, the minimum compute you always need. Let on-demand and spot handle the peaks.
Here is how to analyze your reservation coverage:
# Prometheus query: average CPU utilization over 30 days
# This shows your baseline compute needs
avg_over_time(
sum(
rate(container_cpu_usage_seconds_total[5m])
)[30d:1h]
)
# Compare against your reserved capacity
# If reserved < baseline, you are under-committed (paying too much on-demand)
# If reserved > baseline, you are over-committed (paying for unused reservations)
A practical approach to reservation planning:
- Measure your baseline for at least 3 months. Look at the minimum sustained usage, not the average.
- Reserve 70-80% of baseline. This gives you a safety margin for workload changes.
- Use savings plans over reserved instances when possible. Savings plans are more flexible because they apply to any instance family.
- Review quarterly. If your baseline has shifted, adjust your commitments at renewal time.
- Consider 1-year terms first. The savings gap between 1-year and 3-year is often not worth the risk of being locked in.
For Kubernetes specifically, you can use Karpenter (AWS) or the cluster autoscaler with mixed instance policies to automatically choose the cheapest available instance types:
# karpenter/provisioner.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- spot
- key: node.kubernetes.io/instance-type
operator: In
values:
- t3.medium
- t3.large
- t3a.medium
- t3a.large
- m5.large
- m5a.large
- m6i.large
- m6a.large
- key: kubernetes.io/arch
operator: In
values:
- amd64
- arm64 # ARM instances are ~20% cheaper
limits:
resources:
cpu: "64"
memory: 128Gi
providerRef:
name: default
# Consolidation: Karpenter will replace underutilized nodes
# with smaller ones to save money
consolidation:
enabled: true
ttlSecondsAfterEmpty: 30
Notice the arm64 architecture option. ARM instances (like AWS Graviton) are typically 20% cheaper and offer
comparable or better performance for most workloads. If your container images support multi-arch builds (which
they should), this is an easy win.
Cost alerts tied to SLOs
Here is where SRE and FinOps intersect beautifully: using your error budget as a cost control mechanism. The idea is that if you are spending more than necessary to maintain your SLOs, you have room to optimize.
Think about it this way. If your availability SLO is 99.9% and you are running at 99.99%, you are probably over-provisioned. That extra “9” is costing you money and it is not required by the SLO. You could reduce capacity until availability drops to around 99.95% and still have plenty of error budget left.
Set up cost-per-request as a metric:
# prometheus/cost-per-request-rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cost-metrics
namespace: monitoring
spec:
groups:
- name: cost.rules
interval: 5m
rules:
# Cost per request (estimated)
- record: cost:per_request:ratio
expr: |
(
sum(container_cpu_allocation{namespace="default"} *
on(node) group_left() node_cpu_hourly_cost)
+
sum(container_memory_allocation_bytes{namespace="default"} / 1024 / 1024 / 1024 *
on(node) group_left() node_ram_hourly_cost)
)
/
sum(rate(http_requests_total{namespace="default"}[1h]))
# Monthly cost estimate
- record: cost:monthly:estimate
expr: |
sum(
container_cpu_allocation * on(node) group_left()
node_cpu_hourly_cost * 730
) +
sum(
container_memory_allocation_bytes / 1024 / 1024 / 1024 *
on(node) group_left() node_ram_hourly_cost * 730
)
# Cost efficiency: value delivered per dollar
- record: cost:efficiency:ratio
expr: |
sum(rate(http_requests_total{status=~"2.."}[1h]))
/
(
sum(container_cpu_allocation{namespace="default"} *
on(node) group_left() node_cpu_hourly_cost)
+
sum(container_memory_allocation_bytes{namespace="default"} / 1024 / 1024 / 1024 *
on(node) group_left() node_ram_hourly_cost)
)
Now create alerts that fire when costs exceed thresholds:
# prometheus/cost-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cost-alerts
namespace: monitoring
spec:
groups:
- name: cost.alerts
rules:
# Alert when monthly cost estimate exceeds budget
- alert: MonthlyCostExceedsBudget
expr: cost:monthly:estimate > 500
for: 6h
labels:
severity: warning
team: platform
annotations:
summary: "Monthly cost estimate exceeds $500 budget"
description: "Current estimated monthly cost is ${{ $value | printf \"%.2f\" }}. Budget is $500."
runbook_url: "https://wiki.internal/runbooks/cost-overrun"
# Alert when cost per request spikes
- alert: CostPerRequestSpike
expr: cost:per_request:ratio > 0.001
for: 1h
labels:
severity: warning
team: platform
annotations:
summary: "Cost per request exceeds $0.001"
description: "Current cost per request is ${{ $value | printf \"%.6f\" }}. This may indicate over-provisioning or a traffic drop."
# Alert when CPU efficiency drops (over-provisioning)
- alert: LowCPUEfficiency
expr: |
sum by (namespace) (rate(container_cpu_usage_seconds_total[24h]))
/
sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"})
< 0.2
for: 24h
labels:
severity: info
team: platform
annotations:
summary: "Namespace {{ $labels.namespace }} CPU utilization below 20%"
description: "The namespace {{ $labels.namespace }} is only using {{ $value | printf \"%.1f\" }}% of requested CPU. Consider right-sizing."
# Alert when error budget is healthy but costs are high
# This is the key FinOps+SRE integration
- alert: OverProvisionedForSLO
expr: |
(1 - slo:error_budget:remaining_ratio) < 0.1
and
cost:monthly:estimate > 400
for: 24h
labels:
severity: info
team: platform
annotations:
summary: "Over-provisioned: SLO healthy but costs high"
description: "Error budget consumed is only {{ $value | printf \"%.1f\" }}% but monthly cost is high. Consider reducing capacity to save costs while maintaining SLO."
The OverProvisionedForSLO alert is the most interesting one. It fires when your error budget is barely
touched (meaning you are way above your SLO target) AND your costs are high. This is a signal that you can
safely reduce capacity.
Tagging strategies
Without proper tagging, your cost data is just a big number with no context. You need to know which team, project, and environment is responsible for each cost.
In Kubernetes, labels serve as tags for cost allocation. Define a consistent labeling standard:
# labels/standard-labels.yaml
# Every resource should have these labels
metadata:
labels:
# Who owns this?
app.kubernetes.io/name: tr-web
app.kubernetes.io/component: frontend
app.kubernetes.io/part-of: tr-blog
app.kubernetes.io/managed-by: argocd
# Cost allocation
cost-center: engineering
team: platform
environment: production
project: tr-blog
# Lifecycle
lifecycle: permanent # or: temporary, ephemeral, review
expiry: "none" # or: "2026-04-01" for temporary resources
Enforce these labels with a policy engine like Kyverno:
# kyverno/require-cost-labels.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-cost-labels
annotations:
policies.kyverno.io/title: Require Cost Allocation Labels
policies.kyverno.io/description: >-
All deployments must have cost allocation labels for
tracking and chargeback purposes.
spec:
validationFailureAction: Enforce
background: true
rules:
- name: check-cost-labels
match:
any:
- resources:
kinds:
- Deployment
- StatefulSet
- DaemonSet
- Job
- CronJob
validate:
message: >-
All workloads must have cost allocation labels:
cost-center, team, environment, and project.
pattern:
metadata:
labels:
cost-center: "?*"
team: "?*"
environment: "?*"
project: "?*"
- name: check-pvc-labels
match:
any:
- resources:
kinds:
- PersistentVolumeClaim
validate:
message: "PVCs must have cost-center and team labels."
pattern:
metadata:
labels:
cost-center: "?*"
team: "?*"
- name: check-service-labels
match:
any:
- resources:
kinds:
- Service
validate:
message: "Services must have cost-center and team labels."
pattern:
metadata:
labels:
cost-center: "?*"
team: "?*"
With this policy in place, any deployment without cost allocation labels is rejected at admission time. This ensures 100% label coverage, which means your cost reports are accurate.
For cloud resources outside Kubernetes (S3 buckets, RDS instances, etc.), use Terraform to enforce tags:
# terraform/provider.tf
provider "aws" {
region = "us-east-1"
default_tags {
tags = {
Environment = "production"
Team = "platform"
Project = "tr-blog"
ManagedBy = "terraform"
CostCenter = "engineering"
}
}
}
# terraform/tag-policy.tf
resource "aws_organizations_policy" "require_tags" {
name = "require-cost-tags"
description = "Require cost allocation tags on all resources"
type = "TAG"
content = jsonencode({
tags = {
CostCenter = {
tag_key = {
"@@assign" = "CostCenter"
}
enforced_for = {
"@@assign" = [
"ec2:instance",
"ec2:volume",
"s3:bucket",
"rds:db",
"elasticloadbalancing:loadbalancer"
]
}
}
Team = {
tag_key = {
"@@assign" = "Team"
}
enforced_for = {
"@@assign" = [
"ec2:instance",
"ec2:volume",
"s3:bucket",
"rds:db"
]
}
}
}
})
}
Once tagging is consistent, you can generate cost reports per team:
# Query Kubecost for cost by team label
curl -s "http://kubecost.kubecost.svc:9090/model/allocation?window=30d&aggregate=label:team" \
| jq '.data[0] | to_entries[] | {
team: .key,
monthlyCost: (.value.totalCost | . * 100 | round / 100),
cpuEfficiency: (.value.cpuEfficiency | . * 100 | round),
ramEfficiency: (.value.ramEfficiency | . * 100 | round)
}'
# Example output:
# { "team": "platform", "monthlyCost": 285.42, "cpuEfficiency": 45, "ramEfficiency": 52 }
# { "team": "backend", "monthlyCost": 156.78, "cpuEfficiency": 62, "ramEfficiency": 58 }
# { "team": "data", "monthlyCost": 412.33, "cpuEfficiency": 78, "ramEfficiency": 71 }
This data makes cost conversations productive. Instead of “we need to cut costs,” you can say “the platform team has 45% CPU efficiency, let’s right-size those workloads to save an estimated $128/month.”
Closing notes
Cost optimization in the cloud is not a one-time project. It is an ongoing practice that requires visibility, accountability, and continuous improvement. The good news is that as an SRE team, you already have most of the skills and tooling you need. You know how to measure things (SLIs), set targets (SLOs), and automate responses (alerts and runbooks). Apply those same patterns to cost.
Start with the quick wins: run VPA in recommendation mode and right-size your top 10 over-provisioned workloads. Install OpenCost to get visibility into where your money goes. Set up a weekly cost review alongside your SLO review. Then gradually adopt spot instances, storage tiering, and cost-aware alerting.
The key takeaway is that reliability and cost efficiency are not in conflict. With the right approach, you can reduce spending while maintaining or even improving your SLOs. Every dollar saved on over-provisioning is a dollar you can invest in better tooling, more reliability features, or your team.
Hope you found this useful and enjoyed reading it, until next time!
Errata
If you spot any error or have any suggestion, please send me a message so it gets fixed.
Also, you can check the source code and changes in the sources here
$ Comments
Online: 0Please sign in to be able to write comments.