SRE: Capacity Planning, Autoscaling, and Load Testing

2026-03-05 | Gabriel Garrido | 13 min read

On this page

Introduction
Resource requests and limits: the foundation
Right-sizing with VPA
Horizontal Pod Autoscaler (HPA)
KEDA: event-driven autoscaling
Cluster autoscaling
Load testing with k6
Capacity planning as a practice
Putting it all together
Debugging mystery: connections drop only under load
Closing notes
Errata

Support this blog

If you find this content useful, consider supporting the blog.

Introduction#

Throughout this SRE series we have covered SLIs and SLOs, incident management, observability, and chaos engineering. All of that assumes your system has enough capacity to serve traffic. But how do you know if it does? And what happens when traffic doubles overnight?

Capacity planning is the art of ensuring your infrastructure can handle current and future demand without over-provisioning (wasting money) or under-provisioning (degrading service). In Kubernetes, this means getting resource requests and limits right, configuring autoscalers properly, and validating your setup with load tests.

In this article we will cover resource requests and limits, the Horizontal Pod Autoscaler (HPA), the Vertical Pod Autoscaler (VPA), KEDA for event-driven scaling, and load testing with k6 to make sure everything works under pressure.

Let's get into it.

Resource requests and limits: the foundation#

Before you can autoscale anything, you need to understand resource requests and limits. These are the most misunderstood concepts in Kubernetes, and getting them wrong causes more outages than most people realize.

Requests: The minimum resources a pod needs. The scheduler uses this to decide where to place the pod.

Limits: The maximum resources a pod can use. If the pod exceeds its memory limit, it gets OOM-killed.

Here is how to set them for our Elixir application:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tr-web
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: tr-web
  template:
    metadata:
      labels:
        app: tr-web
    spec:
      containers:
        - name: tr-web
          image: kainlite/tr:latest
          resources:
            requests:
              cpu: "250m"      # 0.25 CPU cores
              memory: "256Mi"  # 256 MB RAM
            limits:
              cpu: "1000m"     # 1 CPU core
              memory: "512Mi"  # 512 MB RAM
          ports:
            - containerPort: 4000

Common mistakes:

Setting requests = limits: This gives you guaranteed QoS but wastes resources. Only do this for critical databases.

Not setting requests: Pods get BestEffort QoS and are the first to be evicted under pressure.

Setting memory limits too low: BEAM applications use memory for ETS tables, process heaps, and binary data. Too-tight limits cause random OOM kills.

Setting CPU limits at all: There is a growing consensus that CPU limits cause more harm than good due to throttling. Consider setting only CPU requests and no CPU limits.

The CPU limits debate:

CPU limits in Kubernetes use CFS (Completely Fair Scheduler) throttling. Even if the node has idle CPU, a pod hitting its CPU limit will be throttled. This causes latency spikes that are hard to debug because everything looks fine from a resource usage perspective.

# Option A: With CPU limits (safe but can cause throttling)
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "1000m"
    memory: "512Mi"

# Option B: Without CPU limits (better performance, requires good monitoring)
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    memory: "512Mi"  # Keep memory limits, drop CPU limits

If you go with Option B, make sure you have good monitoring to detect noisy neighbor issues.

Right-sizing with VPA#

The Vertical Pod Autoscaler (VPA) watches your pod's actual resource usage and recommends or automatically adjusts the requests and limits. This is incredibly useful because guessing the right values is hard.

Install VPA:

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

Create a VPA resource in recommendation mode (safest to start):

# vpa/tr-web-vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: tr-web-vpa
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tr-web
  updatePolicy:
    updateMode: "Off"  # Start with recommendations only
  resourcePolicy:
    containerPolicies:
      - containerName: tr-web
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2000m"
          memory: "1Gi"

After a few days of collecting data, check the recommendations:

kubectl describe vpa tr-web-vpa

# Output will look like:
# Recommendation:
#   Container Recommendations:
#     Container Name: tr-web
#     Lower Bound:
#       Cpu:     150m
#       Memory:  200Mi
#     Target:
#       Cpu:     280m
#       Memory:  310Mi
#     Upper Bound:
#       Cpu:     500m
#       Memory:  450Mi

The "Target" values are what VPA recommends for your requests. Use them as a starting point and validate with load testing.

Once you trust the recommendations, you can switch to updateMode: "Auto" to let VPA adjust resources automatically. Be aware that VPA does this by evicting and recreating pods with new resource values, so make sure you have enough replicas to handle the disruption.

Horizontal Pod Autoscaler (HPA)#

HPA scales the number of pods based on metrics. The most common setup scales on CPU or memory usage, but you can also scale on custom metrics from Prometheus.

Basic HPA on CPU:

# hpa/tr-web-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tr-web-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tr-web
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Key settings:

minReplicas: 2: Always have at least 2 pods for redundancy

maxReplicas: 10: Cap to prevent runaway scaling (and runaway costs)

averageUtilization: 70: Scale up when average CPU across pods exceeds 70%

scaleUp stabilization: 60s: Wait 60 seconds before scaling up to avoid flapping

scaleDown stabilization: 300s: Wait 5 minutes before scaling down to handle traffic oscillations

HPA on custom Prometheus metrics:

Scaling on CPU is a blunt instrument. For a web service, scaling on requests-per-second or latency is much more responsive. This requires the Prometheus Adapter:

# Install Prometheus Adapter
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus:9090

Configure the adapter to expose your SLI metrics to the Kubernetes metrics API:

# prometheus-adapter/config.yaml
rules:
  - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)_total$"
      as: "${1}_per_second"
    metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

  - seriesQuery: 'http_request_duration_seconds_bucket{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      as: "http_request_duration_p99"
    metricsQuery: 'histogram_quantile(0.99, sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (le, <<.GroupBy>>))'

Then create an HPA that scales on requests per second per pod:

# hpa/tr-web-hpa-custom.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tr-web-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tr-web
  minReplicas: 2
  maxReplicas: 10
  metrics:
    # Scale on requests per second per pod
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"  # Scale up if any pod handles more than 100 rps

    # Also consider latency
    - type: Pods
      pods:
        metric:
          name: http_request_duration_p99
        target:
          type: AverageValue
          averageValue: "300m"  # Scale up if p99 > 300ms

This is much better than CPU-based scaling because it reacts to actual traffic patterns rather than resource consumption, which can be misleading (the BEAM VM manages memory differently than most runtimes).

KEDA: event-driven autoscaling#

KEDA (Kubernetes Event-Driven Autoscaling) takes HPA to the next level by supporting dozens of event sources. It is particularly useful for scaling based on queue depth, cron schedules, or external metrics.

Install KEDA:

helm repo add kedacore https://kedacore.github.io/charts
helm repo update

helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace

Scale based on Prometheus metrics (SLO-aware scaling):

# keda/tr-web-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: tr-web-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: tr-web
  minReplicaCount: 2
  maxReplicaCount: 15
  pollingInterval: 30
  cooldownPeriod: 300
  triggers:
    # Scale based on request rate
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: http_requests_rate
        query: sum(rate(http_requests_total{service="tr-web"}[2m]))
        threshold: "200"  # Scale up when total RPS exceeds 200
        activationThreshold: "50"  # Start scaling at 50 RPS

    # Scale based on error budget burn rate
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: error_budget_burn_rate
        query: sli:availability:burn_rate5m{service="tr-web"}
        threshold: "5"  # Scale up when burning error budget 5x faster than normal

The error budget trigger is particularly clever. When your service is burning through error budget faster than expected (which means reliability is degrading), KEDA adds more replicas to absorb the load. This ties capacity planning directly to your SLOs.

Cron-based scaling for predictable traffic patterns:

# keda/tr-web-cron-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: tr-web-cron
  namespace: default
spec:
  scaleTargetRef:
    name: tr-web
  minReplicaCount: 2
  maxReplicaCount: 10
  triggers:
    # Scale up during business hours
    - type: cron
      metadata:
        timezone: America/Argentina/Buenos_Aires
        start: "0 8 * * 1-5"   # Monday-Friday 8am
        end: "0 20 * * 1-5"    # Monday-Friday 8pm
        desiredReplicas: "4"

    # Scale up for newsletter sends (if applicable)
    - type: cron
      metadata:
        timezone: America/Argentina/Buenos_Aires
        start: "0 10 * * 2"    # Tuesday 10am (newsletter day)
        end: "0 12 * * 2"      # Tuesday 12pm
        desiredReplicas: "6"

If you know your traffic patterns (and you should, from your observability data), proactive scaling avoids the lag that reactive autoscaling introduces. Why wait for CPU to spike when you know traffic increases every morning at 8am?

Cluster autoscaling#

Pod autoscaling is only useful if there are nodes with capacity to schedule new pods. The Cluster Autoscaler adds and removes nodes based on pending pod requests.

For a K3s cluster (like the one this blog runs on), you can use a combination of the Cluster Autoscaler and cloud provider integration. For self-managed clusters, you need to think about this differently.

Key considerations:

Node provisioning time: Cloud nodes take 2-5 minutes to provision. Plan your HPA to give enough headroom for this delay.

Over-provisioning: Keep a buffer pod (a low-priority deployment that consumes resources but can be preempted) to ensure there is always room for quick scale-ups.

Pod Disruption Budgets: Ensure the cluster autoscaler does not remove nodes with critical pods.

# Buffer pod to keep spare capacity
apiVersion: apps/v1
kind: Deployment
metadata:
  name: capacity-buffer
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: capacity-buffer
  template:
    metadata:
      labels:
        app: capacity-buffer
    spec:
      priorityClassName: low-priority
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: -10
globalDefault: false
description: "Low priority for buffer pods that can be preempted"

The buffer pod reserves capacity that can be instantly freed when a real workload needs it. The real pods have default priority and preempt the buffer pod, which then triggers the cluster autoscaler to add a new node for the buffer.

Load testing with k6#

All the autoscaling configuration in the world is useless if you have not validated it under real load. k6 is an excellent load testing tool that makes it easy to define test scenarios.

Install k6:

# On Arch Linux
sudo pacman -S k6

# Or via Docker
docker run --rm -i grafana/k6 run - <script.js

Basic load test for the blog:

// load-tests/blog-basic.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 50 },   // Ramp up to 50 users over 2 minutes
    { duration: '5m', target: 50 },   // Stay at 50 users for 5 minutes
    { duration: '2m', target: 100 },  // Ramp up to 100 users
    { duration: '5m', target: 100 },  // Stay at 100 users
    { duration: '2m', target: 0 },    // Ramp down to 0
  ],
  thresholds: {
    // SLO-aligned thresholds
    http_req_duration: ['p(99)<300'],    // 99% of requests under 300ms
    http_req_failed: ['rate<0.001'],     // Less than 0.1% error rate
  },
};

const BASE_URL = __ENV.BASE_URL || 'http://localhost:4000';

export default function () {
  // Simulate a typical user browsing the blog
  const pages = [
    '/blog',
    '/blog/sre-slis-slos-and-automations-that-actually-help',
    '/blog/debugging-kubernetes',
    '/blog/rbac-deep-dive',
  ];

  const page = pages[Math.floor(Math.random() * pages.length)];
  const res = http.get(`${BASE_URL}${page}`);

  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
    'body contains content': (r) => r.body.length > 1000,
  });

  sleep(Math.random() * 3 + 1); // Random think time between 1-4 seconds
}

Load test for search (LiveView WebSocket):

// load-tests/search-load.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '1m', target: 20 },
    { duration: '3m', target: 20 },
    { duration: '1m', target: 0 },
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

const BASE_URL = __ENV.BASE_URL || 'http://localhost:4000';

export default function () {
  // Load the search page
  const searchPage = http.get(`${BASE_URL}/search`);
  check(searchPage, {
    'search page loads': (r) => r.status === 200,
  });

  sleep(1);

  // Note: k6 does not natively support WebSocket LiveView connections
  // For full LiveView load testing, use the k6 WebSocket extension
  // or test the HTTP fallback mode
}

Autoscaling validation test:

This is the most important load test. It validates that your HPA kicks in correctly under load:

// load-tests/autoscale-validation.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Counter, Trend } from 'k6/metrics';

const errorCount = new Counter('errors');
const scalingLatency = new Trend('scaling_latency');

export const options = {
  stages: [
    // Phase 1: Baseline (should run with min replicas)
    { duration: '2m', target: 10 },

    // Phase 2: Ramp to trigger scale-up
    { duration: '1m', target: 200 },

    // Phase 3: Sustained high load (HPA should scale up)
    { duration: '10m', target: 200 },

    // Phase 4: Ramp down (HPA should eventually scale down)
    { duration: '2m', target: 10 },

    // Phase 5: Sustained low load
    { duration: '5m', target: 10 },
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],  // Even under load, p99 < 500ms
    http_req_failed: ['rate<0.005'],   // Less than 0.5% errors
    errors: ['count<50'],              // Less than 50 total errors
  },
};

const BASE_URL = __ENV.BASE_URL || 'http://localhost:4000';

export default function () {
  const res = http.get(`${BASE_URL}/blog`);

  const success = check(res, {
    'status is 200': (r) => r.status === 200,
    'latency < 500ms': (r) => r.timings.duration < 500,
  });

  if (!success) {
    errorCount.add(1);
  }

  sleep(Math.random() * 2 + 0.5);
}

Run the test while watching your HPA:

# Terminal 1: Run the load test
k6 run --env BASE_URL=https://your-app.example.com load-tests/autoscale-validation.js

# Terminal 2: Watch HPA
kubectl get hpa tr-web-hpa --watch

# Terminal 3: Watch pods scaling
kubectl get pods -l app=tr-web --watch

What you want to see:

During Phase 1: 2 replicas (minReplicas), low resource usage

During Phase 2-3: HPA scales up to 4-6 replicas within 2-3 minutes of high load

During Phase 3: Latency stays within SLO even at high load

During Phase 4-5: HPA gradually scales back down to 2 replicas after the stabilization window

If the scale-up takes too long and latency spikes during Phase 2, you need to either:

Lower the averageUtilization threshold

Reduce the scaleUp stabilization window

Use proactive cron-based scaling for predictable traffic patterns

Capacity planning as a practice#

Beyond autoscaling, capacity planning is an ongoing practice:

1. Track resource utilization trends

# Grafana query: CPU utilization trend over 30 days
avg_over_time(
  sum(rate(container_cpu_usage_seconds_total{pod=~"tr-web.*"}[5m]))
  /
  sum(kube_pod_container_resource_requests{pod=~"tr-web.*", resource="cpu"})
  [30d:1h]
)

If your average utilization is consistently above 80%, you need more capacity. If it is consistently below 20%, you are over-provisioned and wasting money.

2. Review after every major change

After launching a new feature, check if resource usage patterns changed. A new background job might increase memory usage. A new API endpoint might increase CPU usage during peak hours.

3. Plan for growth

If your traffic is growing 10% month-over-month, your autoscaler maxReplicas needs to accommodate that growth. Review your max limits quarterly and adjust.

Putting it all together#

Here is the complete capacity management setup:

VPA in recommendation mode tells you what resources your pods actually need

Resource requests are set based on VPA recommendations and validated with load tests

HPA with custom metrics scales pods based on traffic (not just CPU)

KEDA cron triggers proactively scale for known traffic patterns

Cluster Autoscaler adds nodes when pods cannot be scheduled

Buffer pods ensure instant capacity for scale-up events

k6 load tests validate the entire scaling pipeline regularly

This gives you a system that handles traffic spikes automatically, right-sizes resources based on actual usage, and gives you confidence that your SLOs will hold under load.

Debugging mystery: connections drop only under load#

You've load-tested the app, but the real test is debugging it when load exposes something the app can't see. A fraction of new connections drop under load, with CPU to spare. Find what's quietly dropping packets.

Closing notes#

Capacity planning does not have to be guesswork. With VPA recommendations, SLO-aligned autoscaling, and regular load testing, you can be confident that your infrastructure handles whatever traffic comes its way.

The most important takeaway: autoscaling is not a substitute for understanding your workload. Know your traffic patterns, test your scaling limits, and always have a plan for what happens when traffic exceeds your maximum capacity. The answer to "what happens if we get 10x traffic?" should never be "I don't know."

This concludes our five-part SRE series. From SLIs/SLOs to incident management, observability, chaos engineering, and now capacity planning, you have the tools and practices to run reliable systems at any scale.

Hope you found this useful and enjoyed reading it, until next time!

Errata#

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here

$ Comments

Online: 0

Please sign in to be able to write comments.

2026-03-05 | Gabriel Garrido

$ Related Posts

> SRE: Disaster Recovery and Business Continuity (2026-04-03)

> SRE: Security as Code (2026-03-29)

> SRE: Database Reliability (2026-03-23)