SegFault - sre

SRE: SLIs, SLOs, and Automations That Actually Help

2026-02-06 | 11 min read

We will explore how to define SLIs and SLOs as code, deploy them with ArgoCD, and use MCP servers to automate SRE workflows...

[sre] [kubernetes] [argocd] [observability] [automation]

SRE: Incident Management, On-Call, and Postmortems as Code

2026-02-23 | 16 min read

We will explore how to build an effective incident management workflow, set up on-call rotations that don't burn people out, write runbooks as code, and run blameless postmortems...

[sre] [kubernetes] [observability] [automation] [incidents]

SRE: Observability Deep Dive: Traces, Logs, and Metrics

2026-02-28 | 12 min read

We will explore the three pillars of observability, how to instrument your applications with OpenTelemetry, build useful dashboards in Grafana, and set up log aggregation that actually helps during incidents...

[sre] [kubernetes] [observability] [opentelemetry] [grafana]

SRE: Chaos Engineering, Breaking Things on Purpose

2026-03-02 | 12 min read

We will explore chaos engineering in Kubernetes using Litmus and Chaos Mesh, how to plan and run game days, and why breaking things on purpose is the best way to build reliable systems...

[sre] [kubernetes] [chaos-engineering] [reliability] [testing]

SRE: Capacity Planning, Autoscaling, and Load Testing

2026-03-05 | 13 min read

We will explore how to right-size your Kubernetes workloads, configure HPA and VPA for automatic scaling, use KEDA for event-driven scaling, and load test with k6 to validate your capacity...

[sre] [kubernetes] [scaling] [load-testing] [performance]

SRE: Secrets Management in Kubernetes

2026-03-07 | 21 min read

We will explore secrets management in Kubernetes, from Sealed Secrets and External Secrets Operator to HashiCorp Vault integration, secret rotation strategies, and SOPS for encrypting secrets in Git...

[sre] [kubernetes] [security] [secrets] [vault]

SRE: GitOps with ArgoCD

2026-03-09 | 11 min read

We will explore GitOps principles with ArgoCD, from Application CRDs and App of Apps patterns to sync strategies, multi-cluster management with ApplicationSets, and monitoring your GitOps workflows...

[sre] [kubernetes] [argocd] [gitops] [ci-cd]

SRE: Cost Optimization in the Cloud

2026-03-13 | 20 min read

We will explore FinOps principles and cost optimization strategies for Kubernetes and cloud infrastructure, from right-sizing workloads and spot instances to Kubecost visibility and cost-aware SLOs...

[sre] [kubernetes] [cloud] [cost-optimization] [finops]

SRE: Dependency Management and Graceful Degradation

2026-03-17 | 24 min read

We will explore how to manage service dependencies reliably, from circuit breakers and bulkhead patterns to graceful degradation strategies and dependency SLOs with practical Elixir and Kubernetes examples...

[sre] [reliability] [patterns] [elixir] [kubernetes]

SRE: Release Engineering and Progressive Delivery

2026-03-21 | 11 min read

We will explore release engineering practices for reliable deployments, from canary releases with Argo Rollouts and blue-green deployments to feature flags, rollback automation, and deployment SLOs...

[sre] [kubernetes] [deployment] [ci-cd] [argocd]

SRE: Database Reliability

2026-03-23 | 25 min read

We will explore database reliability patterns for PostgreSQL in Kubernetes, from connection pooling and backup strategies to zero-downtime migrations, CloudNativePG operator, and failover automation...

[sre] [database] [postgresql] [kubernetes] [reliability]

SRE: Security as Code

2026-03-29 | 23 min read

We will explore security as code practices for Kubernetes, from OPA Gatekeeper policies and Pod Security Standards to image scanning with Trivy, network policies, runtime security with Falco, and supply chain security...

[sre] [kubernetes] [security] [opa] [policy]

SRE: Disaster Recovery and Business Continuity

2026-04-03 | 27 min read

We will explore disaster recovery planning for Kubernetes, from RPO and RTO targets to Velero backups, etcd recovery, multi-region strategies, DR drills, and runbooks for full cluster recovery...

[sre] [kubernetes] [disaster-recovery] [backup] [reliability]

SRE: Toil Reduction and Automation

2026-04-09 | 20 min read

We will explore toil reduction strategies from the Google SRE book, from identifying and measuring toil to building self-healing systems, internal tooling with Elixir, automation safety patterns, and the 50 percent rule...

[sre] [automation] [platform-engineering] [toil] [elixir]