SegFault - reliability

SRE: Chaos Engineering, Breaking Things on Purpose

2026-03-02 | 12 min read

We will explore chaos engineering in Kubernetes using Litmus and Chaos Mesh, how to plan and run game days, and why breaking things on purpose is the best way to build reliable systems...

[sre] [kubernetes] [chaos-engineering] [reliability] [testing]

SRE: Dependency Management and Graceful Degradation

2026-03-17 | 25 min read

We will explore how to manage service dependencies reliably, from circuit breakers and bulkhead patterns to graceful degradation strategies and dependency SLOs with practical Elixir and Kubernetes examples...

[sre] [reliability] [patterns] [elixir] [kubernetes]

SRE: Database Reliability

2026-03-23 | 25 min read

We will explore database reliability patterns for PostgreSQL in Kubernetes, from connection pooling and backup strategies to zero-downtime migrations, CloudNativePG operator, and failover automation...

[sre] [database] [postgresql] [kubernetes] [reliability]

SRE: Disaster Recovery and Business Continuity

2026-04-03 | 27 min read

We will explore disaster recovery planning for Kubernetes, from RPO and RTO targets to Velero backups, etcd recovery, multi-region strategies, DR drills, and runbooks for full cluster recovery...

[sre] [kubernetes] [disaster-recovery] [backup] [reliability]

DevOps from Zero to Hero: Incident Response and On-Call

2026-06-14 | 25 min read

We will cover the fundamentals of incident response, severity levels, on-call rotations, alerting tools, runbooks, blameless postmortems, and how to build a healthy on-call culture that does not burn people out...

[devops] [incident-response] [on-call] [reliability] [beginners]