SRE: Database Reliability
Support this blog
If you find this content useful, consider supporting the blog.
Introduction
In the previous articles we covered SLIs and SLOs, incident management, observability, chaos engineering, capacity planning, GitOps, secrets management, cost optimization, and dependency management. We have covered a lot of ground, but there is one critical piece we have not yet tackled: the database.
Your database is probably the hardest single point of failure in your entire stack. You can scale stateless services horizontally, you can restart crashed pods, you can even lose a whole node and recover in seconds. But if your database goes down, everything stops. If you lose data, it might be gone forever. And if your migrations lock a table for five minutes during peak traffic, your users will notice.
In this article we will walk through the patterns and tools that make PostgreSQL reliable in Kubernetes. We will cover connection pooling, read replicas, backup strategies, zero-downtime migrations, monitoring, the CloudNativePG operator, and failover automation. These are the building blocks that let you sleep at night even when your app handles real traffic and real data.
Let’s get into it.
Connection pooling with PgBouncer
PostgreSQL creates a new process for every connection. That works fine when you have 10 connections, but
in Kubernetes where you might have dozens of pods each running multiple processes, you can easily exhaust
the server’s connection limit. The default max_connections in PostgreSQL is 100, and each connection
consumes around 5-10MB of RAM.
PgBouncer sits between your application and PostgreSQL, multiplexing many client connections onto a smaller number of server connections. It is lightweight (a single PgBouncer process can handle thousands of client connections) and battle-tested in production at massive scale.
Here is a basic PgBouncer configuration:
# pgbouncer.ini
[databases]
myapp = host=postgresql-primary port=5432 dbname=myapp_production
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
# Pool mode: transaction is the most common for web apps
pool_mode = transaction
# Pool sizing
default_pool_size = 20
min_pool_size = 5
max_client_conn = 1000
max_db_connections = 50
# Timeouts
server_idle_timeout = 300
client_idle_timeout = 0
query_timeout = 30
# Logging
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
stats_period = 60
The pool_mode setting is crucial:
- transaction mode releases the server connection back to the pool after each transaction completes. This is what you want for most web applications because it maximizes connection reuse. However, it does not support session-level features like prepared statements, advisory locks, or LISTEN/NOTIFY.
- session mode keeps the server connection assigned for the entire client session. This supports all PostgreSQL features but provides less connection multiplexing. Use this if your app relies on session-level features.
- statement mode releases the connection after each statement. This provides the best multiplexing but only works for simple read-only queries with no multi-statement transactions.
For Ecto (the database layer in Phoenix/Elixir), transaction mode works perfectly because Ecto wraps each request in an explicit transaction or uses simple queries.
Running PgBouncer as a sidecar in Kubernetes
The sidecar pattern puts a PgBouncer container in the same pod as your application. This means each application pod gets its own PgBouncer instance, and the connection to PgBouncer is over localhost (zero network latency). Here is the pod spec:
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tr-web
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: tr-web
template:
metadata:
labels:
app: tr-web
spec:
containers:
- name: app
image: kainlite/tr:latest
ports:
- containerPort: 4000
env:
# Point the app at PgBouncer on localhost instead of directly at PostgreSQL
- name: DATABASE_URL
value: "ecto://myapp:password@localhost:6432/myapp_production"
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "512Mi"
- name: pgbouncer
image: bitnami/pgbouncer:latest
ports:
- containerPort: 6432
env:
- name: POSTGRESQL_HOST
value: "postgresql-primary.database.svc.cluster.local"
- name: POSTGRESQL_PORT
value: "5432"
- name: PGBOUNCER_DATABASE
value: "myapp_production"
- name: PGBOUNCER_POOL_MODE
value: "transaction"
- name: PGBOUNCER_DEFAULT_POOL_SIZE
value: "20"
- name: PGBOUNCER_MAX_CLIENT_CONN
value: "500"
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
How to size your connection pool
A common mistake is setting the pool too large. More connections does not mean better performance. In fact, too many connections cause contention on locks and shared buffers, which hurts throughput.
A good starting formula:
# Total server connections = number of CPU cores on the DB server * 2 + effective_spindle_count
# For a 4-core server with SSD:
# max_useful_connections = 4 * 2 + 1 = 9 (but round up to ~20 for headroom)
# Then distribute across your application pods:
# per_pod_pool_size = total_server_connections / number_of_pods
# For 3 pods with 50 max DB connections:
# per_pod_pool_size = 50 / 3 ≈ 16
# In your Ecto repo config:
config :myapp, MyApp.Repo,
pool_size: 16,
queue_target: 500, # Target queue time in ms
queue_interval: 1000 # How often to check queue health
Monitor the actual connection usage with SHOW POOLS; in PgBouncer or with PostgreSQL’s
pg_stat_activity view. Adjust based on real data, not guesses.
Read replicas and load balancing
For read-heavy workloads (which most web applications are), you can offload read queries to replicas while keeping writes on the primary. PostgreSQL’s streaming replication makes this straightforward.
Setting up streaming replication
On the primary, enable replication in postgresql.conf:
# postgresql.conf on primary
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
hot_standby = on
# Archive WAL for PITR (more on this in the backup section)
archive_mode = on
archive_command = 'pgbackrest --stanza=myapp archive-push %p'
On the replica, set up the recovery configuration:
# postgresql.conf on replica
primary_conninfo = 'host=postgresql-primary port=5432 user=replicator password=secret'
primary_slot_name = 'replica_1'
hot_standby = on
hot_standby_feedback = on
Read/write splitting in Ecto
Ecto supports multiple repositories, so you can define a read-only repo that points to replicas:
# lib/myapp/repo.ex
defmodule MyApp.Repo do
use Ecto.Repo,
otp_app: :myapp,
adapter: Ecto.Adapters.Postgres
end
# lib/myapp/read_repo.ex
defmodule MyApp.ReadRepo do
use Ecto.Repo,
otp_app: :myapp,
adapter: Ecto.Adapters.Postgres,
read_only: true
end
# config/runtime.exs
config :myapp, MyApp.Repo,
url: System.get_env("DATABASE_URL"),
pool_size: 16
config :myapp, MyApp.ReadRepo,
url: System.get_env("DATABASE_REPLICA_URL"),
pool_size: 20 # Replicas can handle more connections since they only serve reads
Then in your application code, use the appropriate repo:
# Write operations go to the primary
MyApp.Repo.insert(%User{name: "Gabriel"})
MyApp.Repo.update(changeset)
MyApp.Repo.delete(user)
# Read operations go to replicas
MyApp.ReadRepo.all(User)
MyApp.ReadRepo.get(User, 1)
# For operations that need to read their own writes (e.g., after an insert),
# use the primary repo to avoid replication lag issues
def create_and_return_user(attrs) do
{:ok, user} = MyApp.Repo.insert(%User{} |> User.changeset(attrs))
# Read from primary, not replica, to avoid stale data
MyApp.Repo.get!(User, user.id)
end
Load balancing across replicas with pgpool-II
If you have multiple replicas, pgpool-II can distribute read queries across them:
# pgpool.conf
backend_hostname0 = 'postgresql-primary'
backend_port0 = 5432
backend_weight0 = 0
backend_flag0 = 'ALWAYS_PRIMARY'
backend_hostname1 = 'postgresql-replica-1'
backend_port1 = 5432
backend_weight1 = 1
backend_hostname2 = 'postgresql-replica-2'
backend_port2 = 5432
backend_weight2 = 1
# Load balancing
load_balance_mode = on
statement_level_load_balance = on
# Health check
health_check_period = 10
health_check_timeout = 5
health_check_max_retries = 3
With this setup, pgpool-II sends all writes to the primary and distributes reads across the two replicas with equal weight. The health check ensures that unhealthy replicas are removed from the pool automatically.
Backup strategies
There are three main types of PostgreSQL backups, and a solid strategy uses at least two of them:
- Logical backups (pg_dump): Export the database as SQL statements. Great for portability, selective restoration, and small to medium databases. Slow for large databases.
- WAL archiving (PITR): Continuously archive Write-Ahead Log files. Allows Point-in-Time Recovery to any moment in time. Essential for production databases.
- Physical backups (pgBackRest/pg_basebackup): Copy the raw data files. Fast for large databases, supports incremental backups, and works with WAL archiving for PITR.
Logical backups with pg_dump
The simplest backup approach. Good for small databases or when you need to migrate between PostgreSQL versions:
# Full database dump in custom format (compressed, supports parallel restore)
pg_dump -h postgresql-primary -U myapp -Fc -f /backups/myapp_$(date +%Y%m%d_%H%M%S).dump myapp_production
# Restore from a dump
pg_restore -h postgresql-primary -U myapp -d myapp_production --clean --if-exists /backups/myapp_20260318_030000.dump
# For very large databases, use parallel dump/restore
pg_dump -h postgresql-primary -U myapp -Fd -j 4 -f /backups/myapp_parallel/ myapp_production
pg_restore -h postgresql-primary -U myapp -d myapp_production -j 4 /backups/myapp_parallel/
WAL archiving for Point-in-Time Recovery
WAL archiving captures every change made to the database. Combined with a base backup, you can restore to any point in time. This is how you recover from “oops, someone ran DELETE without a WHERE clause”:
# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /wal_archive/%f && cp %p /wal_archive/%f'
# For S3-based archiving (recommended for production):
archive_command = 'aws s3 cp %p s3://my-wal-archive/%f --sse AES256'
To restore to a specific point in time:
# recovery.conf (or postgresql.conf in PG12+)
restore_command = 'aws s3 cp s3://my-wal-archive/%f %p'
recovery_target_time = '2026-03-18 14:30:00 UTC'
recovery_target_action = 'promote'
Physical backups with pgBackRest
pgBackRest is the gold standard for PostgreSQL physical backups. It supports full, incremental, and differential backups, parallel backup and restore, compression, encryption, and S3-compatible storage:
# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-endpoint=s3.amazonaws.com
repo1-s3-bucket=myapp-pg-backups
repo1-s3-region=us-east-1
repo1-s3-key=AKIAIOSFODNN7EXAMPLE
repo1-s3-key-secret=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
repo1-retention-full=4
repo1-retention-diff=14
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=your-encryption-passphrase
process-max=4
compress-type=zst
compress-level=6
[myapp]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432
# Create the stanza (one-time setup)
pgbackrest --stanza=myapp stanza-create
# Full backup
pgbackrest --stanza=myapp --type=full backup
# Incremental backup (only changes since last full or incremental)
pgbackrest --stanza=myapp --type=incr backup
# Differential backup (only changes since last full)
pgbackrest --stanza=myapp --type=diff backup
# List backups
pgbackrest --stanza=myapp info
# Restore to latest
pgbackrest --stanza=myapp --delta restore
# Restore to a specific point in time
pgbackrest --stanza=myapp --delta --type=time --target="2026-03-18 14:30:00" restore
Scheduling backups with Kubernetes CronJobs
# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: pg-backup-full
namespace: database
spec:
schedule: "0 2 * * 0" # Full backup every Sunday at 2am
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 2
template:
spec:
containers:
- name: backup
image: pgbackrest/pgbackrest:latest
command:
- /bin/sh
- -c
- |
pgbackrest --stanza=myapp --type=full backup
RESULT=$?
if [ $RESULT -ne 0 ]; then
echo "Backup failed with exit code $RESULT"
# Send alert to Slack or PagerDuty
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
-d '{"text":"ALERT: PostgreSQL full backup failed!"}'
fi
exit $RESULT
envFrom:
- secretRef:
name: pgbackrest-credentials
- secretRef:
name: slack-webhook
volumeMounts:
- name: pgbackrest-config
mountPath: /etc/pgbackrest
volumes:
- name: pgbackrest-config
configMap:
name: pgbackrest-config
restartPolicy: Never
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: pg-backup-incremental
namespace: database
spec:
schedule: "0 2 * * 1-6" # Incremental backup every other day at 2am
concurrencyPolicy: Forbid
jobTemplate:
spec:
backoffLimit: 2
template:
spec:
containers:
- name: backup
image: pgbackrest/pgbackrest:latest
command:
- /bin/sh
- -c
- |
pgbackrest --stanza=myapp --type=incr backup
RESULT=$?
if [ $RESULT -ne 0 ]; then
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
-d '{"text":"ALERT: PostgreSQL incremental backup failed!"}'
fi
exit $RESULT
envFrom:
- secretRef:
name: pgbackrest-credentials
- secretRef:
name: slack-webhook
volumeMounts:
- name: pgbackrest-config
mountPath: /etc/pgbackrest
volumes:
- name: pgbackrest-config
configMap:
name: pgbackrest-config
restartPolicy: Never
Backup validation and restore testing
Here is a hard truth: a backup that has never been tested is not a backup. It is a hope. And hope is not a strategy.
You need to regularly restore your backups to a temporary database and verify that the data is intact. This should be automated, not something you do manually once a year when someone remembers.
Automated restore testing with a CronJob
# restore-test-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: pg-restore-test
namespace: database
spec:
schedule: "0 6 * * 3" # Every Wednesday at 6am
concurrencyPolicy: Forbid
jobTemplate:
spec:
backoffLimit: 1
activeDeadlineSeconds: 3600 # Timeout after 1 hour
template:
spec:
containers:
- name: restore-test
image: pgbackrest/pgbackrest:latest
command:
- /bin/sh
- -c
- |
set -e
echo "Starting restore test at $(date)"
# Initialize a temporary PostgreSQL data directory
export PGDATA=/tmp/pg_restore_test
mkdir -p $PGDATA
# Restore the latest backup to the temp directory
pgbackrest --stanza=myapp --delta \
--pg1-path=$PGDATA \
--target-action=promote \
restore
# Start PostgreSQL on a non-standard port
pg_ctl -D $PGDATA -o "-p 5433" -w start
# Run validation queries
USERS_COUNT=$(psql -p 5433 -d myapp_production -tAc "SELECT count(*) FROM users;")
POSTS_COUNT=$(psql -p 5433 -d myapp_production -tAc "SELECT count(*) FROM posts;")
LATEST_RECORD=$(psql -p 5433 -d myapp_production -tAc \
"SELECT max(inserted_at) FROM users;")
echo "Validation results:"
echo " Users count: $USERS_COUNT"
echo " Posts count: $POSTS_COUNT"
echo " Latest record: $LATEST_RECORD"
# Verify data is recent (not older than 48 hours)
IS_RECENT=$(psql -p 5433 -d myapp_production -tAc \
"SELECT max(inserted_at) > now() - interval '48 hours' FROM users;")
# Stop the temp PostgreSQL
pg_ctl -D $PGDATA -w stop
# Clean up
rm -rf $PGDATA
if [ "$IS_RECENT" = "t" ]; then
echo "RESTORE TEST PASSED: Data is recent and valid"
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
-d "{\"text\":\"Restore test PASSED. Users: $USERS_COUNT, Posts: $POSTS_COUNT, Latest: $LATEST_RECORD\"}"
else
echo "RESTORE TEST FAILED: Data is stale or missing"
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
-d '{"text":"ALERT: Restore test FAILED. Data is stale or missing!"}'
exit 1
fi
envFrom:
- secretRef:
name: pgbackrest-credentials
- secretRef:
name: slack-webhook
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
restartPolicy: Never
The key things this restore test validates:
- The backup is restorable: If pgBackRest cannot restore, you know immediately
- The data is recent: If the latest record is older than 48 hours, something is wrong with your backup pipeline
- Core tables exist and have data: A basic sanity check that the schema and data are intact
- Notification on success and failure: You want to know both when it works and when it does not
You should also track restore test results as a metric and set up an SLO for it. Something like “99% of weekly restore tests should succeed” is a good starting point.
Zero-downtime migrations
Database migrations are one of the most common causes of downtime. A migration that locks a table can block all queries to that table, which means your application hangs until the migration completes. In PostgreSQL, even seemingly innocent operations like adding a column with a default value used to lock the entire table (though this was fixed in PostgreSQL 11).
Here are the safe migration patterns for Ecto:
Safe: Adding a nullable column without a default
# This is always safe. It takes a brief ACCESS EXCLUSIVE lock but completes almost instantly.
defmodule MyApp.Repo.Migrations.AddAvatarToUsers do
use Ecto.Migration
def change do
alter table(:users) do
add :avatar_url, :string
end
end
end
Safe: Adding a column with a default (PostgreSQL 11+)
# In PostgreSQL 11+, this is safe because the default is stored in the catalog,
# not written to every row. The lock is brief.
defmodule MyApp.Repo.Migrations.AddRoleToUsers do
use Ecto.Migration
def change do
alter table(:users) do
add :role, :string, default: "user"
end
end
end
Dangerous: Adding an index on a large table
A regular CREATE INDEX locks the table for writes. On a table with millions of rows, this can take
minutes. Use CREATE INDEX CONCURRENTLY instead:
# WRONG: This locks the table for writes
defmodule MyApp.Repo.Migrations.AddIndexToPostsTitle do
use Ecto.Migration
def change do
create index(:posts, [:title])
end
end
# RIGHT: Use concurrently to avoid locking
defmodule MyApp.Repo.Migrations.AddIndexToPostsTitle do
use Ecto.Migration
# disable_ddl_transaction is required for concurrent index creation
@disable_ddl_transaction true
@disable_migration_lock true
def change do
create index(:posts, [:title], concurrently: true)
end
end
Safe pattern: Backfilling data in batches
Never backfill data in a migration that runs inside a transaction. Instead, write a separate migration or task that processes rows in batches:
# Step 1: Add the new column (fast, safe)
defmodule MyApp.Repo.Migrations.AddSlugToPosts do
use Ecto.Migration
def change do
alter table(:posts) do
add :slug, :string
end
end
end
# Step 2: Backfill in batches (separate migration or mix task)
defmodule MyApp.Repo.Migrations.BackfillPostSlugs do
use Ecto.Migration
import Ecto.Query
@disable_ddl_transaction true
@disable_migration_lock true
@batch_size 1000
def up do
backfill_batch(0)
end
defp backfill_batch(last_id) do
{count, _} =
repo().query!("""
UPDATE posts
SET slug = lower(replace(title, ' ', '-'))
WHERE id > $1
AND id <= $1 + $2
AND slug IS NULL
""", [last_id, @batch_size])
if count > 0 do
# Small sleep to avoid overwhelming the database
Process.sleep(100)
backfill_batch(last_id + @batch_size)
end
end
def down do
# Nothing to undo, the column drop will handle cleanup
:ok
end
end
# Step 3: Add the NOT NULL constraint and index (after backfill is complete)
defmodule MyApp.Repo.Migrations.MakePostSlugRequired do
use Ecto.Migration
@disable_ddl_transaction true
@disable_migration_lock true
def up do
# Add a check constraint first (non-blocking in PG12+)
execute "ALTER TABLE posts ADD CONSTRAINT posts_slug_not_null CHECK (slug IS NOT NULL) NOT VALID"
# Then validate it (takes a brief lock but does not block writes for long)
execute "ALTER TABLE posts VALIDATE CONSTRAINT posts_slug_not_null"
# Create unique index concurrently
create unique_index(:posts, [:slug], concurrently: true)
end
def down do
drop_if_exists index(:posts, [:slug])
execute "ALTER TABLE posts DROP CONSTRAINT IF EXISTS posts_slug_not_null"
end
end
Migration safety checklist
Before running any migration in production:
- Check the lock type: Will the migration take an ACCESS EXCLUSIVE lock? For how long?
- Test on a copy of production data: Never test migrations on an empty database. A migration that runs instantly on 100 rows might lock the table for minutes on 10 million rows.
- Use statement_timeout: Set a statement timeout so that if a migration takes too long, it fails instead of locking the table indefinitely.
- Run during low traffic: Even “safe” migrations are safer during off-peak hours.
- Have a rollback plan: Know how to undo the migration before you run it.
# Set a statement timeout for migrations to prevent long locks
# config/runtime.exs
config :myapp, MyApp.Repo,
migration_lock: nil,
migration_default_prefix: "public",
after_connect: {Postgrex, :query!, ["SET statement_timeout TO '5s'", []]}
Database monitoring
You cannot fix what you cannot see. PostgreSQL comes with excellent built-in monitoring views, and combining them with Prometheus gives you a comprehensive picture of your database health.
pg_stat_statements: finding slow queries
pg_stat_statements is the most important PostgreSQL extension for performance monitoring. It tracks
execution statistics for every query that runs on the server:
# Enable the extension (once)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
# Top 10 queries by total execution time
SELECT
queryid,
calls,
round(total_exec_time::numeric, 2) AS total_time_ms,
round(mean_exec_time::numeric, 2) AS mean_time_ms,
round(max_exec_time::numeric, 2) AS max_time_ms,
rows,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS percent_total,
left(query, 100) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
# Top 10 queries by average execution time (slow queries)
SELECT
queryid,
calls,
round(mean_exec_time::numeric, 2) AS mean_time_ms,
round(max_exec_time::numeric, 2) AS max_time_ms,
rows / NULLIF(calls, 0) AS avg_rows,
left(query, 100) AS query_preview
FROM pg_stat_statements
WHERE calls > 10 -- Filter out rarely executed queries
ORDER BY mean_exec_time DESC
LIMIT 10;
# Queries with the most I/O
SELECT
queryid,
calls,
shared_blks_read + shared_blks_written AS total_blocks,
round(mean_exec_time::numeric, 2) AS mean_time_ms,
left(query, 100) AS query_preview
FROM pg_stat_statements
ORDER BY (shared_blks_read + shared_blks_written) DESC
LIMIT 10;
Connection monitoring
Knowing how your connections are being used is critical for sizing your pool correctly and detecting connection leaks:
# Current connection count by state
SELECT
state,
count(*) AS connections,
max(now() - state_change) AS longest_in_state
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
GROUP BY state
ORDER BY connections DESC;
# Connections by application name (useful for identifying which service uses the most)
SELECT
application_name,
state,
count(*) AS connections
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
GROUP BY application_name, state
ORDER BY connections DESC;
# Find long-running queries (potential problems)
SELECT
pid,
now() - query_start AS duration,
state,
left(query, 80) AS query_preview,
wait_event_type,
wait_event
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '30 seconds'
ORDER BY duration DESC;
# Find blocked queries (waiting for locks)
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
left(blocked_activity.query, 60) AS blocked_query,
left(blocking_activity.query, 60) AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
Replication lag monitoring
If you are using read replicas, monitoring replication lag is essential. A replica that is too far behind can serve stale data:
# On the primary: check replication status
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS replay_lag_pretty
FROM pg_stat_replication;
# On the replica: check how far behind it is
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_delay,
pg_is_in_recovery() AS is_replica,
pg_last_wal_receive_lsn() AS last_received,
pg_last_wal_replay_lsn() AS last_replayed;
Prometheus exporter for PostgreSQL
The postgres_exporter from Prometheus Community exposes all these metrics in Prometheus format.
Deploy it alongside your PostgreSQL instances:
# postgres-exporter.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres-exporter
namespace: database
spec:
replicas: 1
selector:
matchLabels:
app: postgres-exporter
template:
metadata:
labels:
app: postgres-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9187"
spec:
containers:
- name: exporter
image: prometheuscommunity/postgres-exporter:latest
ports:
- containerPort: 9187
env:
- name: DATA_SOURCE_URI
value: "postgresql-primary.database.svc:5432/myapp_production?sslmode=disable"
- name: DATA_SOURCE_USER
valueFrom:
secretKeyRef:
name: postgres-exporter-credentials
key: username
- name: DATA_SOURCE_PASS
valueFrom:
secretKeyRef:
name: postgres-exporter-credentials
key: password
- name: PG_EXPORTER_EXTEND_QUERY_PATH
value: /etc/postgres-exporter/queries.yaml
volumeMounts:
- name: custom-queries
mountPath: /etc/postgres-exporter
volumes:
- name: custom-queries
configMap:
name: postgres-exporter-queries
---
# Custom queries for the exporter
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-exporter-queries
namespace: database
data:
queries.yaml: |
pg_slow_queries:
query: |
SELECT count(*) AS count
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '30 seconds'
metrics:
- count:
usage: "GAUGE"
description: "Number of queries running longer than 30 seconds"
pg_connection_count:
query: |
SELECT state, count(*) AS count
FROM pg_stat_activity
GROUP BY state
metrics:
- count:
usage: "GAUGE"
description: "Number of connections by state"
master: true
pg_database_size:
query: |
SELECT pg_database.datname,
pg_database_size(pg_database.datname) AS size_bytes
FROM pg_database
WHERE datistemplate = false
metrics:
- datname:
usage: "LABEL"
description: "Database name"
- size_bytes:
usage: "GAUGE"
description: "Database size in bytes"
With this setup, you can create Prometheus alerts for:
- High replication lag: Alert when a replica is more than 30 seconds behind
- Connection exhaustion: Alert when connections are above 80% of
max_connections- Slow queries: Alert when there are queries running longer than 60 seconds
- Database size growth: Alert when the database is growing faster than expected
CloudNativePG operator
CloudNativePG (CNPG) is a Kubernetes operator that manages the full lifecycle of PostgreSQL clusters. It handles provisioning, scaling, failover, backups, and monitoring. If you are running PostgreSQL in Kubernetes, this is the operator you should be using.
Installation
# Install with Helm
helm repo add cnpg https://cloudnative-pg.github.io/charts
helm repo update
helm install cnpg cnpg/cloudnative-pg \
--namespace cnpg-system \
--create-namespace
Creating a PostgreSQL cluster
Here is a production-ready Cluster CRD:
# postgresql-cluster.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: myapp-db
namespace: database
spec:
instances: 3 # 1 primary + 2 replicas
imageName: ghcr.io/cloudnative-pg/postgresql:16.2
postgresql:
parameters:
max_connections: "200"
shared_buffers: "512MB"
effective_cache_size: "1536MB"
maintenance_work_mem: "128MB"
checkpoint_completion_target: "0.9"
wal_buffers: "16MB"
default_statistics_target: "100"
random_page_cost: "1.1" # SSD optimized
effective_io_concurrency: "200"
work_mem: "4MB"
min_wal_size: "1GB"
max_wal_size: "4GB"
max_worker_processes: "4"
max_parallel_workers_per_gather: "2"
max_parallel_workers: "4"
max_parallel_maintenance_workers: "2"
# Enable pg_stat_statements
shared_preload_libraries: "pg_stat_statements"
pg_stat_statements.track: "all"
pg_stat_statements.max: "10000"
pg_hba:
- "host all all 10.0.0.0/8 scram-sha-256"
- "host replication streaming_replica 10.0.0.0/8 scram-sha-256"
bootstrap:
initdb:
database: myapp_production
owner: myapp
secret:
name: myapp-db-credentials
storage:
size: 50Gi
storageClass: longhorn # Or your preferred storage class
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
# Enable monitoring
monitoring:
enablePodMonitor: true
customQueriesConfigMap:
- name: cnpg-default-monitoring
key: queries
# Anti-affinity to spread instances across nodes
affinity:
enablePodAntiAffinity: true
topologyKey: kubernetes.io/hostname
# Backup configuration to S3
backup:
barmanObjectStore:
destinationPath: "s3://myapp-pg-backups/cnpg/"
s3Credentials:
accessKeyId:
name: aws-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: aws-s3-credentials
key: SECRET_ACCESS_KEY
wal:
compression: gzip
maxParallel: 4
data:
compression: gzip
immediateCheckpoint: true
retentionPolicy: "30d"
This creates a 3-instance PostgreSQL cluster with:
- Automatic replication: CNPG handles streaming replication between primary and replicas
- Tuned parameters: Optimized PostgreSQL configuration for a typical web workload
- Pod anti-affinity: Instances are spread across different Kubernetes nodes for resilience
- Monitoring: Pod monitors for Prometheus integration
- WAL archiving to S3: Continuous backup of WAL files for PITR
Scheduled backups
# scheduled-backup.yaml
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: myapp-db-daily-backup
namespace: database
spec:
schedule: "0 2 * * *" # Every day at 2am
backupOwnerReference: self
cluster:
name: myapp-db
immediate: false
target: prefer-standby # Take backup from a replica to avoid impacting the primary
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: myapp-db-weekly-full
namespace: database
spec:
schedule: "0 3 * * 0" # Every Sunday at 3am
backupOwnerReference: self
cluster:
name: myapp-db
immediate: false
target: prefer-standby
Connecting your application
CNPG creates Kubernetes services for read-write and read-only access:
# The operator creates these services automatically:
# myapp-db-rw -> points to the primary (read-write)
# myapp-db-ro -> points to replicas (read-only, load balanced)
# myapp-db-r -> points to any instance (for reads that can tolerate lag)
# In your Ecto configuration:
config :myapp, MyApp.Repo,
hostname: "myapp-db-rw.database.svc.cluster.local",
database: "myapp_production",
username: "myapp",
password: System.get_env("DB_PASSWORD"),
pool_size: 16
config :myapp, MyApp.ReadRepo,
hostname: "myapp-db-ro.database.svc.cluster.local",
database: "myapp_production",
username: "myapp",
password: System.get_env("DB_PASSWORD"),
pool_size: 20
Monitoring the CNPG cluster
CNPG exposes a rich set of metrics. Here are some useful PromQL queries:
# Replication lag in seconds
cnpg_pg_replication_lag{cluster="myapp-db"}
# Number of connections by state
cnpg_pg_stat_activity_count{cluster="myapp-db"}
# Transaction rate
rate(cnpg_pg_stat_database_xact_commit{cluster="myapp-db"}[5m])
+ rate(cnpg_pg_stat_database_xact_rollback{cluster="myapp-db"}[5m])
# Cache hit ratio (should be > 99%)
cnpg_pg_stat_database_blks_hit{cluster="myapp-db"}
/ (cnpg_pg_stat_database_blks_hit{cluster="myapp-db"}
+ cnpg_pg_stat_database_blks_read{cluster="myapp-db"}) * 100
# WAL generation rate
rate(cnpg_pg_stat_archiver_archived_count{cluster="myapp-db"}[5m])
# Database size
cnpg_pg_database_size_bytes{cluster="myapp-db", datname="myapp_production"}
Failover and high availability
The whole point of running multiple instances is that when the primary fails, a replica takes over automatically. This is where CloudNativePG really shines.
Automatic failover with CloudNativePG
CNPG monitors the health of all instances continuously. When it detects that the primary is unhealthy, it:
- Detects the failure: The operator checks instance health via health probes and replication status
- Selects the best replica: Chooses the replica with the least replication lag
- Promotes the replica: Runs
pg_promote()to make the replica the new primary- Updates the services: The
myapp-db-rwservice now points to the new primary- Reconfigures remaining replicas: They start replicating from the new primary
- Fences the old primary: Prevents the old primary from accepting writes (split-brain prevention)
This entire process typically completes in 10-30 seconds. Your application might see a brief connection error during the switchover, so make sure your Ecto configuration has proper retry logic:
# config/runtime.exs
config :myapp, MyApp.Repo,
hostname: "myapp-db-rw.database.svc.cluster.local",
database: "myapp_production",
pool_size: 16,
# Ecto/DBConnection will retry failed checkouts
queue_target: 5000,
queue_interval: 5000,
# Configure the socket options for faster failure detection
socket_options: [
keepalive: true,
# Send keepalive probes after 10 seconds of idle
# (platform-dependent, works on Linux)
],
parameters: [
application_name: "tr-web"
]
Testing failover
You should regularly test that failover works. With CNPG, you can trigger a controlled switchover:
# Trigger a switchover (controlled failover)
kubectl cnpg promote myapp-db myapp-db-2 --namespace database
# Or use the plugin to trigger a restart of the primary (simulates a crash)
kubectl cnpg restart myapp-db myapp-db-1 --namespace database
# Check the cluster status during and after failover
kubectl cnpg status myapp-db --namespace database
The output shows you which instance is the primary, replication lag, and the overall cluster health:
# Example output of kubectl cnpg status myapp-db
Cluster Summary
Name: myapp-db
Namespace: database
PostgreSQL Image: ghcr.io/cloudnative-pg/postgresql:16.2
Primary instance: myapp-db-2 # This was promoted
Status: Cluster in healthy state
Instances: 3
Certificates Status
...
Instances Status
Name Role Status Node Timeline LSN
---- ---- ------ ---- -------- ---
myapp-db-1 Replica OK worker-01 2 0/5000060
myapp-db-2 Primary OK worker-02 2 0/5000060
myapp-db-3 Replica OK worker-03 2 0/5000060
Patroni as an alternative
If you are not using CloudNativePG (maybe you are running PostgreSQL on VMs or using a different Kubernetes approach), Patroni is the go-to solution for PostgreSQL high availability. It uses a distributed consensus store (etcd, Consul, or ZooKeeper) to manage leader election and failover:
# patroni.yml
scope: myapp-cluster
name: postgresql-node-1
restapi:
listen: 0.0.0.0:8008
connect_address: postgresql-node-1:8008
etcd:
hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB
postgresql:
use_pg_rewind: true
parameters:
max_connections: 200
shared_buffers: 512MB
wal_level: replica
hot_standby: on
max_wal_senders: 10
max_replication_slots: 10
initdb:
- encoding: UTF8
- data-checksums
postgresql:
listen: 0.0.0.0:5432
connect_address: postgresql-node-1:5432
data_dir: /var/lib/postgresql/16/main
authentication:
superuser:
username: postgres
password: secret
replication:
username: replicator
password: secret
The key difference is that CNPG is Kubernetes-native (it uses the Kubernetes API for coordination) while Patroni requires a separate consensus store. If you are already running in Kubernetes, CNPG is the simpler choice.
Split-brain prevention
Split-brain is the worst thing that can happen in a database cluster: two instances both think they are the primary and accept writes independently. When they reconnect, the data is inconsistent and potentially unrecoverable.
Both CNPG and Patroni have built-in split-brain prevention:
- CNPG uses fencing. When a failover happens, the old primary is fenced (its data directory is marked as invalid) so even if it comes back, it cannot serve writes. It must be reinitialized as a replica.
- Patroni uses the consensus store (etcd) as the source of truth. Only the node that holds the leader key in etcd can be the primary. If a node loses contact with etcd, it demotes itself.
Additional safeguards you should have:
- Network policies: Ensure that only the operator or Patroni can modify the service endpoints
- Monitoring: Alert on any instance that reports itself as primary when it should not be
- pg_rewind: Enable
pg_rewindso that a former primary can be quickly resynchronized as a replica without a full base backup
# PrometheusRule for split-brain detection
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pg-split-brain-alert
namespace: database
spec:
groups:
- name: postgresql.split-brain
rules:
- alert: PostgreSQLSplitBrain
expr: |
count(cnpg_pg_replication_is_replica{cluster="myapp-db"} == 0) > 1
for: 30s
labels:
severity: critical
annotations:
summary: "CRITICAL: Multiple primary instances detected in myapp-db cluster"
description: "There are {{ $value }} instances reporting as primary. This is a split-brain situation that requires immediate attention."
Closing notes
Database reliability is not a single thing you set up and forget. It is a combination of patterns that work together: connection pooling keeps your connections healthy, replicas distribute the read load, backups protect your data, safe migrations prevent self-inflicted outages, monitoring tells you when something is wrong, and automated failover keeps things running when hardware fails.
The good news is that tools like CloudNativePG make most of this much easier than it used to be. Instead of hand-configuring replication, failover scripts, and backup cron jobs, you declare your desired state in a Kubernetes manifest and the operator handles the rest. That is a massive improvement over the “artisanal PostgreSQL” approach many of us grew up with.
Start with the basics: get PgBouncer in front of your database, set up automated backups with restore testing, and add pg_stat_statements for query monitoring. Then when you are ready, move to CloudNativePG for a fully managed cluster with automated failover. Each layer builds on the previous one.
Hope you found this useful and enjoyed reading it, until next time!
Errata
If you spot any error or have any suggestion, please send me a message so it gets fixed.
Also, you can check the source code and changes in the sources here
$ Comments
Online: 0Please sign in to be able to write comments.