SRE: Dependency Management and Graceful Degradation
Support this blog
If you find this content useful, consider supporting the blog.
Introduction
In the previous articles we covered SLIs and SLOs, incident management, observability, chaos engineering, capacity planning, GitOps, secrets management, and cost optimization. All of those focus on your own systems, your own code, your own infrastructure. But here is the thing: your service does not exist in isolation.
Every HTTP call to another service, every database query, every message published to a queue, every third-party API integration is a dependency. And every dependency is a potential failure point. When that payment gateway goes down at 2am or that internal auth service starts returning 500s under load, what happens to your service? Does it crash? Does it hang? Or does it gracefully handle the situation and keep serving users with reduced functionality?
In this article we will cover how to think about dependencies as risk, implement circuit breakers, apply the bulkhead pattern, handle timeouts and retries properly, build fallback strategies, set up dependency health checks, map your dependency graph, define SLOs for your dependencies, and implement graceful degradation using feature flags. All with practical Elixir and Kubernetes examples.
Let’s get into it.
Dependencies as risk
Not all dependencies are created equal. The first step in managing them is understanding what kind of dependency you are dealing with and what happens when it fails.
There are two fundamental types of dependencies:
- Hard dependencies: Your service cannot function at all without them. If your database is down, you probably cannot serve any requests. If the auth service is unreachable, nobody can log in.
- Soft dependencies: Your service can still function without them, possibly in a degraded state. If the recommendation engine is down, you can still show the product page without recommendations. If the analytics service is slow, you can fire and forget.
The danger comes from cascading failures. Consider this scenario: Service A calls Service B, which calls Service C. Service C starts responding slowly because of a database issue. Service B’s threads are now blocked waiting for Service C. Service B’s response times increase. Service A’s threads are now blocked waiting for Service B. Pretty soon, all three services are effectively down because of one slow database query in Service C.
This is why dependency management matters so much. A single misbehaving dependency can take down your entire system if you do not have the right protections in place. Let’s look at the patterns that prevent this.
Circuit breakers
The circuit breaker pattern is borrowed from electrical engineering. When too much current flows through a circuit, the breaker trips and stops the flow to prevent damage. In software, when a dependency starts failing, the circuit breaker trips and stops sending requests to it, giving it time to recover.
A circuit breaker has three states:
- Closed: Everything is normal. Requests flow through to the dependency. The breaker monitors failure rates.
- Open: The dependency is failing. Requests are immediately rejected without calling the dependency. A timer starts.
- Half-open: The timer has expired. A limited number of test requests are sent through. If they succeed, the breaker closes. If they fail, the breaker opens again.
Here is a practical implementation in Elixir using a GenServer:
defmodule MyApp.CircuitBreaker do
use GenServer
@failure_threshold 5
@reset_timeout_ms 30_000
@half_open_max_calls 3
defstruct [
:name,
state: :closed,
failure_count: 0,
success_count: 0,
last_failure_time: nil,
half_open_calls: 0
]
# Client API
def start_link(opts) do
name = Keyword.fetch!(opts, :name)
GenServer.start_link(__MODULE__, %__MODULE__{name: name}, name: name)
end
def call(name, func) when is_function(func, 0) do
case GenServer.call(name, :check_state) do
:ok ->
try do
result = func.()
GenServer.cast(name, :record_success)
{:ok, result}
rescue
error ->
GenServer.cast(name, :record_failure)
{:error, :dependency_error, error}
end
:open ->
{:error, :circuit_open}
end
end
# Server callbacks
@impl true
def init(state) do
{:ok, state}
end
@impl true
def handle_call(:check_state, _from, %{state: :closed} = state) do
{:reply, :ok, state}
end
def handle_call(:check_state, _from, %{state: :open} = state) do
if time_since_last_failure(state) >= @reset_timeout_ms do
{:reply, :ok, %{state | state: :half_open, half_open_calls: 0}}
else
{:reply, :open, state}
end
end
def handle_call(:check_state, _from, %{state: :half_open} = state) do
if state.half_open_calls < @half_open_max_calls do
{:reply, :ok, %{state | half_open_calls: state.half_open_calls + 1}}
else
{:reply, :open, state}
end
end
@impl true
def handle_cast(:record_success, %{state: :half_open} = state) do
{:noreply, %{state | state: :closed, failure_count: 0, success_count: 0}}
end
def handle_cast(:record_success, state) do
{:noreply, %{state | success_count: state.success_count + 1}}
end
def handle_cast(:record_failure, state) do
new_count = state.failure_count + 1
now = System.monotonic_time(:millisecond)
new_state =
if new_count >= @failure_threshold do
%{state | state: :open, failure_count: new_count, last_failure_time: now}
else
%{state | failure_count: new_count, last_failure_time: now}
end
{:noreply, new_state}
end
defp time_since_last_failure(%{last_failure_time: nil}), do: :infinity
defp time_since_last_failure(%{last_failure_time: time}) do
System.monotonic_time(:millisecond) - time
end
end
And here is how you would use it in your application:
# In your application supervision tree
children = [
{MyApp.CircuitBreaker, name: :payment_service},
{MyApp.CircuitBreaker, name: :auth_service},
{MyApp.CircuitBreaker, name: :recommendation_engine}
]
# When making a call to a dependency
case MyApp.CircuitBreaker.call(:payment_service, fn ->
HTTPoison.post("https://payments.internal/charge", body, headers)
end) do
{:ok, %{status_code: 200, body: body}} ->
{:ok, Jason.decode!(body)}
{:error, :circuit_open} ->
Logger.warning("Payment service circuit is open, using fallback")
{:error, :service_unavailable}
{:error, :dependency_error, error} ->
Logger.error("Payment service error: #{inspect(error)}")
{:error, :payment_failed}
end
The key insight here is that when the circuit is open, you fail fast. Instead of waiting 30 seconds for a timeout from a dead service, you get an immediate response and can execute your fallback logic. This protects both your service and the failing dependency, since you are not piling on more requests while it is trying to recover.
Bulkhead pattern
The bulkhead pattern comes from ship design. Ships have compartments (bulkheads) so that if one section floods, the rest of the ship stays afloat. In software, the idea is to isolate failure domains so that a problem in one area does not affect everything else.
Elixir and the BEAM VM are particularly good at this because of process isolation. Each process is independent, has its own memory, and if it crashes, other processes are unaffected. You can use this to create natural bulkheads:
defmodule MyApp.DependencyPool do
@moduledoc """
Manages separate process pools for each dependency,
preventing one slow dependency from consuming all resources.
"""
def child_spec(_opts) do
children = [
# Each dependency gets its own pool with its own limits
pool_spec(:payment_pool, MyApp.PaymentWorker, size: 10, max_overflow: 5),
pool_spec(:auth_pool, MyApp.AuthWorker, size: 20, max_overflow: 10),
pool_spec(:recommendation_pool, MyApp.RecommendationWorker, size: 5, max_overflow: 2),
pool_spec(:notification_pool, MyApp.NotificationWorker, size: 5, max_overflow: 5)
]
%{
id: __MODULE__,
type: :supervisor,
start: {Supervisor, :start_link, [children, [strategy: :one_for_one]]}
}
end
defp pool_spec(name, worker, opts) do
pool_opts = [
name: {:local, name},
worker_module: worker,
size: Keyword.fetch!(opts, :size),
max_overflow: Keyword.fetch!(opts, :max_overflow)
]
:poolboy.child_spec(name, pool_opts)
end
def call_dependency(pool_name, request, timeout \\ 5_000) do
try do
:poolboy.transaction(
pool_name,
fn worker -> GenServer.call(worker, {:request, request}, timeout) end,
timeout
)
catch
:exit, {:timeout, _} ->
{:error, :pool_timeout}
:exit, {:noproc, _} ->
{:error, :pool_unavailable}
end
end
end
In Kubernetes, you get another layer of bulkheading through resource limits. Each service gets its own CPU and memory budget, so a runaway dependency cannot starve other services:
# k8s/deployment-with-bulkheads.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: myapp:latest
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
---
# Separate resource quotas per namespace act as bulkheads
apiVersion: v1
kind: ResourceQuota
metadata:
name: dependency-quota
namespace: payment-service
spec:
hard:
requests.cpu: "2"
requests.memory: "4Gi"
limits.cpu: "4"
limits.memory: "8Gi"
pods: "20"
---
# Network policies as another form of bulkhead
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: payment-service-policy
namespace: payment-service
spec:
podSelector:
matchLabels:
app: payment-service
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: my-app
ports:
- port: 8080
protocol: TCP
The combination of Elixir process isolation, connection pools, Kubernetes resource limits, and network policies gives you multiple layers of bulkheading. If the payment service goes haywire, it cannot consume all the CPU on the node, cannot exhaust your application’s connection pool for other services, and cannot affect processes handling requests that do not need payments.
Timeouts and retries
Timeouts and retries seem straightforward, but getting them wrong is one of the most common causes of cascading failures. Let’s start with what not to do.
The naive approach looks like this:
# DON'T do this - unbounded retries with no backoff
def fetch_user(user_id) do
case HTTPoison.get("https://auth.internal/users/#{user_id}") do
{:ok, response} -> {:ok, response}
{:error, _} -> fetch_user(user_id) # infinite retry loop!
end
end
This creates a retry storm. If the auth service is down, every single request to your service will generate infinite retries, making the problem worse. Here is the right way to do it with exponential backoff and jitter:
defmodule MyApp.Retry do
@moduledoc """
Retry with exponential backoff and jitter.
"""
@default_opts [
max_retries: 3,
base_delay_ms: 100,
max_delay_ms: 5_000,
jitter: true
]
def with_retry(func, opts \\ []) when is_function(func, 0) do
opts = Keyword.merge(@default_opts, opts)
do_retry(func, 0, opts)
end
defp do_retry(func, attempt, opts) do
case func.() do
{:ok, result} ->
{:ok, result}
{:error, reason} when attempt < opts[:max_retries] ->
delay = calculate_delay(attempt, opts)
Logger.info("Retry attempt #{attempt + 1} after #{delay}ms, reason: #{inspect(reason)}")
Process.sleep(delay)
do_retry(func, attempt + 1, opts)
{:error, reason} ->
Logger.warning("All #{opts[:max_retries]} retries exhausted, reason: #{inspect(reason)}")
{:error, :retries_exhausted, reason}
end
end
defp calculate_delay(attempt, opts) do
# Exponential backoff: base * 2^attempt
base_delay = opts[:base_delay_ms] * Integer.pow(2, attempt)
# Cap at max delay
capped_delay = min(base_delay, opts[:max_delay_ms])
# Add jitter to prevent thundering herd
if opts[:jitter] do
jitter_range = div(capped_delay, 2)
capped_delay - jitter_range + :rand.uniform(jitter_range * 2)
else
capped_delay
end
end
end
And here is how you combine retries with the circuit breaker:
defmodule MyApp.ResilientClient do
alias MyApp.{CircuitBreaker, Retry}
def call_service(circuit_name, request_fn, opts \\ []) do
CircuitBreaker.call(circuit_name, fn ->
Retry.with_retry(fn ->
case request_fn.() do
{:ok, %{status_code: status} = resp} when status in 200..299 ->
{:ok, resp}
{:ok, %{status_code: status}} when status in [429, 503] ->
# Retryable server errors
{:error, :retryable}
{:ok, %{status_code: status} = resp} ->
# Non-retryable client errors
{:ok, resp}
{:error, %HTTPoison.Error{reason: reason}} ->
{:error, reason}
end
end, opts)
end)
end
end
# Usage
MyApp.ResilientClient.call_service(:payment_service, fn ->
HTTPoison.post(url, body, headers, recv_timeout: 5_000)
end, max_retries: 2, base_delay_ms: 200)
There are a few important things to note here:
- Always set timeouts: Never make a network call without a timeout. A default timeout of 5 seconds is a reasonable starting point.
- Jitter is essential: Without jitter, all retries happen at the same time, creating a thundering herd. Adding randomness spreads them out.
- Not everything is retryable: Only retry on transient errors (timeouts, 503s, connection resets). Do not retry on 400s or 404s.
- Set a retry budget: Limit the total number of retries across all requests, not just per request. If 50% of your requests are retrying, something is very wrong.
- Combine with circuit breakers: Retries without a circuit breaker can make a bad situation worse. The circuit breaker stops the bleeding when retries are not helping.
Fallback strategies
When a dependency fails and the circuit breaker is open, you need a plan B. Fallback strategies define what your service does when it cannot reach a dependency. The right strategy depends on the dependency and what your users expect.
Here are the most common fallback patterns:
1. Cache fallback
Serve stale data from a local cache when the source is unavailable:
defmodule MyApp.CacheFallback do
use GenServer
@cache_ttl_ms 300_000 # 5 minutes
@stale_ttl_ms 3_600_000 # 1 hour - stale data is better than no data
def get_user_profile(user_id) do
case MyApp.ResilientClient.call_service(:user_service, fn ->
HTTPoison.get("https://users.internal/profiles/#{user_id}", [],
recv_timeout: 3_000
)
end) do
{:ok, %{status_code: 200, body: body}} ->
profile = Jason.decode!(body)
cache_put(user_id, profile)
{:ok, profile}
{:error, _reason} ->
case cache_get(user_id) do
{:ok, profile, :fresh} ->
{:ok, profile}
{:ok, profile, :stale} ->
Logger.info("Serving stale profile for user #{user_id}")
{:ok, Map.put(profile, :_stale, true)}
:miss ->
{:error, :unavailable}
end
end
end
defp cache_put(key, value) do
:ets.insert(:profile_cache, {key, value, System.monotonic_time(:millisecond)})
end
defp cache_get(key) do
case :ets.lookup(:profile_cache, key) do
[{^key, value, cached_at}] ->
age = System.monotonic_time(:millisecond) - cached_at
cond do
age < @cache_ttl_ms -> {:ok, value, :fresh}
age < @stale_ttl_ms -> {:ok, value, :stale}
true -> :miss
end
[] ->
:miss
end
end
end
2. Default response fallback
Return a sensible default when the dependency is unavailable:
defmodule MyApp.RecommendationService do
@default_recommendations [
%{id: "popular-1", title: "Most Popular Item", reason: "trending"},
%{id: "popular-2", title: "Editor's Pick", reason: "curated"},
%{id: "popular-3", title: "New Arrival", reason: "new"}
]
def get_recommendations(user_id) do
case MyApp.ResilientClient.call_service(:recommendation_engine, fn ->
HTTPoison.get("https://recommendations.internal/for/#{user_id}", [],
recv_timeout: 2_000
)
end) do
{:ok, %{status_code: 200, body: body}} ->
{:ok, Jason.decode!(body)}
{:error, _reason} ->
Logger.info("Recommendation engine unavailable, using defaults")
{:ok, @default_recommendations}
end
end
end
3. Degraded mode fallback
Disable non-essential features and communicate the degraded state to users:
defmodule MyApp.DegradedMode do
@moduledoc """
Tracks which features are operating in degraded mode
and provides appropriate responses.
"""
use GenServer
def start_link(_opts) do
GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
end
def mark_degraded(feature, reason) do
GenServer.cast(__MODULE__, {:mark_degraded, feature, reason})
end
def mark_healthy(feature) do
GenServer.cast(__MODULE__, {:mark_healthy, feature})
end
def degraded?(feature) do
GenServer.call(__MODULE__, {:degraded?, feature})
end
def status do
GenServer.call(__MODULE__, :status)
end
@impl true
def init(_) do
{:ok, %{}}
end
@impl true
def handle_cast({:mark_degraded, feature, reason}, state) do
Logger.warning("Feature #{feature} entering degraded mode: #{reason}")
{:noreply, Map.put(state, feature, %{reason: reason, since: DateTime.utc_now()})}
end
def handle_cast({:mark_healthy, feature}, state) do
if Map.has_key?(state, feature) do
Logger.info("Feature #{feature} recovered from degraded mode")
end
{:noreply, Map.delete(state, feature)}
end
@impl true
def handle_call({:degraded?, feature}, _from, state) do
{:reply, Map.has_key?(state, feature), state}
end
def handle_call(:status, _from, state) do
{:reply, state, state}
end
end
4. Static fallback
For read-heavy services, pre-compute static responses that can be served when everything else fails:
defmodule MyApp.StaticFallback do
@moduledoc """
Serves pre-computed static content when dynamic services fail.
Updated periodically by a background job.
"""
@static_dir "priv/static/fallbacks"
def get_homepage_data do
case fetch_dynamic_homepage() do
{:ok, data} -> {:ok, data}
{:error, _} -> load_static_fallback("homepage.json")
end
end
defp load_static_fallback(filename) do
path = Path.join(@static_dir, filename)
case File.read(path) do
{:ok, content} ->
Logger.info("Serving static fallback: #{filename}")
{:ok, Jason.decode!(content)}
{:error, _} ->
{:error, :no_fallback_available}
end
end
end
The important thing is to plan your fallbacks before you need them. During an incident is not the time to figure out what your service should do when the recommendation engine is down. Document your fallback strategy for each dependency and test it regularly.
Health checks for dependencies
Kubernetes gives you three types of probes, and understanding when to use each one is critical for dependency management:
- Liveness probes: “Is this process alive?” If it fails, Kubernetes restarts the container. This should check your process, not your dependencies. If your database is down, restarting your app will not fix it.
- Readiness probes: “Can this pod serve traffic?” If it fails, Kubernetes removes the pod from the service endpoints. This is where you check dependencies. If you cannot reach the database, you should not receive traffic.
- Startup probes: “Has this pod finished starting up?” Gives slow-starting containers time to initialize before liveness and readiness checks kick in.
Here is a dependency-aware health check implementation:
defmodule MyAppWeb.HealthController do
use MyAppWeb, :controller
@hard_dependencies [:database, :cache]
@soft_dependencies [:recommendation_engine, :notification_service]
# Liveness: only checks if the process is alive
def liveness(conn, _params) do
json(conn, %{status: "alive", timestamp: DateTime.utc_now()})
end
# Readiness: checks hard dependencies
def readiness(conn, _params) do
checks =
@hard_dependencies
|> Enum.map(fn dep -> {dep, check_dependency(dep)} end)
|> Map.new()
all_healthy = Enum.all?(checks, fn {_dep, status} -> status == :ok end)
if all_healthy do
conn
|> put_status(200)
|> json(%{status: "ready", checks: format_checks(checks)})
else
conn
|> put_status(503)
|> json(%{status: "not_ready", checks: format_checks(checks)})
end
end
# Full status: checks everything including soft dependencies
def status(conn, _params) do
hard_checks =
@hard_dependencies
|> Enum.map(fn dep -> {dep, check_dependency(dep)} end)
|> Map.new()
soft_checks =
@soft_dependencies
|> Enum.map(fn dep -> {dep, check_dependency(dep)} end)
|> Map.new()
degraded_features = MyApp.DegradedMode.status()
all_hard_healthy = Enum.all?(hard_checks, fn {_dep, s} -> s == :ok end)
all_soft_healthy = Enum.all?(soft_checks, fn {_dep, s} -> s == :ok end)
overall =
cond do
not all_hard_healthy -> "unhealthy"
not all_soft_healthy -> "degraded"
true -> "healthy"
end
conn
|> put_status(if(all_hard_healthy, do: 200, else: 503))
|> json(%{
status: overall,
hard_dependencies: format_checks(hard_checks),
soft_dependencies: format_checks(soft_checks),
degraded_features: degraded_features
})
end
defp check_dependency(:database) do
case Ecto.Adapters.SQL.query(MyApp.Repo, "SELECT 1", []) do
{:ok, _} -> :ok
{:error, _} -> :error
end
end
defp check_dependency(:cache) do
case Redix.command(:redix, ["PING"]) do
{:ok, "PONG"} -> :ok
_ -> :error
end
end
defp check_dependency(name) do
case MyApp.CircuitBreaker.call(name, fn ->
HTTPoison.get("https://#{name}.internal/health", [], recv_timeout: 2_000)
end) do
{:ok, %{status_code: 200}} -> :ok
_ -> :error
end
end
defp format_checks(checks) do
Map.new(checks, fn {dep, status} ->
{dep, %{status: status, checked_at: DateTime.utc_now()}}
end)
end
end
And the corresponding Kubernetes probe configuration:
# k8s/deployment-with-probes.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 4000
livenessProbe:
httpGet:
path: /health/live
port: 4000
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 4000
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 2
startupProbe:
httpGet:
path: /health/live
port: 4000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30
The critical mistake people make is putting dependency checks in liveness probes. If your database goes down and your liveness probe checks the database, Kubernetes will restart all your pods. Now you have a database outage and an application restart storm happening at the same time. Keep liveness probes simple and use readiness probes for dependency checks.
Dependency mapping
Before you can manage your dependencies, you need to see them. A dependency map is a visual representation of all the services in your system and how they connect. This sounds obvious, but you would be surprised how many teams do not have a clear picture of their dependency graph.
Here is a simple way to document your dependencies:
defmodule MyApp.DependencyMap do
@moduledoc """
Declares all service dependencies with their properties.
This serves as living documentation and powers runtime decisions.
"""
@dependencies %{
database: %{
type: :hard,
url: "postgresql://db.internal:5432/myapp",
timeout_ms: 5_000,
circuit_breaker: false, # managed by Ecto pool
fallback: :none,
slo_target: 0.999,
owner_team: "platform",
criticality: :critical
},
cache: %{
type: :hard,
url: "redis://cache.internal:6379",
timeout_ms: 1_000,
circuit_breaker: true,
fallback: :bypass, # skip cache, hit database directly
slo_target: 0.999,
owner_team: "platform",
criticality: :critical
},
auth_service: %{
type: :hard,
url: "https://auth.internal:8443",
timeout_ms: 3_000,
circuit_breaker: true,
fallback: :cached_tokens,
slo_target: 0.999,
owner_team: "identity",
criticality: :critical
},
payment_service: %{
type: :hard,
url: "https://payments.internal:8080",
timeout_ms: 10_000,
circuit_breaker: true,
fallback: :queue_for_retry,
slo_target: 0.999,
owner_team: "payments",
criticality: :high
},
recommendation_engine: %{
type: :soft,
url: "https://recommendations.internal:8080",
timeout_ms: 2_000,
circuit_breaker: true,
fallback: :static_defaults,
slo_target: 0.99,
owner_team: "ml",
criticality: :low
},
notification_service: %{
type: :soft,
url: "https://notifications.internal:8080",
timeout_ms: 5_000,
circuit_breaker: true,
fallback: :queue_for_retry,
slo_target: 0.99,
owner_team: "comms",
criticality: :medium
},
analytics_service: %{
type: :soft,
url: "https://analytics.internal:8080",
timeout_ms: 1_000,
circuit_breaker: true,
fallback: :fire_and_forget,
slo_target: 0.95,
owner_team: "data",
criticality: :low
}
}
def all, do: @dependencies
def hard_dependencies do
@dependencies
|> Enum.filter(fn {_name, config} -> config.type == :hard end)
|> Map.new()
end
def soft_dependencies do
@dependencies
|> Enum.filter(fn {_name, config} -> config.type == :soft end)
|> Map.new()
end
def get(name), do: Map.get(@dependencies, name)
def critical_path do
@dependencies
|> Enum.filter(fn {_name, config} -> config.criticality in [:critical, :high] end)
|> Enum.sort_by(fn {_name, config} -> config.criticality end)
|> Map.new()
end
end
This kind of declarative dependency map serves multiple purposes: it documents what you depend on, it powers your circuit breaker configuration, it informs your health checks, and it tells on-call engineers which team to contact when a dependency fails.
You can also generate a visual graph from this data:
defmodule MyApp.DependencyGraph do
@moduledoc """
Generates a Mermaid diagram from the dependency map.
"""
def to_mermaid do
deps = MyApp.DependencyMap.all()
nodes =
deps
|> Enum.map(fn {name, config} ->
style = if config.type == :hard, do: ":::critical", else: ":::optional"
" #{name}[#{name}]#{style}"
end)
|> Enum.join("\n")
edges =
deps
|> Enum.map(fn {name, config} ->
arrow = if config.type == :hard, do: "==>", else: "-->"
" my_app #{arrow} #{name}"
end)
|> Enum.join("\n")
"""
graph LR
my_app[My App]
#{nodes}
#{edges}
classDef critical fill:#ff6b6b,stroke:#333
classDef optional fill:#4ecdc4,stroke:#333
"""
end
end
SLOs for dependencies
Just as you set SLOs for your own services, you should track the reliability of your dependencies. This gives you data to make decisions about architecture, fallback strategies, and even vendor selection.
Here is how to think about dependency SLOs:
- Internal dependencies: You can usually negotiate SLOs with the team that owns the service. “We need your auth service to have 99.9% availability and p99 latency under 200ms.”
- External dependencies: You are at the mercy of the provider’s SLA. Track actual performance against their stated SLA, because reality often differs.
- Your effective SLO: Your service’s SLO cannot be higher than the SLO of your weakest hard dependency. If your database SLO is 99.9%, your service SLO cannot realistically be 99.95%.
Here is a Prometheus-based approach to tracking dependency SLOs:
# prometheus-rules-dependency-slos.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: dependency-slos
namespace: monitoring
spec:
groups:
- name: dependency.slos
interval: 30s
rules:
# Track success rate per dependency
- record: dependency:requests:success_rate5m
expr: |
sum by (dependency) (
rate(dependency_requests_total{status="success"}[5m])
) /
sum by (dependency) (
rate(dependency_requests_total[5m])
)
# Track latency per dependency
- record: dependency:latency:p99_5m
expr: |
histogram_quantile(0.99,
sum by (dependency, le) (
rate(dependency_request_duration_seconds_bucket[5m])
)
)
# Dependency error budget remaining (30-day window)
- record: dependency:error_budget:remaining
expr: |
1 - (
(1 - avg_over_time(dependency:requests:success_rate5m[30d]))
/
(1 - 0.999)
)
- name: dependency.alerts
rules:
- alert: DependencyErrorBudgetBurning
expr: dependency:error_budget:remaining < 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Dependency {{ $labels.dependency }} has consumed 50% of error budget"
description: "Error budget remaining: {{ $value | humanizePercentage }}"
- alert: DependencyErrorBudgetExhausted
expr: dependency:error_budget:remaining < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Dependency {{ $labels.dependency }} error budget nearly exhausted"
description: "Error budget remaining: {{ $value | humanizePercentage }}"
To emit these metrics from your Elixir application, instrument your dependency calls:
defmodule MyApp.DependencyTelemetry do
@moduledoc """
Emits telemetry events for all dependency calls,
which are then exposed as Prometheus metrics.
"""
def track_call(dependency, func) when is_function(func, 0) do
start_time = System.monotonic_time()
result =
try do
func.()
rescue
error ->
duration = System.monotonic_time() - start_time
:telemetry.execute(
[:dependency, :call, :exception],
%{duration: duration},
%{dependency: dependency, error: inspect(error)}
)
reraise error, __STACKTRACE__
end
duration = System.monotonic_time() - start_time
status = if match?({:ok, _}, result), do: "success", else: "failure"
:telemetry.execute(
[:dependency, :call, :stop],
%{duration: duration},
%{dependency: dependency, status: status}
)
result
end
end
When you track dependency SLOs over time, you start seeing patterns. Maybe your recommendation engine drops below its SLO every Monday morning when the ML team runs batch jobs. Maybe the payment gateway has reliability dips on the last day of the month. These patterns help you plan better fallback strategies and have informed conversations with dependency owners.
Graceful degradation patterns
Graceful degradation is the art of doing less, well, instead of doing everything, poorly. When your system is under stress or a dependency is failing, you intentionally reduce functionality to protect the core user experience.
Think of it as progressive levels of degradation:
- Level 0 - Normal: All features working, all dependencies healthy
- Level 1 - Reduced: Non-essential features disabled (recommendations, analytics, personalization)
- Level 2 - Core only: Only critical path features remain (browse, search, purchase)
- Level 3 - Minimal: Read-only mode or static content only
- Level 4 - Maintenance: Service is down, show a maintenance page
Here is how to implement progressive degradation:
defmodule MyApp.DegradationLevel do
@moduledoc """
Manages the current degradation level based on
dependency health and system load.
"""
use GenServer
@levels [:normal, :reduced, :core_only, :minimal, :maintenance]
def start_link(_opts) do
GenServer.start_link(__MODULE__, :normal, name: __MODULE__)
end
def current_level do
GenServer.call(__MODULE__, :current_level)
end
def set_level(level) when level in @levels do
GenServer.call(__MODULE__, {:set_level, level})
end
def feature_available?(feature) do
level = current_level()
feature_level = feature_minimum_level(feature)
level_index(level) <= level_index(feature_level)
end
@impl true
def init(level), do: {:ok, level}
@impl true
def handle_call(:current_level, _from, level), do: {:reply, level, level}
def handle_call({:set_level, new_level}, _from, old_level) do
if new_level != old_level do
Logger.warning(
"Degradation level changed: #{old_level} -> #{new_level}"
)
:telemetry.execute(
[:app, :degradation, :level_change],
%{},
%{old_level: old_level, new_level: new_level}
)
end
{:reply, :ok, new_level}
end
# Define which features are available at each level
defp feature_minimum_level(:recommendations), do: :normal
defp feature_minimum_level(:analytics_tracking), do: :normal
defp feature_minimum_level(:personalization), do: :normal
defp feature_minimum_level(:search_suggestions), do: :reduced
defp feature_minimum_level(:user_reviews), do: :reduced
defp feature_minimum_level(:search), do: :core_only
defp feature_minimum_level(:browse_catalog), do: :core_only
defp feature_minimum_level(:checkout), do: :core_only
defp feature_minimum_level(:static_content), do: :minimal
defp feature_minimum_level(_), do: :normal
defp level_index(:normal), do: 0
defp level_index(:reduced), do: 1
defp level_index(:core_only), do: 2
defp level_index(:minimal), do: 3
defp level_index(:maintenance), do: 4
end
You can then use this in your controllers and LiveViews:
defmodule MyAppWeb.ProductLive do
use MyAppWeb, :live_view
alias MyApp.DegradationLevel
def mount(%{"id" => id}, _session, socket) do
product = MyApp.Catalog.get_product!(id)
socket =
socket
|> assign(:product, product)
|> assign(:degradation_level, DegradationLevel.current_level())
|> maybe_load_recommendations(id)
|> maybe_load_reviews(id)
{:ok, socket}
end
defp maybe_load_recommendations(socket, product_id) do
if DegradationLevel.feature_available?(:recommendations) do
case MyApp.RecommendationService.get_recommendations(product_id) do
{:ok, recs} -> assign(socket, :recommendations, recs)
{:error, _} -> assign(socket, :recommendations, [])
end
else
assign(socket, :recommendations, [])
end
end
defp maybe_load_reviews(socket, product_id) do
if DegradationLevel.feature_available?(:user_reviews) do
case MyApp.Reviews.list_for_product(product_id) do
{:ok, reviews} -> assign(socket, :reviews, reviews)
{:error, _} -> assign(socket, :reviews, [])
end
else
assign(socket, :reviews, [])
end
end
end
Feature flags for degradation
Feature flags are the mechanism that makes graceful degradation practical at runtime. Instead of deploying new code to disable a feature, you flip a flag and the change takes effect immediately.
Here is a simple but effective feature flag implementation in Elixir:
defmodule MyApp.FeatureFlags do
@moduledoc """
Simple ETS-based feature flags for runtime toggling.
Supports boolean flags and percentage rollouts.
"""
use GenServer
@table :feature_flags
def start_link(_opts) do
GenServer.start_link(__MODULE__, [], name: __MODULE__)
end
@impl true
def init(_) do
:ets.new(@table, [:named_table, :set, :public, read_concurrency: true])
# Load default flags
load_defaults()
{:ok, %{}}
end
# Check if a feature is enabled
def enabled?(flag) do
case :ets.lookup(@table, flag) do
[{^flag, true}] -> true
[{^flag, false}] -> false
[{^flag, percentage}] when is_integer(percentage) ->
:rand.uniform(100) <= percentage
[] -> true # default to enabled if flag not found
end
end
# Enable a feature
def enable(flag) do
:ets.insert(@table, {flag, true})
Logger.info("Feature flag enabled: #{flag}")
:ok
end
# Disable a feature
def disable(flag) do
:ets.insert(@table, {flag, false})
Logger.warning("Feature flag disabled: #{flag}")
:ok
end
# Set percentage rollout
def set_percentage(flag, percentage) when percentage in 0..100 do
:ets.insert(@table, {flag, percentage})
Logger.info("Feature flag #{flag} set to #{percentage}%")
:ok
end
# List all flags and their states
def list_all do
:ets.tab2list(@table)
|> Map.new()
end
defp load_defaults do
defaults = [
{:recommendations, true},
{:analytics_tracking, true},
{:personalization, true},
{:search_suggestions, true},
{:user_reviews, true},
{:new_checkout_flow, false},
{:experimental_search, 10} # 10% rollout
]
Enum.each(defaults, fn {flag, value} ->
:ets.insert(@table, {flag, value})
end)
end
end
And a Phoenix LiveDashboard page to manage flags at runtime:
defmodule MyAppWeb.FeatureFlagController do
use MyAppWeb, :controller
plug :require_admin
def index(conn, _params) do
flags = MyApp.FeatureFlags.list_all()
json(conn, %{flags: flags})
end
def update(conn, %{"flag" => flag, "value" => "true"}) do
MyApp.FeatureFlags.enable(String.to_existing_atom(flag))
json(conn, %{status: "ok", flag: flag, value: true})
end
def update(conn, %{"flag" => flag, "value" => "false"}) do
MyApp.FeatureFlags.disable(String.to_existing_atom(flag))
json(conn, %{status: "ok", flag: flag, value: false})
end
def update(conn, %{"flag" => flag, "value" => value}) do
case Integer.parse(value) do
{percentage, ""} when percentage in 0..100 ->
MyApp.FeatureFlags.set_percentage(
String.to_existing_atom(flag),
percentage
)
json(conn, %{status: "ok", flag: flag, value: percentage})
_ ->
conn
|> put_status(400)
|> json(%{error: "Invalid value"})
end
end
defp require_admin(conn, _opts) do
# Your admin authentication logic here
conn
end
end
The beauty of combining feature flags with the degradation level system is that you can automate the response to dependency failures. When the circuit breaker for the recommendation engine opens, you automatically disable the recommendations feature flag. When it recovers, you re-enable it:
defmodule MyApp.DegradationAutomation do
@moduledoc """
Automatically adjusts feature flags and degradation level
based on dependency health signals.
"""
use GenServer
@check_interval_ms 10_000
def start_link(_opts) do
GenServer.start_link(__MODULE__, [], name: __MODULE__)
end
@impl true
def init(_) do
schedule_check()
{:ok, %{}}
end
@impl true
def handle_info(:check_dependencies, state) do
deps = MyApp.DependencyMap.all()
Enum.each(deps, fn {name, config} ->
case check_health(name) do
:healthy ->
maybe_restore_features(name, config)
:unhealthy ->
maybe_degrade_features(name, config)
end
end)
update_overall_degradation_level()
schedule_check()
{:noreply, state}
end
defp check_health(dep_name) do
case MyApp.CircuitBreaker.call(dep_name, fn ->
# lightweight health check
:ok
end) do
{:ok, _} -> :healthy
{:error, :circuit_open} -> :unhealthy
{:error, _, _} -> :unhealthy
end
end
defp maybe_degrade_features(dep_name, _config) do
features_for_dependency(dep_name)
|> Enum.each(fn feature ->
MyApp.FeatureFlags.disable(feature)
MyApp.DegradedMode.mark_degraded(feature, "dependency #{dep_name} unhealthy")
end)
end
defp maybe_restore_features(dep_name, _config) do
features_for_dependency(dep_name)
|> Enum.each(fn feature ->
MyApp.FeatureFlags.enable(feature)
MyApp.DegradedMode.mark_healthy(feature)
end)
end
defp features_for_dependency(:recommendation_engine), do: [:recommendations]
defp features_for_dependency(:notification_service), do: [:email_notifications]
defp features_for_dependency(:analytics_service), do: [:analytics_tracking]
defp features_for_dependency(_), do: []
defp update_overall_degradation_level do
hard_deps = MyApp.DependencyMap.hard_dependencies()
soft_deps = MyApp.DependencyMap.soft_dependencies()
hard_healthy = Enum.all?(hard_deps, fn {name, _} -> check_health(name) == :healthy end)
soft_healthy = Enum.all?(soft_deps, fn {name, _} -> check_health(name) == :healthy end)
level =
cond do
not hard_healthy -> :core_only
not soft_healthy -> :reduced
true -> :normal
end
MyApp.DegradationLevel.set_level(level)
end
defp schedule_check do
Process.send_after(self(), :check_dependencies, @check_interval_ms)
end
end
Closing notes
Dependency management and graceful degradation are not optional for any service that aims to be reliable. Every external call is a risk, and the patterns we covered (circuit breakers, bulkheads, timeouts with backoff, fallback strategies, dependency health checks, dependency mapping, dependency SLOs, progressive degradation levels, and feature flags) give you a comprehensive toolkit to manage that risk.
The key takeaways are:
- Know your dependencies: Map them, classify them as hard or soft, and document your fallback strategy for each one
- Fail fast: Use circuit breakers and timeouts so that a slow dependency does not become your problem
- Isolate failures: Use bulkheads (process pools, resource limits, network policies) to contain the blast radius
- Have a plan B: Implement fallback strategies before you need them, not during an incident
- Degrade gracefully: It is better to serve a product page without recommendations than to serve a 500 error
- Automate the response: Use feature flags and automation to respond to dependency failures in seconds, not minutes
Start with the most critical path in your system. Identify the hard dependencies, add circuit breakers and timeouts, implement one fallback strategy, and test it. You do not need to implement everything at once. Incremental improvements compound over time.
Hope you found this useful and enjoyed reading it, until next time!
Errata
If you spot any error or have any suggestion, please send me a message so it gets fixed.
Also, you can check the source code and changes in the sources here
$ Comments
Online: 0Please sign in to be able to write comments.