SRE: Dependency Management and Graceful Degradation

2026-03-17 | Gabriel Garrido | 24 min read

Support this blog

If you find this content useful, consider supporting the blog.

Introduction

In the previous articles we covered SLIs and SLOs, incident management, observability, chaos engineering, capacity planning, GitOps, secrets management, and cost optimization. All of those focus on your own systems, your own code, your own infrastructure. But here is the thing: your service does not exist in isolation.

Every HTTP call to another service, every database query, every message published to a queue, every third-party API integration is a dependency. And every dependency is a potential failure point. When that payment gateway goes down at 2am or that internal auth service starts returning 500s under load, what happens to your service? Does it crash? Does it hang? Or does it gracefully handle the situation and keep serving users with reduced functionality?

In this article we will cover how to think about dependencies as risk, implement circuit breakers, apply the bulkhead pattern, handle timeouts and retries properly, build fallback strategies, set up dependency health checks, map your dependency graph, define SLOs for your dependencies, and implement graceful degradation using feature flags. All with practical Elixir and Kubernetes examples.

Let's get into it.

Dependencies as risk

Not all dependencies are created equal. The first step in managing them is understanding what kind of dependency you are dealing with and what happens when it fails.

There are two fundamental types of dependencies:

Hard dependencies: Your service cannot function at all without them. If your database is down, you probably cannot serve any requests. If the auth service is unreachable, nobody can log in.

Soft dependencies: Your service can still function without them, possibly in a degraded state. If the recommendation engine is down, you can still show the product page without recommendations. If the analytics service is slow, you can fire and forget.

The danger comes from cascading failures. Consider this scenario: Service A calls Service B, which calls Service C. Service C starts responding slowly because of a database issue. Service B's threads are now blocked waiting for Service C. Service B's response times increase. Service A's threads are now blocked waiting for Service B. Pretty soon, all three services are effectively down because of one slow database query in Service C.

This is why dependency management matters so much. A single misbehaving dependency can take down your entire system if you do not have the right protections in place. Let's look at the patterns that prevent this.

Circuit breakers

The circuit breaker pattern is borrowed from electrical engineering. When too much current flows through a circuit, the breaker trips and stops the flow to prevent damage. In software, when a dependency starts failing, the circuit breaker trips and stops sending requests to it, giving it time to recover.

A circuit breaker has three states:

Closed: Everything is normal. Requests flow through to the dependency. The breaker monitors failure rates.

Open: The dependency is failing. Requests are immediately rejected without calling the dependency. A timer starts.

Half-open: The timer has expired. A limited number of test requests are sent through. If they succeed, the breaker closes. If they fail, the breaker opens again.

Here is a practical implementation in Elixir using a GenServer:

defmodule MyApp.CircuitBreaker do
  use GenServer

  @failure_threshold 5
  @reset_timeout_ms 30_000
  @half_open_max_calls 3

  defstruct [
    :name,
    state: :closed,
    failure_count: 0,
    success_count: 0,
    last_failure_time: nil,
    half_open_calls: 0
  ]

  # Client API

  def start_link(opts) do
    name = Keyword.fetch!(opts, :name)
    GenServer.start_link(__MODULE__, %__MODULE__{name: name}, name: name)
  end

  def call(name, func) when is_function(func, 0) do
    case GenServer.call(name, :check_state) do
      :ok ->
        try do
          result = func.()
          GenServer.cast(name, :record_success)
          {:ok, result}
        rescue
          error ->
            GenServer.cast(name, :record_failure)
            {:error, :dependency_error, error}
        end

      :open ->
        {:error, :circuit_open}
    end
  end

  # Server callbacks

  @impl true
  def init(state) do
    {:ok, state}
  end

  @impl true
  def handle_call(:check_state, _from, %{state: :closed} = state) do
    {:reply, :ok, state}
  end

  def handle_call(:check_state, _from, %{state: :open} = state) do
    if time_since_last_failure(state) >= @reset_timeout_ms do
      {:reply, :ok, %{state | state: :half_open, half_open_calls: 0}}
    else
      {:reply, :open, state}
    end
  end

  def handle_call(:check_state, _from, %{state: :half_open} = state) do
    if state.half_open_calls < @half_open_max_calls do
      {:reply, :ok, %{state | half_open_calls: state.half_open_calls + 1}}
    else
      {:reply, :open, state}
    end
  end

  @impl true
  def handle_cast(:record_success, %{state: :half_open} = state) do
    {:noreply, %{state | state: :closed, failure_count: 0, success_count: 0}}
  end

  def handle_cast(:record_success, state) do
    {:noreply, %{state | success_count: state.success_count + 1}}
  end

  def handle_cast(:record_failure, state) do
    new_count = state.failure_count + 1
    now = System.monotonic_time(:millisecond)

    new_state =
      if new_count >= @failure_threshold do
        %{state | state: :open, failure_count: new_count, last_failure_time: now}
      else
        %{state | failure_count: new_count, last_failure_time: now}
      end

    {:noreply, new_state}
  end

  defp time_since_last_failure(%{last_failure_time: nil}), do: :infinity

  defp time_since_last_failure(%{last_failure_time: time}) do
    System.monotonic_time(:millisecond) - time
  end
end

And here is how you would use it in your application:

# In your application supervision tree
children = [
  {MyApp.CircuitBreaker, name: :payment_service},
  {MyApp.CircuitBreaker, name: :auth_service},
  {MyApp.CircuitBreaker, name: :recommendation_engine}
]

# When making a call to a dependency
case MyApp.CircuitBreaker.call(:payment_service, fn ->
  HTTPoison.post("https://payments.internal/charge", body, headers)
end) do
  {:ok, %{status_code: 200, body: body}} ->
    {:ok, Jason.decode!(body)}

  {:error, :circuit_open} ->
    Logger.warning("Payment service circuit is open, using fallback")
    {:error, :service_unavailable}

  {:error, :dependency_error, error} ->
    Logger.error("Payment service error: #{inspect(error)}")
    {:error, :payment_failed}
end

The key insight here is that when the circuit is open, you fail fast. Instead of waiting 30 seconds for a timeout from a dead service, you get an immediate response and can execute your fallback logic. This protects both your service and the failing dependency, since you are not piling on more requests while it is trying to recover.

Bulkhead pattern

The bulkhead pattern comes from ship design. Ships have compartments (bulkheads) so that if one section floods, the rest of the ship stays afloat. In software, the idea is to isolate failure domains so that a problem in one area does not affect everything else.

Elixir and the BEAM VM are particularly good at this because of process isolation. Each process is independent, has its own memory, and if it crashes, other processes are unaffected. You can use this to create natural bulkheads:

defmodule MyApp.DependencyPool do
  @moduledoc """
  Manages separate process pools for each dependency,
  preventing one slow dependency from consuming all resources.
  """

  def child_spec(_opts) do
    children = [
      # Each dependency gets its own pool with its own limits
      pool_spec(:payment_pool, MyApp.PaymentWorker, size: 10, max_overflow: 5),
      pool_spec(:auth_pool, MyApp.AuthWorker, size: 20, max_overflow: 10),
      pool_spec(:recommendation_pool, MyApp.RecommendationWorker, size: 5, max_overflow: 2),
      pool_spec(:notification_pool, MyApp.NotificationWorker, size: 5, max_overflow: 5)
    ]

    %{
      id: __MODULE__,
      type: :supervisor,
      start: {Supervisor, :start_link, [children, [strategy: :one_for_one]]}
    }
  end

  defp pool_spec(name, worker, opts) do
    pool_opts = [
      name: {:local, name},
      worker_module: worker,
      size: Keyword.fetch!(opts, :size),
      max_overflow: Keyword.fetch!(opts, :max_overflow)
    ]

    :poolboy.child_spec(name, pool_opts)
  end

  def call_dependency(pool_name, request, timeout \\ 5_000) do
    try do
      :poolboy.transaction(
        pool_name,
        fn worker -> GenServer.call(worker, {:request, request}, timeout) end,
        timeout
      )
    catch
      :exit, {:timeout, _} ->
        {:error, :pool_timeout}

      :exit, {:noproc, _} ->
        {:error, :pool_unavailable}
    end
  end
end

In Kubernetes, you get another layer of bulkheading through resource limits. Each service gets its own CPU and memory budget, so a runaway dependency cannot starve other services:

# k8s/deployment-with-bulkheads.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: app
          image: myapp:latest
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
---
# Separate resource quotas per namespace act as bulkheads
apiVersion: v1
kind: ResourceQuota
metadata:
  name: dependency-quota
  namespace: payment-service
spec:
  hard:
    requests.cpu: "2"
    requests.memory: "4Gi"
    limits.cpu: "4"
    limits.memory: "8Gi"
    pods: "20"
---
# Network policies as another form of bulkhead
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: payment-service-policy
  namespace: payment-service
spec:
  podSelector:
    matchLabels:
      app: payment-service
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: my-app
      ports:
        - port: 8080
          protocol: TCP

The combination of Elixir process isolation, connection pools, Kubernetes resource limits, and network policies gives you multiple layers of bulkheading. If the payment service goes haywire, it cannot consume all the CPU on the node, cannot exhaust your application's connection pool for other services, and cannot affect processes handling requests that do not need payments.

Timeouts and retries

Timeouts and retries seem straightforward, but getting them wrong is one of the most common causes of cascading failures. Let's start with what not to do.

The naive approach looks like this:

# DON'T do this - unbounded retries with no backoff
def fetch_user(user_id) do
  case HTTPoison.get("https://auth.internal/users/#{user_id}") do
    {:ok, response} -> {:ok, response}
    {:error, _} -> fetch_user(user_id)  # infinite retry loop!
  end
end

This creates a retry storm. If the auth service is down, every single request to your service will generate infinite retries, making the problem worse. Here is the right way to do it with exponential backoff and jitter:

defmodule MyApp.Retry do
  @moduledoc """
  Retry with exponential backoff and jitter.
  """

  @default_opts [
    max_retries: 3,
    base_delay_ms: 100,
    max_delay_ms: 5_000,
    jitter: true
  ]

  def with_retry(func, opts \\ []) when is_function(func, 0) do
    opts = Keyword.merge(@default_opts, opts)
    do_retry(func, 0, opts)
  end

  defp do_retry(func, attempt, opts) do
    case func.() do
      {:ok, result} ->
        {:ok, result}

      {:error, reason} when attempt < opts[:max_retries] ->
        delay = calculate_delay(attempt, opts)
        Logger.info("Retry attempt #{attempt + 1} after #{delay}ms, reason: #{inspect(reason)}")
        Process.sleep(delay)
        do_retry(func, attempt + 1, opts)

      {:error, reason} ->
        Logger.warning("All #{opts[:max_retries]} retries exhausted, reason: #{inspect(reason)}")
        {:error, :retries_exhausted, reason}
    end
  end

  defp calculate_delay(attempt, opts) do
    # Exponential backoff: base * 2^attempt
    base_delay = opts[:base_delay_ms] * Integer.pow(2, attempt)

    # Cap at max delay
    capped_delay = min(base_delay, opts[:max_delay_ms])

    # Add jitter to prevent thundering herd
    if opts[:jitter] do
      jitter_range = div(capped_delay, 2)
      capped_delay - jitter_range + :rand.uniform(jitter_range * 2)
    else
      capped_delay
    end
  end
end

And here is how you combine retries with the circuit breaker:

defmodule MyApp.ResilientClient do
  alias MyApp.{CircuitBreaker, Retry}

  def call_service(circuit_name, request_fn, opts \\ []) do
    CircuitBreaker.call(circuit_name, fn ->
      Retry.with_retry(fn ->
        case request_fn.() do
          {:ok, %{status_code: status} = resp} when status in 200..299 ->
            {:ok, resp}

          {:ok, %{status_code: status}} when status in [429, 503] ->
            # Retryable server errors
            {:error, :retryable}

          {:ok, %{status_code: status} = resp} ->
            # Non-retryable client errors
            {:ok, resp}

          {:error, %HTTPoison.Error{reason: reason}} ->
            {:error, reason}
        end
      end, opts)
    end)
  end
end

# Usage
MyApp.ResilientClient.call_service(:payment_service, fn ->
  HTTPoison.post(url, body, headers, recv_timeout: 5_000)
end, max_retries: 2, base_delay_ms: 200)

There are a few important things to note here:

Always set timeouts: Never make a network call without a timeout. A default timeout of 5 seconds is a reasonable starting point.

Jitter is essential: Without jitter, all retries happen at the same time, creating a thundering herd. Adding randomness spreads them out.

Not everything is retryable: Only retry on transient errors (timeouts, 503s, connection resets). Do not retry on 400s or 404s.

Set a retry budget: Limit the total number of retries across all requests, not just per request. If 50% of your requests are retrying, something is very wrong.

Combine with circuit breakers: Retries without a circuit breaker can make a bad situation worse. The circuit breaker stops the bleeding when retries are not helping.

Fallback strategies

When a dependency fails and the circuit breaker is open, you need a plan B. Fallback strategies define what your service does when it cannot reach a dependency. The right strategy depends on the dependency and what your users expect.

Here are the most common fallback patterns:

1. Cache fallback

Serve stale data from a local cache when the source is unavailable:

defmodule MyApp.CacheFallback do
  use GenServer

  @cache_ttl_ms 300_000  # 5 minutes
  @stale_ttl_ms 3_600_000  # 1 hour - stale data is better than no data

  def get_user_profile(user_id) do
    case MyApp.ResilientClient.call_service(:user_service, fn ->
      HTTPoison.get("https://users.internal/profiles/#{user_id}", [],
        recv_timeout: 3_000
      )
    end) do
      {:ok, %{status_code: 200, body: body}} ->
        profile = Jason.decode!(body)
        cache_put(user_id, profile)
        {:ok, profile}

      {:error, _reason} ->
        case cache_get(user_id) do
          {:ok, profile, :fresh} ->
            {:ok, profile}

          {:ok, profile, :stale} ->
            Logger.info("Serving stale profile for user #{user_id}")
            {:ok, Map.put(profile, :_stale, true)}

          :miss ->
            {:error, :unavailable}
        end
    end
  end

  defp cache_put(key, value) do
    :ets.insert(:profile_cache, {key, value, System.monotonic_time(:millisecond)})
  end

  defp cache_get(key) do
    case :ets.lookup(:profile_cache, key) do
      [{^key, value, cached_at}] ->
        age = System.monotonic_time(:millisecond) - cached_at

        cond do
          age < @cache_ttl_ms -> {:ok, value, :fresh}
          age < @stale_ttl_ms -> {:ok, value, :stale}
          true -> :miss
        end

      [] ->
        :miss
    end
  end
end

2. Default response fallback

Return a sensible default when the dependency is unavailable:

defmodule MyApp.RecommendationService do
  @default_recommendations [
    %{id: "popular-1", title: "Most Popular Item", reason: "trending"},
    %{id: "popular-2", title: "Editor's Pick", reason: "curated"},
    %{id: "popular-3", title: "New Arrival", reason: "new"}
  ]

  def get_recommendations(user_id) do
    case MyApp.ResilientClient.call_service(:recommendation_engine, fn ->
      HTTPoison.get("https://recommendations.internal/for/#{user_id}", [],
        recv_timeout: 2_000
      )
    end) do
      {:ok, %{status_code: 200, body: body}} ->
        {:ok, Jason.decode!(body)}

      {:error, _reason} ->
        Logger.info("Recommendation engine unavailable, using defaults")
        {:ok, @default_recommendations}
    end
  end
end

3. Degraded mode fallback

Disable non-essential features and communicate the degraded state to users:

defmodule MyApp.DegradedMode do
  @moduledoc """
  Tracks which features are operating in degraded mode
  and provides appropriate responses.
  """

  use GenServer

  def start_link(_opts) do
    GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
  end

  def mark_degraded(feature, reason) do
    GenServer.cast(__MODULE__, {:mark_degraded, feature, reason})
  end

  def mark_healthy(feature) do
    GenServer.cast(__MODULE__, {:mark_healthy, feature})
  end

  def degraded?(feature) do
    GenServer.call(__MODULE__, {:degraded?, feature})
  end

  def status do
    GenServer.call(__MODULE__, :status)
  end

  @impl true
  def init(_) do
    {:ok, %{}}
  end

  @impl true
  def handle_cast({:mark_degraded, feature, reason}, state) do
    Logger.warning("Feature #{feature} entering degraded mode: #{reason}")
    {:noreply, Map.put(state, feature, %{reason: reason, since: DateTime.utc_now()})}
  end

  def handle_cast({:mark_healthy, feature}, state) do
    if Map.has_key?(state, feature) do
      Logger.info("Feature #{feature} recovered from degraded mode")
    end

    {:noreply, Map.delete(state, feature)}
  end

  @impl true
  def handle_call({:degraded?, feature}, _from, state) do
    {:reply, Map.has_key?(state, feature), state}
  end

  def handle_call(:status, _from, state) do
    {:reply, state, state}
  end
end

4. Static fallback

For read-heavy services, pre-compute static responses that can be served when everything else fails:

defmodule MyApp.StaticFallback do
  @moduledoc """
  Serves pre-computed static content when dynamic services fail.
  Updated periodically by a background job.
  """

  @static_dir "priv/static/fallbacks"

  def get_homepage_data do
    case fetch_dynamic_homepage() do
      {:ok, data} -> {:ok, data}
      {:error, _} -> load_static_fallback("homepage.json")
    end
  end

  defp load_static_fallback(filename) do
    path = Path.join(@static_dir, filename)

    case File.read(path) do
      {:ok, content} ->
        Logger.info("Serving static fallback: #{filename}")
        {:ok, Jason.decode!(content)}

      {:error, _} ->
        {:error, :no_fallback_available}
    end
  end
end

The important thing is to plan your fallbacks before you need them. During an incident is not the time to figure out what your service should do when the recommendation engine is down. Document your fallback strategy for each dependency and test it regularly.

Health checks for dependencies

Kubernetes gives you three types of probes, and understanding when to use each one is critical for dependency management:

Liveness probes: "Is this process alive?" If it fails, Kubernetes restarts the container. This should check your process, not your dependencies. If your database is down, restarting your app will not fix it.

Readiness probes: "Can this pod serve traffic?" If it fails, Kubernetes removes the pod from the service endpoints. This is where you check dependencies. If you cannot reach the database, you should not receive traffic.

Startup probes: "Has this pod finished starting up?" Gives slow-starting containers time to initialize before liveness and readiness checks kick in.

Here is a dependency-aware health check implementation:

defmodule MyAppWeb.HealthController do
  use MyAppWeb, :controller

  @hard_dependencies [:database, :cache]
  @soft_dependencies [:recommendation_engine, :notification_service]

  # Liveness: only checks if the process is alive
  def liveness(conn, _params) do
    json(conn, %{status: "alive", timestamp: DateTime.utc_now()})
  end

  # Readiness: checks hard dependencies
  def readiness(conn, _params) do
    checks =
      @hard_dependencies
      |> Enum.map(fn dep -> {dep, check_dependency(dep)} end)
      |> Map.new()

    all_healthy = Enum.all?(checks, fn {_dep, status} -> status == :ok end)

    if all_healthy do
      conn
      |> put_status(200)
      |> json(%{status: "ready", checks: format_checks(checks)})
    else
      conn
      |> put_status(503)
      |> json(%{status: "not_ready", checks: format_checks(checks)})
    end
  end

  # Full status: checks everything including soft dependencies
  def status(conn, _params) do
    hard_checks =
      @hard_dependencies
      |> Enum.map(fn dep -> {dep, check_dependency(dep)} end)
      |> Map.new()

    soft_checks =
      @soft_dependencies
      |> Enum.map(fn dep -> {dep, check_dependency(dep)} end)
      |> Map.new()

    degraded_features = MyApp.DegradedMode.status()

    all_hard_healthy = Enum.all?(hard_checks, fn {_dep, s} -> s == :ok end)
    all_soft_healthy = Enum.all?(soft_checks, fn {_dep, s} -> s == :ok end)

    overall =
      cond do
        not all_hard_healthy -> "unhealthy"
        not all_soft_healthy -> "degraded"
        true -> "healthy"
      end

    conn
    |> put_status(if(all_hard_healthy, do: 200, else: 503))
    |> json(%{
      status: overall,
      hard_dependencies: format_checks(hard_checks),
      soft_dependencies: format_checks(soft_checks),
      degraded_features: degraded_features
    })
  end

  defp check_dependency(:database) do
    case Ecto.Adapters.SQL.query(MyApp.Repo, "SELECT 1", []) do
      {:ok, _} -> :ok
      {:error, _} -> :error
    end
  end

  defp check_dependency(:cache) do
    case Redix.command(:redix, ["PING"]) do
      {:ok, "PONG"} -> :ok
      _ -> :error
    end
  end

  defp check_dependency(name) do
    case MyApp.CircuitBreaker.call(name, fn ->
      HTTPoison.get("https://#{name}.internal/health", [], recv_timeout: 2_000)
    end) do
      {:ok, %{status_code: 200}} -> :ok
      _ -> :error
    end
  end

  defp format_checks(checks) do
    Map.new(checks, fn {dep, status} ->
      {dep, %{status: status, checked_at: DateTime.utc_now()}}
    end)
  end
end

And the corresponding Kubernetes probe configuration:

# k8s/deployment-with-probes.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: app
          image: myapp:latest
          ports:
            - containerPort: 4000
          livenessProbe:
            httpGet:
              path: /health/live
              port: 4000
            initialDelaySeconds: 10
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 4000
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 2
          startupProbe:
            httpGet:
              path: /health/live
              port: 4000
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 30

The critical mistake people make is putting dependency checks in liveness probes. If your database goes down and your liveness probe checks the database, Kubernetes will restart all your pods. Now you have a database outage and an application restart storm happening at the same time. Keep liveness probes simple and use readiness probes for dependency checks.

Dependency mapping

Before you can manage your dependencies, you need to see them. A dependency map is a visual representation of all the services in your system and how they connect. This sounds obvious, but you would be surprised how many teams do not have a clear picture of their dependency graph.

Here is a simple way to document your dependencies:

defmodule MyApp.DependencyMap do
  @moduledoc """
  Declares all service dependencies with their properties.
  This serves as living documentation and powers runtime decisions.
  """

  @dependencies %{
    database: %{
      type: :hard,
      url: "postgresql://db.internal:5432/myapp",
      timeout_ms: 5_000,
      circuit_breaker: false,  # managed by Ecto pool
      fallback: :none,
      slo_target: 0.999,
      owner_team: "platform",
      criticality: :critical
    },
    cache: %{
      type: :hard,
      url: "redis://cache.internal:6379",
      timeout_ms: 1_000,
      circuit_breaker: true,
      fallback: :bypass,  # skip cache, hit database directly
      slo_target: 0.999,
      owner_team: "platform",
      criticality: :critical
    },
    auth_service: %{
      type: :hard,
      url: "https://auth.internal:8443",
      timeout_ms: 3_000,
      circuit_breaker: true,
      fallback: :cached_tokens,
      slo_target: 0.999,
      owner_team: "identity",
      criticality: :critical
    },
    payment_service: %{
      type: :hard,
      url: "https://payments.internal:8080",
      timeout_ms: 10_000,
      circuit_breaker: true,
      fallback: :queue_for_retry,
      slo_target: 0.999,
      owner_team: "payments",
      criticality: :high
    },
    recommendation_engine: %{
      type: :soft,
      url: "https://recommendations.internal:8080",
      timeout_ms: 2_000,
      circuit_breaker: true,
      fallback: :static_defaults,
      slo_target: 0.99,
      owner_team: "ml",
      criticality: :low
    },
    notification_service: %{
      type: :soft,
      url: "https://notifications.internal:8080",
      timeout_ms: 5_000,
      circuit_breaker: true,
      fallback: :queue_for_retry,
      slo_target: 0.99,
      owner_team: "comms",
      criticality: :medium
    },
    analytics_service: %{
      type: :soft,
      url: "https://analytics.internal:8080",
      timeout_ms: 1_000,
      circuit_breaker: true,
      fallback: :fire_and_forget,
      slo_target: 0.95,
      owner_team: "data",
      criticality: :low
    }
  }

  def all, do: @dependencies

  def hard_dependencies do
    @dependencies
    |> Enum.filter(fn {_name, config} -> config.type == :hard end)
    |> Map.new()
  end

  def soft_dependencies do
    @dependencies
    |> Enum.filter(fn {_name, config} -> config.type == :soft end)
    |> Map.new()
  end

  def get(name), do: Map.get(@dependencies, name)

  def critical_path do
    @dependencies
    |> Enum.filter(fn {_name, config} -> config.criticality in [:critical, :high] end)
    |> Enum.sort_by(fn {_name, config} -> config.criticality end)
    |> Map.new()
  end
end

This kind of declarative dependency map serves multiple purposes: it documents what you depend on, it powers your circuit breaker configuration, it informs your health checks, and it tells on-call engineers which team to contact when a dependency fails.

You can also generate a visual graph from this data:

defmodule MyApp.DependencyGraph do
  @moduledoc """
  Generates a Mermaid diagram from the dependency map.
  """

  def to_mermaid do
    deps = MyApp.DependencyMap.all()

    nodes =
      deps
      |> Enum.map(fn {name, config} ->
        style = if config.type == :hard, do: ":::critical", else: ":::optional"
        "  #{name}[#{name}]#{style}"
      end)
      |> Enum.join("\n")

    edges =
      deps
      |> Enum.map(fn {name, config} ->
        arrow = if config.type == :hard, do: "==>", else: "-->"
        "  my_app #{arrow} #{name}"
      end)
      |> Enum.join("\n")

    """
    graph LR
      my_app[My App]
    #{nodes}
    #{edges}
      classDef critical fill:#ff6b6b,stroke:#333
      classDef optional fill:#4ecdc4,stroke:#333
    """
  end
end

SLOs for dependencies

Just as you set SLOs for your own services, you should track the reliability of your dependencies. This gives you data to make decisions about architecture, fallback strategies, and even vendor selection.

Here is how to think about dependency SLOs:

Internal dependencies: You can usually negotiate SLOs with the team that owns the service. "We need your auth service to have 99.9% availability and p99 latency under 200ms."

External dependencies: You are at the mercy of the provider's SLA. Track actual performance against their stated SLA, because reality often differs.

Your effective SLO: Your service's SLO cannot be higher than the SLO of your weakest hard dependency. If your database SLO is 99.9%, your service SLO cannot realistically be 99.95%.

Here is a Prometheus-based approach to tracking dependency SLOs:

# prometheus-rules-dependency-slos.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: dependency-slos
  namespace: monitoring
spec:
  groups:
    - name: dependency.slos
      interval: 30s
      rules:
        # Track success rate per dependency
        - record: dependency:requests:success_rate5m
          expr: |
            sum by (dependency) (
              rate(dependency_requests_total{status="success"}[5m])
            ) /
            sum by (dependency) (
              rate(dependency_requests_total[5m])
            )

        # Track latency per dependency
        - record: dependency:latency:p99_5m
          expr: |
            histogram_quantile(0.99,
              sum by (dependency, le) (
                rate(dependency_request_duration_seconds_bucket[5m])
              )
            )

        # Dependency error budget remaining (30-day window)
        - record: dependency:error_budget:remaining
          expr: |
            1 - (
              (1 - avg_over_time(dependency:requests:success_rate5m[30d]))
              /
              (1 - 0.999)
            )

    - name: dependency.alerts
      rules:
        - alert: DependencyErrorBudgetBurning
          expr: dependency:error_budget:remaining < 0.5
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Dependency {{ $labels.dependency }} has consumed 50% of error budget"
            description: "Error budget remaining: {{ $value | humanizePercentage }}"

        - alert: DependencyErrorBudgetExhausted
          expr: dependency:error_budget:remaining < 0.1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Dependency {{ $labels.dependency }} error budget nearly exhausted"
            description: "Error budget remaining: {{ $value | humanizePercentage }}"

To emit these metrics from your Elixir application, instrument your dependency calls:

defmodule MyApp.DependencyTelemetry do
  @moduledoc """
  Emits telemetry events for all dependency calls,
  which are then exposed as Prometheus metrics.
  """

  def track_call(dependency, func) when is_function(func, 0) do
    start_time = System.monotonic_time()

    result =
      try do
        func.()
      rescue
        error ->
          duration = System.monotonic_time() - start_time

          :telemetry.execute(
            [:dependency, :call, :exception],
            %{duration: duration},
            %{dependency: dependency, error: inspect(error)}
          )

          reraise error, __STACKTRACE__
      end

    duration = System.monotonic_time() - start_time
    status = if match?({:ok, _}, result), do: "success", else: "failure"

    :telemetry.execute(
      [:dependency, :call, :stop],
      %{duration: duration},
      %{dependency: dependency, status: status}
    )

    result
  end
end

When you track dependency SLOs over time, you start seeing patterns. Maybe your recommendation engine drops below its SLO every Monday morning when the ML team runs batch jobs. Maybe the payment gateway has reliability dips on the last day of the month. These patterns help you plan better fallback strategies and have informed conversations with dependency owners.

Graceful degradation patterns

Graceful degradation is the art of doing less, well, instead of doing everything, poorly. When your system is under stress or a dependency is failing, you intentionally reduce functionality to protect the core user experience.

Think of it as progressive levels of degradation:

Level 0 - Normal: All features working, all dependencies healthy

Level 1 - Reduced: Non-essential features disabled (recommendations, analytics, personalization)

Level 2 - Core only: Only critical path features remain (browse, search, purchase)

Level 3 - Minimal: Read-only mode or static content only

Level 4 - Maintenance: Service is down, show a maintenance page

Here is how to implement progressive degradation:

defmodule MyApp.DegradationLevel do
  @moduledoc """
  Manages the current degradation level based on
  dependency health and system load.
  """

  use GenServer

  @levels [:normal, :reduced, :core_only, :minimal, :maintenance]

  def start_link(_opts) do
    GenServer.start_link(__MODULE__, :normal, name: __MODULE__)
  end

  def current_level do
    GenServer.call(__MODULE__, :current_level)
  end

  def set_level(level) when level in @levels do
    GenServer.call(__MODULE__, {:set_level, level})
  end

  def feature_available?(feature) do
    level = current_level()
    feature_level = feature_minimum_level(feature)
    level_index(level) <= level_index(feature_level)
  end

  @impl true
  def init(level), do: {:ok, level}

  @impl true
  def handle_call(:current_level, _from, level), do: {:reply, level, level}

  def handle_call({:set_level, new_level}, _from, old_level) do
    if new_level != old_level do
      Logger.warning(
        "Degradation level changed: #{old_level} -> #{new_level}"
      )

      :telemetry.execute(
        [:app, :degradation, :level_change],
        %{},
        %{old_level: old_level, new_level: new_level}
      )
    end

    {:reply, :ok, new_level}
  end

  # Define which features are available at each level
  defp feature_minimum_level(:recommendations), do: :normal
  defp feature_minimum_level(:analytics_tracking), do: :normal
  defp feature_minimum_level(:personalization), do: :normal
  defp feature_minimum_level(:search_suggestions), do: :reduced
  defp feature_minimum_level(:user_reviews), do: :reduced
  defp feature_minimum_level(:search), do: :core_only
  defp feature_minimum_level(:browse_catalog), do: :core_only
  defp feature_minimum_level(:checkout), do: :core_only
  defp feature_minimum_level(:static_content), do: :minimal
  defp feature_minimum_level(_), do: :normal

  defp level_index(:normal), do: 0
  defp level_index(:reduced), do: 1
  defp level_index(:core_only), do: 2
  defp level_index(:minimal), do: 3
  defp level_index(:maintenance), do: 4
end

You can then use this in your controllers and LiveViews:

defmodule MyAppWeb.ProductLive do
  use MyAppWeb, :live_view

  alias MyApp.DegradationLevel

  def mount(%{"id" => id}, _session, socket) do
    product = MyApp.Catalog.get_product!(id)

    socket =
      socket
      |> assign(:product, product)
      |> assign(:degradation_level, DegradationLevel.current_level())
      |> maybe_load_recommendations(id)
      |> maybe_load_reviews(id)

    {:ok, socket}
  end

  defp maybe_load_recommendations(socket, product_id) do
    if DegradationLevel.feature_available?(:recommendations) do
      case MyApp.RecommendationService.get_recommendations(product_id) do
        {:ok, recs} -> assign(socket, :recommendations, recs)
        {:error, _} -> assign(socket, :recommendations, [])
      end
    else
      assign(socket, :recommendations, [])
    end
  end

  defp maybe_load_reviews(socket, product_id) do
    if DegradationLevel.feature_available?(:user_reviews) do
      case MyApp.Reviews.list_for_product(product_id) do
        {:ok, reviews} -> assign(socket, :reviews, reviews)
        {:error, _} -> assign(socket, :reviews, [])
      end
    else
      assign(socket, :reviews, [])
    end
  end
end

Feature flags for degradation

Feature flags are the mechanism that makes graceful degradation practical at runtime. Instead of deploying new code to disable a feature, you flip a flag and the change takes effect immediately.

Here is a simple but effective feature flag implementation in Elixir:

defmodule MyApp.FeatureFlags do
  @moduledoc """
  Simple ETS-based feature flags for runtime toggling.
  Supports boolean flags and percentage rollouts.
  """

  use GenServer

  @table :feature_flags

  def start_link(_opts) do
    GenServer.start_link(__MODULE__, [], name: __MODULE__)
  end

  @impl true
  def init(_) do
    :ets.new(@table, [:named_table, :set, :public, read_concurrency: true])

    # Load default flags
    load_defaults()

    {:ok, %{}}
  end

  # Check if a feature is enabled
  def enabled?(flag) do
    case :ets.lookup(@table, flag) do
      [{^flag, true}] -> true
      [{^flag, false}] -> false
      [{^flag, percentage}] when is_integer(percentage) ->
        :rand.uniform(100) <= percentage
      [] -> true  # default to enabled if flag not found
    end
  end

  # Enable a feature
  def enable(flag) do
    :ets.insert(@table, {flag, true})
    Logger.info("Feature flag enabled: #{flag}")
    :ok
  end

  # Disable a feature
  def disable(flag) do
    :ets.insert(@table, {flag, false})
    Logger.warning("Feature flag disabled: #{flag}")
    :ok
  end

  # Set percentage rollout
  def set_percentage(flag, percentage) when percentage in 0..100 do
    :ets.insert(@table, {flag, percentage})
    Logger.info("Feature flag #{flag} set to #{percentage}%")
    :ok
  end

  # List all flags and their states
  def list_all do
    :ets.tab2list(@table)
    |> Map.new()
  end

  defp load_defaults do
    defaults = [
      {:recommendations, true},
      {:analytics_tracking, true},
      {:personalization, true},
      {:search_suggestions, true},
      {:user_reviews, true},
      {:new_checkout_flow, false},
      {:experimental_search, 10}  # 10% rollout
    ]

    Enum.each(defaults, fn {flag, value} ->
      :ets.insert(@table, {flag, value})
    end)
  end
end

And a Phoenix LiveDashboard page to manage flags at runtime:

defmodule MyAppWeb.FeatureFlagController do
  use MyAppWeb, :controller

  plug :require_admin

  def index(conn, _params) do
    flags = MyApp.FeatureFlags.list_all()
    json(conn, %{flags: flags})
  end

  def update(conn, %{"flag" => flag, "value" => "true"}) do
    MyApp.FeatureFlags.enable(String.to_existing_atom(flag))
    json(conn, %{status: "ok", flag: flag, value: true})
  end

  def update(conn, %{"flag" => flag, "value" => "false"}) do
    MyApp.FeatureFlags.disable(String.to_existing_atom(flag))
    json(conn, %{status: "ok", flag: flag, value: false})
  end

  def update(conn, %{"flag" => flag, "value" => value}) do
    case Integer.parse(value) do
      {percentage, ""} when percentage in 0..100 ->
        MyApp.FeatureFlags.set_percentage(
          String.to_existing_atom(flag),
          percentage
        )
        json(conn, %{status: "ok", flag: flag, value: percentage})

      _ ->
        conn
        |> put_status(400)
        |> json(%{error: "Invalid value"})
    end
  end

  defp require_admin(conn, _opts) do
    # Your admin authentication logic here
    conn
  end
end

The beauty of combining feature flags with the degradation level system is that you can automate the response to dependency failures. When the circuit breaker for the recommendation engine opens, you automatically disable the recommendations feature flag. When it recovers, you re-enable it:

defmodule MyApp.DegradationAutomation do
  @moduledoc """
  Automatically adjusts feature flags and degradation level
  based on dependency health signals.
  """

  use GenServer

  @check_interval_ms 10_000

  def start_link(_opts) do
    GenServer.start_link(__MODULE__, [], name: __MODULE__)
  end

  @impl true
  def init(_) do
    schedule_check()
    {:ok, %{}}
  end

  @impl true
  def handle_info(:check_dependencies, state) do
    deps = MyApp.DependencyMap.all()

    Enum.each(deps, fn {name, config} ->
      case check_health(name) do
        :healthy ->
          maybe_restore_features(name, config)

        :unhealthy ->
          maybe_degrade_features(name, config)
      end
    end)

    update_overall_degradation_level()
    schedule_check()
    {:noreply, state}
  end

  defp check_health(dep_name) do
    case MyApp.CircuitBreaker.call(dep_name, fn ->
      # lightweight health check
      :ok
    end) do
      {:ok, _} -> :healthy
      {:error, :circuit_open} -> :unhealthy
      {:error, _, _} -> :unhealthy
    end
  end

  defp maybe_degrade_features(dep_name, _config) do
    features_for_dependency(dep_name)
    |> Enum.each(fn feature ->
      MyApp.FeatureFlags.disable(feature)
      MyApp.DegradedMode.mark_degraded(feature, "dependency #{dep_name} unhealthy")
    end)
  end

  defp maybe_restore_features(dep_name, _config) do
    features_for_dependency(dep_name)
    |> Enum.each(fn feature ->
      MyApp.FeatureFlags.enable(feature)
      MyApp.DegradedMode.mark_healthy(feature)
    end)
  end

  defp features_for_dependency(:recommendation_engine), do: [:recommendations]
  defp features_for_dependency(:notification_service), do: [:email_notifications]
  defp features_for_dependency(:analytics_service), do: [:analytics_tracking]
  defp features_for_dependency(_), do: []

  defp update_overall_degradation_level do
    hard_deps = MyApp.DependencyMap.hard_dependencies()
    soft_deps = MyApp.DependencyMap.soft_dependencies()

    hard_healthy = Enum.all?(hard_deps, fn {name, _} -> check_health(name) == :healthy end)
    soft_healthy = Enum.all?(soft_deps, fn {name, _} -> check_health(name) == :healthy end)

    level =
      cond do
        not hard_healthy -> :core_only
        not soft_healthy -> :reduced
        true -> :normal
      end

    MyApp.DegradationLevel.set_level(level)
  end

  defp schedule_check do
    Process.send_after(self(), :check_dependencies, @check_interval_ms)
  end
end

Closing notes

Dependency management and graceful degradation are not optional for any service that aims to be reliable. Every external call is a risk, and the patterns we covered (circuit breakers, bulkheads, timeouts with backoff, fallback strategies, dependency health checks, dependency mapping, dependency SLOs, progressive degradation levels, and feature flags) give you a comprehensive toolkit to manage that risk.

The key takeaways are:

Know your dependencies: Map them, classify them as hard or soft, and document your fallback strategy for each one

Fail fast: Use circuit breakers and timeouts so that a slow dependency does not become your problem

Isolate failures: Use bulkheads (process pools, resource limits, network policies) to contain the blast radius

Have a plan B: Implement fallback strategies before you need them, not during an incident

Degrade gracefully: It is better to serve a product page without recommendations than to serve a 500 error

Automate the response: Use feature flags and automation to respond to dependency failures in seconds, not minutes

Start with the most critical path in your system. Identify the hard dependencies, add circuit breakers and timeouts, implement one fallback strategy, and test it. You do not need to implement everything at once. Incremental improvements compound over time.

Hope you found this useful and enjoyed reading it, until next time!

Errata

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here

$ Comments

Online: 0

Please sign in to be able to write comments.

2026-03-17 | Gabriel Garrido

$ Related Posts

> SRE: Disaster Recovery and Business Continuity (2026-04-03)

> SRE: Database Reliability (2026-03-23)

> SRE: Chaos Engineering, Breaking Things on Purpose (2026-03-02)