---
id: 4
title: "Incident tooling must not depend on what it recovers"
status: current
date: 2026-06-12
authors:
  - "Theo Zourzouvillys"
tags: [principles, reliability, incident, security, infra]
summary: "Anything you need to respond to an incident — deploy/rollback, kill switches, observability, break-glass access — must not depend, directly or transitively, on the systems likely to be down during it. Never gate incident tooling behind a system it might need to recover."
supersedes: null
superseded_by: null
aliases: []
---

## TL;DR

Any capability you rely on to respond to an incident or outage — deploying, releasing, and rolling
back; flipping feature flags and kill switches; the observability needed to diagnose; break-glass
access and the credentials it needs; runbooks; incident comms — **must not depend, directly or
transitively, on the systems that are likely to be unavailable during that incident.** If
recovering from an outage requires the thing that's down, you have a deadlock. The canonical trap:
gating the release/deploy path (or any incident tool) behind *the very system it might be needed to
recover* — for example, authenticating your deploy pipeline with your own auth product, so when
that product is the outage, no one can ship the fix. Incident-critical tooling uses independent
mechanisms, and those break-glass paths are tested on the assumption that production is down.

## Context

As a system decomposes into services, tooling increasingly runs on your own systems — you dogfood,
which is healthy and worth doing. The dangerous exception is the small set of tools you reach for
*precisely when your systems are broken*. For those, dogfooding can create a circular dependency:
you need the system to fix the system.

Concrete failure modes to rule out:

- The deploy/release/rollback pipeline authenticates operators with **the same auth system that's
  down** — so during that outage, no one can log in to ship the fix.
- The incident dashboard or alerting runs on the same cluster, database, or region as the service
  that's down, so it goes dark exactly when it's needed.
- The break-glass credentials for recovery are stored behind the SSO/secrets path the outage has
  taken out.
- The runbook describing the recovery lives only in a tool that depends on the failed system.

These are all the same shape — an incident-response capability with a hidden transitive dependency
on its own blast radius. The cost of getting it wrong lands at the worst possible time: mid-
incident, with the clock running. A priority ordering that puts availability above performance
([ZFN-2](/zfn/2-engineering-priority-ordering/)) accepts real cost to protect it; a
recoverable outage is the most basic form of that.

## Recommendation

**For any capability on the incident-response critical path, it must remain operable when your own
production systems are degraded or down.** The critical path includes, at minimum: deploy /
release / rollback; feature-flag and kill-switch control; the observability and alerting needed to
detect and diagnose; break-glass access and the credentials/secrets it requires; the runbooks and
docs needed to act; and incident communications.

Concretely:

1. **No incident-critical tool may depend — directly or transitively — on a system it might need to
   recover.** Trace the dependency chain (auth, network, data stores, regions, third parties), not
   just the first hop. In particular, **don't authenticate incident tooling with a system that's a
   plausible thing you'd be recovering** — use an independent identity provider.
2. **Independence is explicit and documented.** Each incident-critical tool states what it depends
   on and why that set stays available during the incidents it's meant to address — the break-glass
   path is written down, not assumed.
3. **Prefer independent, well-understood break-glass mechanisms:** a separate identity provider,
   out-of-band comms, statically hosted runbooks, and recovery credentials held in an independent
   vault or offline.
4. **Test the break-glass path on the assumption production is down.** Exercise it periodically
   (e.g. game days) so it doesn't quietly rot — an untested recovery path is a guess.

**Scope:** this binds the incident/outage-response critical path. Normal-operation tooling can and
should dogfood your systems freely; the constraint applies specifically to what you need *when
things are on fire*. When in doubt about whether a tool is on the critical path, assume it is and
trace its dependencies.

## Consequences

**Easier:**

- Outages stay recoverable: you never deadlock on "you need the system to fix the system."
- Incident response is calmer and faster — the tools work when needed, and their dependencies are
  known ahead of time, not discovered mid-incident.

**Harder:**

- Some deliberate duplication: a second identity path, separate hosting, an independent secrets
  store. You can't fully dogfood the recovery tools, and that's the point.
- Independent paths rot if unused, so they carry an ongoing testing obligation. That maintenance
  cost is the price of a working break-glass.
- Tracing transitive dependencies takes real effort and judgment, and the chain changes as systems
  evolve — independence is a property to re-check, not establish once.

**New obligations:**

- Any new or changed incident-critical tool declares and documents its independence, and reviewers
  check it for circular dependencies on its own blast radius.
- Break-glass paths are tested on a regular cadence under "production is down" assumptions.
- When a dependency change would pull an incident-critical tool back onto a system it might need to
  recover, that's a blocking concern — treat it like any security-first trade-off under
  [ZFN-2](/zfn/2-engineering-priority-ordering/).

## References

- [ZFN-2](/zfn/2-engineering-priority-ordering/) — priority ordering; availability is
  protected at real cost.

## Changelog

- **2026-06-12**: First published as a Field Note.
