Theo Zourzouvillys

Field Note 4 current

Incident tooling must not depend on what it recovers

By
Theo Zourzouvillys
Published
Tags
principlesreliabilityincidentsecurityinfra

TL;DR

Any capability you rely on to respond to an incident or outage — deploying, releasing, and rolling back; flipping feature flags and kill switches; the observability needed to diagnose; break-glass access and the credentials it needs; runbooks; incident comms — must not depend, directly or transitively, on the systems that are likely to be unavailable during that incident. If recovering from an outage requires the thing that’s down, you have a deadlock. The canonical trap: gating the release/deploy path (or any incident tool) behind the very system it might be needed to recover — for example, authenticating your deploy pipeline with your own auth product, so when that product is the outage, no one can ship the fix. Incident-critical tooling uses independent mechanisms, and those break-glass paths are tested on the assumption that production is down.

Context

As a system decomposes into services, tooling increasingly runs on your own systems — you dogfood, which is healthy and worth doing. The dangerous exception is the small set of tools you reach for precisely when your systems are broken. For those, dogfooding can create a circular dependency: you need the system to fix the system.

Concrete failure modes to rule out:

  • The deploy/release/rollback pipeline authenticates operators with the same auth system that’s down — so during that outage, no one can log in to ship the fix.
  • The incident dashboard or alerting runs on the same cluster, database, or region as the service that’s down, so it goes dark exactly when it’s needed.
  • The break-glass credentials for recovery are stored behind the SSO/secrets path the outage has taken out.
  • The runbook describing the recovery lives only in a tool that depends on the failed system.

These are all the same shape — an incident-response capability with a hidden transitive dependency on its own blast radius. The cost of getting it wrong lands at the worst possible time: mid- incident, with the clock running. A priority ordering that puts availability above performance (ZFN-2) accepts real cost to protect it; a recoverable outage is the most basic form of that.

Recommendation

For any capability on the incident-response critical path, it must remain operable when your own production systems are degraded or down. The critical path includes, at minimum: deploy / release / rollback; feature-flag and kill-switch control; the observability and alerting needed to detect and diagnose; break-glass access and the credentials/secrets it requires; the runbooks and docs needed to act; and incident communications.

Concretely:

  1. No incident-critical tool may depend — directly or transitively — on a system it might need to recover. Trace the dependency chain (auth, network, data stores, regions, third parties), not just the first hop. In particular, don’t authenticate incident tooling with a system that’s a plausible thing you’d be recovering — use an independent identity provider.
  2. Independence is explicit and documented. Each incident-critical tool states what it depends on and why that set stays available during the incidents it’s meant to address — the break-glass path is written down, not assumed.
  3. Prefer independent, well-understood break-glass mechanisms: a separate identity provider, out-of-band comms, statically hosted runbooks, and recovery credentials held in an independent vault or offline.
  4. Test the break-glass path on the assumption production is down. Exercise it periodically (e.g. game days) so it doesn’t quietly rot — an untested recovery path is a guess.

Scope: this binds the incident/outage-response critical path. Normal-operation tooling can and should dogfood your systems freely; the constraint applies specifically to what you need when things are on fire. When in doubt about whether a tool is on the critical path, assume it is and trace its dependencies.

Consequences

Easier:

  • Outages stay recoverable: you never deadlock on “you need the system to fix the system.”
  • Incident response is calmer and faster — the tools work when needed, and their dependencies are known ahead of time, not discovered mid-incident.

Harder:

  • Some deliberate duplication: a second identity path, separate hosting, an independent secrets store. You can’t fully dogfood the recovery tools, and that’s the point.
  • Independent paths rot if unused, so they carry an ongoing testing obligation. That maintenance cost is the price of a working break-glass.
  • Tracing transitive dependencies takes real effort and judgment, and the chain changes as systems evolve — independence is a property to re-check, not establish once.

New obligations:

  • Any new or changed incident-critical tool declares and documents its independence, and reviewers check it for circular dependencies on its own blast radius.
  • Break-glass paths are tested on a regular cadence under “production is down” assumptions.
  • When a dependency change would pull an incident-critical tool back onto a system it might need to recover, that’s a blocking concern — treat it like any security-first trade-off under ZFN-2.

References

  • ZFN-2 — priority ordering; availability is protected at real cost.

Changelog

  • 2026-06-12: First published as a Field Note.