Theo Zourzouvillys

Field Note 36 current

An untested backup is not a backup — test it by restoring

By
Theo Zourzouvillys
Published
Tags
reliabilityinfraincidentoperations

TL;DR

A backup you have never restored is not a backup — it’s a hope. The only thing that actually counts is a successful restore, and you do not know you have one until you’ve done it. So treat restoring, not backing up, as the thing you invest in:

  • Rehearse restores on a regular cadence (game days), into a clean environment, as if production were gone.
  • Measure and meet your RTO and RPO — how long recovery takes, and how much data you lose — by actually timing a real restore, not by assuming.
  • Automate the restore into a tested, scripted path, not a heroic manual reconstruction invented mid-incident.
  • Cover the whole recovery path, not just the data dump: data and schema, config, secrets, dependencies, the app coming back up and serving correct results, and the cutover (DNS, traffic). A restored database that nothing can talk to is not a recovery.

Context

Backups fail silently and recovery fails loudly. The backup job reports success while quietly writing truncated, corrupt, or incomplete data; a retention change drops the snapshot you needed; an extension, a sequence, or a piece of config wasn’t in scope. None of this surfaces until you try to restore — and if the first time you try is during the incident, with the clock running and customers down, you discover all of it at the worst possible moment.

And “restore the data” is rarely the actual job. Recovery is the whole path back to serving: the data, yes, but also the schema and migrations, the configuration and the secrets to reach it (ZFN-35), the dependencies the app needs, the DNS and traffic cutover, and the verification that what came back is correct, not just up. Teams back up the database and call it disaster recovery, then find in the real disaster that they can’t actually reconstitute a working system from what they kept.

Your priority ordering puts availability high (ZFN-2); a recoverable outage is the most basic form of protecting it, and an unrecoverable one is the worst failure there is.

Recommendation

Invest in restores, not just backups. Prove recovery; don’t assume it.

  • Define RTO and RPO, then verify them by restoring. Targets on a wiki are aspirations. A timed, end-to-end restore is the only evidence you can actually meet them — and it routinely reveals the real numbers are far worse than assumed.
  • Rehearse on a cadence, to a clean target. Restore into a fresh, isolated environment on a regular schedule (game days), as though production no longer exists. If it only works because some surviving prod resource was reused, it isn’t a real test.
  • Automate the restore path. A scripted, repeatable restore that anyone can run beats a heroic manual rebuild remembered by one person. The runbook is code, exercised regularly so it doesn’t rot.
  • Restore the whole system, and verify correctness. Data + schema/migrations + config + secrets + dependencies + the app serving + the cutover — and then check the result is right (run real queries, reconcile counts), not just that a process started. “It booted” is not “it recovered.”
  • Cover the small disasters too. Point-in-time recovery, single-tenant restore, and undoing an accidental delete are far more common than total loss — test restoring part of the system, not only full DR.
  • Monitor restore success, not just backup success, and make the restore path independent of what it recovers (ZFN-4) — the runbook, credentials, and tooling you need to recover must not live only inside the system that’s down.

Consequences

Easier:

  • You know — with evidence — that you can recover, how long it takes, and how much you’d lose, before the incident instead of during it.
  • Silent backup rot, missing pieces, and broken restore steps are found on a calm Tuesday, not at 3am mid-outage.
  • Recovery becomes a routine, automated, calm procedure rather than a panicked one-off reconstruction.

Harder:

  • Real work: a clean restore environment, automation, and a recurring game-day cadence that competes with feature work — and is easy to let slide because backups look fine.
  • Restoring large datasets to test is slow and can be costly; you have to budget the time and compute, and may sample/scope rehearsals rather than always doing full-scale.
  • Keeping the whole-path restore current as the system evolves is ongoing — new dependencies, config, and secrets must keep getting into scope, or the test quietly stops matching reality.

References

  • ZFN-4 — the recovery path must not depend on the system it recovers; test it assuming production is down.
  • ZFN-2 — availability is protected at real cost; a recoverable outage is the floor.
  • ZFN-35 — recovery needs the secrets/config to reach the data, not just the data.
  • ZFN-24 — know what your source of truth is; that’s what you must be able to restore.

Changelog

  • 2026-06-12: First published as a Field Note.