Field Note 36current

An untested backup is not a backup — test it by restoring

By: Theo Zourzouvillys
Published: June 12, 2026
Tags: reliabilityinfraincidentoperations

TL;DR

A backup you have never restored is not a backup — it’s a hope. The only thing that actually counts is a successful restore, and you do not know you have one until you’ve done it. So treat restoring, not backing up, as the thing you invest in:

Rehearse restores on a regular cadence (game days), into a clean environment, as if production were gone.
Measure and meet your RTO and RPO — how long recovery takes, and how much data you lose — by actually timing a real restore, not by assuming.
Automate the restore into a tested, scripted path, not a heroic manual reconstruction invented mid-incident.
Cover the whole recovery path, not just the data dump: data and schema, config, secrets, dependencies, the app coming back up and serving correct results, and the cutover (DNS, traffic). A restored database that nothing can talk to is not a recovery.

Context

Backups fail silently and recovery fails loudly. The backup job reports success while quietly writing truncated, corrupt, or incomplete data; a retention change drops the snapshot you needed; an extension, a sequence, or a piece of config wasn’t in scope. None of this surfaces until you try to restore — and if the first time you try is during the incident, with the clock running and customers down, you discover all of it at the worst possible moment.

And “restore the data” is rarely the actual job. Recovery is the whole path back to serving: the data, yes, but also the schema and migrations, the configuration and the secrets to reach it (ZFN-35Field Note · currentZFN-35 — Reference secrets in config; dereference, refresh, and re-fetchDon't put secret values in config — store a reference (a path in a secret store) and dereference it at runtime via your workload identity. Refresh on a signal or expiry so rotation needs no redeploy; re-fetch on auth failure so a rotated secret self-heals.Open ZFN-35 →), the dependencies the app needs, the DNS and traffic cutover, and the verification that what came back is correct, not just up. Teams back up the database and call it disaster recovery, then find in the real disaster that they can’t actually reconstitute a working system from what they kept.

Your priority ordering puts availability high (ZFN-2Field Note · currentZFN-2 — Engineering priority orderingWhen concerns conflict, prioritize security > correctness > availability > performance — and never trade a higher-ranked concern for a lower one. The rule binds the moment you must choose. Cite it instead of re-arguing it.Open ZFN-2 →); a recoverable outage is the most basic form of protecting it, and an unrecoverable one is the worst failure there is.

Recommendation

Invest in restores, not just backups. Prove recovery; don’t assume it.

Define RTO and RPO, then verify them by restoring. Targets on a wiki are aspirations. A timed, end-to-end restore is the only evidence you can actually meet them — and it routinely reveals the real numbers are far worse than assumed.
Rehearse on a cadence, to a clean target. Restore into a fresh, isolated environment on a regular schedule (game days), as though production no longer exists. If it only works because some surviving prod resource was reused, it isn’t a real test.
Automate the restore path. A scripted, repeatable restore that anyone can run beats a heroic manual rebuild remembered by one person. The runbook is code, exercised regularly so it doesn’t rot.
Restore the whole system, and verify correctness. Data + schema/migrations + config + secrets + dependencies + the app serving + the cutover — and then check the result is right (run real queries, reconcile counts), not just that a process started. “It booted” is not “it recovered.”
Cover the small disasters too. Point-in-time recovery, single-tenant restore, and undoing an accidental delete are far more common than total loss — test restoring part of the system, not only full DR.
Monitor restore success, not just backup success, and make the restore path independent of what it recovers (ZFN-4Field Note · currentZFN-4 — Incident tooling must not depend on what it recoversAnything you need to respond to an incident — deploy/rollback, kill switches, observability, break-glass access — must not depend, directly or transitively, on the systems likely to be down during it. Never gate incident tooling behind a system it might need to recover.Open ZFN-4 →) — the runbook, credentials, and tooling you need to recover must not live only inside the system that’s down.

Consequences

Easier:

You know — with evidence — that you can recover, how long it takes, and how much you’d lose, before the incident instead of during it.
Silent backup rot, missing pieces, and broken restore steps are found on a calm Tuesday, not at 3am mid-outage.
Recovery becomes a routine, automated, calm procedure rather than a panicked one-off reconstruction.

Harder:

Real work: a clean restore environment, automation, and a recurring game-day cadence that competes with feature work — and is easy to let slide because backups look fine.
Restoring large datasets to test is slow and can be costly; you have to budget the time and compute, and may sample/scope rehearsals rather than always doing full-scale.
Keeping the whole-path restore current as the system evolves is ongoing — new dependencies, config, and secrets must keep getting into scope, or the test quietly stops matching reality.

References

ZFN-4Field Note · currentZFN-4 — Incident tooling must not depend on what it recoversAnything you need to respond to an incident — deploy/rollback, kill switches, observability, break-glass access — must not depend, directly or transitively, on the systems likely to be down during it. Never gate incident tooling behind a system it might need to recover.Open ZFN-4 → — the recovery path must not depend on the system it recovers; test it assuming production is down.
ZFN-2Field Note · currentZFN-2 — Engineering priority orderingWhen concerns conflict, prioritize security > correctness > availability > performance — and never trade a higher-ranked concern for a lower one. The rule binds the moment you must choose. Cite it instead of re-arguing it.Open ZFN-2 → — availability is protected at real cost; a recoverable outage is the floor.
ZFN-35Field Note · currentZFN-35 — Reference secrets in config; dereference, refresh, and re-fetchDon't put secret values in config — store a reference (a path in a secret store) and dereference it at runtime via your workload identity. Refresh on a signal or expiry so rotation needs no redeploy; re-fetch on auth failure so a rotated secret self-heals.Open ZFN-35 → — recovery needs the secrets/config to reach the data, not just the data.
ZFN-24Field Note · currentZFN-24 — One transactional store per write; propagate changes asynchronouslyCommit each logical write to exactly one transactional store; update other systems via reliable ordered async events — never a synchronous write across two stores, and never 2PC. With a relational primary the WAL is your replayable journal; write events into the same transaction.Open ZFN-24 → — know what your source of truth is; that’s what you must be able to restore.

Changelog

2026-06-12: First published as a Field Note.