---
id: 36
title: "An untested backup is not a backup — test it by restoring"
status: current
date: 2026-06-12
authors:
  - "Theo Zourzouvillys"
tags: [reliability, infra, incident, operations]
summary: "An untested backup is a hope, not a backup — the only thing that counts is a restore. Rehearse restores regularly (game days), measure and meet your RTO/RPO, automate them, and cover the whole recovery path — data, schema, config, secrets, cutover — not just the dump."
supersedes: null
superseded_by: null
aliases: []
---

## TL;DR

A backup you have never restored is not a backup — it's a **hope**. The only thing that actually
counts is a **successful restore**, and you do not know you have one until you've done it. So treat
restoring, not backing up, as the thing you invest in:

- **Rehearse restores on a regular cadence** (game days), into a clean environment, as if production
  were gone.
- **Measure and meet your RTO and RPO** — how long recovery takes, and how much data you lose — by
  actually timing a real restore, not by assuming.
- **Automate the restore** into a tested, scripted path, not a heroic manual reconstruction invented
  mid-incident.
- **Cover the whole recovery path**, not just the data dump: data *and* schema, config, secrets,
  dependencies, the app coming back up and **serving correct results**, and the cutover (DNS, traffic).
  A restored database that nothing can talk to is not a recovery.

## Context

Backups fail silently and recovery fails loudly. The backup job reports success while quietly writing
truncated, corrupt, or incomplete data; a retention change drops the snapshot you needed; an extension,
a sequence, or a piece of config wasn't in scope. None of this surfaces until you try to restore — and
if the first time you try is during the incident, with the clock running and customers down, you
discover all of it at the worst possible moment.

And "restore the data" is rarely the actual job. Recovery is the *whole path* back to serving: the data,
yes, but also the schema and migrations, the configuration and the secrets to reach it
([ZFN-35](/zfn/35-dereference-secrets-not-store-in-config/)), the dependencies the app needs, the DNS
and traffic cutover, and the verification that what came back is *correct*, not just *up*. Teams back up
the database and call it disaster recovery, then find in the real disaster that they can't actually
reconstitute a working system from what they kept.

> [!aside] The pattern I keep seeing
>
> The recurring story isn't "we had no backups." It's "we had backups, the job was green for months,
> and the first time anyone restored one was during the outage — when we learned it had been silently
> truncating, or that we'd never captured the one config file that made it all work." Green backup
> dashboards are comfortable and meaningless on their own.

Your priority ordering puts availability high ([ZFN-2](/zfn/2-engineering-priority-ordering/)); a
recoverable outage is the most basic form of protecting it, and an unrecoverable one is the worst
failure there is.

## Recommendation

**Invest in restores, not just backups. Prove recovery; don't assume it.**

- **Define RTO and RPO, then verify them by restoring.** Targets on a wiki are aspirations. A timed,
  end-to-end restore is the only evidence you can actually meet them — and it routinely reveals the real
  numbers are far worse than assumed.
- **Rehearse on a cadence, to a clean target.** Restore into a fresh, isolated environment on a regular
  schedule (game days), as though production no longer exists. If it only works because some surviving
  prod resource was reused, it isn't a real test.
- **Automate the restore path.** A scripted, repeatable restore that anyone can run beats a heroic manual
  rebuild remembered by one person. The runbook is code, exercised regularly so it doesn't rot.
- **Restore the *whole* system, and verify correctness.** Data + schema/migrations + config + secrets +
  dependencies + the app serving + the cutover — and then *check the result is right* (run real queries,
  reconcile counts), not just that a process started. "It booted" is not "it recovered."
- **Cover the small disasters too.** Point-in-time recovery, single-tenant restore, and undoing an
  accidental delete are far more common than total loss — test restoring *part* of the system, not only
  full DR.
- **Monitor restore success, not just backup success**, and make the restore path **independent of what
  it recovers** ([ZFN-4](/zfn/4-incident-tooling-independence/)) — the runbook, credentials, and tooling
  you need to recover must not live only inside the system that's down.

## Consequences

**Easier:**

- You know — with evidence — that you can recover, how long it takes, and how much you'd lose, before
  the incident instead of during it.
- Silent backup rot, missing pieces, and broken restore steps are found on a calm Tuesday, not at 3am
  mid-outage.
- Recovery becomes a routine, automated, calm procedure rather than a panicked one-off reconstruction.

**Harder:**

- Real work: a clean restore environment, automation, and a recurring game-day cadence that competes
  with feature work — and is easy to let slide because backups *look* fine.
- Restoring large datasets to test is slow and can be costly; you have to budget the time and compute,
  and may sample/scope rehearsals rather than always doing full-scale.
- Keeping the whole-path restore current as the system evolves is ongoing — new dependencies, config,
  and secrets must keep getting into scope, or the test quietly stops matching reality.

## References

- [ZFN-4](/zfn/4-incident-tooling-independence/) — the recovery path must not depend on the system it
  recovers; test it assuming production is down.
- [ZFN-2](/zfn/2-engineering-priority-ordering/) — availability is protected at real cost; a recoverable
  outage is the floor.
- [ZFN-35](/zfn/35-dereference-secrets-not-store-in-config/) — recovery needs the secrets/config to reach
  the data, not just the data.
- [ZFN-24](/zfn/24-one-transactional-store-per-write/) — know what your source of truth is; that's what
  you must be able to restore.

## Changelog

- **2026-06-12**: First published as a Field Note.