---
id: 13
title: "Fail fast and push back: retries, load shedding, and flow control"
status: current
date: 2026-06-12
authors:
  - "Theo Zourzouvillys"
tags: [reliability, architecture, infra, resilience]
summary: "Build client retries (backoff, jitter, Retry-After) from day one. Under overload, shed fast and push the failure back to the source to retry — don't retry internally and amplify it. Flow-control everywhere, bound every queue, and don't take more work than you can finish in time."
supersedes: null
superseded_by: null
aliases: []
references:
  - id: metastable
    title: "Metastable Failures in Distributed Systems (HotOS 2021)"
    url: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf
    abstract: "Names and characterises metastable failures: a trigger pushes a system into a degraded state that then sustains itself through a feedback loop — often retries amplifying load — and persists even after the original trigger is gone, so the system won't recover on its own until the load is removed or capacity is added."
---

## TL;DR

How a system behaves *at the edge of its capacity* decides whether a load spike is a blip or an
outage — and the behaviors that save you have to be designed in from the start, because they're a
contract between caller and callee, not a feature you bolt on later. The rules:

- **Add client-side retries from day one** — exponential backoff, **jitter**, idempotency, and a
  **retry budget/circuit breaker** so retries can't become a storm. Honor **`Retry-After`**.
- **Load-shed quickly.** When you're over capacity, reject *fast and cheap* at the front door (e.g.
  `429`/`503` with `Retry-After`) rather than accepting the work and failing slowly. A fast failure
  the caller can act on beats a slow one that has already timed out.
- **Push shed failures back to the source to retry — don't retry internally.** Internal retries deep
  in the stack multiply load (retries compound at every layer) and hide the backpressure signal. The
  *original caller* has the context to retry sanely; let the failure propagate to it.
- **Flow-control everywhere.** Every boundary applies backpressure: bounded concurrency, admission
  control, and **bounded queues**. A full bounded queue *is* the "slow down" signal — let it shed,
  don't let it grow.
- **Don't take more work than you can realistically finish in reasonable time.** Accept work only if
  you can complete it within its deadline; drop work whose deadline has already passed instead of
  doing doomed work.

## Context

Most systems are tested where they have headroom, so their *overload* behavior is whatever fell out
by accident — and what usually falls out is the worst option: accept everything, queue it without
bound, slow down, time out, and retry internally. That combination is how a brief spike becomes a
**[metastable failure](ref:metastable)** — the system stays down even after the original trigger is gone, because it's
now generating its own load:

- **Unbounded queues** absorb the overload invisibly until latency and memory explode; by the time an
  item is processed, the caller has long since given up, so the work is wasted *and* it pushed out
  work that still mattered.
- **Slow failures** hold connections, threads, and memory while they fail, so overload in one place
  becomes resource exhaustion everywhere — a cascading failure.
- **Retries layered at every hop** turn one client retry into an exponential fan-out: if each of three
  layers retries three times, one request becomes twenty-seven downstream calls, precisely when the
  downstream is already drowning. This is the classic retry storm.

You can't retrofit your way out of this cheaply, because retries, idempotency, deadlines, and
backpressure are part of the *interface* between services. If clients were written without retries
and backoff, every caller is already wrong; if a queue was unbounded, everything downstream assumed
it would always accept. These properties have to be there from the start.

## Recommendation

**Design for the overloaded case explicitly, and make the whole path push back.**

**Build retries into clients from the start — and make them safe.**

- **Backoff with jitter.** Exponential backoff so retries space out; jitter so a thousand clients
  don't retry in lockstep and re-synchronize the spike.
- **Honor `Retry-After`.** When a server sheds or rate-limits, it should *say when to come back*
  (`Retry-After` on `429`/`503`); clients obey it. This converts blind retry into coordinated retry
  and is the single cheapest defense against retry storms.
- **Bound retries.** A **retry budget** (retries capped as a fraction of total requests) and/or a
  **circuit breaker** so a struggling dependency gets *less* traffic, not more. Retrying forever is
  how you keep a downstream dead.
- **Idempotency first.** Retries are only safe if the operation is idempotent (idempotency keys for
  writes). Build that in alongside the retry, not after the first double-charge.

**Shed load fast, at admission.** Decide whether you can serve a request *before* doing expensive work
— cheaply, at the front door. If you're over your concurrency or queue limit, reject immediately with
a clear, retryable signal and a `Retry-After`. Fast rejection lets the caller back off and try
elsewhere/later; slow rejection just burns both sides' resources and usually times out anyway. Shed
the *least important* work first where you can (load-shedding by priority — see
[ZFN-2](/zfn/2-engineering-priority-ordering/)).

**Retry at the source, not in the middle.** When a layer sheds, propagate the failure up to the
original caller and let *it* decide whether and when to retry. Don't bury retries inside intermediate
services: they compound across layers, they retry work the caller may no longer want, and they
suppress the backpressure that should reach the edge. Retry at one level — the outermost one that owns
the request and its budget.

**Flow-control everywhere; bound every queue.** Every boundary needs backpressure, not silent
buffering:

- **Bound every queue** ([ZFN-12](/zfn/12-queues-topics-journals/)). An unbounded queue is a latent
  outage; a bounded one that rejects when full is a working backpressure signal.
- **Bound concurrency / admission** at each tier (max in-flight, connection limits) so you process at a
  sustainable rate instead of accepting everything and thrashing.
- **Propagate deadlines and cancel doomed work.** Carry a deadline with each request; if it's already
  expired by the time you'd start (a stale queue item, a caller that's gone), **drop it** rather than
  spend capacity on a result no one will use.
- **Take only what you can finish in time.** Admission control means accepting work only when you can
  realistically complete it within its deadline. Promising more than you can deliver just converts into
  timeouts and wasted work under load.

## Consequences

**Easier:**

- Spikes degrade gracefully: you serve what you can at full speed and cleanly reject the rest, instead
  of slowing everything to a crawl and toppling over.
- No metastable lock-up — bounded retries, `Retry-After`, and edge-only retries stop the system from
  feeding its own overload, so it recovers when the trigger passes.
- Backpressure reaches the source, where the real decision lives: slow down, retry later, or drop.

**Harder:**

- Callers must handle rejection and retry properly — this only works if clients cooperate, which is
  why it has to be in the SDK/contract from the start.
- Idempotency, retry budgets, deadline propagation, and admission control are real engineering with
  real edge cases; "just retry on error" is easier to write and is exactly the trap.
- Load shedding means deliberately failing some requests *now* to keep the system alive — a trade you
  have to be willing to make explicit (and shed by priority, not at random, where you can).
- Bounded queues and admission limits need sizing and tuning, and a too-tight limit sheds work you
  could have served.

## References

- [ZFN-12](/zfn/12-queues-topics-journals/) — bound every queue; a full bounded queue is the
  backpressure signal.
- [ZFN-2](/zfn/2-engineering-priority-ordering/) — shed the least-important work first; bulkheads keep
  one workload from taking the rest down.
- [ZFN-11](/zfn/11-outbound-http-egress-proxy/) — third parties have their own limits; honor their
  `Retry-After` and back off rather than hammering them.
- [Google SRE Book — Handling Overload](https://sre.google/sre-book/handling-overload/) and [Addressing Cascading Failures](https://sre.google/sre-book/addressing-cascading-failures/).
- [Amazon Builders' Library — Timeouts, retries, and backoff with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) and [Using load shedding to avoid overload](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/).
- [Metastable Failures in Distributed Systems](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf) — why retry amplification keeps systems down after the trigger is gone.

## Changelog

- **2026-06-12**: First published as a Field Note.