---
id: 37
title: "Every lock is a lease"
status: current
date: 2026-06-12
authors:
  - "Theo Zourzouvillys"
tags: [architecture, infra, reliability, consistency]
summary: "A lock that can outlive its holder is a deadlock scheduled for later. Give every lock — including informal ones like claimed_by columns — a TTL, a named owner, and a heartbeat; make expiry automatic and server-side; fence the side effects so a stale holder can't corrupt anything."
supersedes: null
superseded_by: null
aliases: []
references:
  - id: leases
    title: "Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency (SOSP 1989)"
    url: https://dl.acm.org/doi/10.1145/74850.74870
    abstract: "Gray & Cheriton's original lease paper: a lease is a time-bounded grant of authority that the holder must actively renew, so a crashed or partitioned holder loses its rights automatically when the term expires — recovery becomes the passage of time rather than an intervention."
  - id: chubby
    title: "The Chubby lock service for loosely-coupled distributed systems (OSDI 2006)"
    url: https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
    abstract: "Burrows' account of Google's production lock service. Chubby locks ride on sessions with leases and keep-alives — clients that stop renewing lose their locks — and lock sequence numbers (fencing) protect resources from actions by stale holders."
  - id: kleppmann-locking
    title: "How to do distributed locking — Martin Kleppmann (2016)"
    url: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
    abstract: "Analyses why a timed lock alone cannot guarantee mutual exclusion for external side effects (GC pauses, clock skew, in-flight requests) and argues for fencing tokens: a monotonically increasing number issued at each acquisition that downstream systems check, rejecting writes from stale holders."
---

## TL;DR

A lock that can outlive its holder is a deadlock scheduled for later. Treat every lock as a
**lease**:

- **A TTL on every lock.** No infinite holds. Expiry is automatic and server-side — not a cleanup
  job, not the holder's good intentions.
- **A named owner, twice over.** Record the *identity* that may renew it **and** the *task* it
  serves. Those answer different questions, and during an incident you need both.
- **A heartbeat.** Holding is an activity, not a state. Renew at a fraction of the TTL; a missed
  renewal is a crash report.
- **Fencing on the side effects.** Pair the lease with a monotonically increasing token that
  everything the lock protects must check, and make the protected work idempotent — because a lease
  *will* expire mid-flight eventually.

The bar to hold yourself to: you can `kill -9` any holder at any moment, and nobody — human or
machine — has to clean anything up.

## Context

The locks that bite are rarely the ones from a textbook. They're the informal ones: a `claimed_by`
column, a `status = 'processing'` row, a Redis `SETNX` without a TTL, a "current owner" field on a
work item. The happy path works perfectly — acquire, do the work, release — so the design ships. The
failure path is the holder dying *between* acquire and release, and that path strands the lock
precisely because release was the holder's job and the holder is gone.

This is old, settled art. [Leases](ref:leases) — time-bounded grants that the holder must keep
renewing — date to 1989, and every serious lock service is built on them: [Chubby](ref:chubby)
sessions with keep-alives, etcd leases, Consul sessions, DynamoDB's lock client. Yet hand-rolled
application locks skip expiry routinely, because expiry only matters on the path nobody tested.

Two things are raising the stakes. First, orchestrators kill processes as a matter of course —
deploys, autoscaling, OOM kills; "graceful shutdown" is an optimization, not a guarantee. Second,
fleets of short-lived LLM agents and sub-agents now claim work, hold it for minutes, and die
constantly — crashed, cancelled, timed out, or killed by a supervisor. In an agent system, a
stranded claim doesn't just stall one job; it silently parks a subject nobody else will touch.

> [!aside] The 2 a.m. tell
>
> If a runbook anywhere contains `DELETE FROM locks WHERE ...`, that system has deadlocks with extra
> steps — the TTL is a human being woken up at 2 a.m. The subtler cost is hesitation: when killing a
> worker can wedge the system, operators learn to hesitate before killing workers, and hesitation is
> exactly what you can't afford mid-incident.

## Recommendation

**If it grants exclusivity, it's a lock; if it's a lock, it's a lease.** Apply the rule to the
informal locks too — claim columns, assignment rows, "in progress" flags, leader election, cron
overlap guards. The moment one party proceeding excludes another, all of the below applies.

**Expire automatically, server-side.** The store that grants the lock enforces the TTL. Expiry must
not depend on the holder doing anything (it's dead), on a cleanup job someone remembers to run, or
on an operator. Recovery from a crashed holder should be indistinguishable from the passage of time.

**Record the owner and the work.** Two fields, not one: the identity allowed to renew the lease, and
the task or run it serves. The first tells you who can act; the second tells you what was
interrupted and lets a supervisor inspect its descendants' claims and force-release the ones whose
work is gone. Keep force-release as an explicit, recorded act — it's an override, and overrides
should leave a trace.

**Heartbeat at a fraction of the TTL.** Renew at roughly a third of the term, so one missed beat
isn't fatal but a dead holder ages out quickly. Then watch the ratio of *expired* to *released*
leases: a clean system releases; expirations are crash reports, and a rising rate means work is
dying mid-task somewhere.

**Fence the side effects.** A TTL bounds how long a lock stays stranded; it cannot make mutual
exclusion safe for external writes on its own. A paused or partitioned holder can wake up *believing
it still holds the lease* and finish its write after expiry ([the classic
analysis](ref:kleppmann-locking)). Issue a monotonically increasing token at each acquisition, send
it with everything the lock protects, and have the downstream reject stale tokens. Make the
protected mutations idempotent ([ZFN-19](/zfn/19-annotate-readonly-idempotent-endpoints/)) — when a
lease expires mid-flight, the work will be retried under a new holder and a new token.

**Choose the TTL honestly, then bias short.** A short TTL bounds how long a crash stalls the
subject; a long one tolerates GC pauses and slow networks without false expiry. With fencing and
idempotency in place, a false expiry is an efficiency problem, not a correctness problem — so bias
toward short and let renewal do the work. Long-running jobs are many renewals of a short lease,
never one long TTL.

**Surface contention; don't bury it.** A failed acquisition and an expired lease are both signals
worth recording — they tell you where work piles up and where holders die. Retry with backoff and a
budget ([ZFN-13](/zfn/13-load-shedding-and-flow-control/)) rather than hammering the lock, and
resist hard-coding one contention policy into the primitive; record the events and let the caller
decide.

## Consequences

**Easier:**

- Crash recovery is nobody's job — it's the TTL's. Stuck workers can be killed safely and routinely,
  including by automation, because nothing they hold survives them.
- On-call never performs lock surgery; the `DELETE FROM locks` runbook page ceases to exist.
- The released-vs-expired split gives you a free, honest signal of where work dies mid-task.

**Harder:**

- Holders must run renewal loops — real client code with failure handling. Build it once as a
  library; per-callsite heartbeat logic is where the bugs live.
- Fencing means every downstream the lock protects has to carry and check tokens. That's genuine
  plumbing, and it's the part most designs skip — which is why their locks are unsafe under pause
  and partition, TTL or not.
- TTL selection is a judgment call you now have to make explicitly, and renewal traffic isn't free
  at very high lock counts.

**New obligations:**

- The lock API records owner identity *and* owning task, exposes both, and supports inspected,
  recorded force-release.
- Dashboards distinguish clean release from expiry, and someone looks at the ratio.

## References

- [ZFN-13](/zfn/13-load-shedding-and-flow-control/) — backoff, budgets, and not hammering a
  contended resource; the same discipline applies to lock acquisition.
- [ZFN-19](/zfn/19-annotate-readonly-idempotent-endpoints/) — effect-idempotent mutations, which
  lease expiry mid-flight makes mandatory rather than nice-to-have.
- [Gray & Cheriton — Leases (SOSP 1989)](https://dl.acm.org/doi/10.1145/74850.74870) — the original
  statement of time-bounded authority with renewal.
- [Burrows — The Chubby lock service (OSDI 2006)](https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf) —
  leases, keep-alives, and sequence numbers in a production lock service.
- [Kleppmann — How to do distributed locking (2016)](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html) —
  why expiry alone can't protect external side effects, and fencing tokens.

## Changelog

- **2026-06-12**: First published as a Field Note.
