Theo Zourzouvillys

Field Note 37 current

Every lock is a lease

By
Theo Zourzouvillys
Published
Tags
architectureinfrareliabilityconsistency

TL;DR

A lock that can outlive its holder is a deadlock scheduled for later. Treat every lock as a lease:

  • A TTL on every lock. No infinite holds. Expiry is automatic and server-side — not a cleanup job, not the holder’s good intentions.
  • A named owner, twice over. Record the identity that may renew it and the task it serves. Those answer different questions, and during an incident you need both.
  • A heartbeat. Holding is an activity, not a state. Renew at a fraction of the TTL; a missed renewal is a crash report.
  • Fencing on the side effects. Pair the lease with a monotonically increasing token that everything the lock protects must check, and make the protected work idempotent — because a lease will expire mid-flight eventually.

The bar to hold yourself to: you can kill -9 any holder at any moment, and nobody — human or machine — has to clean anything up.

Context

The locks that bite are rarely the ones from a textbook. They’re the informal ones: a claimed_by column, a status = 'processing' row, a Redis SETNX without a TTL, a “current owner” field on a work item. The happy path works perfectly — acquire, do the work, release — so the design ships. The failure path is the holder dying between acquire and release, and that path strands the lock precisely because release was the holder’s job and the holder is gone.

This is old, settled art. LeasesLeases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency (SOSP 1989)Gray & Cheriton's original lease paper: a lease is a time-bounded grant of authority that the holder must actively renew, so a crashed or partitioned holder loses its rights automatically when the term expires — recovery becomes the passage of time rather than an intervention.dl.acm.org ↗ — time-bounded grants that the holder must keep renewing — date to 1989, and every serious lock service is built on them: ChubbyThe Chubby lock service for loosely-coupled distributed systems (OSDI 2006)Burrows' account of Google's production lock service. Chubby locks ride on sessions with leases and keep-alives — clients that stop renewing lose their locks — and lock sequence numbers (fencing) protect resources from actions by stale holders.static.googleusercontent.com ↗ sessions with keep-alives, etcd leases, Consul sessions, DynamoDB’s lock client. Yet hand-rolled application locks skip expiry routinely, because expiry only matters on the path nobody tested.

Two things are raising the stakes. First, orchestrators kill processes as a matter of course — deploys, autoscaling, OOM kills; “graceful shutdown” is an optimization, not a guarantee. Second, fleets of short-lived LLM agents and sub-agents now claim work, hold it for minutes, and die constantly — crashed, cancelled, timed out, or killed by a supervisor. In an agent system, a stranded claim doesn’t just stall one job; it silently parks a subject nobody else will touch.

Recommendation

If it grants exclusivity, it’s a lock; if it’s a lock, it’s a lease. Apply the rule to the informal locks too — claim columns, assignment rows, “in progress” flags, leader election, cron overlap guards. The moment one party proceeding excludes another, all of the below applies.

Expire automatically, server-side. The store that grants the lock enforces the TTL. Expiry must not depend on the holder doing anything (it’s dead), on a cleanup job someone remembers to run, or on an operator. Recovery from a crashed holder should be indistinguishable from the passage of time.

Record the owner and the work. Two fields, not one: the identity allowed to renew the lease, and the task or run it serves. The first tells you who can act; the second tells you what was interrupted and lets a supervisor inspect its descendants’ claims and force-release the ones whose work is gone. Keep force-release as an explicit, recorded act — it’s an override, and overrides should leave a trace.

Heartbeat at a fraction of the TTL. Renew at roughly a third of the term, so one missed beat isn’t fatal but a dead holder ages out quickly. Then watch the ratio of expired to released leases: a clean system releases; expirations are crash reports, and a rising rate means work is dying mid-task somewhere.

Fence the side effects. A TTL bounds how long a lock stays stranded; it cannot make mutual exclusion safe for external writes on its own. A paused or partitioned holder can wake up believing it still holds the lease and finish its write after expiry (the classic analysisHow to do distributed locking — Martin Kleppmann (2016)Analyses why a timed lock alone cannot guarantee mutual exclusion for external side effects (GC pauses, clock skew, in-flight requests) and argues for fencing tokens: a monotonically increasing number issued at each acquisition that downstream systems check, rejecting writes from stale holders.martin.kleppmann.com ↗). Issue a monotonically increasing token at each acquisition, send it with everything the lock protects, and have the downstream reject stale tokens. Make the protected mutations idempotent (ZFN-19) — when a lease expires mid-flight, the work will be retried under a new holder and a new token.

Choose the TTL honestly, then bias short. A short TTL bounds how long a crash stalls the subject; a long one tolerates GC pauses and slow networks without false expiry. With fencing and idempotency in place, a false expiry is an efficiency problem, not a correctness problem — so bias toward short and let renewal do the work. Long-running jobs are many renewals of a short lease, never one long TTL.

Surface contention; don’t bury it. A failed acquisition and an expired lease are both signals worth recording — they tell you where work piles up and where holders die. Retry with backoff and a budget (ZFN-13) rather than hammering the lock, and resist hard-coding one contention policy into the primitive; record the events and let the caller decide.

Consequences

Easier:

  • Crash recovery is nobody’s job — it’s the TTL’s. Stuck workers can be killed safely and routinely, including by automation, because nothing they hold survives them.
  • On-call never performs lock surgery; the DELETE FROM locks runbook page ceases to exist.
  • The released-vs-expired split gives you a free, honest signal of where work dies mid-task.

Harder:

  • Holders must run renewal loops — real client code with failure handling. Build it once as a library; per-callsite heartbeat logic is where the bugs live.
  • Fencing means every downstream the lock protects has to carry and check tokens. That’s genuine plumbing, and it’s the part most designs skip — which is why their locks are unsafe under pause and partition, TTL or not.
  • TTL selection is a judgment call you now have to make explicitly, and renewal traffic isn’t free at very high lock counts.

New obligations:

  • The lock API records owner identity and owning task, exposes both, and supports inspected, recorded force-release.
  • Dashboards distinguish clean release from expiry, and someone looks at the ratio.

References

Changelog

  • 2026-06-12: First published as a Field Note.