Field Note 37current

Every lock is a lease

By: Theo Zourzouvillys
Published: June 12, 2026
Tags: architectureinfrareliabilityconsistency

TL;DR

A lock that can outlive its holder is a deadlock scheduled for later. Treat every lock as a lease:

A TTL on every lock. No infinite holds. Expiry is automatic and server-side — not a cleanup job, not the holder’s good intentions.
A named owner, twice over. Record the identity that may renew it and the task it serves. Those answer different questions, and during an incident you need both.
A heartbeat. Holding is an activity, not a state. Renew at a fraction of the TTL; a missed renewal is a crash report.
Fencing on the side effects. Pair the lease with a monotonically increasing token that everything the lock protects must check, and make the protected work idempotent — because a lease will expire mid-flight eventually.

The bar to hold yourself to: you can kill -9 any holder at any moment, and nobody — human or machine — has to clean anything up.

Context

The locks that bite are rarely the ones from a textbook. They’re the informal ones: a claimed_by column, a status = 'processing' row, a Redis SETNX without a TTL, a “current owner” field on a work item. The happy path works perfectly — acquire, do the work, release — so the design ships. The failure path is the holder dying between acquire and release, and that path strands the lock precisely because release was the holder’s job and the holder is gone.

This is old, settled art. LeasesLeases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency (SOSP 1989)Gray & Cheriton's original lease paper: a lease is a time-bounded grant of authority that the holder must actively renew, so a crashed or partitioned holder loses its rights automatically when the term expires — recovery becomes the passage of time rather than an intervention.dl.acm.org ↗ — time-bounded grants that the holder must keep renewing — date to 1989, and every serious lock service is built on them: ChubbyThe Chubby lock service for loosely-coupled distributed systems (OSDI 2006)Burrows' account of Google's production lock service. Chubby locks ride on sessions with leases and keep-alives — clients that stop renewing lose their locks — and lock sequence numbers (fencing) protect resources from actions by stale holders.static.googleusercontent.com ↗ sessions with keep-alives, etcd leases, Consul sessions, DynamoDB’s lock client. Yet hand-rolled application locks skip expiry routinely, because expiry only matters on the path nobody tested.

Two things are raising the stakes. First, orchestrators kill processes as a matter of course — deploys, autoscaling, OOM kills; “graceful shutdown” is an optimization, not a guarantee. Second, fleets of short-lived LLM agents and sub-agents now claim work, hold it for minutes, and die constantly — crashed, cancelled, timed out, or killed by a supervisor. In an agent system, a stranded claim doesn’t just stall one job; it silently parks a subject nobody else will touch.

Recommendation

If it grants exclusivity, it’s a lock; if it’s a lock, it’s a lease. Apply the rule to the informal locks too — claim columns, assignment rows, “in progress” flags, leader election, cron overlap guards. The moment one party proceeding excludes another, all of the below applies.

Expire automatically, server-side. The store that grants the lock enforces the TTL. Expiry must not depend on the holder doing anything (it’s dead), on a cleanup job someone remembers to run, or on an operator. Recovery from a crashed holder should be indistinguishable from the passage of time.

Record the owner and the work. Two fields, not one: the identity allowed to renew the lease, and the task or run it serves. The first tells you who can act; the second tells you what was interrupted and lets a supervisor inspect its descendants’ claims and force-release the ones whose work is gone. Keep force-release as an explicit, recorded act — it’s an override, and overrides should leave a trace.

Heartbeat at a fraction of the TTL. Renew at roughly a third of the term, so one missed beat isn’t fatal but a dead holder ages out quickly. Then watch the ratio of expired to released leases: a clean system releases; expirations are crash reports, and a rising rate means work is dying mid-task somewhere.

Fence the side effects. A TTL bounds how long a lock stays stranded; it cannot make mutual exclusion safe for external writes on its own. A paused or partitioned holder can wake up believing it still holds the lease and finish its write after expiry (the classic analysisHow to do distributed locking — Martin Kleppmann (2016)Analyses why a timed lock alone cannot guarantee mutual exclusion for external side effects (GC pauses, clock skew, in-flight requests) and argues for fencing tokens: a monotonically increasing number issued at each acquisition that downstream systems check, rejecting writes from stale holders.martin.kleppmann.com ↗). Issue a monotonically increasing token at each acquisition, send it with everything the lock protects, and have the downstream reject stale tokens. Make the protected mutations idempotent (ZFN-19Field Note · currentZFN-19 — Annotate read-only and idempotent endpoints; make every mutation idempotentAnnotate every endpoint as read-only (safe) or idempotent, in the schema, so infrastructure can retry, route to replicas, and cache safely. Make every state-changing endpoint idempotent (idempotency keys for create/charge/send); a non-idempotent retry double-applies.Open ZFN-19 →) — when a lease expires mid-flight, the work will be retried under a new holder and a new token.

Choose the TTL honestly, then bias short. A short TTL bounds how long a crash stalls the subject; a long one tolerates GC pauses and slow networks without false expiry. With fencing and idempotency in place, a false expiry is an efficiency problem, not a correctness problem — so bias toward short and let renewal do the work. Long-running jobs are many renewals of a short lease, never one long TTL.

Surface contention; don’t bury it. A failed acquisition and an expired lease are both signals worth recording — they tell you where work piles up and where holders die. Retry with backoff and a budget (ZFN-13Field Note · currentZFN-13 — Fail fast and push back: retries, load shedding, and flow controlBuild client retries (backoff, jitter, Retry-After) from day one. Under overload, shed fast and push the failure back to the source to retry — don't retry internally and amplify it. Flow-control everywhere, bound every queue, and don't take more work than you can finish in time.Open ZFN-13 →) rather than hammering the lock, and resist hard-coding one contention policy into the primitive; record the events and let the caller decide.

Consequences

Easier:

Crash recovery is nobody’s job — it’s the TTL’s. Stuck workers can be killed safely and routinely, including by automation, because nothing they hold survives them.
On-call never performs lock surgery; the DELETE FROM locks runbook page ceases to exist.
The released-vs-expired split gives you a free, honest signal of where work dies mid-task.

Harder:

Holders must run renewal loops — real client code with failure handling. Build it once as a library; per-callsite heartbeat logic is where the bugs live.
Fencing means every downstream the lock protects has to carry and check tokens. That’s genuine plumbing, and it’s the part most designs skip — which is why their locks are unsafe under pause and partition, TTL or not.
TTL selection is a judgment call you now have to make explicitly, and renewal traffic isn’t free at very high lock counts.

New obligations:

The lock API records owner identity and owning task, exposes both, and supports inspected, recorded force-release.
Dashboards distinguish clean release from expiry, and someone looks at the ratio.

References

ZFN-13Field Note · currentZFN-13 — Fail fast and push back: retries, load shedding, and flow controlBuild client retries (backoff, jitter, Retry-After) from day one. Under overload, shed fast and push the failure back to the source to retry — don't retry internally and amplify it. Flow-control everywhere, bound every queue, and don't take more work than you can finish in time.Open ZFN-13 → — backoff, budgets, and not hammering a contended resource; the same discipline applies to lock acquisition.
ZFN-19Field Note · currentZFN-19 — Annotate read-only and idempotent endpoints; make every mutation idempotentAnnotate every endpoint as read-only (safe) or idempotent, in the schema, so infrastructure can retry, route to replicas, and cache safely. Make every state-changing endpoint idempotent (idempotency keys for create/charge/send); a non-idempotent retry double-applies.Open ZFN-19 → — effect-idempotent mutations, which lease expiry mid-flight makes mandatory rather than nice-to-have.
Gray & Cheriton — Leases (SOSP 1989) — the original statement of time-bounded authority with renewal.
Burrows — The Chubby lock service (OSDI 2006) — leases, keep-alives, and sequence numbers in a production lock service.
Kleppmann — How to do distributed locking (2016) — why expiry alone can’t protect external side effects, and fencing tokens.

Changelog

2026-06-12: First published as a Field Note.