---
id: 18
title: "Enforce a quota at ingress on every endpoint — even unabused ones"
status: current
date: 2026-06-12
authors:
  - "Theo Zourzouvillys"
tags: [reliability, security, api, infra, multi-tenancy]
summary: "Put a quota on every endpoint and enforce it at ingress from day one — per tenant, principal, IP — even for endpoints nobody abuses yet. Unlimited-by-default means the first runaway client or compromised key is an outage. Return 429 + Retry-After; retrofitting limits is painful."
supersedes: null
superseded_by: null
aliases: []
---

## TL;DR

Every endpoint has a **quota**, and it's enforced at the **ingress** (the gateway/edge, before the
request reaches application logic), from day one — **even for endpoints nobody is abusing yet**. No
endpoint is unlimited by default. Limits are keyed by the dimensions that matter — **per tenant, per
principal/API key, per IP**, and a global cap — and rejected requests get a clear, retryable signal
(`429` with `Retry-After`, per [ZFN-13](/zfn/13-load-shedding-and-flow-control/)).

The reason to do this *before* there's a problem: "unlimited" is a promise you didn't mean to make.
The first runaway client, retry storm, infinite loop, or compromised credential turns an unmetered
endpoint into an outage or a bill, and by then clients depend on no limit existing — so adding one
breaks them. A quota that's present from the start is just part of the contract.

## Context

Rate limiting tends to get added *reactively*: an endpoint gets hammered, there's an incident, and a
limit is bolted on afterward. By then you're in the worst position to add it. You don't know a safe
default, because real traffic has been shaped by the absence of one. Some client has built a batch job
that fires ten thousand requests in a burst and considers that normal. Adding a limit now is a breaking
change you have to negotiate, announce, and stage — instead of a property the API always had.

And the failure modes a quota guards against don't require malice:

- A buggy client in a tight loop, or a retry storm with no backoff
  ([ZFN-13](/zfn/13-load-shedding-and-flow-control/)), aimed at your cheapest-looking endpoint.
- One tenant's traffic starving everyone else on shared capacity — the noisy-neighbour problem that
  per-tenant limits and bulkheads ([ZFN-2](/zfn/2-engineering-priority-ordering/),
  [ZFN-15](/zfn/15-partition-customer-data-by-tenant/)) exist to contain.
- A leaked API key used to exfiltrate data or rack up cost as fast as the key will allow — a quota caps
  the blast radius of a compromise.
- An expensive endpoint (a report, a search, a fan-out) that's fine at low volume and falls over the
  first time someone scripts it.

None of these announce themselves. The quota is cheap insurance you want already in place when they
arrive, which is why "even if it's not being abused" is the whole point.

## Recommendation

**Make a quota a default property of every endpoint, enforced at the front door.**

- **No endpoint ships unlimited.** Every endpoint has an explicit limit, even if generous. A sane
  default that you tighten later beats no limit that you scramble to add during an incident.
- **Enforce at ingress, cheaply, before the work.** Check the limit at the gateway/edge before the
  request reaches expensive application logic — admission control, the same fail-fast-at-the-door move
  as load shedding ([ZFN-13](/zfn/13-load-shedding-and-flow-control/)). One central mechanism, applied
  consistently, not re-implemented per service.
- **Key limits by the right dimensions.** Per **tenant** (fairness and blast radius), per
  **principal/API key** (compromise containment), per **IP** (crude abuse), per **endpoint** (protect
  the expensive ones), plus a **global** ceiling. Distinguish the kinds of limit: **rate** (requests per
  second), **quota** (requests per day/month), and **concurrency** (in-flight at once) — you usually
  want all three.
- **Reject clearly and retryably.** Return `429 Too Many Requests` (or the protocol's equivalent) with
  **`Retry-After`**, so well-behaved clients back off instead of hammering — the coordinated-retry
  contract from [ZFN-13](/zfn/13-load-shedding-and-flow-control/). Surface remaining quota in response
  headers where it helps.
- **Treat limits as per-tenant configuration.** Limits are control-plane config
  ([ZFN-16](/zfn/16-separate-data-plane-control-plane/), [ZFN-17](/zfn/17-separate-config-state-ephemeral/)):
  set per plan/tier, adjustable per tenant, pushed to and enforced at the ingress data plane. This is
  also how you sell tiers and grant a trusted customer more headroom without code changes.
- **Observe usage so you can set and tune limits.** Measure per-tenant/per-endpoint usage from the
  start; it's how you pick non-arbitrary defaults, spot abuse, and right-size limits before they either
  bite real users or fail to protect you.

**Scope.** This is ingress quota/rate enforcement for inbound requests. It complements — doesn't replace
— internal concurrency bounds and bounded queues ([ZFN-13](/zfn/13-load-shedding-and-flow-control/)),
which protect the tiers behind the front door.

## Consequences

**Easier:**

- A runaway client, retry storm, or leaked key is capped automatically instead of becoming an outage or
  a surprise bill — the limit was always there.
- Multi-tenant fairness is enforced: one tenant can't consume the shared budget.
- Limits exist from day one, so they're part of the contract clients build against, not a breaking
  change you have to retrofit and negotiate.
- Per-tenant quotas double as a product lever (plans, tiers, trusted-customer headroom).

**Harder:**

- Real infrastructure: a consistent ingress enforcement mechanism, a place to store and push per-tenant
  limits, and the usage accounting behind them (distributed rate limiting has its own subtleties).
- Picking defaults takes data and judgment; too tight rejects legitimate use, too loose protects
  nothing. (Start with generous limits plus monitoring, then tighten.)
- Legitimate bursty workloads need accommodation — burst allowances, higher tiers, or a batch path — so
  the limit doesn't punish good use.

## References

- [ZFN-13](/zfn/13-load-shedding-and-flow-control/) — quotas are admission control at the front door;
  reject with `429` + `Retry-After` and let clients back off.
- [ZFN-2](/zfn/2-engineering-priority-ordering/) and [ZFN-15](/zfn/15-partition-customer-data-by-tenant/)
  — per-tenant limits enforce fairness and contain blast radius.
- [ZFN-16](/zfn/16-separate-data-plane-control-plane/) / [ZFN-17](/zfn/17-separate-config-state-ephemeral/)
  — limits are control-plane config pushed to the enforcing data plane.
- [Amazon Builders' Library — Fairness in multi-tenant systems](https://aws.amazon.com/builders-library/fairness-in-multi-tenant-systems/); token-bucket / leaky-bucket rate limiting as the usual primitives.

## Changelog

- **2026-06-12**: First published as a Field Note.
