Theo Zourzouvillys

Field Note 18 current

Enforce a quota at ingress on every endpoint — even unabused ones

By
Theo Zourzouvillys
Published
Tags
reliabilitysecurityapiinframulti-tenancy

TL;DR

Every endpoint has a quota, and it’s enforced at the ingress (the gateway/edge, before the request reaches application logic), from day one — even for endpoints nobody is abusing yet. No endpoint is unlimited by default. Limits are keyed by the dimensions that matter — per tenant, per principal/API key, per IP, and a global cap — and rejected requests get a clear, retryable signal (429 with Retry-After, per ZFN-13).

The reason to do this before there’s a problem: “unlimited” is a promise you didn’t mean to make. The first runaway client, retry storm, infinite loop, or compromised credential turns an unmetered endpoint into an outage or a bill, and by then clients depend on no limit existing — so adding one breaks them. A quota that’s present from the start is just part of the contract.

Context

Rate limiting tends to get added reactively: an endpoint gets hammered, there’s an incident, and a limit is bolted on afterward. By then you’re in the worst position to add it. You don’t know a safe default, because real traffic has been shaped by the absence of one. Some client has built a batch job that fires ten thousand requests in a burst and considers that normal. Adding a limit now is a breaking change you have to negotiate, announce, and stage — instead of a property the API always had.

And the failure modes a quota guards against don’t require malice:

  • A buggy client in a tight loop, or a retry storm with no backoff (ZFN-13), aimed at your cheapest-looking endpoint.
  • One tenant’s traffic starving everyone else on shared capacity — the noisy-neighbour problem that per-tenant limits and bulkheads (ZFN-2, ZFN-15) exist to contain.
  • A leaked API key used to exfiltrate data or rack up cost as fast as the key will allow — a quota caps the blast radius of a compromise.
  • An expensive endpoint (a report, a search, a fan-out) that’s fine at low volume and falls over the first time someone scripts it.

None of these announce themselves. The quota is cheap insurance you want already in place when they arrive, which is why “even if it’s not being abused” is the whole point.

Recommendation

Make a quota a default property of every endpoint, enforced at the front door.

  • No endpoint ships unlimited. Every endpoint has an explicit limit, even if generous. A sane default that you tighten later beats no limit that you scramble to add during an incident.
  • Enforce at ingress, cheaply, before the work. Check the limit at the gateway/edge before the request reaches expensive application logic — admission control, the same fail-fast-at-the-door move as load shedding (ZFN-13). One central mechanism, applied consistently, not re-implemented per service.
  • Key limits by the right dimensions. Per tenant (fairness and blast radius), per principal/API key (compromise containment), per IP (crude abuse), per endpoint (protect the expensive ones), plus a global ceiling. Distinguish the kinds of limit: rate (requests per second), quota (requests per day/month), and concurrency (in-flight at once) — you usually want all three.
  • Reject clearly and retryably. Return 429 Too Many Requests (or the protocol’s equivalent) with Retry-After, so well-behaved clients back off instead of hammering — the coordinated-retry contract from ZFN-13. Surface remaining quota in response headers where it helps.
  • Treat limits as per-tenant configuration. Limits are control-plane config (ZFN-16, ZFN-17): set per plan/tier, adjustable per tenant, pushed to and enforced at the ingress data plane. This is also how you sell tiers and grant a trusted customer more headroom without code changes.
  • Observe usage so you can set and tune limits. Measure per-tenant/per-endpoint usage from the start; it’s how you pick non-arbitrary defaults, spot abuse, and right-size limits before they either bite real users or fail to protect you.

Scope. This is ingress quota/rate enforcement for inbound requests. It complements — doesn’t replace — internal concurrency bounds and bounded queues (ZFN-13), which protect the tiers behind the front door.

Consequences

Easier:

  • A runaway client, retry storm, or leaked key is capped automatically instead of becoming an outage or a surprise bill — the limit was always there.
  • Multi-tenant fairness is enforced: one tenant can’t consume the shared budget.
  • Limits exist from day one, so they’re part of the contract clients build against, not a breaking change you have to retrofit and negotiate.
  • Per-tenant quotas double as a product lever (plans, tiers, trusted-customer headroom).

Harder:

  • Real infrastructure: a consistent ingress enforcement mechanism, a place to store and push per-tenant limits, and the usage accounting behind them (distributed rate limiting has its own subtleties).
  • Picking defaults takes data and judgment; too tight rejects legitimate use, too loose protects nothing. (Start with generous limits plus monitoring, then tighten.)
  • Legitimate bursty workloads need accommodation — burst allowances, higher tiers, or a batch path — so the limit doesn’t punish good use.

References

Changelog

  • 2026-06-12: First published as a Field Note.