Field Note 16current

Separate the data plane from the control plane

By: Theo Zourzouvillys
Published: June 12, 2026
Tags: architectureinfrareliabilityscalability

TL;DR

Draw a hard line between the data plane — the high-volume, latency-sensitive path that does the actual work (serving requests, moving data, forwarding traffic) — and the control plane — the management and orchestration path that configures, schedules, provisions, and decides policy. They have different reliability requirements, different scaling characteristics, and different change rates, and the single most important property of the split is this: the data plane must keep working on last-known-good state when the control plane is unavailable. Never make a synchronous call to the control plane on the hot path. The control plane owns desired state and pushes it down; the data plane caches what it needs and fails static — it continues with the configuration it already has rather than failing because it couldn’t reach the controller.

If you couple them — the serving path calling the management service per request, or sharing its fate — then every control-plane bug, deploy, or overload becomes a serving outage. Keep the thing that has to be rock-solid free of the thing that changes constantly.

Context

The data plane and the control plane look like one system early on, so they get built as one: the same service that serves traffic also reads its own dynamic configuration live, calls the scheduler in-line, or looks up policy from the management database on every request. It works at low scale, and then the coupling turns into the dominant failure mode:

They have opposite reliability profiles. The data plane must be simple, fast, and almost always up. The control plane is where the complexity lives — orchestration logic, expensive decisions, rich dependencies — so it’s where the bugs and the frequent deploys are. Couple them and you force the reliable thing to inherit the unreliable thing’s failure rate.
They scale on different axes. Data-plane load scales with traffic; control-plane load scales with the number of resources and the rate of change. A burst of config changes or a reconciliation storm shouldn’t be able to starve request serving, and vice versa.
A control-plane outage shouldn’t be a data-plane outage. If serving requires the management service to answer on every request, then the moment the control plane is impaired — a bad deploy, an overloaded API server, a dependency outage — serving stops, even though nothing was wrong with the data path itself.

This is the same independence principle as ZFN-4Field Note · currentZFN-4 — Incident tooling must not depend on what it recoversAnything you need to respond to an incident — deploy/rollback, kill switches, observability, break-glass access — must not depend, directly or transitively, on the systems likely to be down during it. Never gate incident tooling behind a system it might need to recover.Open ZFN-4 →: don’t put a hard dependency on the hot path to something that can be down. The well-run systems you rely on are built this way — a load balancer keeps forwarding on its last config if its controller dies; nodes keep running pods when the cluster control plane is unreachable; a service-mesh proxy keeps proxying on its last-pushed config when the control plane can’t be reached. The data plane degrades management (you can’t make changes), not serving.

Recommendation

Architect the two planes as separate systems with a one-way, asynchronous dependency: control plane → data plane, never the reverse on the hot path.

Name the split explicitly. Decide which components are data plane (serving, forwarding, processing) and which are control plane (config, scheduling, provisioning, policy, metadata, coordination), and keep the responsibilities from bleeding across.
Push config down; cache it; fail static. The control plane is the source of truth for desired state and pushes (or the data plane pulls and caches) the configuration the data plane needs. The data plane runs on that local copy and, when the control plane is unreachable, keeps operating on last-known-good — it does not fail because it couldn’t refresh. This is static stability: the system holds its current state through control-plane impairment.
No synchronous control-plane call on the hot path. Per-request lookups to the management service, scheduler, or config database are the coupling to eliminate. Resolve identity, policy, and routing from cached/pushed state. (ZFN-5Field Note · currentZFN-5 — Make workload identity a platform-owned serviceWorkload identity belongs in shared platform infrastructure, not reimplemented per service. A small token service mints short-lived tokens any service verifies. Shared keys are a fine first step; asymmetric signing the better end-state — don't let 'no PKI' block it.Open ZFN-5 →’s “verify against a cached trust root, no per-request key-distribution endpoint” — ZFN-5Field Note · currentZFN-5 — Make workload identity a platform-owned serviceWorkload identity belongs in shared platform infrastructure, not reimplemented per service. A small token service mints short-lived tokens any service verifies. Shared keys are a fine first step; asymmetric signing the better end-state — don't let 'no PKI' block it.Open ZFN-5 → — is exactly this move for auth.)
Minimal dependencies on the data plane; richer ones on the control plane. Keep the data path’s dependency set small and boring. Put the complex logic, the third-party calls, and the expensive decisions in the control plane, off the request path.
Separate fate: deploy, scale, and shed independently. Different deploy cadences (the data plane changes rarely and carefully; the control plane often), independent scaling, and independent overload behavior — a control-plane overload must not take serving down, and the data plane sheds load (ZFN-13Field Note · currentZFN-13 — Fail fast and push back: retries, load shedding, and flow controlBuild client retries (backoff, jitter, Retry-After) from day one. Under overload, shed fast and push the failure back to the source to retry — don't retry internally and amplify it. Flow-control everywhere, bound every queue, and don't take more work than you can finish in time.Open ZFN-13 →) on its own terms.
Make staleness explicit and bounded. Fail-static means the data plane can run on slightly stale config; design for that — version the config, surface how stale each data-plane instance is, and bound how long divergence is acceptable before it’s an alert (not an outage).

Scope. The split is about coupling on the hot path. It’s fine — expected — for the data plane to receive state from the control plane and to report status back asynchronously; what you’re avoiding is the data plane being unable to serve because the control plane isn’t answering right now.

Consequences

Easier:

A control-plane outage degrades management, not serving: you temporarily can’t push changes, but traffic keeps flowing on last-known-good. This is often the difference between a non-event and an incident.
Each plane scales and deploys on its own terms — frequent, complex control-plane changes don’t risk the data path, and traffic growth doesn’t destabilize orchestration.
The data plane stays simple and auditable; complexity is concentrated where it can fail safely.

Harder:

Two systems and an asynchronous state-distribution mechanism between them — more to build than one service that just looks things up live.
Fail-static means operating on stale state, which you must reason about: bounded staleness, config versioning, and “how divergent is too divergent?” become real design questions.
Some genuinely needs-to-be-fresh decisions (e.g. a hard real-time revocation) require deliberate design to fit a push/cache model rather than a live lookup.
The boundary takes discipline to hold; it’s easy to “just call the control plane here” and quietly reintroduce the coupling.

References

ZFN-4Field Note · currentZFN-4 — Incident tooling must not depend on what it recoversAnything you need to respond to an incident — deploy/rollback, kill switches, observability, break-glass access — must not depend, directly or transitively, on the systems likely to be down during it. Never gate incident tooling behind a system it might need to recover.Open ZFN-4 → — the same don’t-depend-on-what-can-be-down principle; here applied to the serving path rather than recovery tooling.
ZFN-5Field Note · currentZFN-5 — Make workload identity a platform-owned serviceWorkload identity belongs in shared platform infrastructure, not reimplemented per service. A small token service mints short-lived tokens any service verifies. Shared keys are a fine first step; asymmetric signing the better end-state — don't let 'no PKI' block it.Open ZFN-5 → — verifying against a cached trust root with no per-request control-plane lookup is a data-plane-independence pattern for auth.
ZFN-13Field Note · currentZFN-13 — Fail fast and push back: retries, load shedding, and flow controlBuild client retries (backoff, jitter, Retry-After) from day one. Under overload, shed fast and push the failure back to the source to retry — don't retry internally and amplify it. Flow-control everywhere, bound every queue, and don't take more work than you can finish in time.Open ZFN-13 → — each plane handles overload on its own terms; a control-plane storm must not topple serving.
Amazon Builders’ Library — Static stability using Availability Zones — the canonical write-up of fail-static / data-plane independence.
Kubernetes control plane vs. nodes, and the service-mesh data plane (Envoy/xDS) vs. control plane, as worked examples of the split.

Changelog

2026-06-12: First published as a Field Note.