Field Note 16 current
Separate the data plane from the control plane
TL;DR
Draw a hard line between the data plane — the high-volume, latency-sensitive path that does the actual work (serving requests, moving data, forwarding traffic) — and the control plane — the management and orchestration path that configures, schedules, provisions, and decides policy. They have different reliability requirements, different scaling characteristics, and different change rates, and the single most important property of the split is this: the data plane must keep working on last-known-good state when the control plane is unavailable. Never make a synchronous call to the control plane on the hot path. The control plane owns desired state and pushes it down; the data plane caches what it needs and fails static — it continues with the configuration it already has rather than failing because it couldn’t reach the controller.
If you couple them — the serving path calling the management service per request, or sharing its fate — then every control-plane bug, deploy, or overload becomes a serving outage. Keep the thing that has to be rock-solid free of the thing that changes constantly.
Context
The data plane and the control plane look like one system early on, so they get built as one: the same service that serves traffic also reads its own dynamic configuration live, calls the scheduler in-line, or looks up policy from the management database on every request. It works at low scale, and then the coupling turns into the dominant failure mode:
- They have opposite reliability profiles. The data plane must be simple, fast, and almost always up. The control plane is where the complexity lives — orchestration logic, expensive decisions, rich dependencies — so it’s where the bugs and the frequent deploys are. Couple them and you force the reliable thing to inherit the unreliable thing’s failure rate.
- They scale on different axes. Data-plane load scales with traffic; control-plane load scales with the number of resources and the rate of change. A burst of config changes or a reconciliation storm shouldn’t be able to starve request serving, and vice versa.
- A control-plane outage shouldn’t be a data-plane outage. If serving requires the management service to answer on every request, then the moment the control plane is impaired — a bad deploy, an overloaded API server, a dependency outage — serving stops, even though nothing was wrong with the data path itself.
This is the same independence principle as ZFN-4: don’t put a hard dependency on the hot path to something that can be down. The well-run systems you rely on are built this way — a load balancer keeps forwarding on its last config if its controller dies; nodes keep running pods when the cluster control plane is unreachable; a service-mesh proxy keeps proxying on its last-pushed config when the control plane can’t be reached. The data plane degrades management (you can’t make changes), not serving.
Recommendation
Architect the two planes as separate systems with a one-way, asynchronous dependency: control plane → data plane, never the reverse on the hot path.
- Name the split explicitly. Decide which components are data plane (serving, forwarding, processing) and which are control plane (config, scheduling, provisioning, policy, metadata, coordination), and keep the responsibilities from bleeding across.
- Push config down; cache it; fail static. The control plane is the source of truth for desired state and pushes (or the data plane pulls and caches) the configuration the data plane needs. The data plane runs on that local copy and, when the control plane is unreachable, keeps operating on last-known-good — it does not fail because it couldn’t refresh. This is static stability: the system holds its current state through control-plane impairment.
- No synchronous control-plane call on the hot path. Per-request lookups to the management service, scheduler, or config database are the coupling to eliminate. Resolve identity, policy, and routing from cached/pushed state. (ZFN-5’s “verify against a cached trust root, no per-request key-distribution endpoint” — ZFN-5 — is exactly this move for auth.)
- Minimal dependencies on the data plane; richer ones on the control plane. Keep the data path’s dependency set small and boring. Put the complex logic, the third-party calls, and the expensive decisions in the control plane, off the request path.
- Separate fate: deploy, scale, and shed independently. Different deploy cadences (the data plane changes rarely and carefully; the control plane often), independent scaling, and independent overload behavior — a control-plane overload must not take serving down, and the data plane sheds load (ZFN-13) on its own terms.
- Make staleness explicit and bounded. Fail-static means the data plane can run on slightly stale config; design for that — version the config, surface how stale each data-plane instance is, and bound how long divergence is acceptable before it’s an alert (not an outage).
Scope. The split is about coupling on the hot path. It’s fine — expected — for the data plane to receive state from the control plane and to report status back asynchronously; what you’re avoiding is the data plane being unable to serve because the control plane isn’t answering right now.
Consequences
Easier:
- A control-plane outage degrades management, not serving: you temporarily can’t push changes, but traffic keeps flowing on last-known-good. This is often the difference between a non-event and an incident.
- Each plane scales and deploys on its own terms — frequent, complex control-plane changes don’t risk the data path, and traffic growth doesn’t destabilize orchestration.
- The data plane stays simple and auditable; complexity is concentrated where it can fail safely.
Harder:
- Two systems and an asynchronous state-distribution mechanism between them — more to build than one service that just looks things up live.
- Fail-static means operating on stale state, which you must reason about: bounded staleness, config versioning, and “how divergent is too divergent?” become real design questions.
- Some genuinely needs-to-be-fresh decisions (e.g. a hard real-time revocation) require deliberate design to fit a push/cache model rather than a live lookup.
- The boundary takes discipline to hold; it’s easy to “just call the control plane here” and quietly reintroduce the coupling.
References
- ZFN-4 — the same don’t-depend-on-what-can-be-down principle; here applied to the serving path rather than recovery tooling.
- ZFN-5 — verifying against a cached trust root with no per-request control-plane lookup is a data-plane-independence pattern for auth.
- ZFN-13 — each plane handles overload on its own terms; a control-plane storm must not topple serving.
- Amazon Builders’ Library — Static stability using Availability Zones — the canonical write-up of fail-static / data-plane independence.
- Kubernetes control plane vs. nodes, and the service-mesh data plane (Envoy/xDS) vs. control plane, as worked examples of the split.
Changelog
- 2026-06-12: First published as a Field Note.