---
id: 16
title: "Separate the data plane from the control plane"
status: current
date: 2026-06-12
authors:
  - "Theo Zourzouvillys"
tags: [architecture, infra, reliability, scalability]
summary: "Split the serving path (data plane) from the management path (control plane). The data plane keeps serving on last-known-good config when the control plane is down — never call it on the hot path. Coupling them turns a control-plane bug into a serving outage."
supersedes: null
superseded_by: null
aliases: []
---

## TL;DR

Draw a hard line between the **data plane** — the high-volume, latency-sensitive path that does the
actual work (serving requests, moving data, forwarding traffic) — and the **control plane** — the
management and orchestration path that configures, schedules, provisions, and decides policy. They
have **different reliability requirements, different scaling characteristics, and different change
rates**, and the single most important property of the split is this: **the data plane must keep
working on last-known-good state when the control plane is unavailable.** Never make a synchronous
call to the control plane on the hot path. The control plane owns desired state and pushes it down;
the data plane caches what it needs and **fails static** — it continues with the configuration it
already has rather than failing because it couldn't reach the controller.

If you couple them — the serving path calling the management service per request, or sharing its fate —
then every control-plane bug, deploy, or overload becomes a serving outage. Keep the thing that has to
be rock-solid free of the thing that changes constantly.

## Context

The data plane and the control plane look like one system early on, so they get built as one: the same
service that serves traffic also reads its own dynamic configuration live, calls the scheduler
in-line, or looks up policy from the management database on every request. It works at low scale, and
then the coupling turns into the dominant failure mode:

- **They have opposite reliability profiles.** The data plane must be simple, fast, and almost always
  up. The control plane is where the *complexity* lives — orchestration logic, expensive decisions,
  rich dependencies — so it's where the bugs and the frequent deploys are. Couple them and you force
  the reliable thing to inherit the unreliable thing's failure rate.
- **They scale on different axes.** Data-plane load scales with *traffic*; control-plane load scales
  with the *number of resources and the rate of change*. A burst of config changes or a reconciliation
  storm shouldn't be able to starve request serving, and vice versa.
- **A control-plane outage shouldn't be a data-plane outage.** If serving requires the management
  service to answer on every request, then the moment the control plane is impaired — a bad deploy, an
  overloaded API server, a dependency outage — serving stops, even though nothing was wrong with the
  data path itself.

This is the same independence principle as [ZFN-4](/zfn/4-incident-tooling-independence/): don't put
a hard dependency on the hot path to something that can be down. The well-run systems you rely on are
built this way — a load balancer keeps forwarding on its last config if its controller dies; nodes keep
running pods when the cluster control plane is unreachable; a service-mesh proxy keeps proxying on its
last-pushed config when the control plane can't be reached. The data plane degrades *management*
(you can't make changes), not *serving*.

## Recommendation

**Architect the two planes as separate systems with a one-way, asynchronous dependency: control plane
→ data plane, never the reverse on the hot path.**

- **Name the split explicitly.** Decide which components are data plane (serving, forwarding,
  processing) and which are control plane (config, scheduling, provisioning, policy, metadata,
  coordination), and keep the responsibilities from bleeding across.
- **Push config down; cache it; fail static.** The control plane is the source of truth for desired
  state and *pushes* (or the data plane pulls and **caches**) the configuration the data plane needs.
  The data plane runs on that local copy and, when the control plane is unreachable, keeps operating on
  **last-known-good** — it does not fail because it couldn't refresh. This is **static stability**: the
  system holds its current state through control-plane impairment.
- **No synchronous control-plane call on the hot path.** Per-request lookups to the management service,
  scheduler, or config database are the coupling to eliminate. Resolve identity, policy, and routing
  from cached/pushed state. (ZFN-5's "verify against a cached trust root, no per-request key-distribution
  endpoint" — [ZFN-5](/zfn/5-platform-workload-identity-service/) — is exactly this move for auth.)
- **Minimal dependencies on the data plane; richer ones on the control plane.** Keep the data path's
  dependency set small and boring. Put the complex logic, the third-party calls, and the expensive
  decisions in the control plane, off the request path.
- **Separate fate: deploy, scale, and shed independently.** Different deploy cadences (the data plane
  changes rarely and carefully; the control plane often), independent scaling, and independent overload
  behavior — a control-plane overload must not take serving down, and the data plane sheds load
  ([ZFN-13](/zfn/13-load-shedding-and-flow-control/)) on its own terms.
- **Make staleness explicit and bounded.** Fail-static means the data plane can run on slightly stale
  config; design for that — version the config, surface how stale each data-plane instance is, and bound
  how long divergence is acceptable before it's an alert (not an outage).

**Scope.** The split is about coupling on the *hot path*. It's fine — expected — for the data plane to
*receive* state from the control plane and to *report* status back asynchronously; what you're avoiding
is the data plane being unable to serve because the control plane isn't answering right now.

## Consequences

**Easier:**

- A control-plane outage degrades management, not serving: you temporarily can't push changes, but
  traffic keeps flowing on last-known-good. This is often the difference between a non-event and an
  incident.
- Each plane scales and deploys on its own terms — frequent, complex control-plane changes don't risk
  the data path, and traffic growth doesn't destabilize orchestration.
- The data plane stays simple and auditable; complexity is concentrated where it can fail safely.

**Harder:**

- Two systems and an asynchronous state-distribution mechanism between them — more to build than one
  service that just looks things up live.
- Fail-static means operating on stale state, which you must reason about: bounded staleness, config
  versioning, and "how divergent is too divergent?" become real design questions.
- Some genuinely needs-to-be-fresh decisions (e.g. a hard real-time revocation) require deliberate
  design to fit a push/cache model rather than a live lookup.
- The boundary takes discipline to hold; it's easy to "just call the control plane here" and quietly
  reintroduce the coupling.

## References

- [ZFN-4](/zfn/4-incident-tooling-independence/) — the same don't-depend-on-what-can-be-down principle;
  here applied to the serving path rather than recovery tooling.
- [ZFN-5](/zfn/5-platform-workload-identity-service/) — verifying against a cached trust root with no
  per-request control-plane lookup is a data-plane-independence pattern for auth.
- [ZFN-13](/zfn/13-load-shedding-and-flow-control/) — each plane handles overload on its own terms; a
  control-plane storm must not topple serving.
- [Amazon Builders' Library — Static stability using Availability Zones](https://aws.amazon.com/builders-library/static-stability-using-availability-zones/) — the canonical write-up of fail-static / data-plane independence.
- Kubernetes control plane vs. nodes, and the service-mesh data plane (Envoy/xDS) vs. control plane, as
  worked examples of the split.

## Changelog

- **2026-06-12**: First published as a Field Note.
