Field Note 5 current
Make workload identity a platform-owned service
TL;DR
How one service proves its identity to another is a problem every service in a decomposed system has, and it should be solved once, as platform infrastructure — not reimplemented per team and not owned by whichever product happened to build it first. The shape I recommend: a small security-token service (STS) (plus a shared library) that mints short-lived identity tokens for workloads, which any service consumes to mint and to verify.
Start simple; don’t let “we don’t have PKI” block you. A perfectly good first step is tokens signed with a shared symmetric key (HMAC) the issuer and verifiers hold — far better than per-service schemes or long-lived static secrets, and adoptable in an afternoon. The better end-state is asymmetric signing, where only the issuer can mint and verifiers hold just a public key; and ideally a signing key rooted in a KMS/HSM, with verification by anchoring to that root (no public key-distribution endpoint — no JWKS — to operate). Pick the rung you can run well today and climb later; the important move is having one shared mechanism, not having PKI on day one.
New peer-authentication code consumes the platform module; introducing a new per-service identity scheme is a deliberate, reviewed exception, not a local choice.
Context
Authentication and authorization decide whether a workload may act. They don’t, on their own, give you a good answer to how a workload proves which workload it is to its peers — and once a system is decomposed into many services, every one of them needs that answer. Left unsolved at the platform level, it gets solved many times: each team picks its own peer-identity mechanism, the schemes interoperate poorly, they rotate differently, they fail differently, and you end up with an archipelago of auth code that is impossible to audit as one thing.
A few forces make a shared mechanism the right call:
- More than one service needs it. A capability the whole platform authenticates with should not be owned by one product, and should not be copy-pasted into every consumer.
- One mechanism beats many. Convergence on a single, well-understood mechanism means one thing to audit, one rotation story, one verifier, and one set of cross-language ergonomics.
- Verification shouldn’t require a key-distribution service. If every verifier has to fetch and cache a rotating public-key set from an endpoint, that endpoint becomes critical infrastructure with its own availability and trust problems. Anchoring trust to a root you already hold avoids it.
Recommendation
Treat workload identity as platform infrastructure with a single shared contract. Concretely:
-
A platform-owned module (and a service where a boundary needs it). The minting, certificate management, verification, and token-shape logic live in one library, owned by the platform or security function — not by a product team. An STS service form exists for callers that can’t or shouldn’t hold signing material directly (it issues delegated, short-lived credentials); the in-process library covers the rest. Both are the same capability.
-
Short-lived tokens, signed by a key the platform controls. Tokens are short-lived; how they’re signed is a progression, not a prerequisite:
- Shared key (HMAC) — the fine first step. The issuer and verifiers hold a shared symmetric secret. Simple, no certificate machinery, and already a large improvement. Its weakness is that every verifier can also mint (it holds the signing secret), so a verifier compromise is a minting compromise — manage and rotate the shared key accordingly.
- Asymmetric — the better step. Only the issuer holds the private key; verifiers hold the public key and can verify but not mint. A verifier compromise no longer lets an attacker forge identities.
- KMS/HSM-rooted, CA-anchored — the ideal. A locally held, frequently-rotated leaf key
whose certificate chains to a CA private key held in a KMS/HSM and never extractable; the
token carries the chain (e.g. an
x5cheader) and a verifier validates by anchoring to the CA it already trusts. No JWKS endpoint to publish or keep available — verifiers need only the CA (or the KMS public key).
Choose the highest rung you can operate well now; the shared contract should let you raise it later without every consumer changing how it calls the library.
-
Producers and verifiers use the shared code, not reimplementations. A service that needs a token shape or claim the contract doesn’t offer proposes a change to the contract; it does not fork its own.
-
The contract is a seam. Once many services mint and verify against it, the token shape, claims, and roles are hard to change. Evolve it additively and review changes the way you’d review any cross-team interface — guard the module with code owners.
Pair it with sender-constraint. Identity tokens are still bearer tokens unless you bind them to a holder key, so a stolen token is replayable. Combine this with proof-of-possession binding (ZFN-6) so theft of a token alone isn’t enough — or sign the request/message itself (ZFN-7).
Scope. This note is about workload (service) identity — one service proving which service it is to another. It’s complementary to how a workload authenticates to a cloud provider (ZFN-9); a single service typically uses both.
Consequences
Easier:
- One workload-identity mechanism across the platform — one rotation story, one verifier, one audit surface — instead of per-team schemes.
- New services get authenticated identity by consuming a module, not by building crypto.
- Verification needs no key-distribution service; at the top rung trust anchors to a KMS-held root with no JWKS endpoint to operate, and you can start far simpler with a shared key.
Harder:
- Whoever owns the module takes on a capability the whole system depends on — with the on-call, versioning, and cross-language maintenance burden that implies. This is a real organizational shift, not just a code move.
- The contract becomes load-bearing; evolution must be additive and reviewed as a seam.
- Consumers that hold signing or private-key material inherit a blast-radius obligation: bound key residency, handle rotation/eviction carefully, never log it.
- Picking one mechanism forecloses others (mTLS-only, SPIFFE) for the cases they might have fit better. A single well-supported path beats a best-fit-per-case patchwork.
New obligations:
- New peer-authentication code uses the platform module; a new per-service scheme requires a reviewed exception, not a local decision.
- Contract changes (claims, roles, trust distribution) are code-owner-guarded and additive where possible.
- Consumers holding key material document and bound their custody (residency, rotation, eviction, no-logging).
References
- ZFN-6 — sender-constrained tokens (DPoP): bind these identity tokens to a holder key so a stolen one is useless.
- ZFN-7 — signing the request (and ideally response) itself, an alternative or complement to bearer identity tokens.
- ZFN-9 — authenticating to a cloud provider (the other half of a service’s identity story).
Changelog
- 2026-06-12: First published as a Field Note.