Engineering Field Notes

50 on record. Field Notes are how I write down the engineering practices I keep reaching for — the context behind each one, the approach I'd recommend, and the trade-offs it asks you to accept. They're observations from the field, not commandments: opinions formed over time, kept current as I learn, and meant to be argued with. I've spent my career in high-scale SaaS and the infrastructure under it, so some lean that way — though the reasoning is meant to generalize. Some are open problems — questions I don't have a good answer to yet. New here? Start withZFN-0 for how these work.

Every note here is dated June 12, 2026 — the day I imported this collection and made it public. That's not when each was written: the thinking was built and refined privately over years of doing the work; this is just where it became public. Notes published from here on carry their real date.

A note argues a position; a Blueprint specifies one. Where a practice needs more than reasoning to act on — a wire format, a validation order, numbered requirements you can hold an implementation to — it gets written up as a complete, versioned implementation spec instead, built to be handed to whoever (or whatever) is doing the building.

notes.json ·blueprints ·RSS

ZFN-49July 7, 2026Verify by computation, not lookup; store revocations, not issuancesVerification should be a computation, not a query: HMACs, signatures, hashes, and pass-by-value claims let any node verify locally. When revocation is rarer than issuance, invert the state — keep the few revocations for the lifetime of what they revoke, not a row per grant.security architecture auth scalability
ZFN-48June 29, 2026Emit async work into the WAL, not a job tableWhen a DB write should trigger async work, ride the WAL instead of dual-writing or polling a job table. pg_logical_emit_message emits the event transactionally — outbox semantics, no table. A WAL listener consumes it statefully and fans out, keeping load off the primary.architecture infra data messaging reliability
ZFN-47June 29, 2026Govern the contract between teams, not the code inside themTeams own services end to end; one team owns the gateway that dispatches to them. Govern exactly one thing centrally — the contract at the boundary (schema, identity, errors, idempotency) — and enforce it at runtime. Don't mandate libraries; ship them as an opt-in blueprint.architecture platform api culture leadership
ZFN-46June 13, 2026Extract the reusable component from the product — and open-source itProduct work keeps producing general infrastructure — a sync engine, a bus bridge. When it isn't your moat, pull it out behind a clean, zero-opinion boundary; better yet, open-source it. Extraction forces honest design; don't do it before it's earned its generality.architecture open-source design process
ZFN-45June 12, 2026Read the standards; better yet, help write themLearn to read standards docs — RFCs, W3C recs — fluently; they're the primary source, not a last resort. Even better, get involved: reading them well makes you a sharper builder, and helping write them is the best protocol education there is.interop learning industry principles
ZFN-44June 12, 2026Burnout is a conditions problem, not a workload problemBurnout was never about the hours. I've woken up excited on 100-hour weeks and drowned in thirty-hour ones; the variable is the conditions, not the workload. The red flags I watch for, the six pillars that make intensity sustainable, and the levers I pull at the first tremor.culture ic principles
ZFN-43June 12, 2026Name your systems as archetypes, not machineryNames shape thought: a good name compresses a system's purpose, boundaries, and temperament into a word. Name the long-lived cast as archetypes — a Sentinel, a Seer — not Rule Engine and Decision Service. Bureaucratic names flatten judgment; a character must live up to its role.architecture design culture communication
ZFN-42convictionJune 12, 2026My one cloud is AWSApplying the one-cloud principle (ZFN-32), my pick is AWS — 100%, a league of its own. Go native: skip Kubernetes, use ECS; lean into SQS, SNS, Kinesis, IAM, ALB, RDS. The one exception to native — provision with OpenTofu, not CloudFormation, now that LLMs write .tf so well.architecture infra cloud aws
ZFN-41June 12, 2026Make the technical choice you can put your name behindWhen two options would both work, it's fine to choose the one your experience lets you put your name behind, and to resist pressure that isn't about the merits — vendor incentives, politics, top-down preference. Owning the call costs capital up front; being right repays it.leadership culture process decisions
ZFN-40June 12, 2026No anonymous "system" actorIf "system" appears as an actor in your audit log, attribution is already broken. Every automated action — cron job, cleanup task, migration, agent — runs as a named identity with its own credentials and scope, so "who did this?" has an answer and revocation is surgical.security auth infra operations
ZFN-39June 12, 2026Break loops, not spiralsEvent-driven and agentic systems echo. A loop — a lineage recurring with the same data — produces nothing new; detect it with runtime-stamped provenance and break it loudly. A spiral — recurring with new data — is legitimate work; bound it with budgets, never loop-breakers.architecture events reliability llm
ZFN-38June 12, 2026Agents are principals: delegate, never impersonateAn agent acting with a copied user credential is impersonation — untraceable by design. Give agents their own identities and keys; let them act for a human only through explicit, scoped, time-bounded, revocable delegation; and record both actor and principal on every action.security auth llm architecture
ZFN-37June 12, 2026Every lock is a leaseA lock that can outlive its holder is a deadlock scheduled for later. Give every lock — including informal ones like claimed_by columns — a TTL, a named owner, and a heartbeat; make expiry automatic and server-side; fence the side effects so a stale holder can't corrupt anything.architecture infra reliability consistency
ZFN-36June 12, 2026An untested backup is not a backup — test it by restoringAn untested backup is a hope, not a backup — the only thing that counts is a restore. Rehearse restores regularly (game days), measure and meet your RTO/RPO, automate them, and cover the whole recovery path — data, schema, config, secrets, cutover — not just the dump.reliability infra incident operations
ZFN-35June 12, 2026Reference secrets in config; dereference, refresh, and re-fetchDon't put secret values in config — store a reference (a path in a secret store) and dereference it at runtime via your workload identity. Refresh on a signal or expiry so rotation needs no redeploy; re-fetch on auth failure so a rotated secret self-heals.security infra config reliability
ZFN-34June 12, 2026A resource-free 'bouncer' account: the single gateway to customer resourcesFunnel access to customer resources through one dedicated account that holds no resources. Customer trust names only its role; the role is denied from your own org (aws:ResourceOrgID). Fenced both ways, it shrinks the confused-deputy surface to one audited gateway.security infra cloud aws multi-tenancy
ZFN-33signalJune 12, 2026Are LLMs swinging us away from prepackaged services?A signal: LLMs may be swinging build-vs-buy away from prepackaged services. A generic complex service used to beat building until real scale; now building what you need is cheap. Maybe open source becomes shared architectures an LLM implements per user — like C replaced ASM.llm architecture open-source industry
ZFN-32June 12, 2026Commit to one cloud, and go all-in nativeCloud-independence is a false benefit: portability is the least common denominator, costing you the native services that are the point. For end products you run: commit to one cloud, go all-in native. Libraries, software others run, LLM-embeddables, and edge code are exempt.architecture infra cloud
ZFN-31June 12, 2026Own your components — when you deeply understand the domainOwning your own components rather than generic off-the-shelf services is often the better path as you grow: own what's core, lean on small vetted libraries for the hard parts. LLMs make it attainable at smaller scale — but only when you truly understand the domain, or it hurts.architecture process llm design
ZFN-30June 12, 2026Use the standard; don't reinvent the protocolWhen a standard exists for a common or complex problem, use it — don't reinvent the protocol. Standards encode huge adversarial expertise, especially in auth and crypto; a partial implementation beats rolling your own. You're not that special, and your problem isn't either.architecture security api interop
ZFN-29June 12, 2026Blameless culture, taken seriously — and its one hard lineWhen something breaks, support the person, don't blame. Run post-mortems with ceremony and learn at every level — software, org, culture, process, even solo. Blameless protects honest mistakes, not dishonesty: evading or blaming gets coached; hiding evidence is a firing offense.culture leadership incident reliability process
ZFN-28open problemJune 12, 2026Capability without understanding: brute-force LLM PRsAn open problem: people brute-force PRs with LLMs in domains they don't understand, taking on more than their knowledge supports — and the struggle that used to teach them is smoothed away. How do we stop un-understood code without killing learning or banning a good tool?llm culture process learning
ZFN-27June 12, 2026Don't tolerate assholes — but be strict about what one isDon't tolerate assholes — people who demean, belittle, punch down. But filter hard on the word: disagreeing, raising ideas, or opening a competing PR isn't being an asshole, it's the work. Assholes attack people; colleagues attack problems. Don't let the label silence dissent.culture leadership communication ic
ZFN-26June 12, 2026AI-assisted content needs no disclaimer, only a human who can back itUsing an LLM to draft engineering content — chat, commits, PRs, docs, comments — is fine and needs no disclaimer. The obligation is human co-signing: every word under your name is one you drove, reviewed, and can defend. Disclose when the ideas are the model's, not yours.principles process llm
ZFN-25June 12, 2026Track the version a client has seen for read-your-writesFor read-your-writes across backends, track the latest version a client has seen — a token or vector clock. Return it on write; reads then go to a backend at least that fresh. Hold it client-side (a token they present) or server-side (a gateway tracks the session and routes).architecture data consistency api
ZFN-24June 12, 2026One transactional store per write; propagate changes asynchronouslyCommit each logical write to exactly one transactional store; update other systems via reliable ordered async events — never a synchronous write across two stores, and never 2PC. With a relational primary the WAL is your replayable journal; write events into the same transaction.architecture data reliability consistency
ZFN-23June 12, 2026Rewriting an implementation is fine — refactoring isn't always the answerRefactoring isn't always right. When the structure is wrong at the root, it's fine — often better — to rewrite an implementation from scratch. Clean interfaces and data models make the implementation disposable: stable contract, swappable internals. LLMs make it cheaper still.architecture process refactoring llm
ZFN-22June 12, 2026Quarantine bad architecture behind an interface, then replace itWhen a subsystem is complex and badly architected, quarantine it at its seam: write a clean adapter interface over the mess so the rest of the system depends on the contract, then build a better implementation behind it and expose the new interface directly.architecture refactoring design process
ZFN-21June 12, 2026Cache only immutable objects; treat caches as tech debtUse caches sparingly, only for immutable addressed objects — never for mutable DB results, where invalidation bugs and stale reads live; use projections instead. A cache in the data path is usually a patch over an architectural gap that trades correctness for performance.architecture data performance reliability
ZFN-20June 12, 2026The simplest-looking system is often the most complex to live withThe system that's simplest to stand up often isn't simplest to live with — it skips the correctness edge cases, so bugs and inconsistency surface fast. A more deliberate design has more parts but fewer surprises, and is often the simpler one over time.architecture data design philosophy
ZFN-19June 12, 2026Annotate read-only and idempotent endpoints; make every mutation idempotentAnnotate every endpoint as read-only (safe) or idempotent, in the schema, so infrastructure can retry, route to replicas, and cache safely. Make every state-changing endpoint idempotent (idempotency keys for create/charge/send); a non-idempotent retry double-applies.api reliability architecture correctness
ZFN-18June 12, 2026Enforce a quota at ingress on every endpoint — even unabused onesPut a quota on every endpoint and enforce it at ingress from day one — per tenant, principal, IP — even for endpoints nobody abuses yet. Unlimited-by-default means the first runaway client or compromised key is an outage. Return 429 + Retry-After; retrofitting limits is painful.reliability security api infra multi-tenancy
ZFN-17June 12, 2026Separate configuration, state, and ephemeral dataCustomer data splits into mostly-static config, durable state, and ephemeral sessions — different access, durability, and change rates. Model and store each separately. For bounded static config, prefer loading one validated snapshot held in memory over fetching on demand.architecture data multi-tenancy design
ZFN-16June 12, 2026Separate the data plane from the control planeSplit the serving path (data plane) from the management path (control plane). The data plane keeps serving on last-known-good config when the control plane is down — never call it on the hot path. Coupling them turns a control-plane bug into a serving outage.architecture infra reliability scalability
ZFN-15June 12, 2026Partition customer data by tenant from day oneMake customer data tenant-partitioned from day one: tenant-scope every query, never join across tenants, route through a tenant→location directory. Run one physical database at first — but keep the model shardable. Retrofitting isolation onto a shared DB is brutal.architecture data multi-tenancy scalability security
ZFN-14June 12, 2026Define every API with a schema, and generate the clientsDefine every API with a machine-readable schema (OpenAPI, Protobuf, GraphQL) as the source of truth, and generate clients and server stubs from it — never hand-roll request-building and JSON parsing. Hand-written clients drift and break silently; check schema compatibility in CI.architecture api process correctness
ZFN-13June 12, 2026Fail fast and push back: retries, load shedding, and flow controlBuild client retries (backoff, jitter, Retry-After) from day one. Under overload, shed fast and push the failure back to the source to retry — don't retry internally and amplify it. Flow-control everywhere, bound every queue, and don't take more work than you can finish in time.reliability architecture infra resilience
ZFN-12June 12, 2026Queues, topics, and journals are different tools — don't conflate themQueues (competing consumers), topics (fan-out), and journals (ordered, replayable logs) give different guarantees. Don't conflate them; a pipeline often uses several. Prefer journals over topics, but not where head-of-line blocking hurts. With queues, bound the concurrency.architecture infra messaging events reliability
ZFN-11June 12, 2026Route outbound HTTP through an isolated egress proxyApplication compute shouldn't make arbitrary outbound HTTP — it's an SSRF pivot to internal services and the cloud metadata endpoint. Route all egress through a proxy (SOCKS, or a gRPC egress service) on isolated compute with no route inward. The proxy's network is the boundary.security infra network ssrf
ZFN-10June 12, 2026Pin the expected owner on cross-account resource calls (confused-deputy defense)Authority to call a resource isn't proof it's the one you meant. Any call crossing an account boundary must assert the expected owner: ExpectedBucketOwner on S3, aws:ResourceAccount conditions, validation of untrusted ARNs, plus inbound trust pinned with SourceArn/ExternalId.security infra cloud auth
ZFN-9June 12, 2026No long-lived cloud keys; workloads authenticate by federated identityNo static AWS or GCP keys anywhere — not in code, secret stores, or env. Workloads use their runtime's own identity and cross clouds by exchanging it (OIDC) for short-lived credentials via federation. Static keys are a documented carve-out only.security auth infra cloud
ZFN-8June 12, 2026Don't hide behind anonymous 'people'Never invoke unnamed 'people' to carry weight — 'a few people are concerned', 'some think'. It launders one view as phantom consensus and makes the listener argue a crowd they can't see. Name them and bring them in, or own it. If they can't speak up, fix the culture.culture leadership communication ic
ZFN-7June 12, 2026Sign the message, not just the session (HTTP Message Signatures)A bearer token proves nothing about the request it rides on. Sign the message itself (HTTP Message Signatures, RFC 9421) — request, and ideally response — so the recipient can prove who sent this exact message and not a byte changed. Shared keys first; asymmetric better.security auth http
ZFN-6June 12, 2026Bind tokens to a key: sender-constrained tokens (DPoP)A bearer token grants access to whoever holds it — steal it, replay it. Bind the token to a holder key (DPoP, RFC 9449) so using it requires proving possession of a private key the token names. A stolen token alone becomes useless.security auth http
ZFN-5June 12, 2026Make workload identity a platform-owned serviceWorkload identity belongs in shared platform infrastructure, not reimplemented per service. A small token service mints short-lived tokens any service verifies. Shared keys are a fine first step; asymmetric signing the better end-state — don't let 'no PKI' block it.security auth infra platform
ZFN-4June 12, 2026Incident tooling must not depend on what it recoversAnything you need to respond to an incident — deploy/rollback, kill switches, observability, break-glass access — must not depend, directly or transitively, on the systems likely to be down during it. Never gate incident tooling behind a system it might need to recover.principles reliability incident security infra
ZFN-3June 12, 2026Default-encrypt internal service trafficAll external traffic is TLS, no exceptions. Internal traffic is encrypted by default; an internal call site may skip transport encryption (never authentication) only under a documented, audited carve-out anchored to a network-perimeter guarantee.security infra transport
ZFN-2June 12, 2026Engineering priority orderingWhen concerns conflict, prioritize security > correctness > availability > performance — and never trade a higher-ranked concern for a lower one. The rule binds the moment you must choose. Cite it instead of re-arguing it.principles process
ZFN-1June 12, 2026Keep engineering decision recordsRecord significant engineering decisions as short, versioned markdown files — context, decision, consequences. Write one for cross-team contracts, directional principles, hard-to-reverse choices, and conventions others must follow. Cite them instead of re-arguing.process meta
ZFN-0June 12, 2026Engineering Field NotesWhat Field Notes are, why I write them, and how they work — numbered, status-tracked notes on building software well, published openly and importable as JSON and raw markdown.meta process