Engineering Field Notes
46 on record. Field Notes are how I write down the engineering practices I keep reaching for — the context behind each one, the approach I'd recommend, and the trade-offs it asks you to accept. They're observations from the field, not commandments: opinions formed over time, kept current as I learn, and meant to be argued with. I've spent my career in high-scale SaaS and the infrastructure under it, so some lean that way — though the reasoning is meant to generalize. Some are open problems — questions I don't have a good answer to yet. New here? Start with ZFN-0 for how these work.
Every note here is dated June 12, 2026 — the day I imported this collection and made it public. That's not when each was written: the thinking was built and refined privately over years of doing the work; this is just where it became public. Notes published from here on carry their real date.
- Burnout is a conditions problem, not a workload problem Burnout was never about the hours. I've woken up excited on 100-hour weeks and drowned in thirty-hour ones; the variable is the conditions, not the workload. The red flags I watch for, the six pillars that make intensity sustainable, and the levers I pull at the first tremor.
- Name your systems as archetypes, not machinery Names shape thought: a good name compresses a system's purpose, boundaries, and temperament into a word. Name the long-lived cast as archetypes — a Sentinel, a Seer — not Rule Engine and Decision Service. Bureaucratic names flatten judgment; a character must live up to its role.
- Make the technical choice you can put your name behind When two options would both work, it's fine to choose the one your experience lets you put your name behind, and to resist pressure that isn't about the merits — vendor incentives, politics, top-down preference. Owning the call costs capital up front; being right repays it.
- Agents are principals: delegate, never impersonate An agent acting with a copied user credential is impersonation — untraceable by design. Give agents their own identities and keys; let them act for a human only through explicit, scoped, time-bounded, revocable delegation; and record both actor and principal on every action.
- Every lock is a lease A lock that can outlive its holder is a deadlock scheduled for later. Give every lock — including informal ones like claimed_by columns — a TTL, a named owner, and a heartbeat; make expiry automatic and server-side; fence the side effects so a stale holder can't corrupt anything.
- An untested backup is not a backup — test it by restoring An untested backup is a hope, not a backup — the only thing that counts is a restore. Rehearse restores regularly (game days), measure and meet your RTO/RPO, automate them, and cover the whole recovery path — data, schema, config, secrets, cutover — not just the dump.
- Reference secrets in config; dereference, refresh, and re-fetch Don't put secret values in config — store a reference (a path in a secret store) and dereference it at runtime via your workload identity. Refresh on a signal or expiry so rotation needs no redeploy; re-fetch on auth failure so a rotated secret self-heals.
- A resource-free 'bouncer' account: the single gateway to customer resources Funnel access to customer resources through one dedicated account that holds no resources. Customer trust names only its role; the role is denied from your own org (aws:ResourceOrgID). Fenced both ways, it shrinks the confused-deputy surface to one audited gateway.
- Are LLMs swinging us away from prepackaged services? A signal: LLMs may be swinging build-vs-buy away from prepackaged services. A generic complex service used to beat building until real scale; now building what you need is cheap. Maybe open source becomes shared architectures an LLM implements per user — like C replaced ASM.
- Commit to one cloud, and go all-in native Cloud-independence is a false benefit: portability is the least common denominator, costing you the native services that are the point. For end products you run: commit to one cloud, go all-in native. Libraries, software others run, LLM-embeddables, and edge code are exempt.
- Own your components — when you deeply understand the domain Owning your own components rather than generic off-the-shelf services is often the better path as you grow: own what's core, lean on small vetted libraries for the hard parts. LLMs make it attainable at smaller scale — but only when you truly understand the domain, or it hurts.
- Use the standard; don't reinvent the protocol When a standard exists for a common or complex problem, use it — don't reinvent the protocol. Standards encode huge adversarial expertise, especially in auth and crypto; a partial implementation beats rolling your own. You're not that special, and your problem isn't either.
- Blameless culture, taken seriously — and its one hard line When something breaks, support the person, don't blame. Run post-mortems with ceremony and learn at every level — software, org, culture, process, even solo. Blameless protects honest mistakes, not dishonesty: evading or blaming gets coached; hiding evidence is a firing offense.
- Capability without understanding: brute-force LLM PRs An open problem: people brute-force PRs with LLMs in domains they don't understand, taking on more than their knowledge supports — and the struggle that used to teach them is smoothed away. How do we stop un-understood code without killing learning or banning a good tool?
- Don't tolerate assholes — but be strict about what one is Don't tolerate assholes — people who demean, belittle, punch down. But filter hard on the word: disagreeing, raising ideas, or opening a competing PR isn't being an asshole, it's the work. Assholes attack people; colleagues attack problems. Don't let the label silence dissent.
- AI-assisted content needs no disclaimer, only a human who can back it Using an LLM to draft engineering content — chat, commits, PRs, docs, comments — is fine and needs no disclaimer. The obligation is human co-signing: every word under your name is one you drove, reviewed, and can defend. Disclose when the ideas are the model's, not yours.
- Track the version a client has seen for read-your-writes For read-your-writes across backends, track the latest version a client has seen — a token or vector clock. Return it on write; reads then go to a backend at least that fresh. Hold it client-side (a token they present) or server-side (a gateway tracks the session and routes).
- One transactional store per write; propagate changes asynchronously Commit each logical write to exactly one transactional store; update other systems via reliable ordered async events — never a synchronous write across two stores, and never 2PC. With a relational primary the WAL is your replayable journal; write events into the same transaction.
- Rewriting an implementation is fine — refactoring isn't always the answer Refactoring isn't always right. When the structure is wrong at the root, it's fine — often better — to rewrite an implementation from scratch. Clean interfaces and data models make the implementation disposable: stable contract, swappable internals. LLMs make it cheaper still.
- Quarantine bad architecture behind an interface, then replace it When a subsystem is complex and badly architected, quarantine it at its seam: write a clean adapter interface over the mess so the rest of the system depends on the contract, then build a better implementation behind it and expose the new interface directly.
- Cache only immutable objects; treat caches as tech debt Use caches sparingly, only for immutable addressed objects — never for mutable DB results, where invalidation bugs and stale reads live; use projections instead. A cache in the data path is usually a patch over an architectural gap that trades correctness for performance.
- The simplest-looking system is often the most complex to live with The system that's simplest to stand up often isn't simplest to live with — it skips the correctness edge cases, so bugs and inconsistency surface fast. A more deliberate design has more parts but fewer surprises, and is often the simpler one over time.
- Annotate read-only and idempotent endpoints; make every mutation idempotent Annotate every endpoint as read-only (safe) or idempotent, in the schema, so infrastructure can retry, route to replicas, and cache safely. Make every state-changing endpoint idempotent (idempotency keys for create/charge/send); a non-idempotent retry double-applies.
- Enforce a quota at ingress on every endpoint — even unabused ones Put a quota on every endpoint and enforce it at ingress from day one — per tenant, principal, IP — even for endpoints nobody abuses yet. Unlimited-by-default means the first runaway client or compromised key is an outage. Return 429 + Retry-After; retrofitting limits is painful.
- Separate configuration, state, and ephemeral data Customer data splits into mostly-static config, durable state, and ephemeral sessions — different access, durability, and change rates. Model and store each separately. For bounded static config, prefer loading one validated snapshot held in memory over fetching on demand.
- Separate the data plane from the control plane Split the serving path (data plane) from the management path (control plane). The data plane keeps serving on last-known-good config when the control plane is down — never call it on the hot path. Coupling them turns a control-plane bug into a serving outage.
- Partition customer data by tenant from day one Make customer data tenant-partitioned from day one: tenant-scope every query, never join across tenants, route through a tenant→location directory. Run one physical database at first — but keep the model shardable. Retrofitting isolation onto a shared DB is brutal.
- Define every API with a schema, and generate the clients Define every API with a machine-readable schema (OpenAPI, Protobuf, GraphQL) as the source of truth, and generate clients and server stubs from it — never hand-roll request-building and JSON parsing. Hand-written clients drift and break silently; check schema compatibility in CI.
- Fail fast and push back: retries, load shedding, and flow control Build client retries (backoff, jitter, Retry-After) from day one. Under overload, shed fast and push the failure back to the source to retry — don't retry internally and amplify it. Flow-control everywhere, bound every queue, and don't take more work than you can finish in time.
- Queues, topics, and journals are different tools — don't conflate them Queues (competing consumers), topics (fan-out), and journals (ordered, replayable logs) give different guarantees. Don't conflate them; a pipeline often uses several. Prefer journals over topics, but not where head-of-line blocking hurts. With queues, bound the concurrency.
- Route outbound HTTP through an isolated egress proxy Application compute shouldn't make arbitrary outbound HTTP — it's an SSRF pivot to internal services and the cloud metadata endpoint. Route all egress through a proxy (SOCKS, or a gRPC egress service) on isolated compute with no route inward. The proxy's network is the boundary.
- Pin the expected owner on cross-account resource calls (confused-deputy defense) Authority to call a resource isn't proof it's the one you meant. Any call crossing an account boundary must assert the expected owner: ExpectedBucketOwner on S3, aws:ResourceAccount conditions, validation of untrusted ARNs, plus inbound trust pinned with SourceArn/ExternalId.
- No long-lived cloud keys; workloads authenticate by federated identity No static AWS or GCP keys anywhere — not in code, secret stores, or env. Workloads use their runtime's own identity and cross clouds by exchanging it (OIDC) for short-lived credentials via federation. Static keys are a documented carve-out only.
- Don't hide behind anonymous 'people' Never invoke unnamed 'people' to carry weight — 'a few people are concerned', 'some think'. It launders one view as phantom consensus and makes the listener argue a crowd they can't see. Name them and bring them in, or own it. If they can't speak up, fix the culture.
- Sign the message, not just the session (HTTP Message Signatures) A bearer token proves nothing about the request it rides on. Sign the message itself (HTTP Message Signatures, RFC 9421) — request, and ideally response — so the recipient can prove who sent this exact message and not a byte changed. Shared keys first; asymmetric better.
- Make workload identity a platform-owned service Workload identity belongs in shared platform infrastructure, not reimplemented per service. A small token service mints short-lived tokens any service verifies. Shared keys are a fine first step; asymmetric signing the better end-state — don't let 'no PKI' block it.
- Incident tooling must not depend on what it recovers Anything you need to respond to an incident — deploy/rollback, kill switches, observability, break-glass access — must not depend, directly or transitively, on the systems likely to be down during it. Never gate incident tooling behind a system it might need to recover.
No notes match this filter. Clear it.