Snapshot Writer Runbook

Operational procedures for laredo-snapshotter. See the architecture & design for how it works.

:::note Design-phase document The Snapshot Writer is a proposed feature (EDR-0001). This runbook describes the operational model the design commits to; commands and metric names are finalized as each implementation phase lands. :::

Health & readiness

Probe	Meaning
`GET /health/live`	Process is up.
`GET /health/ready`	The initial base snapshot is durable on all destinations. Use for load-balancer / orchestrator readiness.
`GET /status`	Current source position, last snapshot/diff, epoch, buffer depth, manifest head.

A writer that is live but not ready is still bootstrapping (subscribing, loading the first snapshot, or writing it). If it never becomes ready, check the subscription (can it reach laredo-server?) and the destinations (can it write?).

Forcing a snapshot

curl -XPOST http://snapshotter:8080/snapshot

Forces an immediate re-base (new base snapshot, new epoch, diff counters reset). Use before a schema migration, a consumer cutover, or to shorten a cold-read chain that has grown long.

Common incidents

Manifest CAS keeps failing

Symptom: logs show repeated "manifest write conflict / precondition failed"; snapshotter_manifest_cas_retries_total climbing.

Cause: two writers are pointed at the same table + prefix, or a previous writer did not shut down cleanly.

Action: there must be exactly one writer per (table, destination prefix). Identify the duplicate (the manifest's updated_at and the writer client_id in events help) and stop it. The CAS guarantees no corruption — the loser simply retries — but two writers will fight and double-write artifacts.

A destination is failing writes

Symptom: snapshotter_destination_errors_total{dest=…} rising; readiness may drop if it is the only destination.

Behavior: an artifact is durable only once written to all destinations, so a single failing destination stalls the manifest commit. Buffered changes continue to accumulate in memory (watch snapshotter_buffer_depth and process RSS).

Action: restore the destination (bucket policy, credentials, network). The writer retries with backoff and resumes from its in-memory buffer — no data is lost as long as the process stays up and the fan-out journal still covers its position. If memory pressure is a risk during a long outage, remove the broken destination from config and restart (you can backfill it later by forcing a snapshot once it returns).

Credentials expired / access denied

Symptom: AccessDenied / ExpiredToken from S3 or an event sink.

Action: credential profiles resolve per action group. Check the specific profile named by the failing component (an S3 destination vs. a Kinesis sink may use different roles). For assume-role profiles, verify the trust policy, the external_id, and that the base (ambient) identity may assume the target role. IRSA/instance-role token rotation is automatic; a persistent failure means a policy/trust problem, not rotation.

Cold-read chains are too long

Symptom: consumers report many diffs to replay; snapshotter_diffs_since_snapshot high.

Action: the re-base thresholds are too loose for this workload. Lower snapshot.max_interval, snapshot.max_churn_records/max_churn_fraction, or snapshot.max_diff_bytes/max_diff_fraction. Conversely, if snapshots are too frequent (storage cost, CPU), raise them or raise snapshot.min_interval.

Events missing or duplicated

Expected. Events are at-least-once and advisory. Consumers must poll the manifest as the source of truth and tolerate gaps/duplicates. If a sink is down, snapshotter_event_errors_total{sink=…} rises but artifacts and the manifest are unaffected — pollers still see the new head.

Capacity

Memory: the writer holds the full table in memory (like any client/fanout consumer) plus the diff buffer between flushes. Size for table size + peak buffered churn.
Storage: base snapshots dominate; retention prunes artifacts older than the newest snapshot that precedes them. Multiply by the number of formats emitted.
One writer per table. Scale by running more processes, not by widening one.

Shutdown

On SIGTERM/SIGINT the writer flushes a final diff (so buffered changes are durable) and updates the manifest before exiting. Give it a shutdown grace period long enough to write to all destinations.