I’m easing back into long-form writing with something practical and surprisingly absent from open source: a logging system that can handle tens of millions of low-frequency, high-cardinality log lines. Think forensic contexts like ETL pipelines, CI jobs, and data workflows, where a single missing line can sink your investigation.
The Problem With Commodity Logging Tools
Tools like Fluentd, Logstash, and Loki are excellent at what they’re designed for: observability and alerting on infrastructure logs (web servers, VMs, Kubernetes Pods). Performance in that world means minimizing what gets shipped and stored, using techniques like:
- Sampling (e.g., keep 20% of events)
- Heuristic filters (e.g., drop
DEBUG
unless tied to an error) - Aggregation (e.g., reduce many logs into one metric)
That’s perfect when your goal is trend detection. The more frequent an error, the more likely it gets surfaced. The individual line is disposable; the pattern is what matters.
But What If Every Line Matters?
When an ETL job fails or a CI step breaks, you don’t want a sample; you need the exact sequence of events and inputs. These jobs are often:
- Low-frequency (daily or hourly)
- Stateful (carry context across steps)
- Non-retryable (re-runs can violate assumptions or cause side effects)
In this world, dropping even one log line can erase the breadcrumb that explains what happened.
Observability vs. Forensics
These are different problems.
- Observability asks: Is something wrong, and how often?
- Forensics asks: What exactly happened, when, and why?
Not for trends, but for truth.
This maps to the classic “Pets vs. Cattle” tradeoff. In commodity infrastructure, logs are cattle: valuable in the herd, rarely as individuals. In forensics, each log is a pet: unique, irreplaceable, and often the only clue you’ll get.
Why There’s No Off-the-Shelf Tool
There isn’t a commodity system that delivers complete, durable, unsampled, and efficiently queryable logs for these workloads. If you need that, you end up building it.
Consider CI platforms: GitHub Actions, GitLab CI, Buildkite, CircleCI. Each ships bespoke log capture, storage, and serving. If the big vendors had to roll their own, it’s a signal that the wheel you want doesn’t exist yet.
Even then, vendor systems are often tuned for throughput and UX, not internal traceability guarantees. The gap remains.
It Looks Like Event Sourcing (But Isn’t)
The shape of the solution resembles a domain-aggregate append-only event store: a complete, ordered record for a job, pipeline, or workflow.
But it isn’t just event sourcing. Event sourcing focuses on state reconstruction; forensic logging emphasizes immutable, lossless replay with rich query over payloads, indices, and timelines. Systems like Kafka, Pulsar, and EventStoreDB excel at streaming and pub/sub, but you still need to bolt on:
- Durable object storage for long-term retention
- Indexes for fast point and range lookups
- Retention and compaction policies that don’t lose detail
- Retrieval APIs optimized for ordered, partial reads per aggregate
You can’t just throw JSON into Kafka and call it done.
Why This Series Exists
This post kicks off a series on building a logging system for low-frequency, stateful, non-retryable workloads where you cannot lose a single line. It’s a niche use case with no great off-the-shelf solution, but a powerful one if you’re building CI platforms, workflow engines, ETL orchestrators, or data pipelines.
Up next in the series:
- Architecture and constraints: guarantees, failure models, and cost ceilings
- Storage and indexing: append paths, page layouts, and efficient fan-out
- Ingest and ordering: idempotency, sequencing, and backpressure
- Retrieval and APIs: partial reads, pagination, and cursor design
- Operating at 10B+: shard strategies, quotas, and retention without regret