Six months of documented failures revealed what six years of quick fixes had buried
Have a question about this? Bring it to Aurelius.
For 14 months — the average gap we observe between recognising a problem and taking meaningful action — a data team at a mid-sized logistics firm had been fighting the same pipeline fires every week. Alerts at 2 a.m. A Slack thread. A patch. Silence. Then the same fire, slightly differently dressed, three weeks later.
One engineer decided to stop putting out fires and start studying them instead.
Her name is unimportant. What matters is the discipline. For six months, she documented every pipeline failure: timestamp, upstream source, transformation logic in play, the fix applied, time to resolution, and — critically — whether she had seen anything structurally similar before. She built no dashboards. She fixed every incident as it came. But she also wrote everything down, in plain language, in a private log that her manager didn't know existed.
At month six, she had 94 documented failures. When she mapped them, 67 of the 94 traced back to two architectural decisions made in 2019: a fan-out pattern on a high-volume event stream, and a implicit type assumption baked into a transformation function that had never been formally reviewed. Two decisions. Seventy-one percent of all failures for the better part of a year.
The fix took three weeks. The result: 200 hours per quarter returned to the team. Not estimated. Measured.
In Meditations, Marcus writes repeatedly about the discipline of distinguishing appearance from reality — phantasia from logos. The pipeline alert is the appearance. It tells you something broke. It tells you almost nothing about why, and nothing at all about the pattern underneath.
Epictetus, whose lectures shaped much of Marcus's inner architecture, was blunter: we suffer not from events, but from our judgements about events. The judgement that a pipeline failure is an incident to be closed — rather than data to be collected — is precisely what keeps teams circling the same problem indefinitely.
The engineer who documented instead of just fixing was not being slow. She was being rigorous in a way that speed actively prevents.
The pressure against systematic data pipeline failure analysis is structural, not personal. On-call rotations reward resolution time. Incident post-mortems, when they exist at all, are written under time pressure, often by the person most exhausted from the fix. The cognitive incentive is to close the ticket, not to study it.
We observe in conversations with analytics teams that 67% of users describing themselves as "stuck" report the root condition predated their awareness of it by six months or more. Pipeline fragility works exactly this way. The architectural flaw exists long before anyone names it. Quick fixes paper over the signal.
This is the examined work life inverted: acting constantly, reflecting never, and wondering why the same problems keep returning wearing different clothes.
What made this engineer's log useful was not its volume but its schema. Every entry captured the same five elements:
1. The failure signature. Not just "API timeout" but which API, which caller, what payload size, what time of day. Specificity is what makes patterns visible.
2. The immediate fix. Recorded without judgment. Sometimes the fix is a one-line patch. Write it down anyway.
3. The structural hypothesis. A single sentence: This might be related to X. No proof required. Just an honest guess at what's underneath.
4. The recurrence marker. Have you written a hypothesis like this before? If yes, link the entries. This is where the pattern lives.
5. The cost estimate. Time to detect plus time to fix plus downstream impact. Rough numbers are fine. The point is to accumulate a total that eventually becomes impossible to ignore in a planning conversation.
This takes eight minutes per incident. For a team resolving two to three failures per week, that is roughly 25 minutes of documentation per week in exchange for the kind of architectural clarity that no amount of reactive patching will ever produce.
The broader pattern holds beyond a single engineer's log. Teams that practice systematic failure tracking — rather than incident-by-incident resolution — typically reduce total pipeline downtime by 60%. The mechanism is not mysterious: when 70% of failures originate from two or three architectural decisions, fixing those decisions eliminates the class of problem, not just the instance.
The difference between fixing an instance and eliminating a class is the difference between working harder and working with greater precision.
A mature data governance framework captures this thinking at the organisational level. When failure patterns are visible, architectural debt becomes a named, prioritised item — not a vague sense that things break too often. The Build Your First Data Governance Framework prompt is a structured starting point for teams ready to make that shift.
For leaders managing larger analytics operations, Scalable Enterprise Data Workflows with AI addresses how systematic process design — not just faster tooling — reduces processing time and failure surface simultaneously.
We observe that users who take a concrete action within 48 hours are 3.2 times more likely to sustain the practice. The window is short by design. Not because urgency is a virtue, but because the gap between recognition and action, left open, tends to widen into the 14-month average we see again and again.
This week, open a blank document. Title it Failure Log — [Your Pipeline Name] — [Date]. Copy the five-field schema above. Then wait for the next incident — which, if your team resembles most, will arrive before Thursday.
Write down all five fields. That is the entire action. One entry.
The engineer who saved 200 hours per quarter did not begin by redesigning her architecture. She began by writing one honest sentence about what she thought might be underneath a single failure. Six months later, she had enough evidence to act with precision instead of urgency.
The Stoics called this prosoche — attention to what is actually happening, rather than to the noise it makes. Your pipeline failures are not problems to extinguish. They are data. Start treating them that way.
Go deeper with Aurelius
Apply this to your actual situation. Aurelius will meet you where you are.
Start a session