BlogDeep Dive

The A/B Test That Ran 3 Months Past Statistical Significance

What experimental discipline reveals about attachment to outcomes

Ξ
Marcus Aurelius
·April 23, 2026·5 min read
Ξ

Have a question about this? Bring it to Aurelius.

Ask Aurelius

The A/B Test That Ran 3 Months Past Statistical Significance

47 days after their experiment crossed the threshold of statistical significance, a product team kept the test running for another 91 days. They were not gathering more information. They were waiting for the data to change its mind.

This is not a story about methodology. It is a story about what happens when the will to be right overwhelms the discipline to see clearly — and how that failure compounds silently across quarters until it is visible only in the rearview.

The same force that held this team captive lives inside almost every analytics organization I have encountered. It does not announce itself. It arrives dressed as diligence.


The Discipline That Has to Be Decided Before the Data Arrives

Experimental integrity is not something you exercise in the moment of uncomfortable results. It cannot be. By then, the hypothesis you fell in love with is already whispering reasons to wait one more week.

The teams that move cleanly through experimentation share one habit: they determine their stopping rules — sample size, significance threshold, minimum detectable effect — before the test begins, when reason is uncontaminated by preference. The discipline is not about statistics. It is about deciding, in a calm hour, what you will do when a difficult hour arrives.

"Optional stopping" is what statisticians call the opposite practice: ending an experiment only when the numbers finally say what you hoped they would say. The name is clinical. The experience is more familiar than most teams admit. One week of unfavorable results becomes two. Two becomes eight. The test continues not because there is methodological reason to extend it, but because closing it would require accepting that the checkout redesign, the pricing change, the feature the roadmap was built around — did not work.

The product manager who hypothesized a 15% lift and sees a 3% decline in week two is not looking at a preliminary result. She is looking at a result. The test that follows is not research. It is negotiation with reality — and reality does not negotiate.

Organizations that allow experiments to drift this way make poor product decisions at significantly higher rates than those with disciplined stopping procedures. The financial cost is real. But the deeper cost is structural: every extended test is a signal to the team that the rules bend when the results are inconvenient. That signal is heard. It shapes what people propose next, how boldly they commit to stopping rules they expect to be overridden, how seriously they take the next pre-registration document. The rot is quiet and cumulative.

If your team wants to build experiments that hold up under real scrutiny — including the scrutiny of a finance review — the course Run Experiments That Survive Your CFO's Budget Scrutiny Next Quarter covers the structural habits that make the difference.


What Aurelius Sees in This

In Book IV of the Meditations, Marcus Aurelius writes: "Confine yourself to the present." It reads, at first glance, like a note toward calm. It is actually a command toward clarity — a direct instruction about where the rational mind must anchor itself when the world offers results it did not want.

The Stoic principle at work here is the hegemonikon: the ruling faculty, the seat of judgment. For Aurelius, the hegemonikon is the only thing truly under our control. Not outcomes. Not whether the checkout redesign converts. Not whether the pricing test vindicates the roadmap. What is under our control is how we receive the information reality offers us — whether we receive it honestly or whether we corrupt the faculty of judgment by letting desire filter what the data is allowed to mean.

This reveals something most teams never name directly: extending a significant experiment is not a statistical error. It is a philosophical one. It is the hegemonikon abdicating its function. The team is no longer reasoning from evidence — they are reasoning toward a preferred conclusion and using evidence as cover. Aurelius would recognize this immediately. He spent decades in command, receiving dispatches from campaigns that were not going as planned, and he understood that the commander who cannot read a bad report clearly is more dangerous than the enemy in the field.

The Stoic distinction that illuminates this situation is the dichotomy of control. What is in your control: the quality of your experimental design, the honesty of your stopping rules, the courage to close a test when it has spoken. What is not in your control: what the test finds. Most conventional advice about experimentation focuses entirely on the first category — better power calculations, cleaner holdout groups, tighter significance thresholds. This is useful, but it misses the harder truth.

The harder truth is this: pre-commitment frameworks fail not because they are poorly designed but because the people inside them have not genuinely accepted what they cannot control. They pre-register stopping rules as a bureaucratic gesture while privately believing that a longer test will eventually confirm what they believe. That private belief is the problem. No process document fixes it. It requires what Aurelius spent his examined life attempting: the honest separation of what you want to be true from what is true.

Therefore, the question for anyone in this situation is not "how do we build better stopping rules?" It is "have we actually accepted that the test might tell us we were wrong — and that we will act on it?" A team that has genuinely accepted this builds stopping rules that stick. A team that has not will find ways around any process, however rigorously documented.

Aurelius failed at this himself. The Meditations are not a record of mastery. They are a record of a man reminding himself, repeatedly, to do what he already knew was right. The test running 91 days past significance is not a story about a bad team. It is a story about a human tendency that requires active, ongoing resistance — not a one-time process fix.


What to Do This Week

Before you close this tab, pull up the list of experiments currently running in your organization. For each one, ask three questions:

1. Was a stopping rule documented before the test launched?

Not written after the first results came in. Before. If the answer is no, you are not running an experiment. You are running a preference confirmation engine with a statistics layer on top.

2. Has any test crossed its predetermined threshold and kept running?

If yes, name the reason out loud. If the reason is "we want to see if it stabilizes" or "leadership wants more data" or "the results surprised us," those are not methodological justifications. They are the sound of the hegemonikon stepping aside.

3. Who has the authority to call the test — and have they pre-committed to acting on the result?

Authority without pre-commitment is the mechanism by which optional stopping becomes organizational habit. The person closing the test must have decided, before seeing results, that they will close it when the rules say to close it.

One practical tool: Run the Decision Test on Every Metric in Your Current Report. It is a structured prompt that forces the question: is this metric actually informing a decision, or are we collecting it because we hope it will eventually say something useful?

If your reports are arriving after the window to act has already closed, Diagnose Whether Your Metrics Report Arrives Too Late to Matter will help you trace where the delay lives.

The flourishing of an analytics function is not measured in dashboards shipped or tests launched. It is measured in decisions made clearly, on honest evidence, without negotiation. That is a harder standard. It is also the only one that compounds in the right direction.


Explore Further

Frequently Asked Questions

Why do organizations continue A/B tests past statistical significance?
Teams often extend experiments hoping results will eventually confirm their hypotheses, rather than accepting unwelcome data. This stems from attachment to desired outcomes rather than commitment to truth-seeking.
What are stopping rules and why do they matter?
Stopping rules are predetermined criteria for ending experiments, established before testing begins. They prevent optional stopping bias and ensure statistical validity by removing emotional decision-making from the process.
How can teams build better experimental discipline?
Document hypotheses and stopping conditions beforehand, implement automated stopping rules, celebrate negative results equally with positive ones, and practice accepting disappointing data in low-stakes situations.
Ξ

Go deeper with Aurelius

Apply this to your actual situation. Aurelius will meet you where you are.

Start a session