A/B Testing Advanced

Confidence Interval

A range of values within which the true effect of an A/B test variant is likely to fall — more informative than a single point estimate.

By Mario Kuren

A confidence interval (CI) is a range of values within which the true effect of an A/B test is likely to fall, with a specified level of confidence.

Example: “Variant B shows a 15% improvement in CVR (95% CI: 4.2%–25.8%)”

This means: with 95% confidence, the true effect of the variant is somewhere between a 4.2% and 25.8% improvement over control. The 15% is the point estimate (most likely value); the CI tells you how certain you should be about that estimate.

What Confidence Interval Actually Means

A 95% confidence interval does not mean “there is a 95% probability that the true value is in this range.” (This is the most common misinterpretation.)

What it actually means: if you conducted this experiment 100 times with fresh samples each time and calculated a 95% CI from each, approximately 95 of those 100 CIs would contain the true value.

Practically: A 95% CI gives you a plausible range for the true effect, constructed using a method that captures the true value 95% of the time.

CI Width and What It Tells You

CI WidthInterpretationCauseAction
Narrow (e.g., 13%–17%)Precise estimate, high confidence in magnitudeLarge sample sizeStrong basis for decision
Moderate (e.g., 8%–22%)Reasonable estimate with expected uncertaintyAdequate sampleProceed with informed caution
Wide (e.g., -2%–32%)Highly uncertain estimateUnderpowered testExtend test or collect more data
CI includes zeroNo detected effectTrue null or underpoweredDo not ship

CI vs P-Value: The Full Picture

P-value and confidence interval are complementary:

MetricTells youDoesn’t tell you
P-valueIs the result likely due to chance?How large or precise the effect is
Confidence intervalThe plausible range of the effectWhether the result is “significant”

Best practice: Always report both.

  • Significant p-value + narrow CI = high confidence in the effect size → ship
  • Significant p-value + very wide CI = significant but imprecise → consider extending
  • Non-significant p-value + CI that includes zero = no detectable effect → null result

Practical Application: “Is This Worth Shipping?”

Use the lower bound of the confidence interval for conservative decision-making:

Example: Test shows 15% CVR improvement, 95% CI: 4%–26%.

Question: “Even in the worst case (4% improvement), is this variant worth implementing?”

If yes → ship. The minimum realistic benefit (lower CI bound) still exceeds your implementation cost.

If no → the uncertainty means the risk of implementing isn’t justified by the potential reward.

Confidence Intervals and Sequential Testing

For teams using sequential testing (making ongoing decisions as data comes in), confidence intervals are especially important because:

  • Point estimates fluctuate wildly early in a test
  • CI width decreases as sample size grows
  • Decisions made when CI is still wide are far more likely to be wrong

The standard: wait until the CI has narrowed enough that the lower bound exceeds your minimum acceptable effect size.

For the complete framework on when to stop a test, see How Long Should You Run an A/B Test?.

Frequently Asked Questions

What is a confidence interval in A/B testing?

A confidence interval (CI) is a range of values within which the true conversion rate effect is likely to fall with a specified probability. A 95% confidence interval means: if you ran this exact test 100 times, approximately 95 of those tests would produce a confidence interval that contains the true effect. In A/B testing, a result might show 'Variant B improves CVR by 15% (95% CI: 3% to 27%)' — meaning the true improvement could plausibly be anywhere from 3% to 27%.

What does a wide vs narrow confidence interval mean?

The width of a confidence interval reflects the precision of your estimate — which is primarily determined by sample size. A narrow CI (e.g., 14%–16% improvement) means you have a precise estimate of the true effect. A wide CI (e.g., 2%–28%) means high uncertainty — the true effect could be anywhere in that range. Wide confidence intervals are typical in underpowered tests (insufficient sample size). They don't necessarily mean the result is wrong, but they mean you should be cautious about acting on it.

Should I use p-value or confidence interval to judge A/B test results?

Use both — they provide complementary information. P-value tells you whether the result is statistically significant (unlikely to be random noise). Confidence interval tells you the magnitude and precision of the effect. A statistically significant result with a very wide CI might technically be significant but practically uncertain. A result where the lower bound of the CI is still above your MDE is more compelling than one where the CI spans from near-zero to very large. Together, they give you significance AND practical meaning.