A/B Testing Intermediate

Statistical Significance

A statistical threshold confirming that a measured difference between A/B test variants is unlikely to be due to random chance.

By Mario Kuren

Statistical significance is a threshold that indicates whether the observed difference between two A/B test variants is unlikely to be the result of random chance. At the standard 95% significance level, there is a 5% probability that the measured difference occurred by luck rather than reflecting a real effect.

It is expressed as a p-value: the probability of observing the measured result (or a more extreme result) if there were actually no difference between variants. A p-value below 0.05 corresponds to 95% significance.

Why Statistical Significance Matters in CRO

Without statistical rigour, A/B testing produces false positives — you implement changes that appeared to win but actually had no real effect. Over time, a library of false positives produces no cumulative lift.

The cost of false positives compounds:

  • You ship changes that don’t help (or hurt)
  • Your test log fills with misleading learnings
  • You build wrong mental models about your audience

Statistical significance is the gatekeeper between noise and actionable insight.

The Relationship Between Key Statistical Concepts

ConceptWhat It Controls
Significance level (α)Maximum acceptable false positive rate (typically 5%)
Statistical power (1-β)Probability of detecting a real effect (typically 80%)
Sample sizeDetermined by significance level, power, baseline CVR, and MDE
P-valueProbability the result occurred by chance
Confidence intervalRange within which the true effect likely falls

All five are connected. Changing one changes the others.

Common Misconceptions

“95% significant means I’m 95% sure the variant is better.” Wrong. It means there’s a 5% chance the difference is noise. It says nothing about how much better, or how certain the improvement is in real-world conditions.

“I should stop as soon as I hit 95% significance.” Wrong. This is the peeking problem. Significance fluctuates during a test. A variant showing 97% significance on day 4 may drop to 75% by day 14.

“A 95% significant result will hold after implementation.” Not guaranteed. Site traffic seasonality, regression to the mean, and novelty effects all cause real-world performance to differ from test results.

How Much Traffic Do You Need?

Minimum sample size depends on:

  • Baseline CVR — Lower baseline requires more visitors
  • Minimum detectable effect (MDE) — Smaller effect requires more visitors
  • Power — Higher power (90% vs 80%) requires more visitors
  • Significance level — Higher threshold (99% vs 95%) requires more visitors

At a 3% baseline CVR with 15% MDE, 80% power, and 95% significance: ~10,000 visitors per variant. At 1% baseline with 10% MDE: ~50,000 visitors per variant.

Use a sample size calculator before starting any test. Never determine sample size after the fact.

Bayesian vs Frequentist Significance

The standard approach described above is frequentist. Some A/B testing platforms (including VWO and Convert) offer Bayesian significance testing, which expresses results differently:

  • Frequentist: “The probability this result is noise is less than 5%”
  • Bayesian: “There is an 85% probability that B is better than A by at least X%”

Bayesian testing can handle sequential testing without the peeking problem, but requires careful calibration of priors. For most teams, frequentist testing with strict pre-set sample sizes produces reliable results.

Frequently Asked Questions

What is statistical significance in A/B testing?

Statistical significance (typically set at 95%) means there is only a 5% probability that the observed difference between A/B test variants occurred by random chance. It does not mean the result is large, important, or guaranteed to hold — only that the measured difference is unlikely to be noise. A statistically significant result with a tiny effect size may not be worth implementing.

What significance level should I use for A/B tests?

95% statistical significance (p < 0.05) is the standard for most A/B tests. Some teams use 90% for low-stakes tests or 99% for high-stakes decisions (major redesigns, pricing changes). Using a lower threshold increases the risk of false positives; using a higher threshold requires more traffic and time per test.

Can I stop an A/B test when it reaches 95% significance?

No. Stopping a test the moment it reaches 95% significance is the peeking problem — one of the most common A/B testing errors. Statistical significance fluctuates constantly during a test. You should only stop when both conditions are met: (1) your pre-calculated minimum sample size has been reached, AND (2) at least 14 days have passed to capture weekly behavioral cycles.