A/B Testing Advanced

Statistical Power

The probability that an A/B test correctly detects a real effect when one exists — typically set at 80%, meaning 20% chance of missing a real winner.

By Mario Kuren

Statistical power is the probability that an A/B test correctly detects a real difference between variants when one actually exists. It is the test’s sensitivity — its ability to find a true signal rather than missing it.

Notation: Power = 1 − β, where β is the Type II error rate (probability of a false negative — failing to detect a real effect).

Standard setting: 80% power → 20% chance of missing a real improvement.

The Two Types of Error in A/B Testing

Error TypeNameWhat it meansControlled by
Type I (false positive)Alpha (α)Declaring a winner when variants are equalSignificance level (typically 0.05)
Type II (false negative)Beta (β)Missing a real winner, declaring no resultPower (typically 0.80, β = 0.20)

Most A/B testing discussion focuses on Type I errors (false positives) — but underpowered tests that miss real winners are equally damaging, just less visible.

Why 80% Power?

The 80% power standard means that if Variant B truly beats control by your minimum detectable effect, there’s a 20% chance your test will fail to detect it and produce an inconclusive result.

Why accept 20% risk of missing real improvements?

  • Sample size economics: Each incremental power gain requires a disproportionate sample size increase
  • Practical trade-off: Running a larger test costs more time; 80% power balances sensitivity with feasibility
  • Multiple tests: Running 5 well-powered tests produces more learning than running 2 overpowered tests

When to use higher power (90%+):

  • The decision is irreversible (major redesign you can’t quickly roll back)
  • The test is site-wide and affects all revenue
  • You have the traffic to afford it without extending the run time significantly

Power and Sample Size

Power is one of four interrelated parameters in sample size calculation:

ParameterControlsTypical value
Alpha (α)Type I error / false positive rate0.05 (95% significance)
Power (1 − β)Type II error / sensitivity0.80
Baseline CVRBaseline conversion rateYour actual current rate
MDEMinimum effect you want to detect10–20% relative improvement

Change any one of these and the required sample size changes. The most common mistake: setting MDE too low, which requires impractically large samples. See Minimum Detectable Effect.

Underpowered Tests: A Hidden Problem

An underpowered test (below 80% power) doesn’t just miss more winners — it also produces noisier results that are harder to interpret:

  • A test at 50% power has a coin-flip chance of detecting a real improvement
  • Many “inconclusive” results are actually real effects in underpowered tests
  • Underpowered tests are especially common on low-traffic pages — see How to Do CRO With Low Traffic

Calculating Power Before You Test

Use any standard sample size calculator with these inputs:

  • Baseline CVR
  • Minimum Detectable Effect (MDE)
  • Alpha = 0.05
  • Power = 0.80

The calculator outputs required sample size per variant. Divide by daily traffic to get minimum test duration.

Free calculators: Evan Miller, VWO, Optimizely.

Frequently Asked Questions

What is statistical power in A/B testing?

Statistical power is the probability that an A/B test correctly identifies a true effect — i.e., detects a real difference between variants when one actually exists. It's also described as 1 − β, where β is the Type II error rate (the probability of missing a real effect, or a 'false negative'). Standard A/B testing uses 80% power, meaning there's a 20% chance of failing to detect a real improvement. Higher power requires larger sample sizes.

What is the difference between statistical significance and statistical power?

Statistical significance (alpha, α) controls the false positive rate — how often you declare a winner when variants are actually equal. Statistical power (1 − β) controls the false negative rate — how often you correctly detect a real winner when one exists. Significance and power are complementary: significance prevents you from shipping losers, power prevents you from dismissing real winners. Both must be set before a test starts — significance at 95% (α = 0.05) and power at 80% (β = 0.20) is the standard.

How does statistical power affect sample size?

Higher statistical power requires larger sample sizes. Going from 80% to 90% power increases required sample size by roughly 25–30%. Going from 80% to 95% power increases sample size by roughly 50–60%. This is why 80% power is the practical standard — the incremental sample size cost of higher power often isn't worth it for most CRO tests. If you need higher confidence (e.g., site-wide shipping changes for a large retailer), 90% power is justified.