A/B Testing
A controlled experiment comparing two versions of a webpage to determine which produces more conversions.
A/B testing is a randomised controlled experiment that compares two versions of a webpage, email, or interface element to determine which produces a higher conversion rate. Version A (the control) represents the current design; Version B (the variant) contains a single proposed change.
Traffic is randomly split between both versions. After collecting sufficient data, statistical analysis determines whether the observed difference in conversion rate is likely real or the result of random variation.
How A/B Testing Works
- Identify a conversion problem — Use analytics, heatmaps, and session recordings to find pages where visitors drop off or fail to convert
- Form a hypothesis — “Because we observed [data], we believe [change] will improve [metric] for [segment]”
- Calculate required sample size — Before the test starts, determine how many visitors per variant you need at your chosen significance level and power
- Run the test — Split traffic 50/50, collect data until sample size and minimum duration are met
- Analyse results — Check statistical significance, effect size, and segment by device/source/audience
- Implement or discard — Ship winners, log learnings from both outcomes
The hypothesis step is often skipped — and skipping it is what separates random tinkering from systematic CRO. Every test should be connected to a specific observation from user research.
What to A/B Test (and What Not to)
| Element | Impact Potential | Notes |
|---|---|---|
| Headlines and value proposition | Very high — often 20–50% lift | Start here |
| CTA copy and placement | High | First-person copy typically wins |
| Hero section / above the fold | High | Affects first impression and bounce |
| Social proof placement and type | Medium | Specific testimonials beat generic |
| Form length | Medium | Remove unnecessary fields |
| Page layout | Medium | Requires design resources |
| Button colour | Low | Only matters if current has no contrast |
The biggest A/B testing gains come from copy, offer framing, and trust architecture — not cosmetic changes. For full guidance on test prioritisation, see A/B Testing Best Practices.
A/B Test Benchmarks and Effect Sizes
Most CRO practitioners report that well-researched A/B tests produce these results over time:
| Outcome | Frequency |
|---|---|
| Statistically significant winner | ~25–30% of tests |
| Inconclusive (insufficient data) | ~40–50% of tests |
| Control wins (variant loses) | ~15–20% of tests |
| Statistically significant loser | ~10% of tests |
This means most tests don’t produce clear winners — and that’s expected. The value of A/B testing is cumulative: the tests that do win compound into significant long-term CVR improvement. Microsoft Research (Kohavi et al.) found that only about 1 in 3 tests at top tech companies produces a statistically significant positive result.
The Peeking Problem
The most common A/B testing mistake: checking results before hitting your sample size and stopping when you see a winning variant.
Statistical significance fluctuates constantly during a test. A variant showing 95% confidence on day 3 may drop to 60% by day 14. If you stop on day 3, you’ve shipped a false positive.
Checking results 5 times during a test inflates the false positive rate from 5% to 26%. The fix: decide when the test ends before it starts, and don’t open the dashboard until then.
For the full list of statistical mistakes that invalidate test results, see A/B Testing Mistakes.
A/B Testing vs Multivariate Testing
| A/B Test | Multivariate Test | |
|---|---|---|
| What’s tested | One element, two variations | Multiple elements simultaneously |
| Traffic needed | Lower | Much higher (5–10× more) |
| Results | Which version wins | Which combination of elements wins |
| Best for | 90%+ of all tests | High-traffic pages with multiple hypotheses |
A/B testing is the right tool for the vast majority of CRO scenarios. Multivariate testing requires enough traffic to support many variant combinations simultaneously — typically 100,000+ monthly sessions. See Multivariate Testing for when to escalate.
When to Use A/B Testing vs Other Methods
Not every conversion problem requires an A/B test. Use this framework to decide:
| Situation | Recommended approach |
|---|---|
| Clear UX problem identified in usability testing | Fix without testing |
| Hypothesis based on analytics + qualitative data | A/B test |
| Site with under 5,000 monthly conversions | Focus on qualitative research, test only high-confidence hypotheses |
| Multiple competing hypotheses on same page | A/B test sequentially, not simultaneously |
| Critical bug or legal requirement | Fix immediately, no test needed |
For teams with limited traffic, read CRO for Low-Traffic Sites — A/B testing requires adequate sample sizes and there are more appropriate research methods below certain traffic thresholds.
Sample Size by Baseline Conversion Rate
Pre-calculating sample size is non-negotiable. Tests stopped before reaching the required sample produce false positives at a dramatically elevated rate.
| Baseline CVR | MDE (relative) | Visitors per variant needed |
|---|---|---|
| 1% | 20% | ~35,000 |
| 2% | 15% | ~18,000 |
| 3% | 15% | ~12,000 |
| 5% | 10% | ~15,000 |
| 10% | 10% | ~7,500 |
At 80% statistical power, 95% confidence level. Calculated using standard frequentist methodology.
For the exact calculation methodology, see How Long to Run an A/B Test.
Common A/B Testing Mistakes
- Testing without a hypothesis — Changes made without research backing are random guesses
- Running too many tests simultaneously — Overlapping tests pollute each other’s data
- Stopping at significance without hitting sample size — The peeking problem in practice
- Not segmenting results — A test that “loses” overall may win on mobile or for paid traffic
- Ignoring interaction effects — A winning headline may perform differently with a different hero image
- Treating null results as failures — A test that shows no difference is still valuable learning
Tools for A/B Testing
Popular platforms include VWO, Optimizely, AB Tasty, and Convert. Statistical analysis can also be done manually using a chi-squared test or a dedicated significance calculator.
Running tests correctly requires more than a tool — it requires a structured testing methodology that prevents common statistical errors. A/B testing is the primary delivery mechanism of any CRO programme — every insight from research eventually becomes a test hypothesis.
Understanding statistical significance and confidence intervals is prerequisite reading before interpreting results.
Frequently Asked Questions
What is A/B testing?
A/B testing (also called split testing) is a controlled experiment that compares two versions of a webpage, email, or interface element — Version A (control) and Version B (variant) — to determine which produces more conversions. Visitor traffic is randomly split between both versions and statistical analysis determines whether the difference is real or due to chance. The method was formally established in marketing by Ron Kohavi at Microsoft in the early 2000s, and is now the gold standard for evidence-based conversion optimization.
How long should an A/B test run?
An A/B test should run for a minimum of 14 days (two complete business cycles) AND until each variant reaches the pre-calculated minimum sample size — whichever takes longer. Stopping tests early, even when results look significant, leads to false positives. Research by Ronny Kohavi and Roger Longbotham found that the false positive rate jumps from 5% to over 26% if you check results 5 times during a test. The 14-day minimum accounts for weekly behavioral cycles — Tuesday traffic converts differently than Sunday traffic.
How many visitors do I need for an A/B test?
Sample size requirements depend on your baseline conversion rate, minimum detectable effect (MDE), statistical power (typically 80%), and significance level (typically 95%). At a 3% baseline CVR targeting a 15% relative improvement, you need approximately 10,000 visitors per variant. At a 1% baseline with the same MDE, you need roughly 30,000 per variant. Always calculate sample size before starting — not after — using Evan Miller's sample size calculator (evanmiller.org/ab-testing/sample-size.html) or VWO's duration calculator.
What is the most important element to A/B test first?
The highest-impact A/B tests — in order of typical effect size — are: (1) headlines and value proposition copy, which often produce 20–50% CVR differences, (2) CTA copy and placement, (3) hero section or above-the-fold layout, (4) social proof type and position, (5) form length. Button color is frequently tested but rarely moves the needle meaningfully. Start with the hypothesis most grounded in research — a customer interview insight or a session recording observation — not with cosmetic changes.
What is the peeking problem in A/B testing?
The peeking problem is the practice of checking A/B test results before reaching the pre-set sample size and stopping the test early when a 'winner' appears. Statistical significance fluctuates constantly during a test — a variant showing 95% confidence on day 3 may drop to 60% by day 14. If you stop on day 3, you've shipped a false positive. Checking results 5 times during a test inflates the false positive rate from 5% to 26% (Kohavi et al., 2014). The solution: decide the stopping conditions before the test starts and do not open the dashboard until those conditions are met.
What is the difference between A/B testing and split testing?
A/B testing and split testing are synonymous terms for the same method. Both describe randomly splitting traffic between a control (original) and variant (changed) version of a page, then using statistical analysis to determine which performs better. The distinction sometimes drawn is between 'A/B testing' (comparing two page variants at the same URL using JavaScript injection) and 'split URL testing' (redirecting visitors to entirely different URLs). For CRO purposes, both methods apply the same statistical framework — the difference is technical implementation.
What does a 95% confidence level mean in an A/B test?
A 95% confidence level means that if you ran the same test 100 times with fresh samples, approximately 95 of those tests would produce a statistically significant result that reflects a real effect rather than random noise. It does not mean you are 95% certain the variant is better — it means the method has a 5% false positive rate. For most business decisions, 95% confidence is the standard threshold. For high-stakes changes (site-wide, pricing), some teams require 99% confidence, which reduces false positives but requires larger sample sizes.