Can I stop my A/B test as soon as it hits 95% significance?

No. Statistical significance alone is not a stopping condition. You must also reach your pre-planned sample size and run for a minimum of 2 full business cycles (14 days). Stopping at 95% confidence before that inflates your false positive rate to over 30%.

How long is too long for an A/B test?

Over 60–90 days. Tests running longer than 3 months risk contamination from seasonal shifts, product changes, and audience drift. If you're not at significance after 90 days, the effect is too small to matter at your current traffic level.

How many visitors do I need for an A/B test?

At minimum: 1,000 sessions per variant per month and 30 conversions per variant per month. Use a sample size calculator with your baseline CVR and target MDE for a precise number. Below these thresholds, results are unreliable regardless of significance levels shown.

What if my A/B test never reaches significance?

A null result is a valid result. It means you couldn't detect an improvement of your chosen MDE — either the change doesn't work, or the effect is smaller than your threshold. Document it, archive the test, and move to a different hypothesis.

Should I split traffic 50/50 or use a different ratio?

50/50 is optimal for speed. Use a 90/10 split only when testing high-risk changes to protect revenue — but this significantly increases required runtime. Only use unequal splits when there is a clear business reason.

What is the Minimum Detectable Effect (MDE) in A/B testing?

MDE is the smallest improvement you would consider worth implementing. It's a business decision, not a statistics one. For most e-commerce and SaaS tests, 10–20% relative improvement is a practical starting point.

How Long Should You Run an A/B Test? (The Real Answer)

You’re running an A/B test. Day 4, your variant is up 22%. Your dashboard shows 94% confidence. Someone on your team says “that’s basically significant — let’s call it.”

Don’t.

Ending an A/B test early is one of the most expensive mistakes in conversion optimization. It produces false winners — variants that appear to beat the control, then underperform when you roll them out at 100% traffic.

Here’s the exact framework for calculating how long to run your test, and why “it reached significance” is not the green light you think it is.

Already know how long to run tests? Read A/B Testing Mistakes That Invalidate Your Results to make sure your setup is valid before you launch.

The Short Answer

Run your A/B test for a minimum of 2 full business cycles (usually 2 weeks), regardless of when statistical significance is reached.

This is the floor, not the ceiling. The actual minimum depends on your traffic volume, baseline conversion rate, and minimum detectable effect (MDE). More on how to calculate all three below.

Why “It Reached Significance” Isn’t Enough

Statistical significance tells you one thing: the probability that your result isn’t random noise. It does not tell you:

Whether the result will hold at 100% traffic
Whether it was influenced by day-of-week effects
Whether a traffic spike or dip skewed the result
Whether your sample was representative of your full audience

The peeking problem: When you check your test results daily and stop the moment it reaches significance, you inflate your false positive rate from 5% to over 20%. You’re not reading a result — you’re cherry-picking the moment the data happened to look good.

“The longer you peek, the more likely you’ll see a false positive. A test that was checked daily for 14 days has a 54% chance of showing at least one false significant result — even if the variants are identical.” — Evan Miller, HowNotToRunAnABTest.com

How to Calculate Your A/B Test Duration

Before you launch a test, calculate the required sample size. Then divide by your daily traffic to get the minimum runtime.

Step 1: Define your baseline conversion rate

Pull your current conversion rate for the page or element you’re testing. Use at least 30 days of data. Don’t use the last 7 days — it may not be representative.

Step 2: Define your Minimum Detectable Effect (MDE)

MDE is the smallest improvement you’d consider worth implementing. This is a business decision, not a statistics question.

Rule of thumb by business context:

Context	Recommended MDE
High-traffic e-commerce	5–10% relative improvement
Low-traffic B2B site	20–30% relative improvement
Email subject line test	2–5% improvement in open rate
Checkout optimization	10–15% improvement in completion
Paid landing page	15–25% improvement in CVR

Setting MDE too low (e.g., 1%) means you need enormous sample sizes. Setting it too high (e.g., 50%) risks missing real but moderate improvements.

Step 3: Calculate required sample size

Use any of these calculators:

Default settings to use:

Statistical significance: 95% (α = 0.05)
Statistical power: 80% (β = 0.20)
Tails: 2-tailed test (unless you are 100% certain the variant can only improve, never hurt)

Step 4: Divide by daily unique visitors to the test

Test Duration (days) = Required Sample Size per Variant ÷ Daily Unique Visitors to Test Element

Example calculation:

Baseline CVR: 3.2%
MDE: 15% relative improvement → target CVR: 3.68%
Required sample size per variant: ~7,800
Daily unique visitors to the page: 520
Duration: 7,800 ÷ 520 = 15 days per variant

Since you’re running two variants (control + one challenger), you need 15,600 total visitors minimum — at 520/day, that’s approximately 30 days.

The 2-Week Minimum Rule (And Why It Exists)

Even if your sample size calculation says you only need 5 days, run for at least 2 full weeks.

The reason: Day-of-week effects are real and significant. B2B sites see drastically different behavior on weekdays vs weekends. E-commerce sites spike on Thursdays and Sundays. If your test runs Monday to Friday, you’re only measuring weekday visitors — and your winner may perform differently on Saturday.

Two full weeks ensures each variant sees the same distribution of days at least twice.

Exception: If your site has zero weekend traffic (pure B2B, corporate tool), 10 business days (2 work weeks) is sufficient.

When Can You Stop a Test Early?

There are three legitimate reasons to stop a test before reaching your planned duration:

The variant is causing measurable harm. Revenue is dropping, bounce rate spiked, or calls to support are increasing. If a variant is clearly hurting you, stopping is the right call. Document why.
A critical external event occurred. A PR crisis, a major sale, a server outage, or an algorithm update that renders your test data uninterpretable. Stop, record the reason, and relaunch after the event passes.
You’ve reached planned sample size AND 2-week minimum. Both conditions must be met, not just one.

Never stop because:

“It looks like the variant is winning”
“We need to ship this feature”
“The boss wants to see results”

These are pressure-driven decisions, not data-driven ones. They produce false winners.

How Many Variants Can You Test at Once?

This is where many teams go wrong by running too many variants simultaneously.

Variants	Required Sample Size Multiplier	Risk
A/B (1 variant)	1×	Low
A/B/C (2 variants)	1.5–2×	Medium
A/B/C/D (3 variants)	2–3×	High
4+ variants	3×+	Very High (insufficient traffic for most sites)

Practical rule: If your monthly traffic is under 20,000 sessions, run one variant at a time. Period. Multiple variants fragment your already-limited traffic and produce tests that take months to conclude — or never reach significance.

A/B Test Duration by Traffic Volume

Use this table to estimate minimum test duration based on your monthly traffic and a 3% baseline CVR:

Monthly Sessions	MDE 20%	MDE 15%	MDE 10%
5,000	~90 days	Impractical	Impractical
10,000	~45 days	~90 days	Impractical
25,000	~18 days	~35 days	~80 days
50,000	~9 days	~18 days	~40 days
100,000+	~5 days	~9 days	~20 days

Assumes 95% significance, 80% power, 2-tailed test, 50/50 traffic split

The uncomfortable truth for low-traffic sites: If you have under 5,000 sessions/month, classic A/B testing isn’t your tool. More on that in the next section.

What to Do If You Have Low Traffic

Low-traffic sites (under 5,000 sessions/month) have two options:

Option 1: Increase your MDE Accept that you’ll only detect large improvements. Set MDE at 25–40% and only test changes that could plausibly deliver that — major redesigns, new copy angles, removing friction points.

Option 2: Use qualitative methods instead

Session recordings (Hotjar, Microsoft Clarity): watch how real users behave
Heatmaps: identify ignored CTAs and rage clicks
User testing: 5 sessions reveal 80% of usability issues
Customer surveys: ask why they didn’t convert

These methods give you directional insight without the sample size problem. For a complete guide on optimizing with low traffic, see How to Do CRO With Low Traffic (Under 1,000 Visitors/Month).

Wondering if your current A/B tests are set up correctly?

A bad test produces bad data. And bad data produces bad decisions — which is worse than no data at all. I audit A/B testing programs and give you a plain-English report on what’s valid, what needs to be rethought, and what to test next.

Also read: A/B Testing Best Practices — the full framework for running tests that produce reliable results. And A/B Testing Mistakes for the structural errors that silently invalidate most tests.

Get a Free CRO Audit →

Frequently Asked Questions

Get a Free CRO Audit

Keep Reading

7 A/B Testing Mistakes That Invalidate Your Results (And How to Fix Them)

A/B Testing Best Practices: How to Run Tests That Actually Mean Something