A/B Testing

7 A/B Testing Mistakes That Invalidate Your Results (And How to Fix Them)

Red warning signs overlaid on an A/B testing dashboard showing common errors in split test setup and analysis

I had a conversation with a product manager about six months ago. She was frustrated because her team had been running A/B tests for months, and every single time they shipped a winner, it didn’t hold up in production. She asked me what they were doing wrong.

The answer was simple: almost everything.

I’ve reviewed hundreds of A/B testing programs across e-commerce, SaaS, and B2B companies. The pattern is always the same. Teams aren’t running invalid A/B tests by accident. They’re running them on purpose, without realizing it. They check results every day. They ship the moment they see p less than 0.05. They test five things at once. Then they wonder why their winners disappear after launch.

The good news is that these A/B testing mistakes are completely fixable. But they require discipline, not complexity.

Before you read this, make sure you understand how long to run an A/B test. That’s the foundation everything else builds on.

The Real Cost of Invalid A/B Tests

Here’s what most companies don’t realize. When you ship a false winner, you’re not just wasting time. You’re actually making your product worse. You’ve locked in a change that hurts your users, and you can’t test that area again because you already have a winner. That’s months of lost optimization potential.

One product manager on Reddit described it perfectly. A false winner doesn’t just fail to lift revenue. It locks in a worse experience and stops you from testing that area for months. That’s the real cost.

The seven mistakes I’m about to walk you through are responsible for most of these false winners. Fix these, and your A/B testing program will actually work.

Mistake 1: Stopping the Test as Soon as It Hits Significance

This is the peeking problem, and it’s the most expensive A/B testing mistake I see.

Picture this. It’s Monday. Your A/B test has been running for three days. You check your dashboard. The variant is up 12 percent with 96 percent confidence. You call the team. We have a winner. By Friday, you’ve shipped it.

By the following Wednesday, the lift has vanished.

Here’s why this happens. Every time you check your A/B test results, you increase the probability of seeing a false positive by random chance. Think of it like flipping a coin. If you want to prove the coin is lucky, you could just flip it until you get three heads in a row, then stop. If you keep flipping and only stop when you see that specific pattern, you aren’t measuring the coin. You’re just waiting for a random pattern to appear.

That’s peeking.

A 95 percent confidence level means this: if the null hypothesis is true and the variants are identical, you would see this result 5 percent of the time by random chance. But that 5 percent only applies if you commit to a fixed sample size and stop there. When you peek multiple times, you multiply your chances of seeing a false positive with every check.

Spotify’s research on the peeking problem shows that checking results daily and stopping at the first significant result inflates your false positive rate from 5 percent to over 30 percent. That’s not a small difference. That’s the difference between a reliable testing program and one that’s mostly noise.

The fix is straightforward. Calculate your required sample size before the A/B test starts using Evan Miller’s sample size calculator. Write down the number. Write down your stop date. Don’t look at the results until you’ve hit both. This removes the temptation to call it when you see a green spike.

If you absolutely must peek at results during a test, consider sequential testing. Spotify’s research shows that sequential tests allow you to monitor results continuously without inflating false positive rates. The tradeoff is wider confidence intervals early on, meaning A/B tests may take longer to reach significance.

Mistake 2: Testing Too Many Elements at Once

You redesign the headline. You change the hero image. You update the CTA button color. You add social proof. The variant wins by 12 percent. Everyone celebrates.

Six months later, you’re still trying to figure out which of those four changes actually drove the improvement. Spoiler: you never will.

This is noise-driven development. When you change multiple things in an A/B test and the variant wins, you have zero information about what actually moved the needle. When you try to iterate from this, test the next version, build on the winner, you’re building on a foundation you don’t understand. Eventually the optimizations plateau, and you can’t diagnose why.

The problem gets worse when you try to replicate results. You ship the winning variant, but the improvement doesn’t hold. Why? Because you don’t know which of the four changes was actually responsible. Maybe it was the headline. Maybe it was the image. Maybe it was the combination of all four. Now you’re stuck.

There is an exception to this rule. Multivariate testing is designed for testing multiple element combinations simultaneously. But it requires 5 to 10 times more traffic than a standard A/B test and is only suitable for high-traffic pages with 100,000 or more monthly sessions. For everyone else, it’s a trap.

The fix is to isolate your variable. One headline. One CTA. One layout. This isn’t boring. It’s the fastest path to compounding growth. You understand why something works, so you can build on that knowledge. Next month you test something else. The month after that, something else. Over time, these small wins compound into massive improvements.

Mistake 3: Not Accounting for Day-of-Week Effects

Your A/B test runs Monday to Friday. The variant shows a clear winner. You roll it out Monday morning. By Wednesday, the improvement has shrunk by half.

This is day-of-week bias, and it’s invisible until you look for it. User behavior varies dramatically by day of week. B2B sites see 30 to 50 percent higher engagement on Tuesday through Thursday. E-commerce spikes on Thursday evening and Sunday afternoon. If your A/B test doesn’t capture a full week on each side, you’ve measured a biased sample.

A 5-day test might be statistically significant in terms of raw sample size, but behaviorally it’s only measuring one type of visitor. You’re not testing your product. You’re testing Monday-to-Friday users.

Here’s the thing about day-of-week effects. They’re not random. They’re predictable. B2B users check email on weekday mornings. E-commerce users shop on weekends. If your A/B test only captures weekday traffic, you’re missing half your user base.

Always run your A/B tests for complete calendar weeks. The minimum is 2 full weeks. Check your analytics for day-of-week CVR variation before launching. If it’s significant, weight your required runtime accordingly. This isn’t negotiable.

Mistake 4: Ignoring the SRM (Sample Ratio Mismatch)

You expect a 50/50 traffic split in your A/B test. Your test shows 52.3 percent control, 47.7 percent variant. You proceed anyway.

This is a Sample Ratio Mismatch, and it’s a red flag that something is broken in your setup. Even a small imbalance like 52/48 instead of 50/50 can indicate that the variants aren’t receiving comparable traffic. Common causes include bot traffic hitting one variant disproportionately, redirect issues affecting one URL, caching serving one variant more frequently, or mobile versus desktop rendering differently for each variant.

An SRM can invalidate your entire A/B test because the populations being compared are no longer equivalent. You’re not comparing apples to apples. You’re comparing apples to a different type of apple.

Here’s why this matters. If one variant is getting more bot traffic, your conversion rate is artificially inflated or deflated. If one variant is cached differently, some users see a faster loading experience. If mobile and desktop render differently, you’re comparing two different products.

Check for SRM before analyzing your A/B test results. Use a Chi-squared test on your traffic split. Any A/B testing tool should flag this automatically. If your p-value for the traffic ratio is under 0.05, your test has an SRM problem. Investigate the cause before reading results. Don’t try to save the data. Rerun the test.

Mistake 5: Running Tests on Polluted Traffic

Your A/B test includes internal team traffic. It includes bot traffic. It includes QA sessions. The sample size looks healthy. The results are noise.

When your team clicks through the test during QA, they generate sessions that don’t represent real customer behavior. Bot traffic creates artificial session counts without conversion intent. Both inflate your denominator and dilute real conversion signals.

Think about what happens. You have 10,000 sessions. 500 of them are from your team or bots. That’s 5 percent of your data that’s garbage. Your conversion rate is now 5 percent lower than it should be. Your A/B test is now 5 percent less powerful.

Signs your A/B test data is polluted: CVR is wildly inconsistent day-to-day. Traffic spikes that don’t correlate with any campaign activity. Suspiciously high bounce rates on specific variants.

Here’s how to fix it. Filter internal IP addresses in your analytics and testing tool. Enable bot filtering in GA4. Go to Admin, then Data Streams, then Advanced Settings. Exclude known crawler user agents in your testing tool. Run a clean baseline for 2 weeks before launching a test to establish a pollution-free CVR baseline.

Mistake 6: Measuring the Wrong Metric

You optimize for clicks on the CTA button. Clicks go up 18 percent. Revenue stays flat.

This is the proxy metric trap, and it’s one of the most common A/B testing mistakes. Clicks don’t equal conversions. Opens don’t equal revenue. Micro-metrics are useful diagnostics, but they’re not business outcomes. A test that increases CTA clicks by 18 percent but doesn’t move the downstream conversion is measuring the wrong thing.

This mistake is especially common in multipage funnels. A change on page 1 might increase page 2 arrivals, but if page 2 is the friction point, you’ll see a CVR improvement on step 1 and no change in final conversion. You’ve optimized the wrong thing.

Here’s the problem. You ship the change because the dashboard shows a win. But in production, nothing happens. Revenue doesn’t move. Users don’t convert. You wasted weeks on an optimization that doesn’t matter.

Define your primary metric before the A/B test starts. Make it a business outcome like purchase, trial signup, or demo booked. Not a behavioral proxy like click, scroll depth, or time on page. Track secondary metrics as diagnostics, but don’t make shipping decisions based on them.

Mistake 7: Not Running a Pre-Test A/A Test (When Traffic Allows)

You set up your A/B test. You split traffic 50/50 between identical pages. And the variant wins by 8 percent with 91 percent confidence. You didn’t notice. You launched the real test. You trusted results that were already poisoned.

An A/A test sends equal traffic to two identical variants. It should show no significant difference. If it does, your testing setup is broken. The tracking pixel fires differently on each variant. The randomization algorithm is biased. Or there’s a technical implementation error.

If your A/A test shows a winner, your A/B test results cannot be trusted until the underlying issue is fixed.

Here’s why this matters. If your A/A test shows a false winner, that means your testing infrastructure is broken. Every A/B test you run after that is suspect. You could be shipping false winners left and right without realizing it.

Run an A/A test for 1 week before any major test program launches. Or after setting up new testing infrastructure. If it shows a significant result, debug before continuing. This step is skipped by the vast majority of teams. And it’s why so many winning A/B tests fail to hold in production.

The P-Hacking Problem: Why Running Many Tests Guarantees False Positives

Here’s a statistical reality that most teams ignore. If you run 20 A/B tests with a true effect size of zero, on average one of them will show statistical significance at p less than 0.05 purely by random chance.

This is called p-hacking or the multiple comparisons problem. It’s especially dangerous for teams running lots of A/B tests without proper statistical controls. Many product teams are unknowingly p-hacking their way to false winners.

Think about what happens in a typical company. You run 20 A/B tests per month. By random chance, one of them will show significance even if there’s no real effect. You ship it. It fails in production. You wonder why.

The fix is simple but requires discipline. Either run fewer A/B tests and make each one count, or use statistical corrections like Bonferroni correction when running multiple simultaneous tests. If you’re running 2 to 3 concurrent tests on different pages, you’re fine. If you’re running 20 or more tests per month, you need to adjust your statistical significance threshold downward.

How to Diagnose A/B Test Problems Before You Read Results

Before you look at CVR numbers, run through this diagnostic checklist. Most A/B test failures are detectable at setup, not after the fact.

SymptomLikely MistakeHow to Confirm
Winner doesn’t hold after shippingStopped too early or day-of-week biasDid you hit pre-planned sample size? Did test run 2 or more full weeks?
Multiple things changed, can’t attribute the winTesting multiple elementsReview change log. Was it really one variable?
CTR up, revenue flatWrong primary metricIs your primary metric a business outcome or a proxy?
Traffic split is 53/47 instead of 50/50Sample Ratio MismatchChi-squared test on the split. SRM confirmed if p less than 0.05
CVR varies 3 to 5 times between daysPolluted trafficCompare bot versus human sessions. Check internal IP filtering
A/A test shows a winnerBroken testing infrastructureRandomization or tracking error. Fix before any real test

A/B Testing Mistakes: Quick Reference

MistakeRisk LevelSymptomFix
Stopping at significanceVery HighFalse winnersPre-calculate sample size, commit to it
Testing multiple elementsHighCan’t replicate resultsOne variable per test
Ignoring day-of-week effectsHighResults don’t hold post-launchMinimum 2 full weeks
Ignoring SRMVery HighBiased population comparisonChi-squared check on traffic split
Polluted trafficMediumNoisy, inconsistent resultsFilter bots and internal IPs
Wrong metricHighNo revenue impact despite winningDefine business outcome upfront
No A/A testMediumBroken tracking produces false resultsRun A/A before major test programs

What a Valid A/B Test Looks Like

A well-run A/B test has five things defined before it launches.

First, a hypothesis. We believe changing X to Y will increase metric Z because reason.

Second, a primary metric. A business outcome, not a click.

Third, a required sample size. Calculated with Evan Miller’s sample size calculator at 95 percent confidence and 80 percent power.

Fourth, a minimum runtime. 2 full business cycles minimum.

Fifth, stopping rules. Pre-committed. Don’t change them mid-test based on early results.

Run the A/B test. Don’t peek. When it finishes, read the result once. Make your decision. Document everything.

That’s it. It sounds boring. It is boring. Boring, rigorous A/B testing produces compound results over time. Exciting, intuition-driven testing produces one good story and a lot of wasted traffic.

Why Most Companies Get A/B Testing Wrong

The reason most companies fail at A/B testing isn’t because the concept is hard. It’s because discipline is hard. It’s easy to check results on day 3. It’s easy to ship when you see a green number. It’s easy to test five things at once.

What’s hard is waiting. Waiting for your pre-planned sample size. Waiting for two full weeks. Waiting for statistical significance instead of just significance.

But that’s what separates companies that actually optimize their products from companies that think they’re optimizing their products.

Resources for A/B Testing

Evan Miller’s A/B Testing Tools - Comprehensive collection of statistical calculators for A/B testing. https://www.evanmiller.org/ab-testing/

Spotify’s Sequential Testing Framework - How to peek at results without inflating false positives. https://engineering.atspotify.com/2023/03/choosing-sequential-testing-framework-comparisons-and-discussions/

Reddit discussion on A/B test wins being fake - Real product managers discussing peeking and false positives. https://www.reddit.com/r/ProductManagement/comments/1snuq6z/a_lot_of_ab_test_wins_are_just_fake/

Trustworthy Online Controlled Experiments - Academic reference by Ronny Kohavi and Diane Tang. https://www.cambridge.org/core/books/trustworthy-online-controlled-experiments/D97B26382EB0EB2DC781F87C461D9F1B

Are Your Tests Set Up to Produce Valid Results?

I review A/B testing programs and give you an honest assessment. What’s producing real signal, what’s noise, and what tests you should actually be running next. Most programs have at least 3 structural issues that are silently invalidating results.

Also read: A/B Testing Best Practices. The full framework for running your first valid test. And Best CRO Tools in 2026. The testing tools worth using, reviewed honestly.

Get a Free CRO Audit at /contact/


Frequently Asked Questions

What is the most common A/B testing mistake?

Stopping tests early, also called peeking. Checking results daily and stopping at the moment significance is reached inflates false positive rates from 5% to over 30%. Pre-commit to a sample size and runtime before launch.

Can I run multiple A/B tests at the same time?

Yes, if they test different pages or non-overlapping elements on the same page and your traffic supports it. Under 20,000 sessions per month, run one test at a time to avoid interaction effects and insufficient sample sizes per variant.

What is a Sample Ratio Mismatch in A/B testing?

An SRM occurs when the actual traffic split between variants differs from the intended split, like 53/47 instead of 50/50. Even small mismatches indicate a technical problem with randomization or caching that can bias results. Always check for SRM before reading results.

How do I know if my A/B test results are reliable?

Check five things: Did you reach pre-planned sample size? Did you run for 2 or more business cycles? Is the traffic split close to 50/50? Is traffic filtered for bots and internal sessions? Are you measuring a business outcome, not a proxy metric?

How many A/B tests should I run per month?

Quality over quantity. One rigorous, well-structured test per month compounds into real optimization gains over time. Five poorly structured tests produce noise and false winners. With 50,000 or more sessions per month, running 2 to 3 concurrent tests on different pages is achievable.

What is an A/A test and why does it matter?

An A/A test sends equal traffic to two identical variants. It should show no significant difference. If it does, your testing infrastructure is broken. The randomization algorithm is biased or tracking fires differently on each variant. Run an A/A test before launching any major test program.

Mario Kuren

CRO Specialist & Founder

Mario has been running A/B tests and conversion optimization programs since 2018. He's helped 50+ businesses grow revenue without increasing ad spend. Read all his articles →

Want us to apply this to your site?

Get a Free CRO Audit

We'll analyze your site, find the biggest conversion leaks, and hand you a prioritized action plan — completely free.

Book Your Free Audit