Dirty Secret Of A/B Split Testing Software…

 

There are lots of great A/B Split testing tools at our disposal these days and my favourites include…

[list style=”checkmark”]

But there’s a HUGE flaw in almost every single one of these tools causing countless Internet marketers to make decisions that could actually be losing them money when they’re expecting the opposite.

Here’s the problem…

When someone runs a test using any of these tools they’re typically looking at a few key metrics including…

  • [list style=”arrow”]
    • % change in conversion
    • Confidence score

    [/list]

For those of you who are new to split testing, “confidence score” is basically a measure of the confidence that the projected increase/decrease is accurate. The more visitors & conversions measured the higher the confidence score and accuracy of the results.

For example, if I ran an A/B test on a landing page and version B showed a 6% increase in conversion with a 95% confidence score,  then version would B should be the winner.

Unfortunately that is not necessarily true…

You see, not too long ago I ran a VERY interesting experiment on a very high volume ecommerce website. Because it was a client’s website I cannot disclose the actual site but I can share the details that matter…

Over the past year I’d been working with my client to set up numerous different split tests on their landing pages. In many cases the tests were beating the control in which case we would roll it out and it would become the new control. But I noticed that after running quite a few of these tests and rolling out the winner we did not see the expected bump in sales.

Something wasn’t right.

Then to fuel the fire I brought this up to few other Internet buddies to see what they had to say and the feedback was the same… they would run a test, find a winner that showed an increase but then after rolling it out their conversion stayed flat.

What was going on? Were these testing solutions wrong? 

Now I’m fortunate to have access to some very high volume websites to run tests on so I set-up an A/B split test experiment.

The goal of my experiment was to determine the margin of error in the predicted conversion increase/decrease when running a split test and the only way to do this was to take a website and set up a split test where the control and the test version were identical.

So version A and version B were identical. No changes.

In theory, if these testing tools really work and I ran a test until our confidence score was 90%+ the predicted conversion increase should be almost identical right?

Wrong.

This is the dirty secret that these split testing companies either don’t want to tell us or don’t actually know…

After running this test until we reached a 90%+ confidence it showed the test version was outperforming the control by 6%.

WTF?

How is that possible? They were identical.

But I did not stop there. I let this test run for weeks and over the following weeks I watched the confidence score continue to creep up and up while the ‘winner’ bounced back and forth between the two versions. Some weeks the control showed a higher conversion, some weeks the test showed a higher conversion… even when the confidence score was almost 100%.

So does that mean these tools are now useless?

No.

What it means is there’s a margin of error we need to account for when reading the results from these testing platforms and based on the results of this experiment my conversion needs to increase by 10% or higher before I can confidently consider it a real winner because I know there’s a margin of error of around 8% I need to account for.

So the moral of the story is A/B split testing tools like Google Website Optimizer are not perfect. There’s a margin of error that I estimate to be around 8% that needs to be accounted for.

So if you have been running tests that are predicting increased conversion/sales only to be disappointed by the results when the winner is rolled out this might be the problem.

Happy Testing!

Derek

 

5 Responses to Dirty Secret Of A/B Split Testing Software…

  1. Great post Derek and a discussion that we could have for hours. Unfortunately your conclusions only make the testing decisions for small businesses without high volume traffic even harder. You say that until you see a conversion increase of 10% or higher you can’t be confident of a winner. To me, a conversion increase of 8%, 6% or even 5% would be a great success and so that page/version would be the winner…but from your post I’m running a big risk of betting on the wrong horse in the long term.

    Paul

  2. You’re right but the key is the number of visitors. The fewer the visitors the more you need to account for a margin of error.

    You need around 1000 “actions”, not hits but “actions” to have a margin of error of 4%. Actions in most cases will be orders or subscribers.

  3. There would always need to be a margin of error because we are dealing with human beings and we can never 100% duplicate nor predict the responses of humans. We aren’t computers. There are too many variables for any system to be completely accurate, even over time. God is the only split tester who can predict with 100% accuracy every time.

  4. Hi Derek,

    You raise a valid point, but a few things to remember.

    If X is outperforming Y at 95% confidence you still have a 5% chance that the Y will beat X in the future. For this reason I wait until I have 99.9% confidence before declaring a winner.

    Confidence levels are reporting regarding who is the winner, NOT the conversion rates themselves.

    Make sure the two variations are identical in load time, not just user experience. If you are redirecting users to another page you are introducing a small disadvantage.

    Lastly, results are not always reproducible. It sucks but it’s true, not only for split tests. For more details see:
    http://analyticsimpact.com/2012/01/17/test-fatigue-why-it-happens/

    Take care
    Ophir

Leave a reply

ebusiness report

Get It Now