In a previous post on experimentation at Microsoft I linked to a recent presentation by Ron Kohavi (GM of their experimentation platform). One point he raised was that you can actually get the wrong answers from split tests because of a phenomenon called Simpson’s Paradox. You read right; your test might tell you that version A is the best bet when in reality the better performing version is B.
That should send a shiver down the spine of anyone tasked with improving a website’s ROI.
Simpson’s paradox can occur in any setting where the proportion of people allocated to split groups (e.g., control and test) varies according to some important attribute in the study. It is easiest to understand the paradox by example. Thankfully, the Wall Street Journal presented one a couple of days ago in an article on the Flaw of Averages. Essentially, it showed that although current aggregate unemployment rates in the US (expressed as % jobless) don’t appear as bad as they were during the 80s recession, they are actually consistently worse when the figures are examined by educational subgroup. This is because the proportion of people in each educational subgroup has shifted between the 1980s and now, and each subgroup has a different susceptibility to unemployment.
The WSJ article also presents two other examples (U of C Berkeley admissions gender bias and Kidney stone treatment efficacy). If you are still scratching your head after reading through the narrative explanations, try having a look at a the data-based explanations of the same examples on this Wikipedia entry.
Turning to a web-based scenario, in a recent paper outlining pitfalls to avoid in online experimentation, the folks at Microsoft showed how Simpson’s Paradox can occur when a test is ‘ramped up’ over time. Their example involves a page design test run over two days, with a 1% sample of users assigned to the test group on the first day (Friday) and then a 50% sample assigned to the test group on the second day (Saturday). Here is the data from the paper:
(Note: The percentage in the version B ‘total’ cell is different here due to an error in the original)
On both test days ‘B’ was the winning version. However, the result is reversed in the aggregated total; Version A is the winner. This is essentially because both the test split allocations and response levels varied by day.
Test ‘ramp ups’ are quite common. It is good practice to do a pilot of the test on a small sample to make sure everything is working OK before unleashing it on a larger sample. So, the potential for Simpson’s Paradox to occur is very real. If you are analysing split test results, you can make sure your analysis avoids the problem by re-weighting the results from periods with different allocation procedures or by simply discarding the results from the pilot phase.
ShortURL for this post: http://wp.me/pnqr9-3X