Tagged: Microsoft Toggle Comment Threads | Keyboard Shortcuts

  • Ben 11:15 on Sunday, December 6, 2009 Permalink | Reply
    Tags: Experimental Design, Microsoft, Simpson's Paradox,   

    Have You Fallen Prey to Simpson’s Paradox? 

    In a previous post on experimentation at Microsoft I linked to a recent presentation by Ron Kohavi (GM of their experimentation platform).  One point he raised was that you can actually get the wrong answers from split tests because of a phenomenon called Simpson’s Paradox.  You read right; your test might tell you that version A is the best bet when in reality the better performing version is B.

    That should send a shiver down the spine of anyone tasked with improving a website’s ROI.

    Simpson’s paradox can occur in any setting where the proportion of people allocated to split groups (e.g., control and test) varies according to some important attribute in the study.  It is easiest to understand the paradox by example.  Thankfully, the Wall Street Journal presented one a couple of days ago in an article on the Flaw of Averages.  Essentially, it showed that although current aggregate unemployment rates in the US (expressed as % jobless) don’t appear as bad as they were during the 80s recession, they are actually consistently worse when the figures are examined by educational subgroup.  This is because the proportion of people in each educational subgroup has shifted between the 1980s and now, and each subgroup has a different susceptibility to unemployment.

    The WSJ article also presents two other examples (U of C Berkeley admissions gender bias and Kidney stone treatment efficacy).  If you are still scratching your head after reading through the narrative explanations, try having a look at a the data-based explanations of the same examples on this Wikipedia entry.

    Turning to a web-based scenario, in a recent paper outlining pitfalls to avoid in online experimentation, the folks at Microsoft showed how Simpson’s Paradox can occur when a test is ‘ramped up’ over time.  Their example involves a page design test run over two days, with a 1% sample of users assigned to the test group on the first day (Friday) and then a 50% sample assigned to the test group on the second day (Saturday).  Here is the data from the paper:

    (Note: The percentage in the version B ‘total’ cell is different here due to an error in the original)

    On both test days ‘B’ was the winning version.  However, the result is reversed in the aggregated total; Version A is the winner.  This is essentially because both the test split allocations and response levels varied by day.

    Test ‘ramp ups’ are quite common.  It is good practice to do a pilot of the test on a small sample to make sure everything is working OK before unleashing it on a larger sample. So, the potential for Simpson’s Paradox to occur is very real.  If you are analysing split test results, you can make sure your analysis avoids the problem by re-weighting the results from periods with different allocation procedures or by simply discarding the results from the pilot phase.

    _____

    ShortURL for this post: http://wp.me/pnqr9-3X

     
  • Ben 18:17 on Saturday, November 28, 2009 Permalink | Reply
    Tags: , Microsoft,   

    Online Experimentation at Microsoft 

    Over the last three years Microsoft embraced experimentation as a mechanism for testing changes to their various online products.  That they are only recently formally adopting a data-driven approach to their design was a little surprising to me, but it is certainly better late than never!

    As part of the process of making the shift away from simply following the Highest Paid Person’s Opinion (HiPPO) to actually testing the ROI of different ideas, the team in charge of experimentation has been disseminating some of their experiences. You can see a recent talk on the topic, presented at a September meeting of Seattle Tech Startups, at the URL below (sorry, the quality isn’t great and I can’t embed because of WordPress.com restrictions).  Alternatively, go to the Microsoft experimentation portal to see other work from this group.

    http://www.ustream.tv/flash/video/2134721

    The talk presents a number of interesting insights, ranging from the results of some tests (winning versions are often different to what you’d think) through to the cultural hurdles arising from an increased reliance on data for decision making (e.g., people with strong opinions get their egos bruised).

    Amazon.com is also mentioned a couple of times.  I think a few of the current Microsoft team originally cut their teeth there, so those of you interested in this topic might also like to see this eMetrics Summit 2004 presentation (pdf).  It showcases the Amazonian approach to deciding on site changes and resolving bitter political disputes over whose pet area should get highly coveted slots on the home page.  Interesting stuff that more and more organisations are going to have to grapple with as their products and services become increasingly digitized.

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel
Follow

Get every new post delivered to your Inbox.