Two things struck me when I saw this AdSense optimization ad…
First, the age profile of the first two cases was toward the high end. Google is probably trying to target older website owners, but it is still worth remembering that the urge to test empirically does not necessarily discriminate by age. They did manage to find two cases for their ad, after all :)
Second, the fact that Google is running ads focussed on getting people to try out different ad formats suggests they are pretty sure a large proportion of current placements are sub-optimal. More testing and optimisation of AdSense placements by website owners means more revenue for both parties. Unless, of course, the website owner has other revenue streams that may be cannibalised by changes to the ad formats they have on their site…
It is not uncommon to see news stories celebrating the success of some initiative or individual as being due to some bright idea or moment of inspiration. This phenomenon is not new; every child is taught that Archimedes had his ‘Eureka moment’ and can recite the story of Netwon’s falling apple. It is these flashes of insight that we remember and strive to emulate.
However, the focus on creativity is unfortunate because it only paints half the picture. For instance, the ‘file drawer problem’ means we see those flashes of inspiration that led to success, rather than the countless others that didn’t. And, it is easy to forget that people like Archimedes and Newton were old-school split testers. They subjected idea after idea to the brutal scientific method and learned from the many failures they no doubt had. It is their perseverance and commitment to testing, not just their creativity, that we should remember them for.
Fortunately, more news is starting to bubble to the surface about the interplay between the creative and scientific processes. For instance, this wired story shows how the gaming industry (typically considered a bastion of creativity and design) is embracing split testing to drive development decisions. I also recently saw the following talk shared widely on Twitter about the testing that went into the success of Obama’s 2008 fundraising campaigns.
These stories highlight the fact that creative ideas are like the random mutations that drive the evolutionary process. They are necessary, but certainly not sufficient, for progress to occur. And another interesting recurrent theme is that the mental models underlying our creativity – the source of our ‘gut feelings’ about what will work – are often wrong. Indeed, testing is essential to updating these models and is an under appreciated input to the creative process. Together, they form an iterative learning cycle.
This interplay has implications for organisational and personal development in that as much effort should be put into developing the testing and learning process as goes into supporting the creative process.
A great write-up on determining sample sizes for, and avoiding common traps in, split testing. Yet another good testing post from the folks at 37 signals. R code and discussion of power calcs included. http://37signals.com/svn/posts/3004-ab-testing-tech-note-determining-sample-size
How one extra word in an email subject line improved end point conversions by 279%. This stuff never ceases to amaze me. http://whichtestwon.com/archives/7353
Here is a link worth keeping. Google recently updated the look and feel of its search user interface. This article describes the behind-the-scenes process Googlers followed to get to the end point we are all seeing today. Unsurprisingly, they followed a thorough research process, incorporating extensive qualitative and quantitative feedback before settling on an optimal solution.
How Google got its New Look.
In a previous post on experimentation at Microsoft I linked to a recent presentation by Ron Kohavi (GM of their experimentation platform). One point he raised was that you can actually get the wrong answers from split tests because of a phenomenon called Simpson’s Paradox. You read right; your test might tell you that version A is the best bet when in reality the better performing version is B.
That should send a shiver down the spine of anyone tasked with improving a website’s ROI.
Simpson’s paradox can occur in any setting where the proportion of people allocated to split groups (e.g., control and test) varies according to some important attribute in the study. It is easiest to understand the paradox by example. Thankfully, the Wall Street Journal presented one a couple of days ago in an article on the Flaw of Averages. Essentially, it showed that although current aggregate unemployment rates in the US (expressed as % jobless) don’t appear as bad as they were during the 80s recession, they are actually consistently worse when the figures are examined by educational subgroup. This is because the proportion of people in each educational subgroup has shifted between the 1980s and now, and each subgroup has a different susceptibility to unemployment.
The WSJ article also presents two other examples (U of C Berkeley admissions gender bias and Kidney stone treatment efficacy). If you are still scratching your head after reading through the narrative explanations, try having a look at a the data-based explanations of the same examples on this Wikipedia entry.
Turning to a web-based scenario, in a recent paper outlining pitfalls to avoid in online experimentation, the folks at Microsoft showed how Simpson’s Paradox can occur when a test is ‘ramped up’ over time. Their example involves a page design test run over two days, with a 1% sample of users assigned to the test group on the first day (Friday) and then a 50% sample assigned to the test group on the second day (Saturday). Here is the data from the paper:
(Note: The percentage in the version B ‘total’ cell is different here due to an error in the original)
On both test days ‘B’ was the winning version. However, the result is reversed in the aggregated total; Version A is the winner. This is essentially because both the test split allocations and response levels varied by day.
Test ‘ramp ups’ are quite common. It is good practice to do a pilot of the test on a small sample to make sure everything is working OK before unleashing it on a larger sample. So, the potential for Simpson’s Paradox to occur is very real. If you are analysing split test results, you can make sure your analysis avoids the problem by re-weighting the results from periods with different allocation procedures or by simply discarding the results from the pilot phase.
ShortURL for this post: http://wp.me/pnqr9-3X
Over the last three years Microsoft embraced experimentation as a mechanism for testing changes to their various online products. That they are only recently formally adopting a data-driven approach to their design was a little surprising to me, but it is certainly better late than never!
As part of the process of making the shift away from simply following the Highest Paid Person’s Opinion (HiPPO) to actually testing the ROI of different ideas, the team in charge of experimentation has been disseminating some of their experiences. You can see a recent talk on the topic, presented at a September meeting of Seattle Tech Startups, at the URL below (sorry, the quality isn’t great and I can’t embed because of WordPress.com restrictions). Alternatively, go to the Microsoft experimentation portal to see other work from this group.
The talk presents a number of interesting insights, ranging from the results of some tests (winning versions are often different to what you’d think) through to the cultural hurdles arising from an increased reliance on data for decision making (e.g., people with strong opinions get their egos bruised).
Amazon.com is also mentioned a couple of times. I think a few of the current Microsoft team originally cut their teeth there, so those of you interested in this topic might also like to see this eMetrics Summit 2004 presentation (pdf). It showcases the Amazonian approach to deciding on site changes and resolving bitter political disputes over whose pet area should get highly coveted slots on the home page. Interesting stuff that more and more organisations are going to have to grapple with as their products and services become increasingly digitized.
“If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.” Hal Varian, Chief Economist at Google
Me suffer from confirmation bias? Never!
Short URL for this post: http://wp.me/pnqr9-2K