Tagged: Bias RSS

  • Ben 20:40 on Sunday, October 18, 2009 Permalink | Reply
    Tags: Bias, , , Five Star Rating, , Rating Scales   

    The Five-Point Rating House of Cards 

    The web is awash with 5-point rating schemes.  Netflix, Amazon, YouTube, WordPress (via PollDaddy ratings), Apple’s Ap Store, the Android Market and countless blogs use them to gauge people’s experience with various items.  It’s not hard to see why 5-point schemes are so popular;  they are really simple to implement, familiar to most people, and can be made to look all kinds of pretty using icons for stars, hearts or smileys.

    Unfortunately, they often don’t gather very useful data.  And that’s a big problem for sites that intend to use ratings as the backbone of their recommendation systems.

    The 5-point schemes common on the web suffer from two core problems: nonresponse and measurement bias.  First, many people choose not to rate many items, and those that do tend to have had a positive experience.  Data from YouTube supports this, as does that from Netflix.  Second, the scales are usually not labelled, meaning people answer under a wide variety of interpretations as to what each ‘point’ means.  This comment from a YouTube user suggests ambiguity in the scale can also exacerbate the nonresponse issue…

    Ratings on YouTube have always been somewhat confusing for me: should I rate the content of the video or the quality? There are some wonderfully shot videos on YouTube that really don’t have any meaningful content, and there are also a lot of videos that have wonderful content but are shot very poorly. I think a dual content vs. quality rating would add too much complexity to the system, but I often don’t rate a video for that very reason.
    I don’t find ratings all that helpful, probably due to the fact that there are millions of people using YouTube, each with a different opinion. It doesn’t influence whether I watch a video, but then again, I usually find videos from friends or other channels I respect.

    Ratings on YouTube have always been somewhat confusing for me: should I rate the content of the video or the quality? There are some wonderfully shot videos on YouTube that really don’t have any meaningful content, and there are also a lot of videos that have wonderful content but are shot very poorly. I think a dual content vs. quality rating would add too much complexity to the system, but I often don’t rate a video for that very reason.

    I don’t find ratings all that helpful, probably due to the fact that there are millions of people using YouTube, each with a different opinion. It doesn’t influence whether I watch a video, but then again, I usually find videos from friends or other channels I respect. [comment found here]

    Probably the best way to get around these problems is to measure a person’s preferences indirectly by recording their behaviour: how much of the video did they watch?  did they share the content? did they look for related items?  It is fairly well established that what people say and what they do can be very different things, so users’ actions may be much more useful than their words.  Certainly, the ‘popular’ and ‘most viewed’ categories in YouTube appear to rely on behavioural metrics, so perhaps their rating metric is redundant.

    However, the ‘measure behaviour’ solution is best suited to organisations that deliver interactive material consumed on-site (YouTube, StumbleUpon).   So, what can you do if you are dealing with items that aren’t consumed on-site? Collapsing the scale to “liked it”/”didn’t like it” won’t solve the core issues – if anything it will just mean you give up what little discriminative power the 5-point scale might have had.  Another suggestion is to expand the scale to 10 points. While this may increase the discriminative power of the scale and is a format people are familiar with, it won’t solve the ambiguity problem.  For that you need to construct clear labels that are likely to be interpreted in much the same way by most people.  Ideally, the scale will also relate as directly as possible to whatever it is you want to use the data for. This is much easier said than done, but here is an example that might work for a site recommending local restaurants:

    0 – I will definitely not (0%) eat there again soon

    1 – It is unlikely (20% chance) I will eat there again soon

    2 – There is some chance (40%) I will eat there again soon

    3 – There is a good chance (60%) I will eat there again soon

    4 – It is quite likely (80%) I will eat there again soon

    5 – I will definitely (100%) eat there again soon

    This is actually a heavily butchered version of a probability-based predictive instrument called the Juster Scale.  It would have to be tested, but it at least serves to demonstrate the qualities I outlined above.  The scale could also easily be extended to more points (in fact, the Juster Scale is an 11-point scale).

    Finally, there is the issue of nonresponse.  A good scale will help resolve this, but ultimately you need to follow-up users to increase rating participation.  TradeMe and TravelBug are two local examples that do this well.  You’ll never get every user giving a rating for the products they’ve tried, but at least you’ll bump the proportion up, which will provide a more solid foundation for any recommendation or imputation algorithms you want to run over the data.

    So, if you are at the early stages of developing a rating function for your site, give some careful thought to how your scheme will work.  Test it out before you commit to it longer term.  Doing so will give you much better data to work with down the track.

    One final point: you can probably forget all this if your core reason for implementing ratings is to generate reassuring sales cues to prospective buyers (i.e., in the same way sites put testimonials up to reassure users).  In that case, you are likely to be better off with an unlabeled 5-point scale.  As the folks at YouTube found, most of the ratings you will get with such a scale will be positive!

    _____

    Short URL for this post: http://wp.me/pnqr9-2u

     
  • Ben 22:57 on Tuesday, September 15, 2009 Permalink | Reply
    Tags: Bias, Coverage Error, , Probability Sampling, , Your Mother   

    Why some Internet Surveys are a bit like your Mother 

    Most of us love our mothers dearly.  But that doesn’t mean we go to them for advice every time we need answers to some important problem in our lives.  Sure, it would be easy, quick, and cheap to get a few words of wisdom, but it is just not realistic to expect them to be objective.  Thousands of bitterly disappointed American Idol contestants learnt this fact the hard way.

    And so it is with some online surveys.  It is now easy to throw together a web-based questionnaire, get it sent to a bunch of people, and have answers back all within a fortnight.  But if the people who received the survey are skewed on some important dimension (e.g., technologically literate, mostly young, mostly employed, etc.) you can’t expect the results to accurately reflect the opinions or likely behaviours of a more diverse group.  The technical term for this sort of bias is coverage error and it is one of the key reasons to think carefully about how you select the sample for a survey.

    There are two very general categories of survey sampling techniques:

    1. Non-probability sampling: You don’t specifically go out to get a random selection of people from your target group.  Instead, you let allcomers complete your survey.  Perhaps you send an invitation out to your friends and ask them to invite their friends, etc.  Or perhaps you advertise the survey and let anyone who happens to see the ad fill in a questionnaire.  These surveys have all the objectivity of talkback radio.  They might be entertaining, but you wouldn’t usually base a policy or business decision on them.
    2. Probability sampling: You make an attempt to get a random selection of people from your target group completing your survey.  In the ‘holy grail’ version of this approach, you’d have a list of all the people in your target group, take a simple random sample from the list, send the invitations to the sampled people and then follow up to get as many of them answering as possible.  Survey researchers and statisticians have developed lots of variations on this theme to take account of practical issues, but the aim is always to get a wide mix of people from the target group responding.  Although your results under this approach won’t be perfectly accurate, you can be confident that you’ll come close to reflecting the opinions and behaviours of the full group.

    Sounds clear enough, doesn’t it?

    And it is.  Until we enter the wild world of internet survey respondent panels.  You see, it is possible to order up a random selection of people from a panel that makes you feel like you are taking a probability sample when really your results may be subject to the sorts of coverage errors inherent in a non-probability sample.  This is because many panel providers build up their lists of eager members by non-probability methods.  Few providers source members via a random (or pseudo-random) process like Random Digit Dialling or Address Based Sampling because it is so expensive to do so.  Even fewer provide internet access to those households who don’t have it.  Knowledge Networks is one company that does these things.

    Predictably, a recent study titled Study Finds Trouble for Internet Surveys highlights the differences in accuracy that arise from the different panel recruitment approaches (non-probability vs non-probability).  Here are some selected excerpts:

    In the most extensive such analysis to date, David Yeager and Prof. Jon Krosnick compared seven non-random internet surveys with two others based instead on random or so-called probability samples. The non-probability internet surveys were less accurate, and customary adjustments did not uniformly improve them.
    While the random-sample surveys were “consistently highly accurate,” the internet surveys based on self-selected or “opt-in” panels “were always less accurate, on average, than probability sample surveys, and were less consistent in their level of accuracy,” the researchers said. Further, they said, adjusting these samples to known population values had no effect on accuracy (and in one case even worsened it) as often as that process, known as weighting, improved it.

    In the most extensive such analysis to date, David Yeager and Prof. Jon Krosnick compared seven non-random internet surveys with two others based instead on random or so-called probability samples. The non-probability internet surveys were less accurate, and customary adjustments did not uniformly improve them.

    While the random-sample surveys were “consistently highly accurate,” the internet surveys based on self-selected or “opt-in” panels “were always less accurate, on average, than probability sample surveys, and were less consistent in their level of accuracy,” the researchers said. Further, they said, adjusting these samples to known population values had no effect on accuracy (and in one case even worsened it) as often as that process, known as weighting, improved it.

    So, be wary when purchasing a “random” or “representative” sample from an opt-in panel provider.  Such a sample might be fine for your particular purpose or target group, but you need to at least consider the risks of coverage error you are taking.  And don’t expect weighting to magically solve any coverage error you do have!

    _____

    Short URL for this post:  http://wp.me/pnqr9-1t

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
esc
cancel