Just keeping for later: Public datasets hosted on Amazon AWS. https://aws.amazon.com/datasets
Updates from May, 2011 Toggle Comment Threads | Keyboard Shortcuts
-
Ben
-
Ben
Confidence bias in action
I’ve dabbled a little with crowdsourcing for my own projects, but never used it as a primary research tool. It isn’t hard to see how the major crowdsourcing platforms like Mechanical Turk could be used to undertake quick and cost-effective behavioural research (potential for bias notwithstanding!). So, the following study by crowdsourcing firm Crowdflower on its own worker base was interesting in itself. That it related to another interest of mine, human bias, made it even more intriguing :)
Confidence Bias: Evidence from Crowdsourcing
The key take-out: over 75% of contributors overestimated their ability to answer multiple choice questions correctly. The Dunning-Kruger effect is alive and well!
-
Ben
KiwiPycon 2011: Document Classification with the Natural Language Toolkit
I’m heading to KiwiPycon in Welly this weekend to meet some fellow Python fans and give a presentation on using the Python-based Natural Language Toolkit (NLTK) to classify documents. I’ll be using the Enron emails as an example document set.
If you’ve travelled here from the future because you saw the presentation and want the files I referred to, here they are.
- KiwiPycon Presentation with Full Notes (pptx, 1.7MB)
- KiwiPycon Import Enron Data (Python Code)
- KiwiPycon Enron Classifiers (Python Code)
There is a missing link between the two code files: changes I made to the dataset to enable training of the classifier and analysis of the results. If you are interested in getting the final dataset, just get in touch.______Update: Here is the slideshare version of the presentation with audio. And here is a text-to-speech video version, with some extra content. -
Ben
Links: Using SQL ‘With’ statements, and a great example of A/B Testing
Two links worth keeping:
- Michael Berry at Data Miners Blog describes how to use SQL common table expressions (i.e., WITH statements) to simplify complex queries by creating on-the-fly temp tables (named subqueries) prior to the full query definition.
- Designer Andrei from 37Signals describes the outcomes of a number of A/B tests they did on their Highrise signup page, along with the variants they tested.
-
Ben
Big/Open Data and the Social Sciences
Another great interview and article from Audrey Watters. You can pretty much replace ‘Political Science’ with the social science subject of your choosing.
-
Ben
Kaggle: An interesting source of sample data to model
Good real-world datasets used to be quite hard to come by for those interested in playing with different modelling approaches. However, sites like Kaggle, which expand the crowdsourcing approach to model improvement initiated by events such as the Netflix prize and KDD Cup, are opening up more datasets for statistical modellers to use. Worth keeping an eye on.
-
Ben
Google rolls out more Web Font goodness
The new version of Google’s web font archive and viewer tool makes it very simple to choose a web font for your site. The selection of fonts has also expanded to a point where you’d be hard pressed to not find one or two that suit.
Nice.
-
Ben
Case-level Pew Internet Survey Data Released in Various Formats
Originally found over at FlowingData: The Pew Research Center (a well-respected American non-profit social survey organisation) has released case-level data for the surveys relating to its Internet and American Life project. This is great news for anyone interested in looking at correlations or subsections of the data that aren’t already produced and published by the center.
-
Ben
Beware a Statistician with Dating Data
It’s no secret that as we interact with more web services we are creating a larger and deeper footprint with respect to our digital behaviours. I think we are also volunteering more personal information when asked online. The result has been an explosion in individual-level data available to data wranglers in organisations with a digital presence. Often, the negative sides of this are reported in the media; the decline of privacy and the risks of data abuse to individuals. However, it also provides for some fascinating aggregate-level analysis that just hasn’t been previously possible.
For instance, Google Flu trends shows how aggregate search behaviours can be used as an early warning signal for potential public health issues.
And then there is a post I recently found which examines correlations across answers to a questionnaire completed by users of a popular dating site… The aim: to identify first-date questions that “(a) most people were comfortable discussing publicly, and (b) were mathematically likely to tell you something you couldn’t just guess”. The analysis isn’t exactly in the interest of public health, but it is hilarious, well thought through, and accessible. And no individual’s data is exposed in the process.
(Note, the content at this link isn’t really safe for work; if it were a TV show there would be a ‘contains explicit language and sexual themes’ disclaimer before it started.)
OKCupid: The best questions for first dates.
A couple of gems from the post that apply across the sexes (go to the post for the direction and strength of relationship):
To predict: Will my date have sex on the first date?
Ask: Do you like the taste of beer?To predict: Is my date religious?
Ask: Do spelling and grammar mistakes annoy you?And one that shows just how bad we are at judging our common ground with others:
“Which describes you better, normal or weird? might be fine to ask, but doing so is of little value because almost everyone has the same answer. 79% of people think they are weird.”
Disclaimer: The OKCupid sample is large, but probably doesn’t reflect the general population of people looking for partners. So, if you attempt to apply these nuggets of wisdom your mileage may vary. That said, the differences presented are substantial enough that I’d be surprised if they don’t hold to at least a small degree outside of OkCupid’s target market!
-
Ben
Revealed! The Bearded Rowan Simpson
Quick disclaimer: Apologies to those who stumble upon this blog expecting to find something useful and/or enlightening. Thankfully there are other blogs (or other posts) for that. No apologies to Rowan since, well, he should have just come up with the pic himself ;)
Those who follow @RowanSimpson will know he recently professed to growing a beard. There must be something in the water because at least three of my current colleagues have done the same in the last month and, unless I’ve stepped into an alternate universe, Movember is still way off.
Anyway, despite repeated requests from his audience, Mr Simpson has yet to reveal pics of said beard. I for one am inclined to think it hasn’t happened. Don’t even get me started on recent events in Abbottabad.
It does leave ample room for speculation, though. Thankfully the folks at the CIA Identi-Kit lab have some time on their hands at the moment and managed to come up with the following artistic impressions of Simpson in the wild:
If anyone sees him, please send him home to his wife. Apparently she’s still waiting on her anniversary present.
_____
ShortURL: http://wp.me/pnqr9-8m


