Just keeping for later: Public datasets hosted on Amazon AWS. https://aws.amazon.com/datasets
Recent Updates RSS Toggle Comment Threads | Keyboard Shortcuts
-
Ben
-
Ben
A few select pics from a recent trip. By fluke of nature we managed to catch 11 days of sun from the 13 we were away. The rest of the country wasn’t so lucky. It was great to get out and see more of the homeland. Like many New Zealanders, prior to this road trip I’d seen more foreign soil than I had of my own.
The Marina at Picton.

A seal playing. Royal Albatross centre, Otago Peninsula.

Mark of the seagull. Royal Albatross centre, Otago Peninsula.

New Year Rodeo, Wanaka.

View over Wanaka from Mt. Iron.

An earnest Dork impression. Fox Glacier.

Franz Joseph Glacier.

Inside an abandoned Gold Mine. Near Greymouth.

Steps carved into the Pancake Rocks. Punakaiki.

Sea-spray through a blow hole. Punakaiki.

There are no photos from the Ferry crossing at the end of the trip, but it was eventful enough to remember without them. We crossed in 50-55 knot gales, so at least half of the passengers got seasick …Myself included.
-
Ben
Confidence bias in action
I’ve dabbled a little with crowdsourcing for my own projects, but never used it as a primary research tool. It isn’t hard to see how the major crowdsourcing platforms like Mechanical Turk could be used to undertake quick and cost-effective behavioural research (potential for bias notwithstanding!). So, the following study by crowdsourcing firm Crowdflower on its own worker base was interesting in itself. That it related to another interest of mine, human bias, made it even more intriguing :)
Confidence Bias: Evidence from Crowdsourcing
The key take-out: over 75% of contributors overestimated their ability to answer multiple choice questions correctly. The Dunning-Kruger effect is alive and well!
-
Ben
Do not therefore consider this life as an object of any moment. Look back on the immense gulf of time already past; and forwards, to that infinite duration yet to come, and you will find how trifling the difference is between a life of three days and of three ages. Let us then employ properly this moment of time allotted us by fate, and leave the world contentedly; like a ripe olive dropping from its stalk, speaking well of the soil that produced it, and of the tree that bore it.
Marcus Aurelius, Meditations -
Ben
Looks like Google is getting into the Survey Business
From Neiman Journalism Lab:
Google appears to be experimenting with a new paywall-esque content roadblock for publishers, and it’s not One Pass. For lack of a better name, let’s call it a “survey wall,” because instead of dollars the system asks readers a question before they can move on to continue reading what they like.
This could get interesting. Instead of a standard paywall, people may be able to ‘pay’ for content by answering survey questions. The publisher gets valuable information it can on-sell to advertisers, and Google dulls the old-media knives that are increasingly aimed at its vital organs. A natural extension of this would be that the publisher would become a survey panel provider of sorts. Survey companies would be able to buy access to the survey-wall to ask their own questions for a fee-per-answer. There is also no reason why independent panel companies could attempt to step into the role Google appears to be playing as the third-party technology provider.
Of course, there are big questions about the quality of data that may come from these distributed surveys.
- Would people answer honestly?
- What can reasonably be done with one or two answers from each visitor? (e.g., it would be difficult to examine relationships between more than a couple of variables)
- Why would we expect people who visit survey-wall sites to be representative of a given population?
These, and other questions, will keep survey methodologists in business for a while :) -
Ben
A great write-up on determining sample sizes for, and avoiding common traps in, split testing. Yet another good testing post from the folks at 37 signals. R code and discussion of power calcs included. http://37signals.com/svn/posts/3004-ab-testing-tech-note-determining-sample-size
-
Ben
The easy way to find creative commons content: http://search.creativecommons.org
-
Ben
KiwiPycon 2011: Document Classification with the Natural Language Toolkit
I’m heading to KiwiPycon in Welly this weekend to meet some fellow Python fans and give a presentation on using the Python-based Natural Language Toolkit (NLTK) to classify documents. I’ll be using the Enron emails as an example document set.
If you’ve travelled here from the future because you saw the presentation and want the files I referred to, here they are.
- KiwiPycon Presentation with Full Notes (pptx, 1.7MB)
- KiwiPycon Import Enron Data (Python Code)
- KiwiPycon Enron Classifiers (Python Code)
There is a missing link between the two code files: changes I made to the dataset to enable training of the classifier and analysis of the results. If you are interested in getting the final dataset, just get in touch.______Update: Here is the slideshare version of the presentation with audio. And here is a text-to-speech video version, with some extra content. -
Ben
Links: Using SQL ‘With’ statements, and a great example of A/B Testing
Two links worth keeping:
- Michael Berry at Data Miners Blog describes how to use SQL common table expressions (i.e., WITH statements) to simplify complex queries by creating on-the-fly temp tables (named subqueries) prior to the full query definition.
- Designer Andrei from 37Signals describes the outcomes of a number of A/B tests they did on their Highrise signup page, along with the variants they tested.
davidwallacefleming 9:00 on Friday, November 11, 2011 Permalink
Valuable information to stay appraised of. Thank you. I hope this does not get implemented.