Another article on the Obama team’s use of data: http://goo.gl/mIPsv . Also, http://www.propublica.org/article/everything-we-know-so-far-about-obamas-big-data-operation
Tagged: Data Mining Toggle Comment Threads | Keyboard Shortcuts
That was just one of several ways that Mr. Obama’s campaign operations, some unnoticed by Mr. Romney’s aides in Boston, helped save the president’s candidacy. In Chicago, the campaign recruited a team of behavioral scientists to build an extraordinarily sophisticated database packed with names of millions of undecided voters and potential supporters. The ever-expanding list let the campaign find and register new voters who fit the demographic pattern of Obama backers and methodically track their views through thousands of telephone calls every night.
That allowed the Obama campaign not only to alter the very nature of the electorate, making it younger and less white, but also to create a portrait of shifting voter allegiances. The power of this operation stunned Mr. Romney’s aides on election night, as they saw voters they never even knew existed turn out in places like Osceola County, Fla. “It’s one thing to say you are going to do it; it’s another thing to actually get out there and do it,” said Brian Jones, a senior adviser.
From: this NYT article
I’m heading to KiwiPycon in Welly this weekend to meet some fellow Python fans and give a presentation on using the Python-based Natural Language Toolkit (NLTK) to classify documents. I’ll be using the Enron emails as an example document set.
If you’ve travelled here from the future because you saw the presentation and want the files I referred to, here they are.
There is a missing link between the two code files: changes I made to the dataset to enable training of the classifier and analysis of the results. If you are interested in getting the final dataset, just get in touch.______
- KiwiPycon Presentation with Full Notes (pptx, 1.7MB)
- KiwiPycon Import Enron Data (Python Code)
- KiwiPycon Enron Classifiers (Python Code)
Two links worth keeping:
- Michael Berry at Data Miners Blog describes how to use SQL common table expressions (i.e., WITH statements) to simplify complex queries by creating on-the-fly temp tables (named subqueries) prior to the full query definition.
- Designer Andrei from 37Signals describes the outcomes of a number of A/B tests they did on their Highrise signup page, along with the variants they tested.
Good real-world datasets used to be quite hard to come by for those interested in playing with different modelling approaches. However, sites like Kaggle, which expand the crowdsourcing approach to model improvement initiated by events such as the Netflix prize and KDD Cup, are opening up more datasets for statistical modellers to use. Worth keeping an eye on.
It’s no secret that as we interact with more web services we are creating a larger and deeper footprint with respect to our digital behaviours. I think we are also volunteering more personal information when asked online. The result has been an explosion in individual-level data available to data wranglers in organisations with a digital presence. Often, the negative sides of this are reported in the media; the decline of privacy and the risks of data abuse to individuals. However, it also provides for some fascinating aggregate-level analysis that just hasn’t been previously possible.
For instance, Google Flu trends shows how aggregate search behaviours can be used as an early warning signal for potential public health issues.
And then there is a post I recently found which examines correlations across answers to a questionnaire completed by users of a popular dating site… The aim: to identify first-date questions that “(a) most people were comfortable discussing publicly, and (b) were mathematically likely to tell you something you couldn’t just guess”. The analysis isn’t exactly in the interest of public health, but it is hilarious, well thought through, and accessible. And no individual’s data is exposed in the process.
(Note, the content at this link isn’t really safe for work; if it were a TV show there would be a ‘contains explicit language and sexual themes’ disclaimer before it started.)
A couple of gems from the post that apply across the sexes (go to the post for the direction and strength of relationship):
To predict: Will my date have sex on the first date?
Ask: Do you like the taste of beer?
To predict: Is my date religious?
Ask: Do spelling and grammar mistakes annoy you?
And one that shows just how bad we are at judging our common ground with others:
“Which describes you better, normal or weird? might be fine to ask, but doing so is of little value because almost everyone has the same answer. 79% of people think they are weird.”
Disclaimer: The OKCupid sample is large, but probably doesn’t reflect the general population of people looking for partners. So, if you attempt to apply these nuggets of wisdom your mileage may vary. That said, the differences presented are substantial enough that I’d be surprised if they don’t hold to at least a small degree outside of OkCupid’s target market!