What is data mining? Is there a little tiny prospector living in my computer?
According to theatlantic.com:
Discovering information from data takes two major forms: description and prediction. At the scale we are talking about, it is hard to know what the data shows. Data mining is used to simplify and summarize the data in a manner that we can understand, and then allow us to infer things about specific cases based on the patterns we have observed. Of course, specific applications of data mining methods are limited by the data and computing power available, and are tailored for specific needs and goals. However, there are several main types of pattern detection that are commonly used. These general forms illustrate what data mining can do.
Anomaly detection : in a large data set it is possible to get a picture of what the data tends to look like in a typical case. Statistics can be used to determine if something is notably different from this pattern. For instance, the IRS could model typical tax returns and use anomaly detection to identify specific returns that differ from this for review and audit.
Association learning: This is the type of data mining that drives the Amazon recommendation system. For instance, this might reveal that customers who bought a cocktail shaker and a cocktail recipe book also often buy martini glasses. These types of findings are often used for targeting coupons/deals or advertising. Similarly, this form of data mining (albeit a quite complex version) is behind Netflix movie recommendations.
Cluster detection: one type of pattern recognition that is particularly useful is recognizing distinct clusters or sub-categories within the data. Without data mining, an analyst would have to look at the data and decide on a set of categories which they believe captures the relevant distinctions between apparent groups in the data. This would risk missing important categories. With data mining it is possible to let the data itself determine the groups. This is one of the black-box type of algorithms that are hard to understand. But in a simple example – again with purchasing behavior – we can imagine that the purchasing habits of different hobbyists would look quite different from each other: gardeners, fishermen and model airplane enthusiasts would all be quite distinct. Machine learning algorithms can detect all of the different subgroups within a dataset that differ significantly from each other.
Classification: If an existing structure is already known, data mining can be used to classify new cases into these pre-determined categories. Learning from a large set of pre-classified examples, algorithms can detect persistent systemic differences between items in each group and apply these rules to new classification problems. Spam filters are a great example of this – large sets of emails that have been identified as spam have enabled filters to notice differences in word usage between legitimate and spam messages, and classify incoming messages according to these rules with a high degree of accuracy.
Regression: Data mining can be used to construct predictive models based on many variables. Facebook, for example, might be interested in predicting future engagement for a user based on past behavior. Factors like the amount of personal information shared, number of photos tagged, friend requests initiated or accepted, comments, likes etc. could all be included in such a model. Over time, this model could be honed to include or weight things differently as Facebook compares how the predictions differ from observed behavior. Ultimately these findings could be used to guide design in order to encourage more of the behaviors that seem to lead to increased engagement over time.
Predictive behavior. Hmmm. Like, which presidential candidate an American is going to vote for?
The Obama Administration is betting the house that they can predict the mood of the country through data mining.
The depth and breadth of the Obama campaign’s 2012 digital operation — from data mining to online organizing — reaches so far beyond anything politics has ever seen, experts maintain, that it could impact the outcome of a close presidential election. It makes the president’s much-heralded 2008 social media juggernaut — which raised half billion dollars and revolutionized politics — look like cavemen with stone tablets.
The article goes on to say that the difference between 2008’s campaign and 2012’s is the fact that Obama’s Campaign Staff have:
• Created a holistic, totally in-house digital operation that is the largest department at campaign headquarters. In 2008, much of the social media and video was generated organically from supporters. As one campaign official put it, “digital is no longer a part of the campaign. It is the campaign.”
• Hired a number of nonpolitical tech innovators, software engineers and statisticians. “It has been incredibly freeing, because all election campaigns are a slave to history, and the history here is just nonexistent,” says Obama campaign manager Jim Messina. “So, we’ve been able to kind of reinvent it.”
• Invested mightily in cutting-edge technology that scales the website to fit the screen of any device. With nearly half of the U.S. population using smart phones, “responsive design” allows a user to give money and volunteer without bifocals. “More than 40 percent of all our donors are new, and a lot of them are coming in because of things like this,” says Messina. “Call up our website and try to donate on your phone and then do Romney’s. … Those things are important, because people are busy and people want to help us and they think about — ‘Oh, yeah, I saw the president on TV. I want to give them money. How hard is it?’ ”
• Developed a more complex symbiosis between the campaign and Facebook, which is 10 times bigger than it was four years go, and has far more personal information available to mine. “Facebook was just a site to see friends four year ago now it is part of people’s DNA,” notes a senior campaign adviser. Obama invites supporters to log on to the campaign through their Facebook accounts, which gives the campaign one more avenue for data.
• Opened the first all-volunteer. all-digital office in San Francisco where knowledgeable techies drop in for a few hours and strive to develop new software for the campaign under the supervision of paid staff.
• Staffed a full-time digital director in each of about a dozen battleground states to effectively run mini-general election campaigns in those states.
Now, it seems to me, that, analyzing results gleaned from a data base, is a poor substitute for actually listening to the American people.
But, hey, I guess that’s just my opinion.