Wired Magazine: (Brief) Dispatches from the “Petabyte Age”
I tend to geek hard when a publication drops a series of articles about petabyte-scale data analysis. Frequent culprits include the SIGMOD Record or Teradata Magazine, but this week, Wired Magazine splashed into the pool.
Chris Anderson (the long tail one) starts things off by declaring that the data deluge spells trouble for the scientific method. Clearly this claim is false; Google’s success is almost entirely due to rapid product iterations via thousands of hypothesis tests every month. To give Chris the benefit of the doubt, he seems to be asserting instead that the practice of model development should be done in close concert with the collection of empirical data. I couldn’t agree more.
Near the end of the article, Chris makes a common (if frustrating) mistake: he claims that the cluster used for the NSF’s CluE program will be running the Google File System, as well as crediting Google and IBM for building the software that will power this cluster. In actuality, Yahoo deserves almost all of the credit, as they have done an incredible job scaling the Hadoop project to thousands of nodes and making the NSF program possible. Just another testament to the effectiveness of Google’s PR machine.
The rest of the articles cover many application areas of large scale data analysis in brief, including agriculture, astronomy, high-energy physics, politics, epidemiology, and insurance. There’s also a startlingly incoherent attempt to describe how MapReduce works that probably should have been left out.
Wired’s intentions were noble but their execution was not up to expectations. I’m hoping that tomorrow’s MMDS Workshop will have a bit more substance.