An exploration of people and data management, the evolution of learning and the scientific method in an era of data-intensive distributed computing, and efficient knowledge capture and distribution using the web. Probably other stuff, too.
MMDS 2008 and CIM
Last week the second Workshop on Algorithms for Modern Massive Data Sets was held at Stanford University. This workshop had an incredible density of prestigious speakers from the field of machine learning; I guess spending a weekend in California during the summer is an easy sell.
Each of the four days of the conference had a theme. The official themes:
- Data Analysis and Data Applications
- Networked Data and Algorithmic Tools
- Statistical, Geometric, and Topological Methods
- Machine Learning and Dimensionality Reduction
After attending several talks by these machine learning luminaries over the course of four days, I tried to pull together a few common themes of my own for further exploration:
- Incomplete Dyadic Data
- Distributing Data and Computation
- Manifolds with Noise
I’ll try to put a detailed post up on each of these three topics this week.
Over the past few years I’ve become a regular conference attendee. It seems plausible that I will continue to attend KDD, VLDB, SIGMOD, and MMDS for years to come, leaving me with a major question: where can I find a longitudinal analysis of the content of these conferences?
It seems that these conferences provide an excellent yearly cross-section of the state of their respective fields, but I’m much more interested in the deltas. What are the new topics, which topics are making forward progress, and which topics are losing steam? Taking time to collect this information would provide some fascinating insight into the progress of science.
The KDD community has done some analysis of the DBLP data set, but not with the aims proposed above. From the VLDB community has come some work that is a bit closer in intent: AnHai Doan’s Cimple project. So far, they’ve produced the moderately useful DBLife, and they’ve outlined a promising research direction.
I’ve recently started a conference type on Freebase and have a simple tool to monitor conference progress using their API. I’d like to make this tool more general and robust. Any CIM students or conference geeks looking for a summer project?
Python is an amazing language for many reasons, but code prototyping via the interactive interpreter has crippled my development speed in languages that do not have a REPL.
I recently came across CERN’s CINT, a REPL for C/C++. Finally! I have come to dread coding in these languages, especially C++, but I’m now looking forward to my next big C/C++ project. If you’ve used CINT, drop me a line.
Wired Magazine: (Brief) Dispatches from the “Petabyte Age”
I tend to geek hard when a publication drops a series of articles about petabyte-scale data analysis. Frequent culprits include the SIGMOD Record or Teradata Magazine, but this week, Wired Magazine splashed into the pool.
Chris Anderson (the long tail one) starts things off by declaring that the data deluge spells trouble for the scientific method. Clearly this claim is false; Google’s success is almost entirely due to rapid product iterations via thousands of hypothesis tests every month. To give Chris the benefit of the doubt, he seems to be asserting instead that the practice of model development should be done in close concert with the collection of empirical data. I couldn’t agree more.
Near the end of the article, Chris makes a common (if frustrating) mistake: he claims that the cluster used for the NSF’s CluE program will be running the Google File System, as well as crediting Google and IBM for building the software that will power this cluster. In actuality, Yahoo deserves almost all of the credit, as they have done an incredible job scaling the Hadoop project to thousands of nodes and making the NSF program possible. Just another testament to the effectiveness of Google’s PR machine.
The rest of the articles cover many application areas of large scale data analysis in brief, including agriculture, astronomy, high-energy physics, politics, epidemiology, and insurance. There’s also a startlingly incoherent attempt to describe how MapReduce works that probably should have been left out.
Wired’s intentions were noble but their execution was not up to expectations. I’m hoping that tomorrow’s MMDS Workshop will have a bit more substance.
Bloomberg for the Web? We Need Real-Time News, Data, and Analytics.
The NYT had an article today about the battle for efficient financial services information delivery currently heating up between Bloomberg and Thomson Reuters. A few weeks ago, there were articles about the NASDAQ and NYSE making their stock quotes available in real time to multiple information outlets. And just last week, Tibco, once a wholly owned subsidiary of Reuters, purchased Insightful. Insightful makes the S-Plus statistical analysis software on which the R project is based. Tibco already owns Spotfire, a firm that makes excellent software for data exploration.
These developments made me think back to my time in financial services. Trading floors have the highest concentration of numerate people I’ve ever been around. They also have ready access to superb information manipulation software.
Now consider the current state of analytics software for the web. I won’t bother to list the major competitors, as they are all pretty mediocre. As the online advertising space grows in sophistication and mechanisms to promote a more efficient market are introduced (see, for example, Right Media and ContextWeb’s ADSDAQ), a workbench for the real-time exploration of news and data related to the web will be a necessary tool for many quantitative marketers.
As Hal Varian points out, marketing is the next field to be overrun with quants, and I expect that the tools most useful in finance will be brought along for the invasion.
Creative Commons Technology Summit 2008
Joi Ito started the day off with a lucid delineation of CC’s major components. He pointed out the technical side of CC is hoping to create a set of standards for digital media exchange in a similar spirit to what the IETF does for the internets as a whole. The political side of CC is more akin to the Open Internet Coalition that is fighting to keep these standards in neutral hands.
The later sessions started to drift away from my core interests, but I was intrigued by the proliferation of digital copyright registries: Registered Commons, SafeCreative, and Noank Media, for example. It was great to see Attributor have a presence at the summit. They’re also heavy users of Hadoop—I am once again impressed by what Jim has built.
I’ve enjoyed watching Creative Commons evolve over the past several years and I’m still holding out some hope that I’ll be able to have a material impact on their success some time in the future. For now, it’s great to keep up with the team; I trust they’re in good hands with a fellow Cavalier leading the way.
In a recent post I mentioned the idea of reproducible research. It turns out that this year’s SIGMOD conference, where we’ll be presenting a new approach to structured storage, has conducted an experiment in reproducible research. You can read a fascinating account of the experiment in this month’s SIGMOD Record. While you’re there, be sure to check out the “Data Management Projects at Google” article as well!
Hadoop at Facebook
There’s a post on the Facebook Engineering blog today from one of the Data team engineers, Joydeep Sen Sarma, discussing how we use Hadoop here at Facebook. Check it out.