Mainly Data

An exploration of people and data management, the evolution of learning and the scientific method in an era of data-intensive distributed computing, and efficient knowledge capture and distribution using the web. Probably other stuff, too.

Jun 30

MMDS 2008 and CIM

Last week the second Workshop on Algorithms for Modern Massive Data Sets was held at Stanford University.  This workshop had an incredible density of prestigious speakers from the field of machine learning; I guess spending a weekend in California during the summer is an easy sell.

Each of the four days of the conference had a theme.  The official themes:

  1. Data Analysis and Data Applications
  2. Networked Data and Algorithmic Tools
  3. Statistical, Geometric, and Topological Methods
  4. Machine Learning and Dimensionality Reduction

After attending several talks by these machine learning luminaries over the course of four days, I tried to pull together a few common themes of my own for further exploration:

  1. Incomplete Dyadic Data
  2. Distributing Data and Computation
  3. Manifolds with Noise

I’ll try to put a detailed post up on each of these three topics this week.

Over the past few years I’ve become a regular conference attendee.  It seems plausible that I will continue to attend KDD, VLDB, SIGMOD, and MMDS for years to come, leaving me with a major question: where can I find a longitudinal analysis of the content of these conferences?

It seems that these conferences provide an excellent yearly cross-section of the state of their respective fields, but I’m much more interested in the deltas.  What are the new topics, which topics are making forward progress, and which topics are losing steam?  Taking time to collect this information would provide some fascinating insight into the progress of science.

The KDD community has done some analysis of the DBLP data set, but not with the aims proposed above.  From the VLDB community has come some work that is a bit closer in intent: AnHai Doan’s Cimple project.  So far, they’ve produced the moderately useful DBLife, and they’ve outlined a promising research direction.

I’ve recently started a conference type on Freebase and have a simple tool to monitor conference progress using their API.  I’d like to make this tool more general and robust.  Any CIM students or conference geeks looking for a summer project?


Page 1 of 1