An exploration of people and data management, the evolution of learning and the scientific method in an era of data-intensive distributed computing, and efficient knowledge capture and distribution using the web. Probably other stuff, too.
peer reviewed journals for source code and data: narrative forms for the modern scientific method
greg wilson has a great post about a new journal devoted to source code for biology and medicine. it’s a fascinating idea, and greg asks, “why don’t we do this?”. clearly there are many reasons; one is the significant financial reward awaiting programmers for solving problems better than others for an extended period of time. in the research community, the prestige of publishing a result often outweighs the financial rewards of keeping that result to yourself.
even for computer science professors, the allure of financial rewards keep them from providing their code to the open source community for criticism: see, for example, ken birman’s licensing of the astrolabe [pdf] source code to amazon for a large sum.
suppose we could align incentives appropriately and a healthy community of peer reviewed journals for source code emerged. how would you structure one of the articles in these journals? they would probably take inspiration from don knuth’s literate programming (see also knuth’s book on the topic). tools for literate programming seem to be getting more sophisticated recently (especially in the two languages i use most, python and r). another related idea is reproducible research, which involves publishing empirical data in addition to code. it’s a logical extension of literate programming to the scientific realm.
as an aside, it’s somewhat notable that these journals are emerging outside of the computer science community, where code is needed primarily to manufacture results. i suppose it’s indicative of the strange relationship computer science professors have with programming, as pointed out to me by a professor of statistics this weekend.
in fields where i meander from time to time, there has been some progress on this issue. well, sort of, though none are quite analagous to the journal highlighted by greg’s post. statistics has the journal of statistical software; machine learning has mloss; and databases have the vldb experiments and analyses papers.
i’ve always enjoyed books that use actual code to illustrate their ideas (example one, example two), and i’d love to see this trend extend to the academic literature. as we all adjust to the new tools of modern science (hypotheses, code, and data), having a unified narrative method for conveying your results to other researchers will grow in importance. it looks like medicine ance biology are moving pretty quickly on this front; if you know of any other examples that illustrate how certain fields are moving forward with literate programming or reproducible research, please send them my way!
mathematicians and infrastructure
we were recently invited to give a talk at this year’s sigmod. it’s quite an honor. another talk on the same industrial track is being given by some folks from google about megastore, a layer they’ve written on top of bigtable to make it easier to build web applications. the full abstract and authors list is below.
the last author is a former classmate of mine at harvard and a fellow mathematics major. we recently had another former mathematician come by the facebook offices to present his work on distributed storage: peter braam, who architected the lustre file system.
Megastore: A Scalable Data System for User Facing Applications
JJ Furman, Jonas S Karlsson, Jean-Michel Leon, Alex Lloyd, Steve Newman, and Philip Zeyliger
Megastore provides a rich model and API that facilitates implementation of user facing applications storing data in Bigtable. Our goal is to enable Google developers to quickly build and launch highly available applications at Google scale. We extend Bigtable to provide strong consistency guarantees and higher level abstractions such as transactions, secondary indexes and synchronous replication. Megastore takes a practical approach to schema management, providing integrated declarative schemas with rich data extensions, such as logical data partitioning, which is key to achieve high performance querying and scalable massively parallel transactions.
measuring an ad bundle
in my last post, a few firms were mentioned as possible partners for the major studios and brand advertisers as they search for a consolidated and effective set of metrics for evaluating the success of their new advertising products. clearly the list was not complete; here are two more firms that could help measure the success of ad bundles:
- omniture: for a brand advertiser, web analytics is no more exotic than old school performance marketing. however, because the internet is a new technology (bear with me, we’re talking about media firms here) and each website has a different structure, it’s much more difficult to produce relevant metrics across sites and campaigns. omniture has a strong hold on the traditional web analytics market and could make a strong move into integrated ad bundle measurement as more ad spends move online. if they can complete the visual sciences integration and keep up the strong revenue growth, they’ll be well positioned to grow into the larger audience measurement space.
- spot runner: spot runner is creating a really compelling proposition for getting your campaign on television. given the rise of ad bundles, and because spot runner is native to the web, they may be in the strongest position to capture some of the integrated audience measurement business that will be emerging. it looks as though they’ll use their new war chest to expand the basic product into other markets; in addition, i hope they keep an eye on the smaller audience measurement firms mentioned in the last post. if they can start to hit a meaningfull run rate, an acquisition would be a reasonable next move.
media web trail: follow up
in a previous post i poked at a few things going on in media that seem significant: the diversification of distribution channels for content producers and the proliferation of potential products to be packaged and distributed throughout the content creation process.
today fortune ran an article about the decreasing importance of upfronts. according to the article, media companies used to lock up 80% of their advertising revenues through this single source; in recent years, the sales process has evolved into a continuous engagement between advertisers and content producers. apparently the content producers are having some success selling “ad bundles”, which are “cross-promotional deals where advertisers purchase traditional spots on broadcast TV along with, say, the sponsorship of a program when it is streamed on the network’s website”.
for these ad bundles to grow, the content producers really need standardization of the various components of an ad bundle. they also need metrics to assess the effectiveness of the ad bundle, and the advertisers need a way to relate the metrics for the ad bundle to the metrics for a pure television spot.
if you’re aware of any media partnerships with these media measurement firms that have proven successful with advertisers, drop me a line! i’m especially interested to learn if any firms are using immi: their technology looks rad.
statistics and government organizations
while digging up some numbers for my previous post i came across some interesting government organizations related to statistics. given that science is measurement, let’s take a look at some government organizations that aggregate measurements.
in the eu, there’s eurostat, whose focus is on pulling together these entities from across their member states. in the uk, last month saw the creation of the uk statistics authority, as required by the statistics and registration service act of 2007. the goal of this new, independent entity is to monitor and report on all official statistics and to provide oversite for the office for national statistics. all of this reform is apparently the result of a eurobarometer survey which showed that british people trust their official statistics less than the citizens of any other member nation. in france, another nation known for its history of achievements in statistics and probability, things seem to be a bit more settled: the insee has been operating since 1946.
in the united states, we have an array of statistics-gathering entities accessible via the fedstats portal. most critical statistics are collected via the economics and statistics administration, under the department of commerce, or via the bureau of labor statistics, which is under the department of labor. the cia, an independent agency, produces its own set of worldwide statistics and makes them available via their factbook. i wonder what kind of master data management or data quality initiatives they have in place to make sure these numbers are aligned?
let’s take a look at a few other countries. in china, the national bureau of statistics has a website that is easy to use and frequently updated. israel’s central bureau of statistics seems to be a similarly modern institution, but india’s ministry of statistics and programme implementation has a website due for an overhaul.
there are a few international organizations with similar missions, as well: the un has the statistics division and the oecd has sourceoecd. for information on countries not mentioned above, both the un and the oecd have links to their respective statistics authorities.
in a future post, i hope to go deeper into what sorts of statistics these entities collect and what problems they hope to address with the intelligent application of statistics. i wonder if they have any data warehouse architects on hand to treat their organizations as giant data integration challenges; if you have any information on the history of any of these institutions or their technical underpinnings, drop me a line! i picked up schumpeter’s history of economic analysis hoping to find some leads myself.
of course the folks at gapminder and swivel, among others, have started collecting this data for you to manipulate. but while the data is interesting, it’s primarily the ways in which the data and analyses influence policy that interests me. i’d love to see a site collecting mentions of data in the deliberations of government so that we can better understand how the data we collect is ultimately used to make decisions in government.
media web trail
producing quality content for the consumption of many is a fairly involved process. at some point, a media company decides to snapshot the product and begin distributing that snapshot through whatever channels they have available: radio, television, newspapers, magazines, those television screens in elevators, tied to the back of an airplane, whatever.
the process continues after the snapshot, as consumers of that media place it in context and annotate its content. a creative mind can apprehend the value in all components of this process, both prior to and after the snapshot, and package many products in addition to the primary output of the project: “making of” shows, outtakes, soundtracks, etc. each of these secondary outputs requires rigorous control of the production process so that you can capture the secondary outputs and manipulate their distribution to generate revenue.
there are so many interaction points (distribution channels) now between the consumer and all parts of the production process that modern media companies are not able to keep up with proliferation. as traditional distribution channels throw off less and less revenue, media companies are scrambling to restrict and better define the alternative interaction points with consumers: after all, you must define before you can control.
some interesting movements recently in the world of modern media:
- one: cbs, trying to use showtime’s recent critical success to strengthen their bargaining position with the pay station’s content partners, has seen their strategy backfire: viacom, their recently separated twin, is working with mgm, lions gate, and now blockbuster (wtf?!) to aggregate content into another premium cable station (read: distribution channel).
- two: cbs had an okay first quarter without the superbowl, even upping their dividend (is that bravado?). they break out their $3.7 billion in revenues into three operating segments (distribution channels): television ($2.6 billion), radio ($364 million), and outdoor ($497 million). given that they are creating television shows, own the rights to premium sports content, and now have a little movie-making division, i’d love to see them reconceptualize their operating segments around properties of content production and report on which content performed well in the different distribution channels. they’re sitting on almost a billion dollars in free cash flow and are clearly looking to make acquisitions in the online and outdoor distribution channels, but i’d be interested in seeing them reconsolidate around their strength as a content creator.
- three: youtube, an alternative distribution channel for video content, is probably starting to realize they can’t monetize the long tail of ugc; they’re going to need premium content to drive alternative revenue generating strategies. major media companies, of course, are trying to create yet another alternative distribution channel rather than realizing their position of strength as premium content creators. it’s cool, though; hulu is light years ahead of youtube on user experience. that’s what happens when you have to focus on scaling your infrastructure and preventing abuse instead of offering a good user experience.
- four: publicis, who knows how to create and distribute content, are starting to turn the crank on their digitas acquisition. their partnership with google is compelling; to be honest, i think they have more to teach google than to learn. i’d love to see content producers just get it over with and merge with these agencies.
- five: twx finally washes their hands of twc. some interesting quotes in here from bewkes: “in a fragmenting world, we think brands will increasingly matter more, not less” and “we believe that ultimately all packaged media will move to digital distribution”. these statements are fairly lucid predictions, but i’m not sure twx is well positioned to benefit wholly from either trend.