GAP is intended to be a long-term project, looking at an increasing number of media sources, correlating to a larger universe of data sets and entertaining theories of causality as well as correlation. Some steps likely to be taken in the future:
v More data sources. Some of the most interesting future directions for GAP come from comparing similar media sources, like BBC and CNN. It would be ideal to be able to compare the 20 largest newspapers in the US on a regular basis, and to compare US newspapers to international English-language news sources. Some of this is simple legwork – figuring out what certain search engines do and don’t support and creating appropriate configuration and keyword files.
Some of the most interesting news sources are available through information retrieval services like Lexis. Unfortunately, these services are generally configured to be “unscrapeable”, using checksums to create custom URLs that could not be requested by automated tools. Fortunately, Lexis does include excellent facilities for performing automated searches and having the results mailed to you. One design for the next scraper takes input from mail, rather than from the web, and relies heavily on mail pre-processing through procmail.
In the future, GAP will deal with non-English language media as well. This will require a thorough rewrite of keyword lists, but should not require major code changes.
v Database driven. Current GAP scripts have no sense of history – they’re not aware of what results they generated a week or a month ago. Those analyses need to be performed by hand. In the future, GAP scrapers will log their results into a database, making it possible to see how a particular news source represented certain search times over a period of time. This is critical for resolving questions about the reliability of data models.
v Influential media index. Right now all stories reported by a source like news.google.com have the same weight, whether reported by the Wall Street Journal or the Samoa Observer, despite WSJ’s significantly larger reach and influence on the international community. The Influential Media Index will attempt to identify twenty most influential media sources and track their attention on a daily basis, providing both summarized information, and information on how each individual source deviates from the mean.
v Multiple Factor Correlation. All correlation studies performed on GAP data thus far consider a single factor at a time. In the next round of analysis, it will be interesting to combine factors and see if any combination gives a near-perfect estimation of media distribution for a particular media source.
v Newswire Analysis. Andrew McLaughlin and Diane Cabell, both of the Berkman Center, each independently suggested that GAP could track both large newswires like AP and Reuters, and individual newspapers that use these newswires for foreign coverage. A comparison would reveal a great deal about editorial decisionmaking as concerns international coverage. Do newspapers run very little news on Africa because little is available from wire services? Or do they deprioritize available stories due to perceived lack of reader interest?
v The Media Coefficient. One of the most interesting statistics in development economics is the Gini coefficient. It measures the difference between the actual distribution of wealth in a country and the theoretical, perfectly equal distribution. The result is a number between 0 (perfect equality) and 1 (perfect inequality). Using the same technique, media inequality could be computed by comparing the actual story count of a nation to a "perfect" media distribution, where everyone in the world gets equal attention from the media. Alternatively, one could calculate differences between an individual outlet's curve and a mean curve, like the proposed Influential Media Index.
v Open Source Tools. In the near future, the scripts behind GAP will be released under GPL or a similar license. It is the author’s hope that these tools, primitive as they are, could be useful to other researchers interested in quantitative media analysis. It is further hoped that this paper is the first in a long series of studies, and that the author will not be the only one performing said studies.
Chris Warren wrote the original scraper designed to pull data from Google News – the current ScrapeNews program builds heavily on his code, and this project would not have been possible without his original code and his advice on my code.
Nate Kurz authored Mapper, an extremely useful interface to ImageMagick, which turned the mapping of GAP data from an evening-long manual task to a momentary automatic task. GAP would be far less pretty without his code and his input on my code.
Special thanks to Jesse Ross, server wrangler extraordinare, for his help with getting GAP code to play nicely with the Berkman servers.
GAP relies heavily on two open source tools: Perl and ImageMagick. The existence of tools like these turns GAP from a year-long project to something a reasonably inept programmer can create in a few weeks. Long live Open Source!
Many thanks to colleagues who’ve read early drafts of this paper and offered critique and advice, especially Gerry Wyckoff, Kira Maginnis, Nate Kurz, Zach Yeskel, Noah Eisenkraft, Andrew McLaughlin, Diane Cabell, and Jonathan Zittrain. Special thanks to Rachel Barenblat and Andrew McLaughlin for their extensive editing.
Thanks to the Berkman Center for Internet and Society at Harvard Law School, and especially director John Palfrey, for supporting this research.
 Jonathan Zittrain of Harvard Law School points out that the analogy between media and money may well break down. While a world where everyone had the same amount of money might be a very nice place, a world where everyone were equally famous might be very strange. Or might not...