Global Attention Profiles – A working paper

First steps towards a quantitative approach to the study of media attention

Ethan Zuckerman

 

Methodology

The core of the Global Attention Profile project is a set of Perl scripts – scrapers – that query the search engines of nine new media outlet websites and perform 183 automated searches. The scrapers collect a single piece of data from each search – the total number of stories available within a given time period for a given search term – and present this data in an HTML table and on a world map. Using previously calculated equations, the scrapers estimate how many stories “should” result from a given search term, estimating based on a nation’s population and GDP. The program calculates the deviation between the two predictions and observed results, and reports this deviation on the HTML table and on maps.

 

Output of each script is an HTML table and three maps – hitcount, deviation from GDP prediction, and deviation from population prediction. Scripts can be run at an arbitrary interval, called on a Unix system via crontab. Scripts have run daily for the past three months – in the future, they will likely run on a weekly basis, as data changes little on daily intervals.

 

Scrapers

“Scraper” is a generic term for a program that requests a webpage, selects certain data from it and returns that data in a different format. Before the advent of syndication formats like RSS, programmers routinely used scrapers to retrieve news headlines from multiple sources and aggregate them into personal news sites for instance. In this case, scrapers make it possible to rapidly query search engines and select one piece of data from the results – the total number of results the search engine links to. The scraper creates a custom URL – the general URL for querying a specific search engine, plus a query term corresponding to the nation we’re searching for – and, because the website in question believes it is responding to a request from a web browser, receives an HTML file in response. The scraper then uses regular expression matching to retrieve the string that contains the total response count.

 

GAP scrapers query seven websites: news.google.com, www.AltaVista.com/news, query.nytimes.com, pqasb.pqarchiver.com/nypost, www.bbc.co.uk, search.cnn.com, and www.washingtonpost.com, which is queried for AP and Reuters results, as well as for Washington Post results. Because there is no industry-standard search engine, or universally accepted search protocol, it is necessary to approach each engine slightly differently, using four data files, which differ only in how they pass boolean queries to the engine (NOT represented as “-“, “NOT”, “AND NOT” or no support for NOT). A unique configuration file was used for each engine, which tells the scraper which data file to use, as well as the base search URL and the regular expression that matches total results.

 

The first edition of the GAP scraper was written by Chris Warren; the author is responsible for subsequent versions.

 

Search Terms

The goal of GAP is to compare the representation of different nations by a news media outlet; the first challenge was finding search terms that generate stories on a given nation. For the purposes of GAP, nations and territories with a population over 100,000 and current population and GDP statistics were most interesting.

 

Does a search for Argentina return all the Argentine-related stories within a collection? Obviously not. Stories that reference Buenos Aires, but not Argentina, will be skipped, as will stories that refer to Argentines. Sometimes a search may be overbroad – a search for Tonga will match stories listing football teams playing in the World Cup. Is that story about Tonga or not?

 

Worse still, there are the inconveniently-named nations. Searching for Chad will find webpages on country singers and baseball players before identifying the African nation, and Georgia will net far more articles on Atlanta than on Tbilisi.

 

GAP acknowledges all of these difficulties and then ignores most of them. Automatically determining the topic of a piece of text is one of the most difficult problems in computer science. The best automated systems rely on human judgment to construct a corpus of hand-sorted documents for the system to “learn” from. In other words, there’s no way to rapidly determine a document’s subject at a high degree of accuracy without both human judgment and expensive, complex software.

 

That is not to suggest that the simple names of nations are the ideal terms for GAP – a future version may use a string like “Great Britain” OR England OR “United Kingdom” OR Scotland OR Wales OR London OR “Downing Street” OR Welsh OR Scotch OR English OR British to match for the United Kingdom. Given the existence of cities named London in both the US and Canada, and of Scotch whisky, the solution may create more problems than it solves.

 

GAP tries to address the most egregious problems through the judicious use of quotes and boolean search terms to constrain overbroad searches. In the cases of Georgia, Chad and the Republic of Congo, it searches for the name of the national capital, rather than the name of the nation, and compensates by multiplying the number of returned searches by five. (Searches of the twenty nations and capitals closest to each nation in terms of total GDP set 5x as the appropriate multiplier.) GAP ignores the United States altogether, given massive undercounting of stories set in major US cities.

 

GAP strives to be comparatively accurate, not absolutely accurate. If GAP reports that news.google.com turns up 15,000 results for Japan, one should not conclude that there are 15,000 stories on Japan – there are likely more, and possibly fewer. However, GAP tries hard to be consistently inaccurate, so that when comparing 15,000 results for Japan and 3,000 for Nigeria, it’s reasonable to say that there are five times as many stories on Japan than on Nigeria.

 

Search Engines

An ideal search engine, for GAP purposes, would support boolean queries, interpret quoted strings as literal strings, give exact and verifiable numbers of total results and allow any date range to be queried. Unsurprisingly, the ideal engine does not yet seem to exist[1] – in every case one must compromise, somewhat, making “apples to apples” comparisons of results inexact. The table below summarizes the characteristics of the seven sites queried:

 

Site

Exact

Verifiable

Boolean

Quote

Date

AltaVista

Yes

No

Yes – minus

Yes

7,30, range

BBC

No

Yes

No

Yes

Full – 1997

CNN

Yes

No

Yes – NOT

No

Full – 1996

Google

Yes

No

Yes – minus

Yes

30 days

NY Post

Yes

Yes

Yes – AND NOT

Yes

2 years, Full – 1998

NY Times

No

Yes

Yes – NOT

Yes

7,30,90, 365, Full – 1996

WPost, AP, Reuters

Yes

Yes

Yes – NOT

Yes

1-14 days

 

Exact – To perform meaningful comparisons of results counts requires an exact number, rather than a range. Many engines refuse to give an exact count, offering a string like “more than 1000 matches” for large queries. BBC does not offer a count of stories when it matches fewer than 10 results – the next version of the GAP scraper will accommodate this special case, but the current one does not.

 

While some problems can be worked around (the “more than” problem can be defeated by specifying a short date range, for instance), others are insurmountable: Fox News Channel (foxnews.com) returns three results for each query, and an additional ten on follow-on pages, but never tells you how many results or pages are available. Evidently “we report, you decide” doesn’t apply to quantitative analysis of their news coverage.

 

The New York Times appears to provide an exact count, but the count is not believable for date ranges above 90 days. Year-long and full archive searches never return more than 1,000 results, implying a manual trimming of archives. 90-day counts appear believable – they are roughly three times the size of 30-day counts.

 

Verifiable – When a search engine reports 980 results, users expect to be able to view any of those 980 pages. In three cases, this isn’t possible. CNN will only provide access to the top 500 stories it matches for any query. While other stories may be there, it is possible only to verify the first 500.

 

AltaVista and Google, the only two news aggregators in the set, have major verifiability problems. Google will routinely report 45,000 results for a search. When one pages through search results, as few as 1% of the stories will be user-viewable – in other words, there will be 45 pages of results, rather than the 4,500 one would anticipate.

AltaVista shares this limitation, though camouflages it – request a high-numbered page of search results on AltaVista and you’ll get the first page of results!

 

Google has another peculiarity – changing story numbers. While the number of total results returned by news.google.com is constant over short periods of time, the number of stories a user can view varies from moment to moment. The variance is small, under 5%, but it implies that Google is querying one server for a total story count, and others for the story summaries.

 

These phenomena may be explainable as byproducts of search engine optimization. While many users rely on result count as a measure of a search’s exactness, very few request the 1000th story returned for a particular search – as a result, the engine is optimized to provide the first piece of data but makes it difficult, if not impossible, to retrieve the second piece of data. (One might also conclude that AltaVista and Google have done this to prevent scrapers and bots from spidering their site by performing broad searches and collecting all the URLs referenced in results.) In other words, while unverifiable results don’t necessarily imply incorrect results, they are a cause for concern in collecting valid data.

 

Boolean – A search for Ireland is likely to return stories on the Republic of Ireland (the target) and on Northern Ireland, which is part of the UK and hence not part of the target. It’s useful to be able to ask a search engine for Ireland NOT Northern. Indeed, it becomes very important when searching for information on Guinea, which tends to turn up matches on Guinea-Bissau, Equatorial Guinea and Papua New Guinea, not to mention the Gulf of Guinea, guinea hens and guinea pigs.

 

BBC is the only engine queried that does not support boolean searching. As a result, certain results (Guinea, Congo, Niger, Ireland and others) are bound to be overbroad. At the moment, GAP does not compensate for these overbroad matches.

 

Quotes – A search for the string “west bank” should return stories about the middle east, while a search for the string west bank should give us all those stories, plus stories about the bank opening on the west side of town. CNN does not recognize quotes, and hence searches like “Equatorial Guinea” are overbroad. Again, GAP does not compensate for these overbroad matches.

 

Date – To compare “apples to apples” between data sources, one should look at the same time period for both sources. Unfortunately, search engines vary widely in what date ranges can be searched. Three engines – New York Post, BBC, CNN – provide only multi-year searches. The Washington Post data sources, which include AP and Reuters, are only searchable for the past 14 days. And the New York Times, while quite flexible in time periods permitted, doesn’t allow a 14 day search (for easy comparison with the Washington Post), and doesn’t provide believable results for periods over 90 days.

 

Differing date ranges means it’s impossible to compare CNN and Google without considering the time scope – any comparison needs to recognize that Google measures a short slice of time compared to CNN’s multiyear swath. Short-term phenomena – the war in Iraq, for instance – are likely to affect a month-long sample more dramatically than a multi-year sample.

 

Correlations and estimators

The results of the GAP scraper, by themselves, are relatively unhelpful. What does it mean that CNN has 629 stories on Sudan? Is 629 a lot or a little?

 

GAP attempts to contextualize by building two models for the distribution of data and comparing actual results to the models. One model is built on population data, the other on GDP. These two statistics are used because they are ones for which the most thorough data sets exist – the World Bank provides 2001data[2] on population for but one country (Mayotte), and GDP data on 90% of our countries. For the remaining nations, the CIA World Factbook’s 2000-2002 estimates[3] were used.

 

To create an estimator model, a correlation between population and results count was assumed. The logarithm is taken of both population and results counts, which results in distributions that appear to match a normal distribution for each data set. The log of story count was graphed against the log of population on a scatterplot, and a line was fit to the results. This linear fit to logarithmic data is equivalent to fitting a curve of the form y=mxn where y=stories, x=population and m and n are constants specific to that particular data distribution.[4]

The chart, left, shows the relationship between search results on CNN.com on June 11th, 2003 and population. Both axes are scaled logarithmically, and as a result, the curve y=0.0127x0.6622 appears linear. The equation was used as estimator for future CNN results. When m and n from this equation and from the corresponding GDP equation are plugged into a CNN specific configuration file, the scraper is able to estimate how many results it expects based on the population and GDP models, and calculate deviation from those expectations. For ease of visualization, it color-codes the deviations to make maps and charts easier to read, using deeper shades of red to signify greater positive deviations (more results than the model predicted) and deeper shades of blue to signify negative deviation (fewer results than expected.)

 

It is tempting to read blue spots on the map as “underrepresented” and red ones as “overrepresented”, but these generalizations have to be made with a strong caveat. Blue and red signify under and overrepresentation from a specific model. As will be demonstrated in the Results section, models are more or less well-correlated to a specific data set, and one would expect large amounts of deviation from a loosely-correlated model.

 

Results were then correlated from the nine media sites with 21 World Bank data sets[5]. With the exception of the aforementioned GDP and Population statistics, these sets represent 1999 data. Few are as complete as the GDP or Population statistics, so correlations consider 120-160 pairs of values rather than the 183 considered by GDP and Population correlations[6]. It was not assumed that missing data indicates a zero value – generally this is untrue, and would badly skew correlations. In the case of Foreign Direct Investment and Development Assistance, negative values – i.e., countries making investments or countries giving aid – were disposed of, because this information is extremely incomplete.

 

Microsoft Excel was used to fit curves to the data. Because Excel transforms data via a logarithm to effect a power-series curve fit, it is not capable of working with zero values. To work around this problem, zero values returned by our scrapers were replaced with 0.1 – one tenth of a story. As this study makes use of logarithmic scales, this change turns the difference between zero and one into a difference of one order of magnitude, rather than an infinite difference – probably a better representation of what is actually going on, especially on news sites that might fail to report on Vanuatu this week, but provide a story on it next week.

 

Visualization

Some of the most interesting patterns that GAP reveals are geographic. For instance, it is easier to see that the BBC focuses a lot of reporting attention on former British colonies in east and southern Africa once results are plotted on a map. For this reason, the GAP scraper outputs three maps, as well as a table of values. The maps represent results count in percentage terms, deviation from population estimate and deviation from GDP estimate.

 

To automatically create these maps, the GAP scraper calls on mapper, a package built by programmer Nate Kurz to enable automated mapping using ImageMagick, a leading open source imaging tool. Mapper reads a data file containing x,y coordinates for each nation represented on a specific map. Some discontinuous nations (Indonesia, for instance) require multiple coordinate points. Mapper defines a region as being represented by one or more coordinate pairs, and then accepts commands to fill a region with a certain color, coded in hexadecimal RGB pairs and outputs PNGs, GIFs or JPEGs.[7]

 

Next section: Results | Index



[1] Commercial news aggregators like Lexis/Nexis come close to being this ideal search engine. Since they log all stories into their own database, one can search across multiple media outlets with a common set of keywords and time periods. Once GAP is modified to search a news aggregator, much of the complexity detailed in this section will be extraneous.

[2] From the World Development Indicators Database, via the Data Query online tool: http://www.worldbank.org/data/dataquery.html, accessed June 15, 2003.

[3] CIA World Factbook, http://www.cia.gov/cia/publications/factbook, accessed July 31,2003. Unfortunately, the CIA World Factbook’s estimates of GDP are in purchasing power parity dollars, not real US dollars. As there’s no easy way to convert PPP to real dollars, the use of Factbook introduces some error into GDP comparisons.

[4] Indeed, when Microsoft Excel fits a power series to a set of data, it appears to take the logarithm of both sets and fits a linear equation to the transformed data. As a result, Excel is unable to perform power series fits to data that includes zeros, as it’s impossible to take the logarithm of zero.

[5] World Development Indicators Database.

[6] As a result, comparing correlation coefficients is somewhat imprecise, as a data set with 120 data points will show slightly different correlations than one with 180 data points.

[7] The author and Mr. Kurz plan to release mapper to the Open Source community at some point in the future. If you need the code in the meantime, please contact ethan@geekcorps.org.