Hi. This is Ethan's project page for OverCluster, a web scraper that targets Overture's Keyword Selector Tool. OverCluster accepts a text file of search terms and feeds those terms, serially, to KST. It reports the monthly searches for each term from KST, calculates how often a secondary term was added to the primary term (in ratio to the primary term) and then clusters secondary terms, showing which primary terms led to a specific secondary term. Output is in the form of an HTML document.

Source
Nations term file
States term file
Cities term file

Recent results from OverCluster
Nations (depth 20)
Nations (depth 40)
States and Provinces (depth 20)
States and Provinces (depth 40)
Cities (depth 20)
Cities (depth 40)

Requirements:

OverCluster requires you to have Perl 5 installed on your system. I've not tested it on anything other than Perl 5.8.1 - I suspect it will run on most Perl 5 installations. It requires two libraries - Date::Calc, and LWP::UserAgent, both of which are available from CPAN.

Installation:

There are two variables that require customization - $sourcedir and $destination. $sourcedir should be the directory on your system where your term files live; $destination needs to be a directory that the script can write to and create subdirectories in.

The other two variables you may wish to customize are $depth and $cluster. Depth defines how many subsidiary results from Overture the script considers - set it below 10 and you'll get very little clustering; setting it above 100 probably doesn't help you, as Overture results seem to peter out for most terms around 50 results. I tend to leave it set at 40 when I'm looking for lots of clusters, 20 when I'm looking for a few clusters. $cluster defines the size of what Overcluster considers a cluster - a cluster is any group of primary terms larger than $cluster - i.e., set $cluster to 4 and you'll get clusters of 5 or more terms. Setting $cluster low gets you lots of clusters (indeed, set it to 0 and every term will be a cluster); setting it high gets you few.

All four variables appear in the code shortly below the "Usage" text.

There's another option buried in the code, which lets you change the clustering behavior, so that cluster are ranked by number of results for the secondary term, rather than the ratio, or alphabetically by subsidiary term - it's around line 230 and requires you to comment out one "sort" statement and uncomment another one.

Warnings:

OverCluster is a scraper. There are those in the web community who think scrapers are Bad Things. Indeed, they can be, if they're abused. Please don't abuse OverCluster and cause Overture to take down the Keyword Selector Tool or put it behind a CAPTCHA. Specifically, that means:

Because OC is a scraper, it will break - guaranteed - if Overture changes the format of their Keyword Selector page. When this happens, I'll do my best to rebuild the regexes as soon as I can. If I am herding yaks in Kyrgyzstan when this happens, my best will not be very good.

Bugs, missing features and other peculiarities:

Overture does a bad job of pre-clustering their search terms - if you search for "nation", you're likely to get both "news" and "news about" as secondary terms. I could write some regexes that would combine terms, but while they'd work for my purposes, they'd make the tool less generally useful. If I write a next revision, I may include the ability to load a file of regexes that do custom clustering.

The one bit of pre-clustering I do is eliminate "orphaned n's". These occur when "Indonesia" turns up "Indonesian Food" as a secondary term. The script eliminates the primary search term, resulting in "n food" as a secondary search term. So I've added elimination of orphaned "n"s, "an"s and "ian"s at the beginning of a resulting secondary search string. This means you can end up with "food" and "food" under "Ghana" if "Ghana" resulted in "Ghana Food" and "Ghanaian Food". It's a bug - I'll fix it at some point.

If you're consistently having trouble with a search term, feed it into the Overture interface manually. I've discovered that Overture dislikes plurals - it turns "United States" into "united state" which breaks some of OC's functionality. If the Overture interface is transforming your search term, modify your term in your termsfile to the term Overture prefers.

Questions, comments, complaints - contact me at ezuckerman AT cyber DOT law DOT harvard DOT edu. Please do let me know if you end up trying it out.