Source
Nations term file
States term file
Cities term file
Recent results from OverCluster
Nations (depth 20)
Nations (depth 40)
States and Provinces (depth 20)
States and Provinces (depth 40)
Cities (depth 20)
Cities (depth 40)
The other two variables you may wish to customize are $depth and $cluster. Depth defines how many subsidiary results from Overture the script considers - set it below 10 and you'll get very little clustering; setting it above 100 probably doesn't help you, as Overture results seem to peter out for most terms around 50 results. I tend to leave it set at 40 when I'm looking for lots of clusters, 20 when I'm looking for a few clusters. $cluster defines the size of what Overcluster considers a cluster - a cluster is any group of primary terms larger than $cluster - i.e., set $cluster to 4 and you'll get clusters of 5 or more terms. Setting $cluster low gets you lots of clusters (indeed, set it to 0 and every term will be a cluster); setting it high gets you few.
All four variables appear in the code shortly below the "Usage" text.
There's another option buried in the code, which lets you change the clustering behavior, so that cluster are ranked by number of results for the secondary term, rather than the ratio, or alphabetically by subsidiary term - it's around line 230 and requires you to comment out one "sort" statement and uncomment another one.
Because OC is a scraper, it will break - guaranteed - if Overture changes the format of their Keyword Selector page. When this happens, I'll do my best to rebuild the regexes as soon as I can. If I am herding yaks in Kyrgyzstan when this happens, my best will not be very good.
The one bit of pre-clustering I do is eliminate "orphaned n's". These occur when "Indonesia" turns up "Indonesian Food" as a secondary term. The script eliminates the primary search term, resulting in "n food" as a secondary search term. So I've added elimination of orphaned "n"s, "an"s and "ian"s at the beginning of a resulting secondary search string. This means you can end up with "food" and "food" under "Ghana" if "Ghana" resulted in "Ghana Food" and "Ghanaian Food". It's a bug - I'll fix it at some point.
If you're consistently having trouble with a search term, feed it into the Overture interface manually. I've discovered that Overture dislikes plurals - it turns "United States" into "united state" which breaks some of OC's functionality. If the Overture interface is transforming your search term, modify your term in your termsfile to the term Overture prefers.
Questions, comments, complaints - contact me at ezuckerman AT cyber DOT law DOT harvard DOT edu. Please do let me know if you end up trying it out.