Word occurrences in documents

While investigating some compression schemes I extracted some statistics from a (random) subset of our data so far.

The statistics are how many times a word occurs in each document (in documents that contain the word). The subset may be slightly biased and incomplete representation of the complete web, but the differences are intriguing nonetheless. The counts includes all occurrences of the word (both in title, keywords, description, body text, headings, …). The counts are without synonyms and variations, so exact-match.

Animal/vegetable/berry?

  • giraffe: 2
  • rabbit: 2
  • penguin: 2
  • banana: 2.5
  • cucumber: 2
  • potato: 5
  • strawberries: 2
  • lingonberries: 1.2

Why is “potato” repeated 5 times in each document while “lingonberries” barely has one occurrence?

Brands

  • linux: 10
  • lego: 4
  • iphone: 5
  • mercedes: 3.5

I’m not sure if one can draw any conclusions from that, but I have the hunch that documents with “lego” in them will refer to that word only once and after that just use the word “piece”. The weirdness of “linux” I have no theories about.

City names

City names turned out to have some intriguing differences

  • Roskilde: 4.5
  • Copenhagen: 2.5
  • Stockholm: 2.5
  • Berlin: 3.5
  • New York: 5
  • San Diego: 5
  • Palermo: 2
  • Madrid: 2.5
  • Toronto: 6.5
  • Karachi: 2

Are well-known city names repeated more? No. Are names of big cities repeated more? No. Are northern city names repeated more than southern city names? No. I can find no pattern in these numbers. Canada, please tell us what is going on 🙂