Word occurrences in documents

While investigating some compression schemes I extracted some statistics from a (random) subset of our data so far. The statistics are how many times a word occurs in each document (in documents that contain the word). The subset may be slightly biased and incomplete representation of the complete web, but the differences are intriguing nonetheless.

