Word occurrences in documents

While investigating some compression schemes I extracted some statistics from a (random) subset of our data so far. The statistics are how many times a word occurs in each document (in documents that contain the word). The subset may be slightly biased and incomplete representation of the complete web, but the differences are intriguing nonetheless. … Continue reading Word occurrences in documents

Read more

Are giraffes bad? (word variations and search widening)

When a user types the question “are giraffes bad?” into a search engine it could search for exactly those words. It would most likely produce somewhat helpful results, but the results would not include documents related to giraffe badness if the documents don’t use exactly those words.

Read more