Possessive apostrophe-s and search engines

The apostrophe is normally used for English possessive-s, but there anomalies.

In English the apostrophe is used for the possessive s, such as

  • John’s cat

What does that have to do with search engines? Two things.

Indexing words and word combinations

Our search engine (and most other search engines too) index words, not strings. So when we index “John’s cat” we split the text into tokens:

  • John
  • s
  • cat

and index that. In addition we also index word pairs/bigrams:

  • John + s
  • s + cat

For texts such as “my red car”, “the little mermaid” or “quick guide to handling cats” it works really well for improving search precision and result quality. It doesn’t work so well for the possessive-apostrophe-s case. So what we do is that we ignore the apostrophe and also index:

  • Johns
  • Johns + cat (bigram)

And search precision goes up because the tokens “john” and “s” are strongly connected and we consider them a single word.

This is not limited to English. The …s suffix is used in other Common-Germanic languages (Dutch/German/Norwegian/…) but the orthography in those languages do not use the apostrophe (eg. they just use plain “Svends kat”). However, the possessive apostrophe is seen in some company names, signs, and informal text. Whether that is unofficial orthography, anglophone influence or bad sign makers – we don’t judge. We just deal with it.

It looks like an apostrophe, but …

The blotch between the main noun and the possessive-s is normally an apostrophe. But our crawler also encounters other signs used in place of the apostrophe:

  • “John`s cat”: grave accent (U+0060)
  • “John´s cat”: acute accent (U+00B4)
  • “John‘s cat”: left single quotation mark (U+2018)
  • “John‛s cat”: single high-reversed-9 quotation mark (U+201B)

Sometimes it looks like the result of people unfamiliar with the apostrophe or a too-helpful word processing program. Sometimes it is a mystery how a particular not-apostrophe codepoint got involved. But the meaning is still clear. I looked through the relevant blocks in unicode and selected the codepoints that from a distance visually look like an apostrophe.

Apostrophe for contractions/omissions

In English and in few other languages the apostrophe is also used for showing contractions, eg.

  • “John’s saved the cat”
  • “The cat’s playing with John”

where the “‘s” is a contraction of “has” or “is”. Distinguishing between possessive and contraction would require NLP and we don’t do that. When we get something similar to STO for English we could make more precise bigrams.

Further reading