Are giraffes bad? (word variations and search widening)

When a user types the question “are giraffes bad?” into a search engine it could search for exactly those words. It would most likely produce somewhat helpful results, but the results would not include documents related to giraffe badness if the documents don’t use exactly those words.

If we consider synonyms and closely related words then we can expand the possible queries to those that mean the same or almost the same thereby finding more candidate documents, hopefully producing results more useful to the user.

You could use a dictionary-based approach to find synonyms, but we have found that such a simple method doesn’t generate as good results as we would like because it is not only synonyms we are looking for but rather closely related words. A synonym-based approach would consider “giraffe” and “giraffid” synonyms (maybe) but not find the closely related word “giraffes”. It also relies heavily on the quality of the synonym list in the dictionary.

Variations on “are giraffes bad”

If we instead consider word variations from the point of view of grammatical analysis we find find more useful word variations and get a bit more insight.

Verb (“are”) conjugation (including incorrect ones):

  • Is giraffes bad
  • Was giraffes bad
  • Were giraffes bad
  • Giraffes would be bad

Noun (“giraffe”) declensions:

  • Are giraffe bad

Adjective (“bad”) synonyms:

  • Are giraffes evil
  • Are giraffes misbehaved
  • Are giraffes malicious
  • Are giraffes distasteful
  • Are giraffes inferior

Noun synonyms:

  • Are giraffa camelopardalis bad
  • Are G. camelopardalis bad
  • Are giraffids bad

Verb synonyms:

  • giraffes become bad

Since giraffa camelopardalis is a category then there are even more synonyms:

  • Are G. C. camelopardalis bad
  • Are G. C. reticulata bad
  • Are G. C. angolensis bad
  • Are G. C. antiquorum bad
  • Are G. C. tippelskirchi bad
  • Are G. C. rothschildi bad
  • Are G. C. giraffa bad
  • Are G. C. thornicrofti bad
  • Are G. C. peralta bad

Negated antonyms:

  • Are giraffes not good
  • Are giraffes not benign
  • Are giraffes not helpful

Negated antonyms only works for words that represent a one-dimensional continuum such as good/bad, light/dark, white/black, weak/strong, but are less useful for adjectives describing multi-dimensional ranges, such as colors, N/S/W/E directions or the cosiness of an armchair. It also fails if there is no useful antonym (fun exercise: find the antonym to “thirsty”). Natural language analysis would be so much easier if people used Newspeak as in George Orwell’s “1984” (I mean: It would become “plusgooder”).

Degrees of usefulness

The conjugation of the English verb “to be” is a bit limited due to English being the way it is. If we consider Italian instead there are more possibilities (these are all third-person plural so assume that neither the user nor the search engine are in fact the mentioned giraffes)

  • Le giraffe sono male
  • Le giraffe sono state male
  • Le giraffe erano male
  • Le giraffe erano state male
  • Le giraffe furono male
  • Le giraffe ebbero state male
  • Le giraffe saranno male
  • … le giraffe sarebbero male

Furthermore, not all conjugations have equal value. If the user used the present tense (“sono”) then it is quite unlikely that the remote pluperfect conjugation (“ebbero state”) would be helpful. On the other hand, if the user used the recent past conjugation (“passsato prossimo”) then the remote past (“passato remoto”) conjugation could be helpful, depending on the topic and the origin of the user and the candidate documents.

Similar issues occur with the word “blue”. If the user typed “blue” then these words might be good variations:

  • azure
  • ultramarine
  • violet

But if the user typed “azure” then the related word(s) “ultramarine” and “dark blue” would not be good variations.

Bag or weighted network approach?

The number of variations of the original query is almost infinite but not all variations are likely to generate useful results. This also gives us the insight that not all word variations are equal, and it is not simply a matter of categorizing the words, put them into a bag, and then just use the whole bag when one of the words are used.

variations1

The best results would presumably come from using a weighted, possibly directed, network of words as in the drawing above. You can see that the words “giraffes” and “giraffe” are closer than “giraffes” and “okapi”. And the weight could be adjusted so when a user searches for “giraffa camelopardalis” then the search engine tries to widen the search with “giraffid” before it tries the plain “giraffe”.

 

How to generate such weighted network of words is an interesting subject but that have to be in another post.