When a user types the question “are giraffes bad?” into a search engine it could search for exactly those words. It would most likely produce somewhat helpful results, but the results would not include documents related to giraffe badness if the documents don’t use exactly those words.
If we consider synonyms and closely related words then we can expand the possible queries to those that mean the same or almost the same thereby finding more candidate documents, hopefully producing results more useful to the user.
You could use a dictionary-based approach to find synonyms, but we have found that such a simple method doesn’t generate as good results as we would like because it is not only synonyms we are looking for but rather closely related words. A synonym-based approach would consider “giraffe” and “giraffid” synonyms (maybe) but not find the closely related word “giraffes”. It also relies heavily on the quality of the synonym list in the dictionary.
If we instead consider word variations from the point of view of grammatical analysis we find find more useful word variations and get a bit more insight.
Verb (“are”) conjugation (including incorrect ones):
Noun (“giraffe”) declensions:
Adjective (“bad”) synonyms:
Since giraffa camelopardalis is a category then there are even more synonyms:
Negated antonyms only works for words that represent a one-dimensional continuum such as good/bad, light/dark, white/black, weak/strong, but are less useful for adjectives describing multi-dimensional ranges, such as colors, N/S/W/E directions or the cosiness of an armchair. It also fails if there is no useful antonym (fun exercise: find the antonym to “thirsty”). Natural language analysis would be so much easier if people used Newspeak as in George Orwell’s “1984” (I mean: It would become “plusgooder”).
The conjugation of the English verb “to be” is a bit limited due to English being the way it is. If we consider Italian instead there are more possibilities (these are all third-person plural so assume that neither the user nor the search engine are in fact the mentioned giraffes)
Furthermore, not all conjugations have equal value. If the user used the present tense (“sono”) then it is quite unlikely that the remote pluperfect conjugation (“ebbero state”) would be helpful. On the other hand, if the user used the recent past conjugation (“passsato prossimo”) then the remote past (“passato remoto”) conjugation could be helpful, depending on the topic and the origin of the user and the candidate documents.
Similar issues occur with the word “blue”. If the user typed “blue” then these words might be good variations:
But if the user typed “azure” then the related word(s) “ultramarine” and “dark blue” would not be good variations.
The number of variations of the original query is almost infinite but not all variations are likely to generate useful results. This also gives us the insight that not all word variations are equal, and it is not simply a matter of categorizing the words, put them into a bag, and then just use the whole bag when one of the words are used.
The best results would presumably come from using a weighted, possibly directed, network of words as in the drawing above. You can see that the words “giraffes” and “giraffe” are closer than “giraffes” and “okapi”. And the weight could be adjusted so when a user searches for “giraffa camelopardalis” then the search engine tries to widen the search with “giraffid” before it tries the plain “giraffe”.
How to generate such weighted network of words is an interesting subject but that have to be in another post.