Progress on Search Widening – Danish

We have finally gotten around to implement search widening, starting with Danish.

We are currently working on Danish-specific search widening, meaning that if you search for a word we may include other words such as:

  • Spelling variants
  • Compound word derivatives
  • Decompositions
  • Inflections
  • Conjugations
  • etc.

For other considerations look for other blog posts with the tag “search widening”.

Base material

We found a Danish lexicon (in linguistic terms) covering approximately 50.000 Danish words and each entry contains detailed information about the word class, part-of-speech, alternative spellings, inflections, etc. this gives us a good starting point for generating search widenings. later on we may use it for analysing what the user’s query is about.

The lexicon we found was STO (sprogteknologisk orddatabase) from Center for Sprogteknologi (CST). It is not meant for human consumption but instead for machine use.

User visibility

The intent is to allow the user to specific whether search widening is allowed, and if so, by how much. And for the really nerdy users allow them to specify each type of widening. The search widening is not active in production yet.

Additionally, if you enclose with word in quotation marks “” then Findx will not try to widen that word.

We are toying with the idea that when you search for something and the result is no matches or suspiciously few matches then automatically show the user the option to widen the search parameters a bit.

Spelling variants

The STO gives us alternate spellings for some words, eg “cirklen” and “cirkelen”. So when you search for “cirklen” Findx knows that “cirkelen” is an alternate spelling and search for that too.

We also hardcoded the rule that some proper nouns (cities, towns, places, persons, …) may use the old written double-a form instead of “å”. An example is the southern Danish city Aabenraa. So if you search for “Åbenrå” Findx will automatically look for “Aabenraa” too (and vice versa).

We also hardcoded rules for stripping and adding acute accent / accent-aigu. Eg the French first name René is often written with an accent aigu on the last e but not always. So if you search for “rene” or “bangs alle” Findx knows to also search for “rené” and “bangs allé” respectively. And vice versa.

Noun inflections

We also support changing nouns, eg:

  • indefinite form to definite form
  • definite form to indefinite form
  • singular to plural
  • plural to singular

Initial tests revealed that changing singular to plural is mostly ok, but plural to singular doesn’t work so well. So the first widening will have a heigher weight than the second widening.

Verbs

Currently we only change past tense to past tense. In Danish the imperfect past (“he ate”) and the perfect past (“he has eaten”) are mostly interchangeable, especially in colloquial form.

Future work

  • Changing genitive noun form possessive (“spandens låg”) to/from part-of form (“låget på spanden”).
  • Changing verbs in present to future if there a future time indications (“han kommer i morgen” vs. “han vil komme i morgen”)
  • Other languages. Norwegian seems to be the obvious choice if we can get our hands on the equivalent of STO in Norwegian (or generate it ourselves).
  • Activate this feature in the production