Eszett – search widening or normalization?

The German letter ß provides some interesting challenges for search engines.

The history of the letter ß is long and interesting, but we are mainly interesting in the unique properties of it:

  • It has no capital form. It is always transliterated when converted to uppercase. (note*1)
  • It can be transliterated in two ways (‘ss’ and ‘sz’). (note*2)
  • It is region-specific. ß is used in Germany and Austria but not in Switzerland and Lichtenstein.
  • Usage has recently changed. The German orthography reform in 1996 changed the use of ß slightly.

The juicy problem is that the words Konradstraße, Konradstrasse, Konradstrasze and KONRADSTRASSE are the same; and the German word “Geschoss” is the same as the Austrian word “Geschoß” (note*3). But ß cannot just be transliterated blindly because it conveys not only pronunciation help but also disambiguation, as in “Ich trinke Bier in Maßen” versus “Ich trunke Bier in Massen”.

 

We are currently investigating if we should just transliterate all occurrences of ß, or if we should use search widening. My gut feeling is that the search widening will produce slightly better search results.

 

Note 1: There have been attempts to introduce an uppercase form of ß (U+1E9E), and Unicode 5.1.0 has it. Time will tell if it will be used in general again.

 

Note 2: Usually ‘ss’ in Germany, but quite often ‘sz’ in Austria especially near Hungary.

 

Note 3:  The word is pronounced differently in Germany and Austria, and that influences if the word is written with double-s or with ß.