Language support in Findx

We support multiple languages, but how (and which are they)? Are Czech results useful to Slovaks? Can Italians read Spanish?

We don’t support all languages, unfortunately

The Internet is large. We did consider crawling the entire Internet but quickly decided against it due to space limitation and because it would require at least one person per language to verify result quality, spam, inaccurate classification, etc. The search engine we run on is also currently limited to 64 languages, which is something we will fix sometime in the future.

So we limit our web crawler to the languages spoken in Europe. But even then we had to leave some out, eg. Frisian, Sami, or the distinction between Nynorsk/Bokmål, Romanian/Moldavian, and Italian/Sardinian/Sicilian.

Document language

Some document specify which language they are in. Some don’t. Some lie. For instance, it is not uncommon to see a HTML document with the <html lang=”en”> header with the actual content in something different eg. Norwegian, so what the documents claim cannot be trusted entirely. The same applies to any Content-Language HTTP header. We do use them as a hint though, together with the inherent hint from ccTLDs (eg. .de). But we mainly use some 3rd-party libraries to detect the document language.

Query language / Useful language results

Detecting which language the user’s query is can sometimes be easy as in “rindfleisch” which is only used in one language, sometimes it is difficult as in “computer” which is used in many languages. Web browsers can tell us which languages the user allegedly understands (in the Accept-Language HTTP request header) but that cannot be trusted entirely because some browsers lie by including extra languages the user has not chosen.

So instead of trying hard to detect exactly which language a query is in (and failing for multi-lingual words) we try instead to detect which language results would be useful to the user.
We take into account:

  • The domain/TLD the user accessed findx on (find.de, findx.dk, …).
  • The Geo-IP, but only on country level.
  • The browser’s Accept-Language request header.
  • The words in the query.
  • Specific letters in the query.

For instance, if we have:

  • TLD: findx.be
  • Geo-IP: France
  • Accept-Language: fr,en,it
  • Query: “kraftwerk”

then we adjust the language weights accordingly. The TLD indicates French or Dutch so the weights for those languages are adjusted up. The Geo-IP indicates French so that is adjusted further up. The Geo-IP also indicates (weakly) that the user has had teaching in English as foreign language so the English weight is also adjust up. The Accept-Language indicates French, English, and Italian so those weights are adjusted up but not equally – the languages are usually prioritized in the header. Additionally we consider language intelligibility / sprachbunds / dialect continuums in the Accept-Language header. French and English has no obvious friends, but Italian is approximately 30% intelligible with Spanish so the weight for Spanish is adjusted a bit up. The query indicates German so the weight for that is adjusted up. The query letters include “k” and “w” so the weights for Italian and Finnish are adjusted down. The resulting language weights / usefulness probability could be:

  • French: 90% useful
  • Dutch: 50% useful
  • English: 70% useful
  • Italian: 50% useful
  • Spanish: 15% useful
  • Polish: 3% useful
  • Finnish: 1% useful

These weights are then factored in when we rank the results. So the presumably French-speaking Belgian searching for “kraftwerk” would primarily get results about the German band in French, but also in English and Dutch if the results were good enough.

You can override this mechanism in the Findx frontend by going to findx→settings→search result language. We plan on giving full access to tweaking the language weights (we’re working on it), but for now we do all the above behind the scenes unless you explicitly override it.

We are not the only search engine that doesn’t consider multilingualism as a freak of nature – Duckduckgo has multiple interface languages, Bing allows you to select multiple results languages, Qwant supports Corsican and Breton in their interface.

But as far as we know no other search engine considers language intelligibility or non-official country languages when ranking results. They usually either pick 1-2 languages or none at all. They don’t consider that if a user is doing an obviously Czech query then perhaps Slovak results might be useful too; or that Maltese and Swedes are more likely to understand English than Poles (source: CEMFI:do you speak English?); or that the first foreign language taught in school seems to stick better in Slovenia than in Ireland (source: Eurostat:foreign language skills, Eurostat:database).

Other sources we used when constructing the current weights and interaction:

  • Official languages according to Wikipedia and CIA world factbook
  • Actual languages according to actually-used languages on government websites, language percentages from world factbook, and a whole lot of digging in forums answering the question (preferably from citizens).
  • Information on which foreign languages are taught in school (not that easy to find), both as the first foreign language, and any second foreign language. Some surprises here. Eg. in Denmark the second foreign language (mandatory) is German. In Germany the second foreign language varies from school to school and can include Latin and Russian. In Spain there is apparently no mandatory second foreign language.
  • Statistics and surveys on language proficiency per country and language (not that easy to find).
  • Common sense. Eg. even though the only official language in the Vatican is Latin they are bound to have some proficiency in Italian.
  • Television viewing habits.

Prototype for calculating language weights

If you are curious and know the programming language Python then you can play around with the prototype I used to testing out ideas: query_language.py. The prototype is close to the final algorithm we use. The differences are:

  • Only country (geo-IP) can be specified in the prototype. TLD/ccTLD cannot.
  • The tool has a tiny built-in dictionary. For actual query text examination we use dictionaries (hunspell and aspell), CLD2 and CLD3

For example:

./query_language.py dk da,en,de kanelsnegl
./query_language.py dk da,en,de ålegilde
./query_language.py de de,en currywurst
./query_language.py de de,en,da currywurst
./query_language.py de de,en Maßstab
./query_language.py us en What Is the Airspeed Velocity of an Unladen Swallow

Summary

When users don’t specify any language hints on findx.com/findx.de/… then we should most of the time do the right thing for them. For monolingual countries/users it shouldn’t matter. For multilingual countries you should sometimes see a difference from other approaches that only consider a single language useful.

Future work

Lots of tweaking, and possibly a more fine-grained estimation on the query words, because some queries are using words exclusive to a language according to dictionaries and CLD2/CLD3, but the phrase is international. Eg “kerry blue terrier” is clearly English, except it isn’t because it is the name of a dog breed. We intend to do something better automatically but for now you have to select the result language yourself.
There are also some language intelligibility that is missing/unused i.e. Finnish/Estonian and Latvian/Lithuanian.