Language support in Findx

We support multiple languages, but how (and which are they)? Are Czech results useful to Slovaks? Can Italians read Spanish? We don’t support all languages, unfortunately The Internet is large. We did consider crawling the entire Internet but quickly decided against it due to space limitation and because it would require at least one person … Continue reading Language support in Findx

Read more

Indexing update – summary

We have been extending some of the indexing and tokenization (what we treat as words) over the past few months. This post sums up the changes so far. Ligatures We decompose stylistic ligatures (ij/ffl/…) into the component letters. We decompose language-specific ligatures in English (œ and æ) and French (œ). This means that if a … Continue reading Indexing update – summary

Read more

Indexing superscripts and subscripts

Indexing and searching for H₂O and j*=σT⁴ is not straightforward. Documents can have superscript and subscript in them as in the above examples. In HTML it can be achieved with the <sub> and <sup> tags, viz. H2O j*=σT4 14 m2 drywall But superscript/subscript for some characters can also be achieved with dedicated unicode codepoints, eg: … Continue reading Indexing superscripts and subscripts

Read more

Possessive apostrophe-s and search engines

The apostrophe is normally used for English possessive-s, but there anomalies. In English the apostrophe is used for the possessive s, such as John’s cat What does that have to do with search engines? Two things. Indexing words and word combinations Our search engine (and most other search engines too) index words, not strings. So … Continue reading Possessive apostrophe-s and search engines

Read more

Digraphs and Ligatures

Are ligatures easy to index for search engines? They mostly are, but correctly identifying and classifying them is not straight-forward. It usually starts with a digraph… A digraph is a pair of letters that when combined do not produce the normal sounds of the letters individually. An example is the English “ng” which represents /ŋ/ … Continue reading Digraphs and Ligatures

Read more

Citroën, Škoda, Hildisvíni, and the search for ragù alle bolognese and bûche de Noël

When encountering words with diacritics which challenges are there? As a starting point we index documents as they are written and we search for words as you write them. This gives you exact matches as you would expect. But in some cases that would leave out relevant results. This post series will explore some cases … Continue reading Citroën, Škoda, Hildisvíni, and the search for ragù alle bolognese and bûche de Noël

Read more

Progress on Search Widening – Danish

We have finally gotten around to implement search widening, starting with Danish. We are currently working on Danish-specific search widening, meaning that if you search for a word we may include other words such as: Spelling variants Compound word derivatives Decompositions Inflections Conjugations etc. For other considerations look for other blog posts with the tag … Continue reading Progress on Search Widening – Danish

Read more

To index or not to index, that is the policy question

If we are allowed to crawl a site in general, are there any exceptions we make? As a follow-up to our post about robots.txt challenges we now come to the question of what to do if we are allowed to crawl, because we may not want to or we may get a different restriction in … Continue reading To index or not to index, that is the policy question

Read more

Norwegian – Bokmål or Nynorsk

Just a short note: We currently don’t distinguish between Bokmål and Nynorsk. We inherited a code base (https://github.com/privacore/open-source-search-engine) which had set aside 6 bits (64 values) for language ids. We recently discovered that it only had one code for Norwegian. But Norwegian comes in two versions: Bokmål and Nynorsk, which a similar but do have … Continue reading Norwegian – Bokmål or Nynorsk

Read more

Server load

Apparently our severs are loaded 20000% We installed a new metrics collector that puts OS metrics into a prometheus database, and installed an off-the-shelf grafana dashboard. We were surprised: So either: The servers are really loaded, or You shouldn’t trust off-the-shelf dashboards to get everything right  

Read more