Indexing update – summary

We have been extending some of the indexing and tokenization (what we treat as words) over the past few months. This post sums up the changes so far. Ligatures We decompose stylistic ligatures (ij/ffl/…) into the component letters. We decompose language-specific ligatures in English (œ and æ) and French (œ). This means that if a … Continue reading Indexing update – summary

Read more

Indexing superscripts and subscripts

Indexing and searching for H₂O and j*=σT⁴ is not straightforward. Documents can have superscript and subscript in them as in the above examples. In HTML it can be achieved with the <sub> and <sup> tags, viz. H2O j*=σT4 14 m2 drywall But superscript/subscript for some characters can also be achieved with dedicated unicode codepoints, eg: … Continue reading Indexing superscripts and subscripts

Read more

Possessive apostrophe-s and search engines

The apostrophe is normally used for English possessive-s, but there anomalies. In English the apostrophe is used for the possessive s, such as John’s cat What does that have to do with search engines? Two things. Indexing words and word combinations Our search engine (and most other search engines too) index words, not strings. So … Continue reading Possessive apostrophe-s and search engines

Read more

Digraphs and Ligatures

Are ligatures easy to index for search engines? They mostly are, but correctly identifying and classifying them is not straight-forward. It usually starts with a digraph… A digraph is a pair of letters that when combined do not produce the normal sounds of the letters individually. An example is the English “ng” which represents /ŋ/ … Continue reading Digraphs and Ligatures

Read more

Citroën, Škoda, Hildisvíni, and the search for ragù alle bolognese and bûche de Noël

When encountering words with diacritics which challenges are there? As a starting point we index documents as they are written and we search for words as you write them. This gives you exact matches as you would expect. But in some cases that would leave out relevant results. This post series will explore some cases … Continue reading Citroën, Škoda, Hildisvíni, and the search for ragù alle bolognese and bûche de Noël

Read more

Progress on Search Widening – Danish

We have finally gotten around to implement search widening, starting with Danish. We are currently working on Danish-specific search widening, meaning that if you search for a word we may include other words such as: Spelling variants Compound word derivatives Decompositions Inflections Conjugations etc. For other considerations look for other blog posts with the tag … Continue reading Progress on Search Widening – Danish

Read more

Norwegian – Bokmål or Nynorsk

Just a short note: We currently don’t distinguish between Bokmål and Nynorsk. We inherited a code base (https://github.com/privacore/open-source-search-engine) which had set aside 6 bits (64 values) for language ids. We recently discovered that it only had one code for Norwegian. But Norwegian comes in two versions: Bokmål and Nynorsk, which a similar but do have … Continue reading Norwegian – Bokmål or Nynorsk

Read more

Server load

Apparently our severs are loaded 20000% We installed a new metrics collector that puts OS metrics into a prometheus database, and installed an off-the-shelf grafana dashboard. We were surprised: So either: The servers are really loaded, or You shouldn’t trust off-the-shelf dashboards to get everything right  

Read more

Vagus – just for heartbeats

I recently had to rework how the search engine instances keep track of each other, and ended up developing a stand-alone tool. Background The search engine instances need to know which instances are alive and which are dead so they know which instances to ask and wait for sub-answers from and which to skip and … Continue reading Vagus – just for heartbeats

Read more

robots.txt – subtle challenges

robots.txt controls which content web crawlers may reach, but the lack of proper specifications has lead to some interesting corner-cases.

Read more