Digraphs and Ligatures

Are ligatures easy to index for search engines? They mostly are, but correctly identifying and classifying them is not straight-forward. It usually starts with a digraph… A digraph is a pair of letters that when combined do not produce the normal sounds of the letters individually. An example is the English “ng” which represents /ŋ/ … Continue reading Digraphs and Ligatures

Read more

Citroën, Škoda, Hildisvíni, and the search for ragù alle bolognese and bûche de Noël

When encountering words with diacritics which challenges are there? As a starting point we index documents as they are written and we search for words as you write them. This gives you exact matches as you would expect. But in some cases that would leave out relevant results. This post series will explore some cases … Continue reading Citroën, Škoda, Hildisvíni, and the search for ragù alle bolognese and bûche de Noël

Read more

Norwegian – Bokmål or Nynorsk

Just a short note: We currently don’t distinguish between Bokmål and Nynorsk. We inherited a code base (https://github.com/privacore/open-source-search-engine) which had set aside 6 bits (64 values) for language ids. We recently discovered that it only had one code for Norwegian. But Norwegian comes in two versions: Bokmål and Nynorsk, which a similar but do have … Continue reading Norwegian – Bokmål or Nynorsk

Read more

Server load

Apparently our severs are loaded 20000% We installed a new metrics collector that puts OS metrics into a prometheus database, and installed an off-the-shelf grafana dashboard. We were surprised: So either: The servers are really loaded, or You shouldn’t trust off-the-shelf dashboards to get everything right  

Read more

Vagus – just for heartbeats

I recently had to rework how the search engine instances keep track of each other, and ended up developing a stand-alone tool. Background The search engine instances need to know which instances are alive and which are dead so they know which instances to ask and wait for sub-answers from and which to skip and … Continue reading Vagus – just for heartbeats

Read more

robots.txt – subtle challenges

robots.txt controls which content web crawlers may reach, but the lack of proper specifications has lead to some interesting corner-cases.

Read more

Word occurrences in documents

While investigating some compression schemes I extracted some statistics from a (random) subset of our data so far. The statistics are how many times a word occurs in each document (in documents that contain the word). The subset may be slightly biased and incomplete representation of the complete web, but the differences are intriguing nonetheless. … Continue reading Word occurrences in documents

Read more

Eszett – search widening or normalization?

The German letter ß provides some interesting challenges for search engines.

Read more

Are giraffes bad? (word variations and search widening)

When a user types the question “are giraffes bad?” into a search engine it could search for exactly those words. It would most likely produce somewhat helpful results, but the results would not include documents related to giraffe badness if the documents don’t use exactly those words.

Read more

POPCNT Instruction Adventure

I stumbled upon the getNumBitsOn64() function and friends in the source code. The function counts the number of 1 bits in a byte/word/dword/… by using table lookups.

Read more