Norwegian – Bokmål or Nynorsk

Just a short note: We currently don’t distinguish between Bokmål and Nynorsk. We inherited a code base (https://github.com/privacore/open-source-search-engine) which had set aside 6 bits (64 values) for language ids. We recently discovered that it only had one code for Norwegian. But Norwegian comes in two versions: Bokmål and Nynorsk, which a similar but do have … Continue reading Norwegian – Bokmål or Nynorsk

Read more

Server load

Apparently our severs are loaded 20000% We installed a new metrics collector that puts OS metrics into a prometheus database, and installed an off-the-shelf grafana dashboard. We were surprised: So either: The servers are really loaded, or You shouldn’t trust off-the-shelf dashboards to get everything right  

Read more

Vagus – just for heartbeats

I recently had to rework how the search engine instances keep track of each other, and ended up developing a stand-alone tool. Background The search engine instances need to know which instances are alive and which are dead so they know which instances to ask and wait for sub-answers from and which to skip and … Continue reading Vagus – just for heartbeats

Read more

robots.txt – subtle challenges

robots.txt controls which content web crawlers may reach, but the lack of proper specifications has lead to some interesting corner-cases.

Read more

Word occurrences in documents

While investigating some compression schemes I extracted some statistics from a (random) subset of our data so far. The statistics are how many times a word occurs in each document (in documents that contain the word). The subset may be slightly biased and incomplete representation of the complete web, but the differences are intriguing nonetheless. … Continue reading Word occurrences in documents

Read more

Eszett – search widening or normalization?

The German letter ß provides some interesting challenges for search engines.

Read more

Are giraffes bad? (word variations and search widening)

When a user types the question “are giraffes bad?” into a search engine it could search for exactly those words. It would most likely produce somewhat helpful results, but the results would not include documents related to giraffe badness if the documents don’t use exactly those words.

Read more

POPCNT Instruction Adventure

I stumbled upon the getNumBitsOn64() function and friends in the source code. The function counts the number of 1 bits in a byte/word/dword/… by using table lookups.

Read more