Indexing update – summary

We have been extending some of the indexing and tokenization (what we treat as words) over the past few months. This post sums up the changes so far. Ligatures We decompose stylistic ligatures (ij/ffl/…) into the component letters. We decompose language-specific ligatures in English (œ and æ) and French (œ). This means that if a … Continue reading Indexing update – summary

Read more

Indexing superscripts and subscripts

Indexing and searching for H₂O and j*=σT⁴ is not straightforward. Documents can have superscript and subscript in them as in the above examples. In HTML it can be achieved with the <sub> and <sup> tags, viz. H2O j*=σT4 14 m2 drywall But superscript/subscript for some characters can also be achieved with dedicated unicode codepoints, eg: … Continue reading Indexing superscripts and subscripts

Read more

Possessive apostrophe-s and search engines

The apostrophe is normally used for English possessive-s, but there anomalies. In English the apostrophe is used for the possessive s, such as John’s cat What does that have to do with search engines? Two things. Indexing words and word combinations Our search engine (and most other search engines too) index words, not strings. So … Continue reading Possessive apostrophe-s and search engines

Read more

Digraphs and Ligatures

Are ligatures easy to index for search engines? They mostly are, but correctly identifying and classifying them is not straight-forward. It usually starts with a digraph… A digraph is a pair of letters that when combined do not produce the normal sounds of the letters individually. An example is the English “ng” which represents /ŋ/ … Continue reading Digraphs and Ligatures

Read more

Citroën, Škoda, Hildisvíni, and the search for ragù alle bolognese and bûche de Noël

When encountering words with diacritics which challenges are there? As a starting point we index documents as they are written and we search for words as you write them. This gives you exact matches as you would expect. But in some cases that would leave out relevant results. This post series will explore some cases … Continue reading Citroën, Škoda, Hildisvíni, and the search for ragù alle bolognese and bûche de Noël

Read more

Online privacy events: Data Privacy Day and Safer Internet Day

This year’s Data Privacy Day and Safer Internet Day are back-to-back, and as part of our efforts to support both events, we are launching a new website aimed at helping families take better care of their online privacy. Activate Privacy provides you with a number of small, actionable steps that you can follow that will help … Continue reading Online privacy events: Data Privacy Day and Safer Internet Day

Read more

Findx and Qwant: Two more good private search engines

In March 2017, Kim Elmose wrote a review of Findx and Qwant in Danish, “Findx og Qwant: Yderligere to gode private søgemaskiner.” We received permission to translate and republish his article in English here – thank you Kim! I haven’t used Google to search for information for more than a year and a half now. I … Continue reading Findx and Qwant: Two more good private search engines

Read more

December roundup – Danish result improvements, instant answers and more!

Improving results at Findx We’ve been tackling the problem of the quality of results in a number of ways here at Findx. Improve results in one language to improve them all Now that Findx has been in public beta for some time, we are focusing on improving the index. But testing various tweaks to see … Continue reading December roundup – Danish result improvements, instant answers and more!

Read more

Progress on Search Widening – Danish

We have finally gotten around to implement search widening, starting with Danish. We are currently working on Danish-specific search widening, meaning that if you search for a word we may include other words such as: Spelling variants Compound word derivatives Decompositions Inflections Conjugations etc. For other considerations look for other blog posts with the tag … Continue reading Progress on Search Widening – Danish

Read more

To index or not to index, that is the policy question

If we are allowed to crawl a site in general, are there any exceptions we make? As a follow-up to our post about robots.txt challenges we now come to the question of what to do if we are allowed to crawl, because we may not want to or we may get a different restriction in … Continue reading To index or not to index, that is the policy question

Read more