Indexing update – summary

We have been extending some of the indexing and tokenization (what we treat as words) over the past few months. This post sums up the changes so far. Ligatures We decompose stylistic ligatures (ij/ffl/…) into the component letters. We decompose language-specific ligatures in English (œ and æ) and French (œ). This means that if a … Continue reading Indexing update – summary

Read more

Indexing superscripts and subscripts

Indexing and searching for H₂O and j*=σT⁴ is not straightforward. Documents can have superscript and subscript in them as in the above examples. In HTML it can be achieved with the <sub> and <sup> tags, viz. H2O j*=σT4 14 m2 drywall But superscript/subscript for some characters can also be achieved with dedicated unicode codepoints, eg: … Continue reading Indexing superscripts and subscripts

Read more

Possessive apostrophe-s and search engines

The apostrophe is normally used for English possessive-s, but there anomalies. In English the apostrophe is used for the possessive s, such as John’s cat What does that have to do with search engines? Two things. Indexing words and word combinations Our search engine (and most other search engines too) index words, not strings. So … Continue reading Possessive apostrophe-s and search engines

Read more

Digraphs and Ligatures

Are ligatures easy to index for search engines? They mostly are, but correctly identifying and classifying them is not straight-forward. It usually starts with a digraph… A digraph is a pair of letters that when combined do not produce the normal sounds of the letters individually. An example is the English “ng” which represents /ŋ/ … Continue reading Digraphs and Ligatures

Read more

Citroën, Škoda, Hildisvíni, and the search for ragù alle bolognese and bûche de Noël

When encountering words with diacritics which challenges are there? As a starting point we index documents as they are written and we search for words as you write them. This gives you exact matches as you would expect. But in some cases that would leave out relevant results. This post series will explore some cases … Continue reading Citroën, Škoda, Hildisvíni, and the search for ragù alle bolognese and bûche de Noël

Read more