To index or not to index, that is the policy question

If we are allowed to crawl a site in general, are there any exceptions we make?

As a follow-up to our post about robots.txt challenges we now come to the question of what to do if we are allowed to crawl, because we may not want to or we may get a different restriction in the retrieved document.

Robots directives

Directives to robots can occur outside robots.txt

HTTP response header

The response header X-Robots-Tag can exist. We currently don’t support that.

HTML meta tag

The <head> section of an HTML file can contain a <meta name=”robot” content=”…. directive. We support some of those.

Directives and how we interpret them

noindex: we take that as “don’t index this” and we won’t.

nosnippet: we take that as “do not display snippets or summaries”. But we do display the title.

nofollow: supported

noarchive: Not relevant for us as we don’t support “show cached version”

noodp: Not supported. It is also no longer relevant as the ODP project shut down 2017-03-17

unavailable_after: Not supported

But do we want to?

Cloaking sites

Cloaking sites sound cool like cloak&dagger and medieval assassins. But they aren’t. They are sites that return one version of URL for robots and another version for non-robots. Eg. a site could inspect the user-agent header in the HTTP request or the source IP-address and return a page about nice giraffes if it is a robot retrieving the document, but when a real user tries the same with his browser the site would return a page about evil penguins.

When we detect such sites we ban them, making them effectively non-existing according to our index.

Pay walls

Examples are newspapers, news sites, journals. We don’t take into consideration whether the access fee is reasonable, whether the creators of the information are paid fairly or whether all profit goes to feeding giraffes. We only consider: a: is it obvious that this is a for-pay access? b: does the site return the same content for both robots and users?

Technical limits and our choices

Due to capacity concerns we currently limit our index to:

  • West- and Central-European languages mostly (English, Finnish, Hungarian, Spanish, …)
  • Most but not all top-level-domains
  • HTML, plain text, PDFs and (old-format) MS-Word documents

For example if we encounter an Autocad WG file, a Thai document or a link to a .np site then we don’t index it.