If we are allowed to crawl a site in general, are there any exceptions we make?
As a follow-up to our post about robots.txt challenges we now come to the question of what to do if we are allowed to crawl, because we may not want to or we may get a different restriction in the retrieved document.
Directives to robots can occur outside robots.txt
The response header X-Robots-Tag can exist. We currently don’t support that.
The <head> section of an HTML file can contain a <meta name=”robot” content=”…. directive. We support some of those.
noindex: we take that as “don’t index this” and we won’t.
nosnippet: we take that as “do not display snippets or summaries”. But we do display the title.
noarchive: Not relevant for us as we don’t support “show cached version”
noodp: Not supported. It is also no longer relevant as the ODP project shut down 2017-03-17
unavailable_after: Not supported
Cloaking sites sound cool like cloak&dagger and medieval assassins. But they aren’t. They are sites that return one version of URL for robots and another version for non-robots. Eg. a site could inspect the user-agent header in the HTTP request or the source IP-address and return a page about nice giraffes if it is a robot retrieving the document, but when a real user tries the same with his browser the site would return a page about evil penguins.
When we detect such sites we ban them, making them effectively non-existing according to our index.
Examples are newspapers, news sites, journals. We don’t take into consideration whether the access fee is reasonable, whether the creators of the information are paid fairly or whether all profit goes to feeding giraffes. We only consider: a: is it obvious that this is a for-pay access? b: does the site return the same content for both robots and users?
Due to capacity concerns we currently limit our index to:
For example if we encounter an Autocad WG file, a Thai document or a link to a .np site then we don’t index it.