Honouring the content of robots.txt and the intentions of the webmaster can be tricky.
Yes, we honour robots.txt. And yes, we have had a few bugs there – all fixed now (and if not please give us a shout).
That doesn’t mean that everything is rainbows and happy bunnies, because there are some faulty configurations in the wild, and there are grey areas in the specification. Well, in the non-existing specifications.
We have seen at least two examples of incorrect content of robots.txt due to a less-than-perfect webmaster.
The page pinapple.example.com/boo.boo.html was explaining that the requested document wasn’t found. With HTTP result-code 200 of course 🙁
That was from one of the larger sites on the internet. They check the user-agent string and if it isn’t white-listed then they return a HTML page. Even for robots.txt. Even more curious, they had whitelisted the tool curl but not wget.
So a robots.txt was located a place we aren’t allowed to crawl. Uhm… what does that mean?
We chose to not check penguin.example.com/robots.txt.
The rule is to search for longest prefix match, and if none is found then use the * entry:
User-Agent: * Disallow: /funny-penguin-pictures/ User-Agent: penguin Disallow: /funny-giraffe-pictures/ User-Agent: penguin-bot Disallow: /funny-platypus-pictures/
So a crawler named “penguin-bot” will be match the third entry, and a crawler “penguin-spider” will match the second entry. So far so good. But what if there are duplicated matches as in:
... User-Agent: penguin Disallow: /funny-giraffe-pictures/ ... User-Agent: penguin Allow: /funny-giraffe-pictures/
Which set of directives apply? Yes, we have seen such robots.txt files in the wild.
If you have this:
Disallow: /giraffids Allow: /giraffids/giraffe Disallow: /giraffids/giraffe/reticulata
is /giraffids/giraffe/tippelskirchi allowed? The original specification didn’t have ‘allow’ so the order didn’t matter. With ‘allow’ the order does matter but the question is which order is correct. Some crawler (bing, google) use the longest pattern and would match second directive (allow) while other crawlers (eg. yandex) use the first match and hit first directive (disallow). Add wildcards to the mix and it gets somewhat unclear what will match and what won’t.
We use the longest pattern and would match the second directive (allow).
If you have
then checking if /acgfhdhbbcfbhchacchighfibhcifabnhiahnaaibcafhigbihaihj matches is not necessarily computationally cheap. The specification does not set a limit on the number *-wildcards that can occur in a directive. See Wikipedia:ReDoS for why this can be a problem.
robots.txt can have comments in them with ‘#’. We have seen a few examples of:
#robots.txt from clueless.example.org #crawling is strictly forbidden except with prior written agreement. #Breach of this restriction might result in prosecution bla bla bla ... #bla bla and we also want a pony!
It might surprise those webmasters that web crawlers normally don’t have a natural-language processing capabilities and are therefore unable to parse legalese in comments.
By default we crawl, as all other search engines do. We also honor robots.txt, as all search engines should do. However, only if the robots.txt file is accessible, readable and correctly formatted will the results be well-defined.
If robots.txt redirects to a location we aren’t allowed to crawl or to a non-existing location then we treat it as if robots.txt doesn’t exist.
If the content of robots.txt (or where it redirects to) isn’t a robots.txt file then we parse it as a robots.txt file anyway. Any content that can be parsed as robots.txt directives is honoured. The only partially legitimate case of returning robots.txt as html we can think of some are ancient wiki’s or CMS’s that only served html, so if the user wanted a robots.txt he would have had to edit it like a html page. Probably not useful but since we have fetched the content of the page and the behaviour in this case is best-effort we might as well try to do the right thing for ancient webservers.