robots.txt – subtle challenges

Honouring the content of robots.txt and the intentions of the webmaster can be tricky.

Yes, we honour robots.txt. And yes, we have had a few bugs there – all fixed now (and if not please give us a shout).

That doesn’t mean that everything is rainbows and happy bunnies, because there are some faulty configurations in the wild, and there are grey areas in the specification. Well, in the non-existing specifications.

Incorrect content

We have seen at least two examples of incorrect content of robots.txt due to a less-than-perfect webmaster.

Redirect to HTML (file-not-found)

  1. We find a link to orange.example.com/foo/foofile.html
  2. We first fetch orange.example.com/robots.txt
  3. orange.example.com/robots.txt redirects to pinapple.example.com/boo/boofile.html
  4. pinapple.example.com/boo.boo.html is actually a HTML document

The page pinapple.example.com/boo.boo.html was explaining that the requested document wasn’t found. With HTTP result-code 200 of course ūüôĀ

“Helpful” user-agent check

  1. We find a link to vegetable.example.com/broccoli.html
  2. We first request vegetable.example.com/robots.txt
  3. We get a HTML document saying that “your browser is not supported”

That was from one of the larger sites on the internet. They check the user-agent string and if it isn’t white-listed then they return a HTML page. Even for robots.txt. Even more curious, they had whitelisted the tool curl but not wget.

Redirect to disallowed crawl

  1. We find a link to giraffe.example.com/giraffids/reticulata.html
  2. We first request giraffe.example.com/robots.txt
  3. giraffe.example.com/robots.txt redirects to penguin.example.com/giraffes-in-disquise.html
  4. penguin.example.com/robots.txt says that crawling of /giraffes-in-disquise.html is disallowed

So a robots.txt was located a place we aren’t allowed to crawl. Uhm… what does that mean?

We chose to not check penguin.example.com/robots.txt.

Troublesome directives in robots.txt

Multiple user-agent matches

The rule is to search for longest prefix match, and if none is found then use the * entry:

  User-Agent: *
  Disallow: /funny-penguin-pictures/

  User-Agent: penguin
  Disallow: /funny-giraffe-pictures/

  User-Agent: penguin-bot
  Disallow: /funny-platypus-pictures/

So a crawler named “penguin-bot” will be match the third entry, and a crawler “penguin-spider” will match the second entry. So far so good. But what if there are duplicated matches as in:

  ...
  User-Agent: penguin
  Disallow: /funny-giraffe-pictures/

  ...
  User-Agent: penguin
  Allow: /funny-giraffe-pictures/

Which set of directives apply? Yes, we have seen such robots.txt files in the wild.

First match versus longest pattern

If you have this:

  Disallow: /giraffids
  Allow: /giraffids/giraffe
  Disallow: /giraffids/giraffe/reticulata

is /giraffids/giraffe/tippelskirchi allowed? The original specification didn’t have ‘allow’ so the order didn’t matter. With ‘allow’ the order does matter but the question is which order is correct. Some crawler (bing, google) use the longest pattern and would match second directive (allow) while other crawlers (eg. yandex) use the first match and hit first directive (disallow). Add wildcards to the mix and it gets somewhat unclear what will match and what won’t.

We use the longest pattern and would match the second directive (allow).

Excessive wildcards

If you have

  Disallow /*a*b*c*d*e*f*g*h*i*j

then checking if /acgfhdhbbcfbhchacchighfibhcifabnhiahnaaibcafhigbihaihj matches is not necessarily computationally cheap. The specification does not set a limit on the number *-wildcards that can occur in a directive. See Wikipedia:ReDoS for why this can be a problem.

Clueless webmasters

robots.txt can have comments in them with ‘#’. We have seen a few examples of:

  #robots.txt from clueless.example.org
  #crawling is strictly forbidden except with prior written agreement.
  #Breach of this restriction might result in prosecution bla bla bla
  ...
  #bla bla and we also want a pony!

It might surprise those webmasters that web crawlers normally don’t have a natural-language processing capabilities and are therefore unable to parse legalese in comments.

What we do

By default we crawl, as all other search engines do. We also honor robots.txt, as all search engines should do. However, only if the robots.txt file is accessible, readable and correctly formatted will the results be well-defined.

If robots.txt redirects to a location we aren’t allowed to crawl or to a non-existing location then we treat it as if robots.txt doesn’t exist.

If the content of robots.txt (or where it redirects to) isn’t a robots.txt file then we parse it as a robots.txt file anyway. Any content that can be parsed as robots.txt directives is honoured. The only partially legitimate case of returning robots.txt as html we can think of some are ancient wiki’s or CMS’s that only served html, so if the user wanted a robots.txt he would have had to edit it like a html page. Probably not useful but since we have fetched the content of the page and the behaviour in this case is best-effort we might as well try to do the right thing for ancient webservers.