We received many questions in our last AMA on reddit and from webmasters about Findx, about how we build the Findx search engine index, and how our crawler, Findxbot, works. We hope this post can answer some of those questions!
Because Findx is not a meta search engine, it uses its own bot to crawl the internet to find and add pages to our search index. And its working hard: We’ve recently improved our algorithms and scaled up to indexing and refreshing to about 17 million pages per day. Our index is nearly at 2.5 billion pages already!
From our initial crawling in 2015 until our public beta launch in May 2017, we’ve done a lot of work to make the Findxbot strictly follow the written and unwritten web crawler rules. In May 2017, when we launched our public beta, we also increased the crawling speed. As you can see, we have visited and indexed a lot of websites.
Our Findxbot stats from 20 August:
This has improved the relevancy of the search results on Findx, and it’s getting better each day. We’ve also improved our spam detection algorithms, part of our secret sauce behind Findx, and this has removed millions of spammy pages from the index.
It was during this process that we discovered some sites had blocked the Findxbot. Some did it back in 2015 when we had just started. At that time, the code we base Findx on had some bugs that in some cases caused the crawler to not always respect the robots.txt standard.
In addition to the sites visited by Findxbot on 20 August, 3.2 million more pages were checked where Findxbot access to the page was denied or explicitly disallowed in robots.txt.
In 2015, when Findx just started to add pages to the index, the Findxbot had some bugs that caused it to sometimes index pages it was not allowed to. In certain error situations, it could ignore robots.txt directives, and in other cases it choked on DNS error messages and other rare scenarios. We want to apologise to webmasters who were affected back then. We built Findxbot on an existing open source solution that unfortunately had some bugs we were unaware of. The bugs have been fixed, and we have added additional security measures that will prevent our crawler to misbehave even if we introduce new bugs ourselves. Be assured that the Findxbot respects the directives in the robots.txt these days.
When we saw the stats on 20 August it puzzled us: Why have so many websites blocked us? After a short investigation it became clear that the IP addresses of Findxbot, for unknown reasons, had ended up on a Project Honeypot list of bad crawler bots, even though Findxbot is a genuine independent search engine bot. It is still not clear to us why we ended up on the list, and what exactly triggered this unhappy situation. We can’t exclude that its because our bot itself was poorly behaved in rare instances, but we don’t know, as we haven’t received any detailed information about the reasons.
We really appreciate the work that Project Honeypot does – this distributed system helps to identify and reduce the number of spammy sites on the internet by listing the bots that do not respect robots.txt files and other forms of site ‘scraping’.
Project Honey Pot is the first and only distributed system for identifying spammers and the spambots they use to scrape addresses and content from your website. Using the Project Honey Pot system, webmasters can take action to protect the internet from spambots.
We have been in close contact with Project Honeypot, and clarified our intentions and genuine purpose. Our IP addresses was quickly removed from their blacklist, as they are satisfied and the confident that the Findxbot is behaving well, and we appreciate the quick support we had so far, and off course wants to collaborate in details about how we can make sure to comply and support the project.
CDN companies uses Project Honey Pot as an indicator of bad bot behaviour, and because of our blacklisting some major CDN providers, like Cloudflare, banned our bot’s IP addresses. This is a catastrophe when running a search engine, because large numbers of popular high traffic websites use CDNs, and we were not allowed to crawl them.
What is even worse, is when a 403 is returned from a webpage, Findxbot, as bot guidelines prescribe, sees that this webpage no longer exists, or don’t want to be listed in our independent index, and because of that, we delete the pages from Findx. Can you imagine how many important web pages we deleted along with the spam? We have our work cut out for us to revisit quite a few sites in order to raise the quality of the results again.
We remain in close contact with Cloudflare and other CDN providers, and Cloudflare already removed the blocking of our bot’s IP addresses.
Our IPs also got on another blocklist, and this turned out to be caused by a single user reporting it falsely (!) for bad behavior. The blocklist maintainer ended up blocking the user, and white listing our IPs. This shows how big power these blocklist maintainers have. If Cloudflare or Project Honeypot end up blacklisting our IPs again, we might as well close up shop. This is how big a deal it is. What is a search engine worth if hundreds of thousands of sites, including most of the biggest ones, cannot be indexed by us?
If you know of any CDNs or other services that, because of our prior blacklisting on Project Honey Pot, have banned Findxbot, we would very much like to hear from you.
To allow the Findxbot to index your sites’ pages, we encourage you to explicitly allow it in your robots.txt file:
If your website is within the list of top level domains that Findx indexes, you are welcome to add your website to the indexing queue. On the findx search engine page, you’ll find a section for webmasters in the menu, where you can submit your website to the Findx queue.
There are some subtle challenges when writing robots.txt files correctly that webmasters (and crawlers!) need to be aware of. Incorrect redirects or user-agent checks, disallowed redirects, and multiple user-agent matches with poorly placed wildcards are unfortunately common.
More and more people are becoming concerned about maintaining their privacy online. That means, private search will play a part in the future online landscape – users can keep their searches secret and avoid being tracked. Soon you’ll be able to get Findx for your desktop and mobile devices.
If you are a webmaster, we’d love your feedback! What can we improve? Let us know what you think: leave a message in our online community.