How we are reducing web spam in the Findx index

Indexing only the valuable, non-spam pages of the internet keeps us on our toes! Millions of spammy pages are created ad hoc, and their shady owners try to game the system with all sorts of techniques to make these pages rise to the top of the search results.

The never-ending fight against spam

We’re sure you’ve had a spam email or two in your inbox. Or, if you have published your email address on the web, perhaps you’ll be inundated with tens or hundreds of spam emails each day. For emailed spam automatic filters help reduce the amount of spam that gets through to you. And of course, over time you learned how you yourself can recognise what is a spam or phishing email and delete or report it to your local authorities.

As Findx is a new search engine and is building its search index from scratch with its crawler, the Findxbot, it was a small learning curve for us make our crawler identify web spam, and to index just the ‘valuable’ pages. When the Findxbot looks at a page, it adds the links it finds on that page to its indexing queue. It is a program so it can’t be subjective – it must use fixed criteria to decide whether a site is spammy or not. So, when building a crawler based search engine, avoiding spam sites is impossible, but we decided to take some actions against the spammers.

To free up resources, Findx is no longer indexing adult websites

Now that we have been growing the Findx index for a while, we could analyze how well the Findxbot was performing. Which types of sites add too many spam links to the indexing queue?

Websites with adult content, primarily porn,  are some of the worst offenders – they often contain spam content, malicious JavaScript and viruses, and even phishing links. These types of websites almost drowned our poor index queue in spam and “no-value” content.

We want to make it clear – we aren’t censoring porn sites based on any moral reasons. We know that many people prefer private search engines for … hrmm “private” searches, including porn – that was at least the feedback we saw in our Reddit AMA!

We made the decision to not index adult sites to help the Findxbot prioritise its resources so it can build a quality index, as free of adult related web-spam as possible and to have a better foundation to provide relevant and useful search results.

So, we have deleted the overwhelming number of adult websites from the Findx index, and blocked them they are now no longer indexed. The block list is based on our own adult content filter, but we have for now, decided not to share the specific details of the block list.

Censorship in search results

We have had many discussions about “When does a search engine become a censor?” Do you think we are doing that now, and is this change OK for you?  Let us know if you have questions or concerns by leaving a note for us in our forum.

You can help improve the quality of search results

Our algorithm and the way we classify adult content now prevents of a lot of unwanted content, but it’s a computer so it misses the boundary cases, and you are so much better than a computer!

If you do find a spammy website or one that is displaying ‘mature content’ in your Findx search results, please report the site using the quality rating feature.

Here’s how to quality rate search results on Findx.

Not child safe

Help us make Findx as useful for you as possible by giving us your feedback.