Fighting the scrapers

Early April the number of visitors to dropped dramatically. In two weeks time, the site lost around 1000 daily page views. Since it gets the majority of its traffic through search, I originally thought it had to do with Google changing their algorithms once more. Earlier in February/March Google had made some changes to fight content farms and I thought might also belately get rated as one of those.

Graph showing the weekly pageviews and the effect of site scraping

After a while it became clear that it wasn’t the overall site that got less visitors. It were just a few very popular pages that lost traffic. The biggest page view loss is for a page about the EPS file format. One of the reasons for this seems to be that somebody created a separate site with a matching domain name, just targetting people interested in that file format. There isn’t much that can be done about this. When someone has the main keyword in their domain name, they will get the Google juice. It is however doubtful that 500 to 1000 page views per day can pay for the domain name and all the work involved.

Some time later I got this flash of inspiration and simply typed part of a sentence from one of the prepressure pages in a Google Search box. Hey presto, a few people were simply copying entire pages from this site into their own blog, even leaving links to the images that I am hosting. I learned that this is a fairly common practice that is called site scraping. It mainly affects blogs.

How to fight site scrapers

There are some that argue that trying to fight site scraping is a waste of time. It is better to spend that time creating new content. Others are in favor of taking measures. This Problogger post was my resource for trying to fix the problem:

  • I left a comment on the blog of the main offender, asking him to stop doing this and remove those pages. Simultaneously I filed a support request on the blog community server he was using, informing them of the content theft. Nothing much happened. I waited a week and then took it to the next stage: you can report site scraping to Google, so I did that. Simultaneously I filed a complaint at the hosting provider of the blog. I also left a publicly visibly notification about the content theft on the forum that the blog service runs. It is presumably that last intervention that got things moving. Within a day the guy stealing content from various web sites across the web lost his entire blog, or should I say ‘collection of copied content’.
  • A quick mail to another site, home of a Macintosh User Group, also got rid of a dozen pages containing content taken from my site.
  • Yet another site also removed content after I left a message saying that violating their hosting service ‘terms and conditions’ could prove to be costly.
  • A fairly big Apple dealer simply replaced a copy of one of my pages by a copy of a Wikipedia page after I sent them a mail.
  • Some sites had a fake or missing email address for contacting them. I used the Google service for reporting a DMCA violation and that did convince some to remove the stolen content.

Woohoo, it works!

Early Agust Google had removed those offending pages from its index. The number of daily page views moved up again. It is still too early to see if the site will regain its previous popularity.

One other thing that happened during this period is that an internal web site of Demand Media appeared in the list of referring sites. Demand Media is one of those content farms that pay authors to create hundreds of pages per day. They optimize that content for the Google search algorithm, hoping to attract sufficient visitors to earn money from the ads on their sites. If one of those Demand Media authors is using as a reference to create their own highly optimised content, this site may still gently turn into a dusty and obscure spot on the web.