Now blocking 442 splog servers

We are now blocking any ping from URLs which resolve to one of 442 web servers. That’s a total of 73,592 splog URLs in all. Check out the latest list.

In my opinion, this means that splogs are big business and the spammers are making a lot of money out of them. Four hundred and forty-two dedicated web servers afterall! These machines aren’t doing anything else (as far as we can tell) but hosting splog sites and sending out pings for them.

Number of pings (and splog URLs) dropping

Since we introduced our splog blocklist and Spam Huntress posted about it, both i) the amount of traffic we waste and ii) the number of pings we are receiving have dropped. At the moment we’re not quite sure why the second one has happened, but we think that either one of our XML-RPC ping sources (pingomatic.com being the biggest) has started filtering as well, or else one of the big blog hosting sites has started blocking splogs. Anyway, it’s all good news.

We now have 370 web server IP addresses in our blocklist. Yesterday we received 474,394 pings of which 161,094 were blocked immediately based on the web server being on the blocklist. Of the rest, 179,467 were picked up by our incremental “delay buffer” which means we won’t go out and index them for up to 8 days if we get another ping for the same URL.

Here’s a rough explanation of how our splog blocklist works. I already posted this text as a comment to a Spam Huntress post a while ago:

At A2B (see http://www.a2b.cc for more) we run a search engine with a ping interface. Bloggers who have geo-located META tags in their HTML (see http://www.a2b.cc/help-searching-addurl-blogping.a2b for more) can ping us and we’ll pick up (parse) their page and index it in our geosearch engine. We receive pings from many individual bloggers, a full ping feed from pingomatic.com, and bulk pings from several other sources, usually around 700,000 to 1 million pings per days. With approximately 200 IP addresses in the blocklist, about 37% of daily pings are blocked immediately.

To generate the list, we recorded the URL of each website (read blog) we were pinged with and also converted it into the IP address of the web server for each URL. We recorded the IP address and added to a counter every time we received a ping for a URL on the same web server. We soon noticed that we were getting many thousands of pings for the same IP addresses, so pulled a script together to list the top IP addresses by number of pings.

We built another script which showed all the URLs associated with each IP address. In order to decide which web server IP addresses are serving splogs, which we block, we open this script and manually have a look at a random sampling of the URLs - it’s usually pretty easy to tell if they’re splogs as they’re just full of advertising links or are quite random in their choice of subject matter. Any web server which has real blogs tends to stay off the blocklist (so that rules out blogspot.com even though people are using it for splogging).

As soon as we’d blocked the first 112 IP addresses, the amount of traffic we were using parsing blogs dropped from 27GB per day (it was so high that it was costing us money in hosting charges) to 6GB/day. Of course, it began to creep up again soon after, so we’re realising it’s an ongoing effort and are beginning to think about blocking whole ranges of IP addresses.

Hope that’s of interest to everyone.

A2B blocklist makes it to Spam Huntress

Spam Huntress Ann Elisabeth has given our splog blocklist a mention on her blog. Thanks Ann Elisabeth! I left a comment explaining how we put the list together.

Ann Elisabeth also pointed out that we’d forgotten to put “nofollow” tags on the links to the splog URLs. We’ve now fixed it. Thanks again!