Posted by Sam on January 26th 2006 to
techie
Since we introduced our splog blocklist and Spam Huntress posted about it, both i) the amount of traffic we waste and ii) the number of pings we are receiving have dropped. At the moment we’re not quite sure why the second one has happened, but we think that either one of our XML-RPC ping sources (pingomatic.com being the biggest) has started filtering as well, or else one of the big blog hosting sites has started blocking splogs. Anyway, it’s all good news.
We now have 370 web server IP addresses in our blocklist. Yesterday we received 474,394 pings of which 161,094 were blocked immediately based on the web server being on the blocklist. Of the rest, 179,467 were picked up by our incremental “delay buffer” which means we won’t go out and index them for up to 8 days if we get another ping for the same URL.
Here’s a rough explanation of how our splog blocklist works. I already posted this text as a comment to a Spam Huntress post a while ago:
At A2B (see http://www.a2b.cc for more) we run a search engine with a ping interface. Bloggers who have geo-located META tags in their HTML (see http://www.a2b.cc/help-searching-addurl-blogping.a2b for more) can ping us and we’ll pick up (parse) their page and index it in our geosearch engine. We receive pings from many individual bloggers, a full ping feed from pingomatic.com, and bulk pings from several other sources, usually around 700,000 to 1 million pings per days. With approximately 200 IP addresses in the blocklist, about 37% of daily pings are blocked immediately.
To generate the list, we recorded the URL of each website (read blog) we were pinged with and also converted it into the IP address of the web server for each URL. We recorded the IP address and added to a counter every time we received a ping for a URL on the same web server. We soon noticed that we were getting many thousands of pings for the same IP addresses, so pulled a script together to list the top IP addresses by number of pings.
We built another script which showed all the URLs associated with each IP address. In order to decide which web server IP addresses are serving splogs, which we block, we open this script and manually have a look at a random sampling of the URLs - it’s usually pretty easy to tell if they’re splogs as they’re just full of advertising links or are quite random in their choice of subject matter. Any web server which has real blogs tends to stay off the blocklist (so that rules out blogspot.com even though people are using it for splogging).
As soon as we’d blocked the first 112 IP addresses, the amount of traffic we were using parsing blogs dropped from 27GB per day (it was so high that it was costing us money in hosting charges) to 6GB/day. Of course, it began to creep up again soon after, so we’re realising it’s an ongoing effort and are beginning to think about blocking whole ranges of IP addresses.
Hope that’s of interest to everyone.