Fighting XML-RPC blog ping spam
As regular users of A2B (and readers of this blog) will know, we run an XML-RPC “ping” interface for bloggers. The idea is that bloggers post an entry to their blog and their blog software or service automatically notifies us of the update in the background. In our case, we then go out and parse the URL of the blog entry for geographical metadata. If we find it, we add the site to A2B so it’s searchable via our website and GPS software such as GPSCookie and Navio. Most of our pings come in via ping aggregator services such as Pingomatic - we now receive about 700,000 to 1 million pings a day.
Recently the number of pings and the number of new URLs had been shooting up every day, presenting us with a problem. Because we go out and parse every new URL, we had been generating an average of about 27 GB of traffic a day (yes, that’s twenty-seven gigabytes). Time for some action.
A couple of months ago we’d already introduced a system where we wouldn’t parse a URL again for a while if the first pass didn’t contain geographical meta information, and this had reduced our traffic somewhat initially, but it was back up and climbing fast. We ran some analysis, and found out that a huge proportion of the URLs we were getting pings for turned out to be autogenerated spam blogs - endless variations on “Real Estate in Florida News Blog” and “Hair Loss Removal Therapy in Seattle Blog” and other typical spammer themes. We discovered that many of these URLs seemed to be resolving to a smallish number of web servers set up specifically for hosting spam blogs. Time to block some IP addresses!
We came up with a blacklist of banned IP addresses, based on which web servers the largest numbers of received ping URLs resolved to, and wrote a script blocking anything matching those IP addresses. We banned 114 IP addresses, and our traffic dropped immediately from 27 GB per day to 6 GB! Result! One in the eye for blog spammers! At least on A2B anyway…. You can check out which IP addresses we banned, how many pings we’re getting for each web server, and which URLs they’re sending us, here.
Now we’ve got a list of most popular web server addresses each day and we’re adding to the blocked list all the time! But it’s a struggle as the spammers are constantly changing IP addresses and their exact system is difficult to figure out.
Perhaps the most interesting thing is that we’ve had to ban the web server that Loic Lemeur’s English blog sits on. He’s the founder of Typepad and runs the Les Blogs conferences! The server only holds a few URLs in our database, they all seem to be spam apart from his, and we were getting a lot of pings resolving to it. Does he have friends who are blog spammers? I can’t imagine so, but lots of the URLs are typepad.com domain names.
****
Update - I have just found out these spam blogs are called splogs - we’re now fighting splogs!
9 Comments
