The Spam Diaries

News and musings about the fight against spam.
 by Edward Falk

Monday, June 19, 2006

Google yanks millions of blog spam pages

According to Barry Schwartz of Search Engine Watch, Google has dropped about five billion pages from their index due to search engine spam — specifically blog spam. (Two years ago, I don't think they indexed that many total.)

Discussion starts at Digital Point forums where a member started a thread revealing and discussing a number of spam sites which had managed to acquire billions — yes, billions — of listings at Google.

(On Sunday, Adam Lasnik of Google chimed in and said that the absurd numbers were tied to a bad data push and did not actually reflect the number of pages the spammer had managed to get indexed.)

This is followed by a blogger's step-by-step explanation of how the spam works. Short explanation: search engines weight subdomain names very heavily as keywords, so the spammers build dns servers that create effectively infinite numbers of subdomains that redirect back to the main page. The web server there serves up canned content that corresponds to the keywords being searched for. By having unlimited subdomains, you get around Google's limits on pages indexed per site per day. The virtual subdomains all link to each other. Finally, you set the entire system in motion with massive blog comment spams intended to bring the search engines.

This caught Google's attention, and the delisting began. I just tried it myself, and queries which were returning 5 billion results are now returning zero results.

Just a quick reminder: spam reports can be sent to Google at http://www.google.com/contact/spamreport.html

2 Comments:

Blogger my0p said...

nice to know that Google took action on this. I had no idea that BILLIONS of spam pages were indexed. Yayyee, the entries are gone.

8:01 PM  
Blogger Spam Diaries said...

It's likely that it was far less then billions — a Google rep informs us that there was some sort of database error that caused the search result page to misreport how many matches it had found.

11:16 PM  

Post a Comment

<< Home