There are some HTTP clients accessing my web servers with strange request patterns. They include a header stating that they are a Googlebot, but its clear that they are fake googlebots.
I noticed some HTTP clients spoofing Google's search crawler user-agent, aka "Googlebot", like these:
404 www.neosensorz.com "GET /wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.014 404 www.neosensorz.com "GET /old/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.007 404 www.neosensorz.com "GET /cms/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.007 404 www.neosensorz.com "GET /blog/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.005 404 www.neosensorz.com "GET /blog_old/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.006 404 www.neosensorz.com "GET /blog-old/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.007 404 www.neosensorz.com "GET /blog/wp/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.006 404 www.neosensorz.com "GET /wp/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.006 404 www.neosensorz.com "GET /WP/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.004 404 www.neosensorz.com "GET /backup/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.006 404 www.neosensorz.com "GET /blog/backup/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.006 404 www.neosensorz.com "GET /wordpress/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.006 404 www.neosensorz.com "GET /Wordpress/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.005 404 www.neosensorz.com "GET /wordpress2/wp-login.php HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.19.102.242 0.008
I started to think about ways to automatically block them. I didn't figure that Google would provide a list of CIDR blocks, and based upon what I've read, it would appear that I was right, they don't. I read on their blog that they suggest using DNS to do a reverse lookup and see if it contains their domain name, like this:
$ dig -x 66.249.71.155 ; <<>> DiG 9.6.1-P1 <<>> -x 66.249.71.155 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63653 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;155.71.249.66.in-addr.arpa. IN PTR ;; ANSWER SECTION: 155.71.249.66.in-addr.arpa. 86400 IN PTR crawl-66-249-71-155.googlebot.com. ;; Query time: 157 msec ;; SERVER: 192.168.1.1#53(192.168.1.1) ;; WHEN: Tue Jun 1 02:49:14 2010 ;; MSG SIZE rcvd: 91
Since I use NGINX as the front-most web server, I could manually set non-Google ip addresses in a "map" block and check them when the user-agent contains the Googlebot string, something like this:
map $remote_addr $nogip { default 0; 67.19.102.242 1; } if ($http_user_agent ~ Googlebot) { set $noacc $nogip; } if ($http_user_agent !~ Googlebot) { set $noacc 0; } if ($noacc = 1) { return 503; }
I doubt I'll set this up though, its too specific!