Docunext


Server Based Comment Spam Protection

October 24th, 2007

I've been thinking about how to protect Wordpress, Trac, and other "community-style" web applications against comment spam, so I figured I should write up an article about my thoughts here at Docunext.com.

Spam Karma 2

Spam Karma is useful for Wordpress, but it uses up a lot of server resources. It appears to be Wordpress only, too.

Spamassassin

Spamassassin works great for emails, but I can't figure out how to get it to work for Trac (or Wordpress for that matter). I found a perl client for connecting to spamd, but then I couldn't find the format information should be sent to the daemon.

dnsbl

There are some promising modules in the works for Apache which appear to connect directly with dnsbl servers, but they aren't mature enough for what I'm looking for. Most are still in alpha stage.

mod_rewrite?

mod_rewrite has a few features which could prove to be very helpful in defending against comment spam. One: RewriteMap, Two: [F] (forbidden). Here's a quick sample of how this might work:



RewriteMap badbots txt:bad_bots

RewriteEngine On

RewriteCond %{REQUEST_METHOD} POST

RewriteCond ${badbots:%{REMOTE_ADDR}} 1

RewriteRule . - [F]

Options ExecCGI FollowSymlinks

Allow from all

The nice thing about this rewriterule is that it only targets POST requests. The problem is that it requires a file to reference with an ip address. This would be difficult to update on a regular basis, but not impossible. You could also use an external program such as a perl script to query a dnsbl, but I'm not sure if that works so well, especially if you have a whole bunch of scripts.

Block List

Pass List

Another positive factor of the above access control is that it is simple, quick, effective, and could be used server-wide. It could easily be extended to support pass lists as well as authenticated bypasses. Related:

http://www.spamcannibal.org/cannibal.cgi

http://unknowngenius.com/blog/wordpress/spam-karma/

http://spamassassin.apache.org/Link / Quote of the Day

http://kerneltrap.org/Linux/Compiler_Optimization_Bugs_and_World_Domination

During the thread, Linus suggested that the optimization made by the compiler wasn't "legal", to which Alan Cox retorted, "pedant: valid. Almost all optimizations are legal, nobody has yet written laws about compilers. Sorry but I'm forever fixing misuse of the word 'illegal' in printks, docs and the like and it gets annoying after a bit." Linus playfully responded, "heh. When I'm ruler of the universe, it *will* be illegal. I'm just getting a bit ahead of myself." When asked how long until he expected to be ruler, Linus added, "I'm working on it, I'm working on it. I'm just as frustrated as you are. It turns out to be a non-trivial problem."

mod_defensible

UPDATE October 27, 2007: Found these promising modules: * http://julien.danjou.info/mod_defensible.html* http://www.steve.org.uk/Software/mod_ifier/index.html

UPDATE October 28, 2007: So far mod_defensible is working nicely. I had to enter an alternate DNS resolver as described on the mod_definsible hompage. Otherwise the server would hang, probably unable to resolve the dnsbl. I'm really liking the idea of using mod_defensible as a first line of defense against comment spam for quick filtering.

UPDATE October 30, 2007: I'm now trying out mod_defensible on a production server, it will be interesting to find out if the volume of spam caught by spam karma 2 decreases.

UPDATE October 31, 2007: Over the night, mod_defensible was causing apache processes to never die, resulting in the maxclients directive getting reached and a denial of service (DoS). I've disabled mod_defensible and am emailing the author to ask if there is a timeout or something to alleviate this problem.

I wrote to the author Julien and he responded back quickly (thanks!) saying he's aware of the issue, but doesn't have time to work on it at the moment. I'm no good at C, but if you are, please help! :-)

Debugging now - I figured out that the Limit directive isn't in the scope of mod_defensible, so I'm also trying the location scope. Good, that seems to work better. To fend off the possibility of zombie processes, I think it's best to consider using a local dns server. But still that's not too good an idea in case the dns server crashes.

Interesting, apache won't become a zombie process if udns isn't used and the dnsbl is invalid. Perhaps that is the way to go. The question is whether the same or a similar effect to that of udns can be achieved by setting the system's resolv.conf to use a caching dns resolver, like dnsmasq.

Unfortunately the limit directive is a little different than the directory, location, and files directive, so I'm having to use mod_rewrite to achieve my goal. Here's what I've got going on:

RewriteEngine On

RewriteCond %{REQUEST_METHOD} POST

RewriteRule (.*) /__POST__$1 [PT]        DnsblUse On        #DnsblServers zen.spamhaus.org.        DnsblServers localhost.

Alias /__POST__/ /var/www/

While I was in the source code for mod_defensible, I tried setting a timeout for udns, but what I tried didn't work. On line 356 of mod_defensible.c:

        if(poll(&pfd, 1, dns_timeouts(0, -1, 0) * 1000))

I changed the -1 to 4, but it resulted in the same behavior. What strikes me as odd is that the first parameter is a 0, but in the reference for udns it is ctx, which I think is a reference to the udns object. Maybe if I change the 0 to the reference used throughout the rest of the code, the timeout will work?

if(poll(&pfd, 1, dns_timeouts(&dns_defctx, 4, 0) * 1000))

By the way, to reproduce the timeout error, setup a non-existent dns resolver and feed it a non-existent dnsbl.

                DnsblUse On                DnsblNameserver 7.0.0.1                DnsblServers zeniy.spamhaus.org.                #DnsblServers localhost.

Nope, that didn't work either.

I'm working my way through the mod_defensible code, and I've got a few ideas. I think that the modules should have the option of setting an environment variable as opposed to strictly sending a 403 and error message. To achieve that, I changed the hook near the end of the file to:

    ap_hook_fixups(check_dnsbl_access, NULL, NULL, APR_HOOK_MIDDLE);

That seems to work OK. In trying to come up with a way to timeout the udns lookup, I came up with a really chessy method to try out:

#ifdef HAVE_LIBUDNS    struct pollfd pfd;    struct udns_cb_data **data_array_elts;    int cheesy_timeout = 0;    pfd.fd = dns_sock(0);    pfd.events = POLLIN;    data_array_elts = (struct udns_cb_data **) data_array->elts;    /* While we have a queue active */    while(dns_active(&dns_defctx) && cheesy_timeout < 99) {        if(poll(&pfd, 1, dns_timeouts(0, -1, 0)) * 1000)            dns_ioevent(0, 0);        cheesy_timeout++;    }    dns_close(&dns_defctx);    /* Check if one of the DNSBL server has blacklisted */    for(i = 0; i < data_array->nelts; i++)        if(data_array_elts[i]->blacklist)        {                /*            r->status = 403;            generate_page(r, data_array_elts[i]->dnsbl);            */            apr_table_setn(r->subprocess_env, "defensible","defensible");            return OK;        }#endif

This is a very poor method, and might not work without the other changes I made. Like not using the denial page, but instead setting an environment variable, which can then be used by other access controls to block, filter, or redirect various requests. Cool, huh?

I've still got some major cleanup to do, but thanks to the great work by Julien on this, I was actually able to learn a ton about Apache modules, and maybe even make one a little better! :-)

I just sent Julien this email:

I actually did some stuff. :-)

I changed the result of a dnsbl positive to set an environment variable "defensible". A better solution would have the env variable name set in the configuration file, but I don't really know what I'm doing. That way, the user can test for the env and handle it in a variety of ways, and even use a custom error file. To do this I had to change the hook to fixups.

As for the timeout, I came up with a very cheesy solution: limiting the while loop to 150 iterations! I've set it as low as 9 and it was able to resolve without timing out, but I was using a local resolver. This should also be configurable, but I don't know how to do that yet. I tried a bunch of udns functions but couldn't get a handle on it. I just happened to have an Apache book handy so that's why I stuck to the Apache API.

Obviously the timeout is not the best, but since I'm using this mainly to ward off comment spam, it isn't a big deal if there is a timeout. I may even just stick with the system dns resolver as I was also able to figure out a way to use the dnsbl for only POST requests.

Thanks again, I'm curious to learn what you think of the patch.

And the patch:

Hmm, I worked on this for several more hours tonight and I don't think that ap_hook_fixups is the appropriate hook for this, but instead think the original one was correct. I was getting some very odd behavior when accessing directories versus files, and when I switched it back to ap_hook_access_checker, it worked how I expected. There is a lot going on with how the modules talk to each other, and how they are organized in order. Very interesting. It turns out that the first way I got it to work wasn't half bad. It still has a lot of room for improvement though!

Note - the reason why I changed the hook in the first place was because I thought it was necessary to set an environment variable, but as it turns out it isn't.

Yearly Indexes: 2003 2004 2006 2007 2008 2009 2010 2011 2012 2013 2015 2019 2020 2022