bogofilter and why HAM is so important
Just some hours ago I noticed one of my mail servers (FreeBSD 6.x-stable + Sendmail + bogofilter + bogom) became a big black hole for spam. It seems (as reported from users) that a lot of spam was getting into the users' mailboxes.
I checked and re-checked everything but everything seemed to be ok. Sendmail was running ok, bogom was up too... strange. Then I noticed one thing in my logs:
Aug 18 16:12:51 prunus sm-mta[47093]: m7IECjAn047093: Milter insert (0): header: X-Bogosity: Ham, spamicity=0.000046 Aug 18 16:13:04 prunus sm-mta[47095]: m7IECrMe047095: Milter insert (0): header: X-Bogosity: Ham, spamicity=0.000002 Aug 18 16:13:05 prunus sm-mta[47106]: m7IECxi0047106: Milter insert (0): header: X-Bogosity: Ham, spamicity=0.000002 Aug 18 16:13:06 prunus sm-mta[47107]: m7IED0IH047107: Milter insert (0): header: X-Bogosity: Ham, spamicity=0.000000 Aug 18 16:13:18 prunus sm-mta[47123]: m7IEDBtR047123: Milter insert (0): header: X-Bogosity: Unsure, spamicity=0.520000 Aug 18 16:13:18 prunus sm-mta[47128]: m7IEDCt8047128: Milter insert (0): header: X-Bogosity: Ham, spamicity=0.000440 Aug 18 16:13:23 prunus sm-mta[47108]: m7IED6Gu047108: Milter insert (0): header: X-Bogosity: Unsure, spamicity=0.520000
Everything was being marked as HAM/UNSURE! That was the reason, time to find out what was causing it.
bogoutil helped me a lot finding the problem, as with it I was able to get the ammount of both spam and ham messages known by the filter, just issuing something like:
sudo bogoutil -w /var/spool/bogofilter/wordlist.db .MSG_COUNT
In the failing server I got:
spam good
.MSG_COUNT 3457 0
While in another server I got:
spam good
.MSG_COUNT 282 3337
As this last one didn't experience the same problem as the failing one, and with my mind suddenly remembering Juanjo's recommendations, I realized myself that the problem was, in fact, that bogofilter had no idea about what's ham for that server (if you are interested in how bogofilter works, take a look at the theory of operation manpage section).
Just getting some mails from one of the users' INBOX and adding them as ham to bogofilter
sudo bogofilter -n < user_INBOX_dump
helped to get things back to work. Just in case, I checked the list of spam/ham counts again:
spam good
.MSG_COUNT 3457 52
and the sendmail logs too:
Aug 18 17:53:42 prunus sm-mta[52981]: m7IFrYvj052981: Milter insert (0): header: X-Bogosity: Spam, spamicity=1.000000 Aug 18 17:53:51 prunus sm-mta[52988]: m7IFrjYr052988: Milter insert (0): header: X-Bogosity: Spam, spamicity=1.000000 Aug 18 17:53:54 prunus sm-mta[52990]: m7IFrlKV052990: Milter insert (0): header: X-Bogosity: Spam, spamicity=0.999887 Aug 18 17:53:57 prunus sm-mta[52989]: m7IFro2f052989: Milter insert (0): header: X-Bogosity: Spam, spamicity=1.000000 Aug 18 17:54:02 prunus sm-mta[53003]: m7IFrsDG053003: Milter insert (0): header: X-Bogosity: Spam, spamicity=1.000000 Aug 18 17:54:46 prunus sm-mta[53024]: m7IFscv1053024: Milter insert (0): header: X-Bogosity: Spam, spamicity=1.000000 Aug 18 17:54:58 prunus sm-mta[53031]: m7IFsqGG053031: Milter insert (0): header: X-Bogosity: Spam, spamicity=1.000000
Of course I've no idea what happened with the HAM that was supposed to be in the wordlist.db file, now's time to do some phorensics and try to find out what happened here (but hey, the problem was solved at least!)
UPDATE: Found it! This morning, just after arriving at the office, I took a look over the MSG_COUNT value on that faulty server, just to find:
spam good
.MSG_COUNT 3460 49
It was decreased since yesterday, so it had to be related to my script that feeds bogofilter every night. Just taking a look at the script I realized where was the problem. The script was calling bogofilter to add spam messages this way:
bogofilter -Ns -B inbox_dump
And, from the man page of bogofilter:
The -N option tells bogofilter to undo a prior registration of the same message as non-spam. If a message was incorrectly entered as non-spam by -n or -u and you want to remove it and enter it as spam, then use -Ns. If -N is used for a message that wasn't registered as non-spam, the counts will still be decremented.
which means my daily feeding script was removing tokens from the ham list. As my script runs everyday (through a cron job) and the spam added everyday is a lot of spam, it was removing such tokens too much quick. Removing the -N option when feeding the beast will solve the problem. (Or I hope so)