Monday, January 28, 2008

the mystery of the borken server

Summary

The MailScanner processes on one of my server hangs, and it gets worse as the number of children is increased. Setting a very low number of Children helps, but the problem is not solved.


Background


Server Hardware (Dell)
  • CPU: AMD Dual Core Opteron (2210)
  • RAM: 2GB
  • 2 x 160GB SATA (configured with software RAID 1)
(Key) Server Software


The Problem

The customers (actually it is the customer of my customer) are fairly new, less than a year.

After a recent upgrading, the customers noticed a slowdown in the performance of the email server. Outgoing emails takes a long time to be sent after they hit the "Send" button in the email client. Sometimes it take up to 5 minutes.

So, the parties involved are:
us <-> customers <-> end-customers


Some Context Information

The end-customers are actually located in another country, but the email server is hosted and administered locally.

Network from end-customers to here is routed overseas (which could be contribute to instability at times).

Number of customers is not high, but the network is critical to their international operations.


The Conjecture/Guesses/Hypothesis

  1. Network is unstable or packed, causing upstream traffic to be slow (retrieving emails is fine though). Or their bandwidth is asymmetrical, with upload speed a fraction of the download speed.
  2. Data center network is unstable or does not have peering with customer's network provider, resulting in traffic being routed here indirectly.
  3. Server is under DDoS / spammer attack.
  4. Customer's network has p2p applications running, thereby causing bottlenecks in their internal networks. Or they are hosting web applications in-house, causing their outgoing traffic to be swamped.


Initial Observations

After logging on to the server in the dead of the night (with only some cats and cars passing on the street outside), I noticed that
  • the server load is high, with uptime of >3 (using uptime and top)
  • the email traffic is almost non-existent
  • only 1 user was accessing the server, as evidenced by the paucity of "pop3-login"s in /var/log/maillog
  • MailScanner --lint did not give any errors or warnings
Doesn't seem like the server was under attack (after checking with netstat, lsof), there was spam coming in, at least 1 per 2-3 minutes.

I looked at the MailScanner process and found that it was using the CPU at 100%. Doing a ps on it shows that the processes are hanging at "starting children". Restarting the processes is very slow, the master process dies before the children dies. It takes ages for the children to die (>1 minute, to a maximum of 4 minutes when I ran out of patience). Restarting is the same, the processes hang at the "starting children" stage for a long time with uptime exceeding 3. Once the MailScanner process starts properly, the CPU time consumed was already more than 3:00.00 (as shown in top), I guess that's 3 hours? WOW!!! :O


The Constraints

  1. Obviously, I can't just take the email server offline and play with it.
  2. The actual problem is not obvious and really going through the source code and debugging is tough, if not impossible.
  3. MailScanner is a huge piece of software, and its not easy to find out where the process is hanging (unless Linux has something like DTrace for Solaris and assuming I know how to use it).


The Experiment

The factors which I feel are likely to affect MailScanner load and processes are listed below:
  1. MailScanner, Max Children = X
  2. MailScanner, Virus Scanning = yes|no
  3. MailScanner, Use SpamAssassin = yes|no
  4. MailScanner, spam.whitelist.rules (turn off spam checking for certain domains)
At 12-1am at night, I wasn't too awake (besides I have been coding away for the whole day), so I couldn't come out with more...

1st set of Experiments

I tried out the easiest combinations by first setting #2 #3 to "no", and then played around with #1 from 2 to 5. Nope, the only observation was that as the number of children increases, MailScanner took an (almost) exponentially longer time to start. Actually, I couldn't bother to wait and time it, I just "killall MailScanner".

2nd set of Experiments

I tried to keep the number of children, #1, constant and tested with #2 and #3 on and off alternately. Didn't help either. It seems that the problem is tied to the number of children being started.

3rd Experiment

I reinstalled MailScanner. But, it doesn't work either.

MailScanner is dependent on a lot of PERL modules. The recent server upgrade might have installed/broken something. Or it could be that the CPAN-based modules (perl -MCPAN -e shell) that I have installed previously is affected MailScanner.

One of the questions that kept bugging me is, where does CPAN installed PERL modules go, and where does RPM install PERL modules go? Which one does PERL use if both exist?


[nothing works... *sob* 2:15am... and it's all not working... so... gotta think of something fast before end-customers get online and it's DOWN, then I'll really have early morning calls with people screaming and shouting into my ear]

As a last resort, I configured the server with
  • Max Children = 2 (this still takes a couple of minutes to start)
  • Use SpamAssassin = off (but Spam List = spamhaus-ZEN is retained)
  • insert customers' domain into spam.whitelist.rules (so that outgoing emails will not be checked, and hence, this will hopefully increase the speed at which emails are relayed)
  • Restart Every = 28800 (restart every 8 hours) as the killing and respawning of children processes will cause the hang, could be lengthened to 12 hours also, since I have 1GB of RAM free
So far, so good... it's been 15 hours since...

5 hours of sleep sucks...

To really troubleshoot the problem? I installed MailScanner on a VirtualPC with CentOS 4.6 plain (no GUI, no other services except sendmail). BlueQuartz crashes when installing into a VirtualPC environment so I can't test it.

And... everything works fine in the VirtualPC!!!

ARGHH... maybe I really have to remove all the CPAN-installed modules, remove all the RPM-installed modules and stick with the ones installed by MailScanner. *sigh* Will update if this really works, what else can I do? :D

2 comments:

maxsec said...

check you've got the latest clam-module (you need this to make it work with clamav 0.92).

check the perl modules are OK (MailScanner -v) and "MailScanner --debug" doesn't show any errors.

(also ask on the mailing list /irc with these outputs as well).

Unknown said...

Try "MailScanner --debug" and let me know what it outputs. Does it produce any errors? Have you upgraded any Perl modules without realising it?

If you are still stuck, contact me at mailscanner@ecs.soton.ac.uk.