Wednesday, January 30, 2008

the mystery of the borken server, SOLVED

Acknowledgements

Thanks to maxsec (from MS's irc channel) and Jules (creator) of MailScanner!


Summary

Problem was two-fold:

  1. I did not notice that Mail::ClamAV and Mail::SpamAssassin packages were not installed properly when running the install script provided in install-Clam-0.92-SA-3.2.4.tar.gz (error information below)
  2. My system had /tmp mounted as "noexec" (is this a default BlueQuartz setting, or did I change this when the system was hardened previously?)


MailScanner diagnosis procedure

  1. After installing MailScanner, run MailScanner --lint, check for any errors that get thrown out.
  2. If there is any issue, run MailScanner -v to see the versions of the installed modules, make sure that they are correct.
  3. If it is not conclusive, run MailScanner --debug or MailScanner --debug --debug-sa (if you have SpamAssassin)
  4. If problem persists, Google it and search the MS Mailing List Archive (it is active).
  5. If there is still nothing conclusive, go to the IRC channel and ask for help.
  6. Subscribe and post the problem in the MS Mailing List too.


Error Information

An error was thrown during installation of Mail::SpamAssassin when I ran the install script in
install-Clam-0.92-SA-3.2.4.tar.gz. (Remind myself to maximize the Putty screen next time).


Setting a soft-link from spam.assassin.prefs.conf into the SpamAssassin
site rules directory.
spam.assassin.prefs.conf is read directly by the SpamAssassin startup
code, so make sure you have a link from the site_rules directory to
this file in your MailScanner/etc directory.
Perl could not find your SpamAssassin installation.
Strange, I just installed it.
You should fix this!

Making backup of pre files to /tmp/backup.pre.3457.tar
tar: *pre: Cannot stat: No such file or directory
tar: Error exit delayed from previous errors
Now go and find your v310.pre and v320.pre files,
echo which may well be in the /etc/mail/spamassassin directory.
You need to save a copy of your old v320.pre file and rename
the v320.pre file to v320.pre.


Moving on

*sigh* :)

Now I have to keep reminding myself to be extra careful when updating this server in the future. Not sure about why the other servers are fine. Maybe the manual installation of SpamAssassin source helped but I didn't do it for this server due to its custom configurations.

In the future updates of MailScanner, I will need to:
  1. Download and unpack the new MS package / installer.
  2. Go into the perl-tar directories and list all the PERL modules.
  3. Open up CPAN (perl -MCPAN -e shell) and compare the version of the installed modules vs those with the MS package / installer.
  4. If the versions are not ok, unpack those files that came with the MS package / installer, manually update them via the usual perl Makefile.PL -> make -> make test -> make install as root.
Ok, that's it for now!

Thanks to the advice and help from the people in MS's IRC channel, and especially to maxsec and Jules!

Monday, January 28, 2008

the mystery of the borken server

Summary

The MailScanner processes on one of my server hangs, and it gets worse as the number of children is increased. Setting a very low number of Children helps, but the problem is not solved.


Background


Server Hardware (Dell)
  • CPU: AMD Dual Core Opteron (2210)
  • RAM: 2GB
  • 2 x 160GB SATA (configured with software RAID 1)
(Key) Server Software


The Problem

The customers (actually it is the customer of my customer) are fairly new, less than a year.

After a recent upgrading, the customers noticed a slowdown in the performance of the email server. Outgoing emails takes a long time to be sent after they hit the "Send" button in the email client. Sometimes it take up to 5 minutes.

So, the parties involved are:
us <-> customers <-> end-customers


Some Context Information

The end-customers are actually located in another country, but the email server is hosted and administered locally.

Network from end-customers to here is routed overseas (which could be contribute to instability at times).

Number of customers is not high, but the network is critical to their international operations.


The Conjecture/Guesses/Hypothesis

  1. Network is unstable or packed, causing upstream traffic to be slow (retrieving emails is fine though). Or their bandwidth is asymmetrical, with upload speed a fraction of the download speed.
  2. Data center network is unstable or does not have peering with customer's network provider, resulting in traffic being routed here indirectly.
  3. Server is under DDoS / spammer attack.
  4. Customer's network has p2p applications running, thereby causing bottlenecks in their internal networks. Or they are hosting web applications in-house, causing their outgoing traffic to be swamped.


Initial Observations

After logging on to the server in the dead of the night (with only some cats and cars passing on the street outside), I noticed that
  • the server load is high, with uptime of >3 (using uptime and top)
  • the email traffic is almost non-existent
  • only 1 user was accessing the server, as evidenced by the paucity of "pop3-login"s in /var/log/maillog
  • MailScanner --lint did not give any errors or warnings
Doesn't seem like the server was under attack (after checking with netstat, lsof), there was spam coming in, at least 1 per 2-3 minutes.

I looked at the MailScanner process and found that it was using the CPU at 100%. Doing a ps on it shows that the processes are hanging at "starting children". Restarting the processes is very slow, the master process dies before the children dies. It takes ages for the children to die (>1 minute, to a maximum of 4 minutes when I ran out of patience). Restarting is the same, the processes hang at the "starting children" stage for a long time with uptime exceeding 3. Once the MailScanner process starts properly, the CPU time consumed was already more than 3:00.00 (as shown in top), I guess that's 3 hours? WOW!!! :O


The Constraints

  1. Obviously, I can't just take the email server offline and play with it.
  2. The actual problem is not obvious and really going through the source code and debugging is tough, if not impossible.
  3. MailScanner is a huge piece of software, and its not easy to find out where the process is hanging (unless Linux has something like DTrace for Solaris and assuming I know how to use it).


The Experiment

The factors which I feel are likely to affect MailScanner load and processes are listed below:
  1. MailScanner, Max Children = X
  2. MailScanner, Virus Scanning = yes|no
  3. MailScanner, Use SpamAssassin = yes|no
  4. MailScanner, spam.whitelist.rules (turn off spam checking for certain domains)
At 12-1am at night, I wasn't too awake (besides I have been coding away for the whole day), so I couldn't come out with more...

1st set of Experiments

I tried out the easiest combinations by first setting #2 #3 to "no", and then played around with #1 from 2 to 5. Nope, the only observation was that as the number of children increases, MailScanner took an (almost) exponentially longer time to start. Actually, I couldn't bother to wait and time it, I just "killall MailScanner".

2nd set of Experiments

I tried to keep the number of children, #1, constant and tested with #2 and #3 on and off alternately. Didn't help either. It seems that the problem is tied to the number of children being started.

3rd Experiment

I reinstalled MailScanner. But, it doesn't work either.

MailScanner is dependent on a lot of PERL modules. The recent server upgrade might have installed/broken something. Or it could be that the CPAN-based modules (perl -MCPAN -e shell) that I have installed previously is affected MailScanner.

One of the questions that kept bugging me is, where does CPAN installed PERL modules go, and where does RPM install PERL modules go? Which one does PERL use if both exist?


[nothing works... *sob* 2:15am... and it's all not working... so... gotta think of something fast before end-customers get online and it's DOWN, then I'll really have early morning calls with people screaming and shouting into my ear]

As a last resort, I configured the server with
  • Max Children = 2 (this still takes a couple of minutes to start)
  • Use SpamAssassin = off (but Spam List = spamhaus-ZEN is retained)
  • insert customers' domain into spam.whitelist.rules (so that outgoing emails will not be checked, and hence, this will hopefully increase the speed at which emails are relayed)
  • Restart Every = 28800 (restart every 8 hours) as the killing and respawning of children processes will cause the hang, could be lengthened to 12 hours also, since I have 1GB of RAM free
So far, so good... it's been 15 hours since...

5 hours of sleep sucks...

To really troubleshoot the problem? I installed MailScanner on a VirtualPC with CentOS 4.6 plain (no GUI, no other services except sendmail). BlueQuartz crashes when installing into a VirtualPC environment so I can't test it.

And... everything works fine in the VirtualPC!!!

ARGHH... maybe I really have to remove all the CPAN-installed modules, remove all the RPM-installed modules and stick with the ones installed by MailScanner. *sigh* Will update if this really works, what else can I do? :D

system administration, you think you got what it takes?

Kudos for other system administrators out there!

Since one of my roles is a system administrator, would like to give an acknowledgment to fellow admins out there as I start this blog! :)

I picked up my passion for system administration 13 years ago, when one of my seniors conducted a talk on Linux and helped us made copies of Slackware onto a dozen floppy disks so we could play with it at home. Since then, I have also tried Gentoo, RedHat, CentOS (which is my favourite today), FreeBSD, Solaris 8 - 10 (never admin-ed live sites on Solaris yet), etc.

System administration is a tough job (you know I know but users don't), you have to know the hardware, software, network you are handling and take into account everything holistically . You have to know your users, how they access the server(s), what kind of environment they have, what kind of businesses they are running (dubious email marketers are the ones to avoid), their IT literacy level, etc. Troubleshooting requires a lot of eliminations and tests while users are screaming for their services to be up NOW!!!

The job demands a "big picture" understanding and a sharp eye on minor details. It's like being a car mechanic, when there is an extra rattle or squeak, you know something has got to give soon. Typically, you will have some preventive measures in place, but when it comes to the crunch, it all boils down to a quick identification of the problem and an even quicker fix. Who cares, as long as it works, right? ;) Actually, a quick fix doesn't work and in the end, we are ones who will have to solve the actual problem anyway.

When I start this blog, my intention is to use it to take notes on the problems I have and the way I go about solving it, maybe it would be of use to others should they stumble upon this blog.

Oh well, here goes...

Why is the Internet borken?

There are a few reasons for this...

I had a phone call near midnight to troubleshoot an under-performing server. Its not the first time that I receive urgent requests ("URGENT!" "HELP!") for help, but I really hate the early mornings calls or late night ones just when I'm about to sleep, and especially when the calls come in when I'm SLEEPING. I'm not paid to be on standby 24x7, my work / agreement clauses did not stipulate a 24x7 work hour.

isBorken is influenced by my geek.. err... programming background and LOLCat speak -ICHC

:)

will update this as and when i have the inspiration :D