Archive for March, 2009

sendmail “Name server timeout”

Wed March 18, 2009

So, as it never rains but pours, after fixing the NAS problem detailed below, everything worked fine for a couple of days. Then … suddenly my email notifications from the backup process stopped. I logged into the machine in question, and ran sendmail from the command like but … it just kind-of sat there for a while, and then appeared to timeout.

So … I checked in /var/spool/mail/[username] and saw that the messages were not being sent; the reason appears to be the following line:

   451 xxxxxx.xxx: Name server timeout

(‘x’s added by me, for privacy!)

Some Googling (mainly for: sendmail “Name server timeout”) led me to this page. There are probably pages that better describe the problem (it’s something to do with how sendmail deals with incorrect IPv6 AAAA records) , but the solution is stated on this page, clearly and concisely. I will repeat the fix here here (in slightly more detail), in case the above link disappears:

  1. (as root) cd /etc/mail
  2. Edit sendmail.mc and add the following at the end of the file:
  3. define(`confBIND_OPTS’, `WorkAroundBrokenAAAA’)dnl
  4. type make
  5. restart sendmail (/sbin/service sendmail restart)

Note that in the original page, the quotation marks are incorrect; the apostrophes have been replaced by an end-quote (presumably by the blog software). i.e.

define(`confBIND_OPTS, `WorkAroundBrokenAAAA)dnl

is incorrect. The line should read:

define(`confBIND_OPTS, `WorkAroundBrokenAAAA)dnl

Not noticing this error led to me seeing the following error when running make:

   NONE:0: m4: ERROR: end of file in argument list
   make: *** [sendmail.cf] Error 1

More impotant note: I’m not responsible for the accuracy of this content; I do accept reponsibility for the following advice: “Take a backup of your sendmail config before embarking on the above!”.

Yelp! Check your nas backups aren’t disappearing into a black hole!

Fri March 13, 2009

So, for a while I have been copying up our nightly database backup to an NAS server provided by our co-lo. The backup file is (after gzipping & encrypting) approximately 10GB. I’m running Linux kernel 2.6.18, Oracle EL distro. I mount the NAS server locally using the cifs module. So, I have an entry in fstab looking as follows:

//X.X.X.X/username /mnt/nas cifs _netdev,user=XXX%23xxx,uid=xxx 0 0

(where the X’s indicate sensitive data!). I then just copy the local backup file to /mnt/nas, using “cp”. I would prefer to use rsync, or similar, but … NAS is all our co-lo offers.

Anyway, periodically I have been checking that I can re-import these backups into a test database. All fine. However, recently I tried again and … discovered that [for some unknown amount of time] the backups have not been making it to the NAS server in one piece. A file is created there, and it even has the correct size but … after a few GB, it is corrupted somehow (I’ve read reports that ‘\ 0′ bytes replace the original data).

Lesson 1: Don’t put all of your eggs in one basket. Luckily, in my case, the NAS server wasn’t the sole repository for our database backups. But the above experience serves as a timely reminder of the importance of this policy!

Lesson 2: Periodically check your backups. Again, this I do but … not as often as I should have. I think I will implement a daily process to re-import the backups into a test db. Some people don’t have this luxury. If your production database is huge, then having a test database that’s big enough to import your nightly backups on a daily basis, within a reasonable time-frame might be a luxury you cannot afford. Still, check them as often as possible.

Lesson 3: Check checksums! I had fallen into the trap of thinking “it’s a mounted filesystem, I’m using ‘cp’, the filesizes are the same…what could possibly go wrong?” Well…it looks like things can go wrong. Kind-of obvious really. So, these days, after copying the backup to the NAS server, I now double check that the remote file has the same checksum as my local backup. I use md5sum, a standard linux tool. If I was fortunate enough to have a shell login for the remote NAS server, I would generate the checksum there. As it is, though, I have to generate it on my local machine which – effectively – means that I have to copy the whole backup back across the network again! It’s worth it though.

Finally, the solution to the bug: After a bit of googling, I found the following pages:

  1. Copying large files to CIFS mount (XP) may corrupt data!?
  2. Corrupted data on write to Windows 2003 Server

I checked in /var/log/messages, which is where the cifs filesystem logs to (as it’s a kernel module), and saw messages such as:

CIFS VFS: No response to cmd 47 mid 33281
CIFS VFS: Write2 ret -11, written = 0
CIFS VFS: No response for cmd 50 mid 33289
CIFS VFS: server not responding

The problem appears to be that the client-side cifs cache can corrupt your data if you’re copying a large file, and if the server is unavailable at any point during the copy (e.g. if the network is a bit flaky, or the server is momentarily slow/unresponsive) . The solution is to disable the client side cache by adding the forcedirectio option to the mount options, as follows:

//X.X.X.X/username /mnt/nas cifs _netdev,user=XXX%23xxx,uid=xxx,forcedirectio 0 0

One of the links above points here, which suggests that there is a patch for the cifs kernel module, but I’m unsure whether this patch has been applied to the linux trunk yet. If you know, feel free to leave a comment!

(the above advice comes with all the usual disclaimers. for example, this looks rather scary: CIFS option forcedirectio fails to allow the appending of text to files.)