Lessons from a SAN Failure

At 1:10AM Sunday morning the main SAN at one of my clients suffered a “partial” failure. Partial means that the SAN was still online and functioning but the LUNs attached to our two main SQL Servers “failed”. Failed means that SQL Server wouldn’t start and the MDF and LDF files mostly showed a zero file size. But they were online and responding and most other LUNs were available.

I’m not sure how SANs know to fail at 1AM on a Saturday night but they seem to. From a personal standpoint this worked out poorly: I was out with friends and after more than a few drinks. From a work standpoint this was about the best time to fail you could imagine. Everything was running well before Monday morning. But it was a long, long Sunday. I started tipsy, got tired and ended up hung over later in the day. Note to self: Try not to go out drinking right before the SAN fails.

This caught us at an interesting time. We’re in the process of migrating to an entirely new set of servers so some things were partially moved. This made it difficult to follow our procedures as cleanly as we’d like. The benefit was that we had much better documentation of everything on the server. I would encourage everyone to really think through the process of implementing your DR plan and document as much as possible. Following a checklist is much easier than trying to remember at night under pressure in a hurry after a few drinks.

I had a series of estimates on how long things would take. They were accurate for any single server failure. They weren’t accurate for a SAN failure that took two servers down. This wasn’t bad but we should have communicated better.

Don’t forget how many things are outside the database. Logins, linked servers, DTS packages (yikes!), jobs, service broker, DTC (especially DTC), database triggers and any objects in the master database are all things you need backed up. We’d done a decent job on this and didn’t find significant problems here. That said this still took a lot of time. There were many annoyances as a result of this. Small settings like a login’s default database had a big impact on whether an application could run. This is probably the single biggest area of concern when looking to recreate a server. I’d encourage everyone to go through every single node of SSMS and look for user created objects or settings outside the database.

Script out your logins with the proper SID and already encrypted passwords and keep it updated. This makes life so much easier. I used an approach based on KB246133 that worked well. I’ll get my scripts posted over the next few days.

The disaster can cause your DR process to fail in unexpected ways. We have a job that scripts out all logins and role memberships and writes it to a file. This runs on the DR server and pulls from the production server. Upon opening the file I found that the contents were a “server not found” error. Fortunately we had other copies and didn’t need to try and restore the master database. This now runs on the production server and pushes the script to the DR site. Soon we’ll get it pushed to our version control software.

One of the biggest challenges is keeping your DR resources up to date. Any server change (new linked server, new SQL Server Agent job, etc.) means that your DR plan (and scripts) is out of date. It helps to automate the generation of these resources if possible.

Take time now to test your database restore process. We test ours quarterly. If you have a large database I’d also encourage you to invest in a compressed backup solution. Restoring backups was the single larger consumer of time during our recovery. And yes, there’s a database mirroring solution planned in our new architecture.

I didn’t have much involvement in things outside SQL Server but this caused many, many things to change in our environment. Many applications today aren’t just executables or web sites. They are a combination of those plus network infrastructure, reports, network ports, IP addresses, DTS and SSIS packages, batch systems and many other things. These all needed a little bit of attention to make sure they were functioning properly.

Profiler turned out to be a handy tool. I started a trace for failed logins and kept that running. That let me fix a number of problems before people were able to report them. I also ran traces to capture exceptions. This helped identify problems with linked servers.

Overall the thing that gave me the most problem was linked servers. In order for a linked server to function properly you need to be pointed to the right server, have the proper login information, have the network routes available and have MSDTC configured properly. We have a lot of linked servers and this created many failure points. Some of the older linked servers used IP addresses and not DNS names. This meant we had to go in and touch all those linked servers when the servers moved.