Not 4 hours after posting my most recent blog stressing the importance of setting up systems with disaster recovery in mind, fate stepped up and thwacked me. "Oh yeah, think you're so resilient? How about I take down that critical LAN server you haven't upgraded?"
Yes, it's true. While we have really good recovery plans for all our production web servers and sites, we still have two legacy systems that we are not set up to quickly replace.
Well, that's one now. One just failed.
They've been on our list to upgrade and automate, and one of them is nearly done. But Monday afternoon, all of a sudden I started hearing "The Internet is down!" from a few developers. Sure enough, our internal DNS/DHCP and telephone server had locked up completely. This was a system that had not been updated in some 6 years, and aside from a hard drive replacement a couple years ago, had mostly just run without issue for what seems like forever.
With a power reset, it came back up, but was showing some hardware errors on the console, and would not accept keyboard input. And it froze up again 15 minutes later. The next reboot had the same result.
One of the tasks on my weekly list has been to plan for this system's replacement. And I've moved that to "next week" for about 6 months now. Oh, yeah, and our other systems guy was home sick that day, and unavailable. Crap. There went the rest of my task list for the day.
You know what? I actually do like being a firefighter, going in to save the day. I quickly did an inventory of what systems were running on that box that needed to be spun up elsewhere:
- TFTP (for our phones to boot)
The first two were easy to get a temporary system in place: log into the router, turn on those services, and get people to refresh their network connection. Boom, okay, at least people can work now.
Time to grab a coffee, it's going to be a long night.
Next, figure out where to run Asterisk (and the other services) longer term. We've got a bunch of relatively new workstations that are far more powerful than our old phone server, certainly capable of doing both work and running the phones. I decided mine would work until we got replacement hardware. Ok! Let's create a profile in Salt to start automating the necessary configurations, and build it up. That went smoothly.
Now let's recover and update our configurations. Whoops! Looks like somehow our primary backup hasn't backed up this particular box for 643 days. ?!?!? It's set to disabled, so it didn't trigger any notifications? Ok, how about the secondary backup? Last October. Yikes! I started with those backups, and really there wasn't that much that had changed since then -- this box had pretty much been set and forgotten, but still. Time to review the backups of all our boxes to make sure we're not missing any others! But we turned out to be lucky -- I popped the drive out of the failed box and put it in my workstation, and everything was there.
The outage was at 3:15. By 4:00 (after 2 more outages) I had temporary DNS up and running. By 7:00 I had DNS/DHCP and TFTP automated configurations, and running on my workstation. By 10:00 I had Asterisk and FreePBX installed and running, but could not get our phones to connect successfully, or get it hooked up to our main trunk. And I was having trouble getting Around 1:00 am, I gave up for the night and went home to get some sleep -- the calls would fail over to my cell until this was fixed.
On Tuesday it took another 5 hours to track down and figure out the 3 issues that was keeping our phone system from working, but interrupted by a couple of appointments. Luckily it was a slow day.
All in all, that took about 12 hours to recover, with phone service unavailable for 23 hours and flaky Internet for an hour. We're not that reliant on our phone system, we all have cell phones, so this was a minor inconvenience. But with proper planning, this should not have taken such a big chunk out of my schedule.
Now that we have the systems mostly in configuration management, if this happened again today it would take about 1 1/2 hours to recover on another box, and most of that re-configuring FreePBX which took quite a bit of running scripts and configuring things manually. We'll get that backed up today, which should bring recovery down to 1/2 hour or so.
If we had planned ahead, got the configuration management in place, we probably would have spent as much time getting it working, but we might have been able to keep our phones in service and not had my schedule thrown off so much.
Lesson learned. Time to go create a recovery plan for that last really vulnerable box...