Fate doesn't like to be tempted

By John Locke on August 9, 2013

Not 4 hours after posting my most recent blog stressing the importance of setting up systems with disaster recovery in mind, fate stepped up and thwacked me. "Oh yeah, think you're so resilient? How about I take down that critical LAN server you haven't upgraded?"
Yes, it's true. While we have really good recovery plans for all our production web servers and sites, we still have two legacy systems that we are not set up to quickly replace.
Well, that's one now. One just failed.
They've been on our list to upgrade and automate, and one of them is nearly done. But Monday afternoon, all of a sudden I started hearing "The Internet is down!" from a few developers. Sure enough, our internal DNS/DHCP and telephone server had locked up completely. This was a system that had not been updated in some 6 years, and aside from a hard drive replacement a couple years ago, had mostly just run without issue for what seems like forever.
With a power reset, it came back up, but was showing some hardware errors on the console, and would not accept keyboard input. And it froze up again 15 minutes later. The next reboot had the same result.
One of the tasks on my weekly list has been to plan for this system's replacement. And I've moved that to "next week" for about 6 months now. Oh, yeah, and our other systems guy was home sick that day, and unavailable. Crap. There went the rest of my task list for the day.
You know what? I actually do like being a firefighter, going in to save the day. I quickly did an inventory of what systems were running on that box that needed to be spun up elsewhere:

  • DHCP
  • DNS
  • TFTP (for our phones to boot)
  • Asterisk

The first two were easy to get a temporary system in place: log into the router, turn on those services, and get people to refresh their network connection. Boom, okay, at least people can work now.
Time to grab a coffee, it's going to be a long night.
Next, figure out where to run Asterisk (and the other services) longer term. We've got a bunch of relatively new workstations that are far more powerful than our old phone server, certainly capable of doing both work and running the phones. I decided mine would work until we got replacement hardware. Ok! Let's create a profile in Salt to start automating the necessary configurations, and build it up. That went smoothly.
Now let's recover and update our configurations. Whoops! Looks like somehow our primary backup hasn't backed up this particular box for 643 days. ?!?!? It's set to disabled, so it didn't trigger any notifications? Ok, how about the secondary backup? Last October. Yikes! I started with those backups, and really there wasn't that much that had changed since then -- this box had pretty much been set and forgotten, but still. Time to review the backups of all our boxes to make sure we're not missing any others! But we turned out to be lucky -- I popped the drive out of the failed box and put it in my workstation, and everything was there.
The outage was at 3:15. By 4:00 (after 2 more outages) I had temporary DNS up and running. By 7:00 I had DNS/DHCP and TFTP automated configurations, and running on my workstation. By 10:00 I had Asterisk and FreePBX installed and running, but could not get our phones to connect successfully, or get it hooked up to our main trunk. And I was having trouble getting Around 1:00 am, I gave up for the night and went home to get some sleep -- the calls would fail over to my cell until this was fixed.
On Tuesday it took another 5 hours to track down and figure out the 3 issues that was keeping our phone system from working, but interrupted by a couple of appointments. Luckily it was a slow day.
All in all, that took about 12 hours to recover, with phone service unavailable for 23 hours and flaky Internet for an hour. We're not that reliant on our phone system, we all have cell phones, so this was a minor inconvenience. But with proper planning, this should not have taken such a big chunk out of my schedule.
Now that we have the systems mostly in configuration management, if this happened again today it would take about 1 1/2 hours to recover on another box, and most of that re-configuring FreePBX which took quite a bit of running scripts and configuring things manually. We'll get that backed up today, which should bring recovery down to 1/2 hour or so.
If we had planned ahead, got the configuration management in place, we probably would have spent as much time getting it working, but we might have been able to keep our phones in service and not had my schedule thrown off so much.
Lesson learned. Time to go create a recovery plan for that last really vulnerable box...

Add new comment

The content of this field is kept private and will not be shown publicly.

Filtered HTML

  • Web page addresses and email addresses turn into links automatically.
  • Allowed HTML tags: <a href hreflang> <em> <strong> <blockquote cite> <cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h1> <h2 id> <h3 id> <h4 id> <h5 id> <p> <br> <img src alt height width>
  • Lines and paragraphs break automatically.

Drupal Canvas — Block HTML (locked)

  • Allowed HTML tags: <strong> <em> <u> <a href> <p> <br> <ul> <ol> <li>

Drupal Canvas — Inline HTML (locked)

  • Allowed HTML tags: <strong> <em> <u> <a href>

More Like This

AI vulnerabilities, security incidents, resilience, Drupal WordPress, cybersecurity
🕑May 18, 2026 🖋John Locke 💬0

The Rules Have Changed: Security in the Age of AI-Assisted Attacks

Security is getting dramatically harder and more expensive. AI is simultaneously driving an explosion in vulnerability discovery and weaponizing the exploits that follow. The question for every organization with anything online is no longer whether to invest in resilience — it's whether that investment is already in place before the next incident arrives.
Grafana line showing load dropping to normal
🕑Aug 22, 2023 🖋John Locke 💬2

Rate Limiting an aggressive bot in Nginx

High load isn't necessarily an emergency, but it may be a heads-up before a site noticeably slows down. Sometimes there are weird spikes that just go away, but sometimes this is an indication of a Denial of Service.

Code monster
🕑Mar 29, 2018 🖋John Locke 💬3

Drupalgeddon2: Should I worry about critical security updates?

No, you should not. You should let us worry about them, and go back to your business.

Seriously, we're getting questions from all kinds of people about whether this matters. I'm a bit surprised that there is any question about that. Would you be concerned if your top salesperson was selling for somebody else? If your cashiers were jotting down credit card numbers when they charged a card? If your office became a well-known spot for illicit drug or gun dealers? If your office had a bunch of scammers squatting and running a pyramid scheme? If your confidential client information could be revealed as easily as using a bic pen on an old Kryptonite lock?

Bic Pen vs Kryptonite Lock

We've seen some variation of every single one of those scenarios. And all of them are possible with a remote code execution flaw in a web application, like yesterday's Drupal security vulnerability.

And yet people still

Meltdown
🕑Jan 15, 2018 🖋John Locke 💬0

Meltdown notes

The Meltdown vulnerability leaked out into public news a full week before patches were available for many distributions. When patches did become available, sometimes the patch caused further trouble.

Meltdown in action
🕑Jan 11, 2018 🖋John Locke 💬1

The Spectre of a Meltdown

The news was supposed to come out Tuesday, but it leaked early. Last week we learned about three variations of a new class of attacks on modern computing, before many vendors could release a patch -- and we come to find out that the root cause may be entirely unpatchable, and can only be fixed by buying new computers.

Today Microsoft released a patch -- which they had to quickly pull when they discovered that it crashed computers with AMD chips.

Essentially Spectre and Meltdown demonstrate a new way of attacking your smartphone, your laptop, your company's web server, your desktop, maybe even your tv and refrigerator.

Meltdown - Animated
Meltdown in Action

This all sounds dreadfully scary. And it is... but don't panic! Instead, read on to learn how this might affect you, your website, and what you can do to prevent bad things from getting worse.