This is why you want to be on our maintenance plans. Our number one priority is recoverability, from just about any risk. And today, we had a client that needed this, in a very bad way!
It all started with an alert on my phone saying a production site was down. Frequently these are temporary Internet blips that are gone when I actually try to pull up the site. Not this time! No site, and I couldn't even reach the server.
I pulled up the client's AWS credentials and logged in to see if there was some sort of problem with the actual server. And yes, it was showing as "Terminated" -- Amazon's term for permanently deleted. What?!?!? Had the client discontinued service without letting us know?
There were several new servers in the account, but none looked like a replacement site. We looked in the users section and found a user we weren't aware of, who had created these servers... had this person deleted the web server? Time to get on the phone.
When we reached our client contact, they had just become aware their site was down, and were gratified that we were aware and had called them. We identified the user who had been making changes in their account, and our contact reached him and confirmed that yes, indeed, this person may have accidentally deleted their server. Bingo! Now we knew what had happened. Next step: what to do about it?
This is where our recovery plans kicked in. First up, assess what we have to work with. Server was irrevocably deleted, along with the root disk. However, the data disk with the actual site and assets was available to be re-attached, and we had a disk snapshot of the root volume from 9 hours before the deletion.
I spun up a brand new instance, re-attached the IP address, attached the data volume, and a new volume from the previous snapshot.
I installed our configuration management client (the "salt-minion"), copied over the database and SSL certificate, and applied the configuration. After approving a couple credentials in our system, the site was back up and running! Total time of the outage: 50 minutes, of which the first 15 was trying to reach our client to determine why it had been deleted...
For incidents like these, we do charge for recovery. However, with our maintenance plan in place, this was a pretty straightforward recovery to do -- largely because of the planning work we do ahead of time, and solid, reliable backups set up to recover from a variety of risks. If we had not set up scripts that automatically backed up their AWS servers, and had other redundant backups available, the story might have had a much worse ending.