Yesterday Amazon Web Services (AWS) had a major outage in their US-East datacenter, in Virgina. It made all sorts of national news, largely because it affected some major online services.
At Freelock, we were largely unaffected. Part of this was that while we are pretty heavy users of AWS, we use their Oregon datacenter. And we generally don't rely on outside "software as a service" systems -- unlike most other shops, we're pretty obstinately open source, and would rather deal with the inconvenience of running our own systems than the inconvenience of having our entire business at the mercy of somebody else's infrastructure.
We were working happily away without interruption all day yesterday, and across everything we run, only one system was affected at all -- our "Continuous Integration" system, Concourse CI. And even that was working fine -- it's just that one container image in our pipeline comes from the Docker Hub, which was down. And that container image is one we publish on the Docker hub for others to use, so it's still within our control -- if we had a big deployment to do yesterday, we could have easily changed our pipeline to get the image from our own private Docker registry, where all the other images come from.
But what if the AWS Oregon data center had gone down instead?
That would have been a much different story for us, and we likely would not have gotten much actual client work done yesterday. We would have about 1/3 of our production websites offline, and most of our larger clients' servers down. No email, no access to our LAN backups or secondary backups from other cloud providers. Our central git server and Docker registry would have been offline. Our secondary DNS would have gone down -- but our primary would still be up.
If the outage was prolonged, here's roughly what we would have needed to do to restore service:
- Spin up a new secondary DNS server, and update our DNS registry to add it
- Deploy a replacement Docker registry, rebuild our private images, and publish them in the registry
- Deploy replacement web servers, perhaps at Google Cloud Engine (GCE), or in this case perhaps just a different AWS datacenter
- Recover all production websites from our secondary backup system
- Deploy a replacement mail server
Most of that would be pretty straightforward, though looking back at the nature of the Virginia datacenter outage, it highlights the need to complete some of our infrastructure backlog tasks. In particular, our secondary backup system is currently at AWS, and might have been affected by this outage. I was actually looking at spinning up another secondary backup system at GCE, specifically to back up the AWS servers, for exactly this type of risk.
Otherwise, the mail server would be the hardest thing to recover from, for two main reasons:
- It takes time to establish a new mail server IP address and not have your email all rejected as Spam
- We last deployed that system in 2014, about a year before we switched over most of our infrastructure to Docker containers -- much of the configuration is centrally managed by our configuration management system (Salt) but there is some custom configuration that may take some time to recover.
Everything else we could recover pretty quickly and easily, largely because nearly all of our infrastructure is centrally managed from a server sitting on our LAN (and that server has all of its configuration backed up to AWS).
What if our main LAN server went down?
If that happened, we would largely lose access to our development sites and our project management systems. Our developers could still work by pulling sites down locally and working on their own computers. Our chat system (Matrix) would still be up, along with our "newer" project management tool (Taiga) because we host those both out at Digital Ocean.
What if an outage lasted longer than a half-day?
We could move everything we have at any one cloud provider to another cloud provider in a matter of a few hours, and aside from needing to get a redundant secondary backup in place, we can do it without any access whatsoever to the old provider.
That's because the level of service we outsource is commodity infrastructure. We can spin up a replacement virtual server at any of a thousand competing cloud providers and be fully back in business in less than a day, with the complete failure of any single location. And that's because we've thought about the risks we face, and developed a "Disaster Recovery plan" for our internal systems. And... that's because as a rule we don't rely on "Software as a Service" applications that are out of our control.
Own your applications, don't rent
If there's a single big takeaway from a massive outage like this, it's the risk of renting the applications you use, and not owning them. For just about every SaaS out there you may want to use, there's a free, open source alternative you can install and run yourself (or hire someone like us to run it for you).
|Instead of||We use|
|Gmail||Our own self-hosted Postfix, Dovecot, Roundcube server|
|Github||Our own Gitolite server|
|Docker Hub||Our own private Docker registry|
|TravisCI||Our own Concourse.ci|
|Jira, hosted PM tools||Our own Atrium (Drupal) PM system|
|Trello||Our own Taiga install|
|Slack, HipChat||Our own self-hosted Matrix server|
|Wix, Weebly, Squarespace, BigCommerce, other SaaS content management systems||Drupal, Drupal Commerce/Ubercart, WordPress|
|SalesForce, Zoho||Our own Drupal-based CRM|
|Office 365, Google Docs||LibreOffice, with Collabora shared editing on NextCloud|
|Dropbox, Box.com||Our own NextCloud install|
|Freshbooks||Our own LedgerSMB install|
There are some good reasons to use Saas applications:
- Much lower cost than running yourself
- Professionals presumably running their own infrastructure better than you can
- "Best of breed" user experiences, companies trying to make systems that are easy to use
But in my opinion, these are all entirely negated by the risks associated with using them:
- Vendor Lock-in -- you cannot take a proprietary SaaS system and run it yourself if you don't like the way you're treated by the vendor
- Switching costs -- Once you start using a system, you cannot easily switch to a different one without losing data, time/cost involved in making the change, and a new learning curve for the new application
- Dependence on the vendor -- if it's a critical business system, by using a SaaS application you now are directly dependent on the vendor. If they choose to go a different direction and change the way the app works, you're stuck. If they sell to a larger company, they might get shut down. If they go out of business, you might have to scramble to switch. If they don't do their backups right, you may lose data -- and you may not have any opportunity to check that they are doing things right in the first place.
The only place it can make sense to use SaaS services is when you have bigger risks and a short time frame to make it work -- you can likely get there faster and cheaper with SaaS, but you should pick vendors that will let you easily export your data, and preferably have open source alternatives you can jump to when you're ready to start stabilizing your company and mitigating risks.
Open Source outages
AWS isn't the only place to suffer severe outages recently. Two open source-based services had similar outages, one with some data loss.
We use Matrix as our main chat system, primarily for communications across our team but also to interact with the broader community, clients, and soon, potential clients. The main Matrix.org server had a major 13-hour outage that brought down much of the service due to some disk space issues. Matrix.org has gone down a bunch of times.
But the distributed nature of Matrix means that even though we use it heavily, we didn't notice any outage at all. Our own server was entirely unaffected, and other than less communication from people who use matrix.org itself, we didn't notice a thing.
Gitlab data recovery issue
A month ago, Gitlab, an open source competitor to Github, had a major meltdown. They were apparently hit with a spam attack, possibly exacerbated by some other data deletion operations, and the service locked up. The whole gory story is worth a read, but in short there was some human error in restoring the wrong backup, and 5 different backup/redundency systems all were not operating correctly, so they ended up losing 6 hours of data.
The point is, stuff like this happens all the time in technology. Things fail. But as an open source company, you can run your own Gitlab server, and everybody who did was completely unaffected by Gitlab.com's outage. (And git being a "distributed system", the core code that people store in gitlab would always have multiple copies -- the data lost was really metadata -- issues, comments, pull requests, etc).
Freelock data recovery issue
Just last week, we had a minor incident with a similar "uh-oh" pit in the stomach. I was cleaning out some old email mailboxes and deleted our main sales mailbox entirely.
That same day, I noticed that our main backup system had not successfully backed up our mail server for a year -- it was getting stuck on a particular directory and failing to complete. I excluded that directory and let it run... but too late for the mailbox I had already deleted!
So there I was, thinking "crap! I just deleted it, and lost a year's worth of data!" I restored the year-old backup... and then went to our next backup system, nightly snapshots. That snapshot was fine. I mounted it and recovered the data from the previous night. Mail started delivering again, and the two legitimate mails that had been unable to deliver came through and all was well.
Outages are a basic fact of life
Freelock is an open source company through and through, and yesterday's massive outage illustrates a big reason why: instead of taking an "Internet Snow Day" we could continue working entirely as usual. And while there's always improvements to be made, any business that cares about staying in business needs to consider what risks they face with their IT systems, websites, communications systems, and project management tools, and have a plan for what to do if something major fails.