6 things to consider before the next AWS outage

Yesterday Amazon Web Services (AWS) had a major outage in their US-East datacenter, in Virgina. It made all sorts of national news, largely because it affected some major online services.

At Freelock, we were largely unaffected. Part of this was that while we are pretty heavy users of AWS, we use their Oregon datacenter. And we generally don't rely on outside "software as a service" systems -- unlike most other shops, we're pretty obstinately open source, and would rather deal with the inconvenience of running our own systems than the inconvenience of having our entire business at the mercy of somebody else's infrastructure.

We were working happily away without interruption all day yesterday, and across everything we run, only one system was affected at all -- our "Continuous Integration" system, Concourse CI. And even that was working fine -- it's just that one container image in our pipeline comes from the Docker Hub, which was down. And that container image is one we publish on the Docker hub for others to use, so it's still within our control -- if we had a big deployment to do yesterday, we could have easily changed our pipeline to get the image from our own private Docker registry, where all the other images come from.

Feel free to contact us here, emailing us directly at computing@freelock.com or call us at (206) 577-0540!

But what if the AWS Oregon data center had gone down instead?

That would have been a much different story for us, and we likely would not have gotten much actual client work done yesterday. We would have about 1/3 of our production websites offline, and most of our larger clients' servers down. No email, no access to our LAN backups or secondary backups from other cloud providers. Our central git server and Docker registry would have been offline. Our secondary DNS would have gone down -- but our primary would still be up.

If the outage was prolonged, here's roughly what we would have needed to do to restore service:

Spin up a new secondary DNS server, and update our DNS registry to add it
Deploy a replacement Docker registry, rebuild our private images, and publish them in the registry
Deploy replacement web servers, perhaps at Google Cloud Engine (GCE), or in this case perhaps just a different AWS datacenter
Recover all production websites from our secondary backup system
Deploy a replacement mail server

Most of that would be pretty straightforward, though looking back at the nature of the Virginia datacenter outage, it highlights the need to complete some of our infrastructure backlog tasks. In particular, our secondary backup system is currently at AWS, and might have been affected by this outage. I was actually looking at spinning up another secondary backup system at GCE, specifically to back up the AWS servers, for exactly this type of risk.

Otherwise, the mail server would be the hardest thing to recover from, for two main reasons:

It takes time to establish a new mail server IP address and not have your email all rejected as Spam
We last deployed that system in 2014, about a year before we switched over most of our infrastructure to Docker containers -- much of the configuration is centrally managed by our configuration management system (Salt) but there is some custom configuration that may take some time to recover.

Everything else we could recover pretty quickly and easily, largely because nearly all of our infrastructure is centrally managed from a server sitting on our LAN (and that server has all of its configuration backed up to AWS).

What if our main LAN server went down?

If that happened, we would largely lose access to our development sites and our project management systems. Our developers could still work by pulling sites down locally and working on their own computers. Our chat system (Matrix) would still be up, along with our "newer" project management tool (Taiga) because we host those both out at Digital Ocean.

What if an outage lasted longer than a half-day?

We could move everything we have at any one cloud provider to another cloud provider in a matter of a few hours, and aside from needing to get a redundant secondary backup in place, we can do it without any access whatsoever to the old provider.

That's because the level of service we outsource is commodity infrastructure. We can spin up a replacement virtual server at any of a thousand competing cloud providers and be fully back in business in less than a day, with the complete failure of any single location. And that's because we've thought about the risks we face, and developed a "Disaster Recovery plan" for our internal systems. And... that's because as a rule we don't rely on "Software as a Service" applications that are out of our control.

Own your applications, don't rent

If there's a single big takeaway from a massive outage like this, it's the risk of renting the applications you use, and not owning them. For just about every SaaS out there you may want to use, there's a free, open source alternative you can install and run yourself (or hire someone like us to run it for you).

Instead of	We use
Gmail	Our own self-hosted Postfix, Dovecot, Roundcube server
Github	Our own Gitolite server
Docker Hub	Our own private Docker registry
TravisCI	Our own Concourse.ci
Jira, hosted PM tools	Our own Atrium (Drupal) PM system
Trello	Our own Taiga install
Slack, HipChat	Our own self-hosted Matrix server
Wix, Weebly, Squarespace, BigCommerce, other SaaS content management systems	Drupal, Drupal Commerce/Ubercart, WordPress
SalesForce, Zoho	Our own Drupal-based CRM
Office 365, Google Docs	LibreOffice, with Collabora shared editing on NextCloud
Dropbox, Box.com	Our own NextCloud install
Freshbooks	Our own LedgerSMB install

There are some good reasons to use Saas applications:

Much lower cost than running yourself
Professionals presumably running their own infrastructure better than you can
"Best of breed" user experiences, companies trying to make systems that are easy to use

But in my opinion, these are all entirely negated by the risks associated with using them:

Vendor Lock-in -- you cannot take a proprietary SaaS system and run it yourself if you don't like the way you're treated by the vendor
Switching costs -- Once you start using a system, you cannot easily switch to a different one without losing data, time/cost involved in making the change, and a new learning curve for the new application
Dependence on the vendor -- if it's a critical business system, by using a SaaS application you now are directly dependent on the vendor. If they choose to go a different direction and change the way the app works, you're stuck. If they sell to a larger company, they might get shut down. If they go out of business, you might have to scramble to switch. If they don't do their backups right, you may lose data -- and you may not have any opportunity to check that they are doing things right in the first place.

The only place it can make sense to use SaaS services is when you have bigger risks and a short time frame to make it work -- you can likely get there faster and cheaper with SaaS, but you should pick vendors that will let you easily export your data, and preferably have open source alternatives you can jump to when you're ready to start stabilizing your company and mitigating risks.

Open Source outages

AWS isn't the only place to suffer severe outages recently. Two open source-based services had similar outages, one with some data loss.

Matrix.org outage

We use Matrix as our main chat system, primarily for communications across our team but also to interact with the broader community, clients, and soon, potential clients. The main Matrix.org server had a major 13-hour outage that brought down much of the service due to some disk space issues. Matrix.org has gone down a bunch of times.

But the distributed nature of Matrix means that even though we use it heavily, we didn't notice any outage at all. Our own server was entirely unaffected, and other than less communication from people who use matrix.org itself, we didn't notice a thing.

Gitlab data recovery issue

A month ago, Gitlab, an open source competitor to Github, had a major meltdown. They were apparently hit with a spam attack, possibly exacerbated by some other data deletion operations, and the service locked up. The whole gory story is worth a read, but in short there was some human error in restoring the wrong backup, and 5 different backup/redundency systems all were not operating correctly, so they ended up losing 6 hours of data.

The point is, stuff like this happens all the time in technology. Things fail. But as an open source company, you can run your own Gitlab server, and everybody who did was completely unaffected by Gitlab.com's outage. (And git being a "distributed system", the core code that people store in gitlab would always have multiple copies -- the data lost was really metadata -- issues, comments, pull requests, etc).

Freelock data recovery issue

Just last week, we had a minor incident with a similar "uh-oh" pit in the stomach. I was cleaning out some old email mailboxes and deleted our main sales mailbox entirely.

That same day, I noticed that our main backup system had not successfully backed up our mail server for a year -- it was getting stuck on a particular directory and failing to complete. I excluded that directory and let it run... but too late for the mailbox I had already deleted!

So there I was, thinking "crap! I just deleted it, and lost a year's worth of data!" I restored the year-old backup... and then went to our next backup system, nightly snapshots. That snapshot was fine. I mounted it and recovered the data from the previous night. Mail started delivering again, and the two legitimate mails that had been unable to deliver came through and all was well.

Outages are a basic fact of life

Freelock is an open source company through and through, and yesterday's massive outage illustrates a big reason why: instead of taking an "Internet Snow Day" we could continue working entirely as usual. And while there's always improvements to be made, any business that cares about staying in business needs to consider what risks they face with their IT systems, websites, communications systems, and project management tools, and have a plan for what to do if something major fails.

Feel free to contact us here, email us directly at computing@freelock.com or call us at (206) 577-0540!

Industry

Network Administation

Networks

Server

Secure hosting environments

Recovering from attacks

Preventing Attacks

OpenStack & Aegir

Given that this article is about open-source SaaS alternatives, and AWS is mentioned in the title, it would be a shame not to mention OpenStack as the IaaS alternative to AWS. While you probably don't want to run it yourself and deal with hardware, your data is portable between OpenStack providers. With AWS, there are no alternative providers; you're locked in (and you definitely can't run it yourself).

And let's not forget the open-source PaaS alternative to proprietary Drupal hosting: the Aegir Hosting System.

Hi, Colan,

Hi, Colan,
Great points! OpenStack is indeed a great way to make your virtual machines portable to other providers or your own datacenter if you want complete autonomy. There is Eucalyptus that can provide a similar portability to Amazon infrastructure, which does allow you to "run it yourself" though it trails the Amazon services in functionality.
And Aegir is certainly an option, though we haven't explored that very much -- we've built our pipeline around individual behavior-driven-design and visual regression testing for each site -- a multi-site complicates testing and deployment of individual sites, and leads to its own form of lock-in ;-)
Cheers,
John

One site per platform

Well, it's not required that you run multiple sites on each platform (Aegir-speak for a Drupal codebase/docroot). You could still use Aegir, but host only one site on each. You would then simply name each platform after the site that it's hosting. You can have as many platforms as you like on a single Aegir installation. Well, at least until you run out of resources. ;)

It's super easy to replicate S3 to a secondary data center!

Great article! You covered waay more than just the S3 outage in the us-east-1 data center, but something I've been trying to evangelize since the outage: it's super easy (a couple clicks in the AWS console) to replicate data from one S3 data center to another.

Here's the steps:

https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazo…

And here's my blog about the recent S3 outage:

https://www.mydropwizard.com/blog/its-not-amazons-fault-internet-broke-…

Thanks!

Hi, David,

Hi, David,
Thanks for the comment, and a link to your post! Definitely an interesting event, and we'll be setting up that replication ourselves... But in my mind, it's not sufficient to merely replicate to other data centers managed by the same provider. The post-mortem states they were working on their billing platform -- what happens if your account gets erroneously suspended due to a billing snafu? We've had this happen to more than one client on GCE reaching the end of their generous free trial period -- suddenly the site and all access went away, simply because the payment method wasn't properly set up. No warning, no message, everything was just gone.
That's a scary moment... (and a big reason we continue to put more clients on AWS than GCE).
Point is, your recovery plan needs to consider what happens if you lose all access to everything on a particular vendor, not just at a particular datacenter.
Cheers,
John

Point is, your recovery plan

Point is, your recovery plan needs to consider what happens if you lose all access to everything on a particular vendor, not just at a particular datacenter.

For sure! I'm a proponent, however, of doing the easy things that improve the situation some right away, rather than waiting until there is the opportunity to do the hard things that improve the situation a lot. Amazon allows you to easily synchronize or setup backups on their other data centers (in the case of S3 replication, it's laughably easy), yet many people don't take advantage of that. Just trying to get the word out!

But to the larger point, you're absolutely right -- there's more that can (and should!) be done to be ready for possible failures.

Thanks!

Add new comment

Your name

The content of this field is kept private and will not be shown publicly.

Homepage

Notify me when new comments are posted

All comments

Replies to my comment

Comment

About text formats

Filtered HTML

Web page addresses and email addresses turn into links automatically.
Allowed HTML tags: <a href hreflang> <em> <strong> <blockquote cite> <cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h1> <h2 id> <h3 id> <h4 id> <h5 id> <p> <br> <img src alt height width>
Lines and paragraphs break automatically.

About the Author

John Locke

John Locke is the lead developer and founder of Freelock, LLC. In addition to being a proficient web developer, he is an experienced technical writer, network administrator, and all around problem solver. He has worked with computers since 1984, and currently advises small businesses on open source software.

Vibe-coding versus Open Source - Security over the long haul

/ 0 Comments

Automating all the things - 24 ideas for things you can automate with your website

/ 0 Comments

Sustainable Business and AI

/ 1 Comments

What website owners need to know about Privacy

/ 0 Comments

Privacy Big Picture: 6 ways privacy is changing

/ 1 Comments

The scientific method and website effectiveness

/ 0 Comments

What site owners need to know about Analytics

/ 0 Comments

The rising costs of site ownership

/ 4 Comments

What We've Done

City of Federal Way

Site Description

The federalwaywa.gov website is the main site for the City of Federal Way in Washington State. This site is a resource for residents, visitors, businesses, and people interacting…

Visit Site

Society of Thoracic Surgeons

Site Description

As a national leader in health care transparency and accountability, The Society of Thoracic Surgeons believes that the public has a right to know the quality of surgical…

Visit Site

Culture Foundry - Freelock Partnership

Since 2019, Freelock and Culture Foundry have had a reciprocal relationship, with Culture Foundry making our clients happy with beautiful and slick websites designs, then working with Freelock to…

Visit Site

Better Care Network

Better Care Network (BCN) is an international network of organizations committed to supporting children without adequate family care around the world. BCN works by fostering collaboration,…

Visit Site

ReThink Orphanages

ReThink Orphanages is a courageous organization with a benevolent charge, grand ambition, a network of high-powered partners, and a commitment to make the world a better place. The organization is…

Visit Site