The Meltdown vulnerability leaked out into public news a full week before patches were available for many distributions. When patches did become available, sometimes the patch caused further trouble.
Our vulnerable systems
Before patches were available, we downloaded the Proof-of-Concept exploit code, compiled and tested it on a variety of the environments we work in, or have in production.
Here's a quick run-down of what we found affected, and what was not:
|Local 16.04 workstation||Yes||Exploit ran quickly and reliably.|
|Dev server - 14.04 virtual machine, 10 year-old hardware||Yes -- but slow||Exploit ran, and revealed information -- but unlike our workstations, the information dripped out a character at a time, with some errors.|
|Amazon AWS servers||Yes -- but slow||Similar to our own virtual servers, exploit code ran and revealed secrets, but slowly and with errors. Amazon had already patched the underlying hosts.|
|Google Cloud Engine||No||Google's Project Zero team was one of the groups that discovered the exploit, and Google has deployed something on their infrastructure that seemed to completely foil this attack. The attack printed a bunch of garbage characters, no actual clear text.|
|Digital Ocean||Yes||The exploit ran perfectly, and very quickly, within our Digital Ocean guests.|
We did not attempt to exploit other guests on the same hardware -- all our testing was exploiting Meltdown within a single virtual (or dedicated) host.
What happened when we patched
Most of our infrastructure uses Ubuntu LTS (Long-Term Support) releases. Ubuntu published patches for Meltdown on Tuesday January 9, the original coordinated disclosure date. We updated our older 14.04 servers to use the 16.04 HWE kernel, and deployed Ubuntu's 4.4.0-108-generic pretty much across the board, aside from some hosts that used the AWS-specific kernel. We installed these updates on Tuesday afternoon, and rebooted all our hosts that evening.
For the most part, everything went very smoothly. However, we had 2 incidents:
- One of our continuous integration workers failed to boot into the new kernel. This was dedicated hardware, in our office, and we did not have a remote console available -- which essentially made all our overnight scheduled maintenance jobs fail. This was fixed by a kernel release the following day, for 4.4.0-109-generic.
- Our configuration management server (Salt) ended up getting extremely high loads whenever a git commit got pushed.
Meltdown is an attack on how the CPU schedules work, and patches for it essentially disables processor features designed to speed up computing. Most sources suggest there is a 5% - 30% degradation in CPU speed after patching for Meltdown -- highly dependent on workload.
For the most part, we're not noticing big slowdowns, with the one exception of our Salt events.
Salt event.fire_master performance devastation
We've spent a lot of time automating our systems, and have a variety of triggers hooked up. Once we decide upon a trigger, we will often publish events in various systems, so that at some point in the future if we decide to use them, they're already there running. One of those is a git post-update hook -- whenever anyone pushes any git commits to our central git server, we publish an event in several different systems that any other system can subscribe to, and take action.
In our SaltStack configuration management system, our bot uses "salt-call event.fire_master" to publish a system-wide Salt event. At the moment, we have a "Salt Reactor" listening for these on a few of our repositories, but for the most part these end up entirely ignored. And our Salt Master was ending up with a load north of 20 - 30, with a bunch of these event triggers stacked up.
When you run the event command in a shell, it normally fires and returns within a second or so. However, with the kernel patched for Meltdown, the same exact command would take 2 - 3 minutes before the shell prompt re-appeared -- even for repositories that had nothing subscribed to the event! Worse, our bot uses node.js to trigger these events, and in that environment it was taking more like 15 - 20 minutes before it timed out and cleaned up the process. And with commits happening every minute or two, the CPU load quickly started climbing and triggering all sorts of monitoring issues.