By Josh Snyder
The Netflix Cloud Data Engineering team is delighted to open source s3-flash-bootloader, our tool for performing in-place OS image upgrades on stateful cloud instances, substituting a new AMI where an old one was. In this post, I’ll cover some of the context which led us to develop this tooling, and discuss how it has sped up upgrades of Cassandra and Elasticsearch by an order of magnitude.
In late 2017, I was iterating on AMIs for a benchmark of Cassandra, which runs on stateful instances. Stateless instances enjoy fast iteration times because we can deploy new cloud instances to replace the old ones. By contrast, the stateful Cassandra instances were slow to re-deploy, because they needed to copy of all of their data from the old instance to the new one. I wondered whether I could achieve a stateless level of agility with cloud instances running on stateful ephemeral storage. I began looking for ways to bake AMIs as usual, but to deploy them into EC2 instances that had already launched. I asked: is there a way to overwrite an older AMI with a newer one?
In EC2, an AMI is essentially just an EBS snapshot packaged together with some additional metadata about how to launch the instance¹. When an instance launches, EC2 creates a new EBS volume from the snapshot, and this EBS volume becomes the VM’s root filesystem. For my first attempt, I tried to find ways to swap the EBS root device from a newly launched instance, replacing the root device of an older instance. This approach didn’t work: EC2 (sensibly) doesn’t allow the root EBS device to be detached from a running instance. I also couldn’t stop the instance, as stopping the instance would deprovision its ephemeral storage, requiring a lengthy restreaming of data from other members of the cluster. Speed was the goal here, so any solution which required moving terabytes of data around was unacceptable.
I then looked for ways to modify the contents of the root EBS volume, making them exactly match a newly baked AMI. Other approaches might use configuration management (e.g. Puppet, Chef, Ansible) to perform whatever software changes were needed. The problem with these tools is that they introduce drift and cannot provide guarantees that the machine state they produce would exactly match the AMI which had been run through our build and validation pipeline.
I ended up with a simpler approach: a script which reboots the instance into an in-memory operating system (similar to a Linux LiveCD). When it starts up, the bootloader OS calls out to S3, downloads an image that is byte-for-byte identical to the EBS snapshot contained within a baked AMI, and “flashes” it over the root device. After this process, the filesystem is indistinguishable from a newly launched AMI’s root device. Reboot once more, and the machine boots into the new OS image instead of the old one.
When I presented the results of my benchmark to the team, I mentioned how I had been flashing AMIs to increase iteration speed, and that this may be a useful technique for other teammates to use while integration-testing their own AMIs. My colleague Joey Lynch responded “forget about using this for dev, what if we used this in prod!”
Separate from my work, Joey and our Cassandra team were evaluating ways to perform an OS upgrade across our fleet of tens of thousands of instances. Netflix was moving from Ubuntu Trusty to Ubuntu Xenial (see also Ed Hunter’s talk at LISA18). Additionally, the team wanted to roll out a new version of the datastore itself. Given our scale, we wanted to minimize the human involvement required to effectuate this transition, and (of course) we wanted to achieve it with the least risk practicable.
Before we went with in-place upgrades, we stopped to consider two other classes of approach for solving the OS upgrade problem:
- using Ubuntu’s built-in release upgrade capabilities (do-release-upgrade)
- moving the data to new instances
Having used Ubuntu’s built-in upgrade capabilities on a much smaller scale previously, the idea of using it across a fleet of our size was unappealing. We would have to pull the nodes out of production traffic while they each performed the upgrade, and each node would take longer to mutate itself than simply flashing the 10GB OS image. Furthermore, there would be no guarantee that the upgraded servers would have the same configuration as a freshly booted one: each upgraded server would be its own unique snowflake.
Moving the data to new instances was attractive, because such a solution would be capable of effecting both software and hardware changes in a single step. Unfortunately, such a solution is simply more risky and resource intensive than leaving the data in place and mutating the operating system. The risk is heightened because moving data around introduces the chance of corruption, whereas in-place flashing leaves the data volume untouched. Evaluating resource intensity, an EC2 i3.8xlarge instance can store up to 7600GB and has a 10Gbit/s NIC. Even if we were to saturate the instance’s NIC, it would still take 1 hour and 41 minutes to transfer its dataset. In the end, we judged that solutions involving moving the data were desirable, but that we did not want to block the software upgrades on a separate project to enable hardware swaps.
Sidenote: this isn’t unique
Upgrading a computer by flashing a new disk image atop an old one is slightly unusual in the cloud, but it’s quite typical in other fields of computing. For many types of embedded devices and network equipment, this is how all upgrades are done. Two popular operating systems that I know of, ChromeOS and CoreOS, make this their primary method of performing upgrades. (And CoreOS runs in the cloud!)
Most systems which perform upgrades by flashing have two partitions, one primary and another standby. This allows them to flash an image to the standby partition without first rebooting into an in-memory OS, but requires additional functionality from a bootloader to swap between the two partitions. The rest of Netflix had no use for such functionality, so building it would have put our team “off the paved road” of support. Such modifications would also violate our desire to maintain byte-for-byte fidelity between the upgraded instance’s root device and that of a newly-launched AMI. In fact, the only hint that indicates an in-place upgrade occurred is an EC2 instance tag we create for our infrastructure auditing systems to consume.
When a machine reboots, it unfortunately loses the OS page cache which it was using to accelerate disk reads. Our instances rely on data being hot in page cache to achieve their baseline latencies, so starting back up with an empty page cache would result in anywhere from minutes to hours of SLO-breaking latencies. To avoid this problem, we used a tool I wrote called happycache to dump the locations of cached pages before rebooting. Once the system comes back up, happycache reloads the pages which were previously cached. This substantially reduces the observable latency caused by reboots.
Conclusion: How did it go?
Once we refined our tooling, we could have nodes upgraded and serving traffic in production at 10 minute intervals. Only about 5 minutes of that time were devoted to the upgrade itself, with the other time being used to gracefully remove the node from production traffic, boot up the new OS, and to reload the cached pages from disk using happycache.
We completed our first upgrade over the course of a few weeks, successfully moving from Trusty to Xenial and rolling out a new distribution of Cassandra. We now use this technique for routine upgrades of both Cassandra and Elasticsearch. Having a tool like this available allowed us to build an integrated pipeline which tests new versions of our AMIs and deploys them when they pass, bringing true CI/CD to stateful services at Netflix. But to me, the greatest achievement has been that we got to tell the security team: “whatever the patch is, we can have it fully deployed in 24 hours”.
To Joey Lynch for his work to productionize s3-flash-bootloader, and for initially pointing out the usefulness of this technique in production.
To Harshad Phadke for adapting s3-flash-bootloader for use with Elasticsearch.
¹ There are also ‘instance-store’ AMIs which are backed by objects in S3, but they are less common