Garbage collecting unhealthy JVMs, a proactive approach

By Josh Snyder and Joseph Lynch

Netflix’s Cloud Data Engineering team runs a variety of JVM applications, including popular datastores like Cassandra and Elasticsearch. While the majority of our clusters run stably with the memory allocated to them, on occasion a “query of death” or a bug in the datastore itself will cause runaway memory usage, which can trigger garbage collection (GC) loops or even run the JVM out of memory.

We’ve built an appreciation for jvmkill to remediate such situations: jvmkill is an agent which runs in the JVM process using the JVMTI API. When the JVM runs out of memory or is unable to spawn a thread, jvmkill steps in and kills the whole process. We combine jvmkill with the -XX:HeapDumpOnOutOfMemoryError Hotspot flag so we can come back to analyze the heap after the fact, with the goal of gaining insight on why we ran out of resources. For our applications, this situation is ideal: a JVM that is out of memory is unable to make forward progress, and once jvmkill steps in, systemd will re-start the failed process from a clean slate.

Even with jvmkill protecting us, we continue to have problems with JVMs that are almost — but not quite — out of memory. These Java processes get stuck performing GC over and over again, with almost no useful work done in between pauses. Since the JVM is not 100% out of resources, jvmkill doesn’t notice a problem. Our clients, on the other hand, quickly notice that the throughput of their datastore nodes has, typically, decreased by four orders of magnitude.

To illustrate this behavior, we can demonstrate a “query of death” against a Cassandra JVM¹ by asking Cassandra to load the entire dataset into memory a few times:

We then use jstat and GC logs to observe that the machine is indeed in a GC death spiral:

From the GC log data, we clearly see repeated 20+ second pauses, and we can use the GCViewer tool to graphically interpret the data present in the logs:

In this situation, the JVM is certainly incapable of meeting our performance goals and has little hope of recovering. This death spiral continues until our oncall engineer acts by killing the affected JVMs. After being paged one too many times, we decided that this problem:

  1. is easily identified
  2. has a straightforward solution
  3. benefits from speedy intervention

In other words, we decided that we needed to automate the coup de grâce that humans previously performed manually.

Solution: Identify and Kill Bad JVMs Proactively

Our solution, jvmquake, started with some evening musing of “how hard could this really be?” We started with the idea that, for any given workload, a JVM should spend a significant fraction of its time running program code, not pausing for GC. If the fraction of program time falls below some level for too long, the JVM is clearly unhealthy and should be killed.

We implemented this idea by modeling a JVM’s time spent paused for GC as “debt”. If a JVM spends 200ms GCing, it adds 200ms to its debt counter. Time spent running program code “pays down” any accumulated debt, stopping at zero, so if the same program then runs for ≥200ms, its debt counter falls back to zero. If the JVM spends more than a 1:1 ratio of its time running versus GCing (i.e. >50% throughput), its debt will tend towards zero. On the other hand, if it achieves less than 50% throughput, its debt will tend towards infinity. This “debt counter” approach is akin to a leaky bucket algorithm, applied to tracking program throughput. In this case we add water at a rate proportional to GC time and remove it proportional to application runtime:

As the debt counter for a JVM rises, we gain more and more confidence that it is unhealthy, and eventually we gain enough confidence to take some action. For example, a real world GC spiral with jvmquake applied might look like:

If jvmquake was attached on this JVM it would have been halted at the dashed line.

We decided on a tunable threshold with a rather lenient 30 second default: if a JVM ever finishes a GC with its debt counter above 30 seconds, jvmquake kills the process. We measured these values by hooking into the GarbageCollectionStart and GarbageCollectionFinish JVMTI callbacks.

In addition to the debt threshold, we added two more tunables:

  • runtime_weight: applies a multiplier to time spent running program code, so that we can have a throughput goal other than 1:1 (50% throughput). A runtime_weight of 2, for instance, is a goal of 1:2 (33% throughput). More generally, a runtime_weight of x implies a 1:x ratio (100%/(x+1) throughput). Server JVMs typically run at over 95% throughput, so even a 50% throughput minimum is quite conservative.
  • action: jvmkill will only ever send a SIGKILL to the process, but in jvmquake we added the ability to intentionally OOM the JVM as well as to send itself arbitrary signals prior to the SIGKILL. We’ll explain why these other actions may be desirable in the following section.

After applying jvmquake, if we run the same query of death from before against a Cassandra node, now we see:

Just like before, the JVM starts going into a death spiral of GC, but this time jvmquake notices after the JVM accumulates 30s of GC debt (at 4:1 runtime weight) and halts the JVM. Instead of the JVM limping on forever like it did before it is swiftly killed.

Don’t throw away the evidence!

When we used jvmkill or manually killed the JVM, we always had the opportunity to gather a heap dump, using -XX:HeapDumpOnOutOfMemoryError or jmap, respectively. These heap dumps are crucial for debugging the root cause of the memory leak after the fact. Unfortunately, neither of those methods work when jvmquake sends a SIGKILL to a JVM which hasn’t yet experienced an OutOfMemoryError. Our solution to this is simple: when jvmquake triggers, it activates a thread which allocates large arrays on the heap to OOM the JVM intentionally. This triggers -XX:HeapDumpOnOutOfMemoryError functionality, and finally kills the process.

This had a serious problem though: Java heap dumps are written and stored on disk, which if we had repeated automatic kills may fill up our disks. We therefore started to look at ways to get OS-native core dumps, rather than JVM specific heap dumps. We realized that if we could get an unhealthy JVM to send itself SIGABRT instead of SIGKILL, the Linux kernel will automatically write a core dump for us. We like this approach because it is standard across all language runtimes (including especially node.js and Python), and most importantly because it allows us to collect even very large core/heap dumps and write them to a pipe without provisioning extra disk space to store them.

When Linux takes a core dump, the default behavior is to write a file called “core” in the working directory of crashed process. To prevent situations where writing a core file would run a disk out of space, Linux provides resource limiting (ulimit -c) on the size of written core files. The default resource limit is zero, so the kernel writes no core file at all. But, using the kernel.core_pattern sysctl, it’s possible to specify a program to which the core dump should be piped (see “Piping core dumps to a program” in the core man page). Following this interface, we wrote a script to compress the core file and perform a streaming upload to S3, where it is stored along with metadata about the crashed program.

Once the streaming upload finishes, systemd will restart the OOMed JVM. This is a tradeoff: we synchronously upload the core file to S3 instead of having to reason about the need to store the core file locally. In practice we are able to reliably upload 16GB core dumps in less than two minutes.

Tell me what went wrong

Now that the core dump file has been captured, we can inspect it to show the culprit — was it a bad query, a hardware issue, or a configuration issue? In the majority of cases, the cause is identifiable from the classes in use and their size.

Our team has rolled out jvmquake to all of our Java datastores. So far, it has mitigated dozens of incidents — each time in mere minutes — and improved the availability of some of our most important production database clusters. Furthermore, streaming core dumps and offline conversion tooling has allowed us to debug and fix complex bugs in our Cassandra and Elasticsearch datastore offerings so that our applications get the “always available” datastores they demand. We have already contributed many of these patches back to the community, and we look forward to finding and fixing even more issues going forward.


¹Specifically Cassandra 2.1.19, with about 20GiB of data and 12GiB of heap. For this experiment we turned off the DynamicEndpointSnitch so that we could guarantee the query would route to the local replica and turned off paging to ensure that the node would hold the entire dataset in memory.