Suture - Supervisor Trees for Go

posted Apr 22, 2014
in Programming, Golang

Supervisor trees are one of the core ingredients in Erlang's reliability and let it crash philosophy. A well-structured Erlang program is broken into multiple independent pieces that communicate via messages, and when a piece crashes, the supervisor of that piece automatically restarts it.

This may not sound very impressive if you've never used it. But I have witnessed systems that I have written experience dozens of crashes per minute, but function correctly for 99% of the users. Even as I have been writing suture, I have on occasion been astonished to flip my screen over to the console of Go program I've written with suture, and been surprised to discover that it's actually been merrily crashing away during my manual testing, but soldiering on so well I didn't even know.

(This is, of course, immediately followed by improving my logging so I do know when it happens in the future. Being crash-resistant is good, but one should not "spend" this valuable resource frivolously!)

I've been porting a system out of Erlang into Go for various other reasons, and I've missed having supervisor trees around. I decided to create them in Go. But this is one of those cases where we do not need a transliteration of the Erlang code into Go. For one thing, that's simply impossible as the two are mutually incompatible in some fundamental ways. We want an idiomatic translation of the functionality, which retains as much as possible of the original while perhaps introducing whatever new local capabilities into it make sense.

To correctly do that, step one is to deeply examine not only the what of Erlang supervision trees, but the why, and then figure out how to translate.

What Exactly are Erlang Supervisor Trees?

Erlang Processes

While I want to get to the why we must still cover the what. Let's start at the bottom and work our way up the abstraction stack.

Erlang, like Go, can support many simultaneous execution contexts running "simultaneously" (for some appropriate definition). In Erlang, they are called "processes". Recall that Erlang was developed in the late 1980s-1990s, where the major competing concurrency construct was "threads". Threads are separate execution contexts that run in the same memory space, able to freely read and write a shared heap. Erlang called its execution context a "process" to draw an analogy to an OS process. The idea is that like an OS process, an Erlang process can not freely read and write the RAM of any other process. A normal Erlang process is completely described by its local memory contents, which can feasibly be just a couple hundred bytes. I do not believe this idea originates with Erlang, but it was one of the earlier practical manifestations that went beyond LaTeX papers. (Not that there's anything wrong with that!)

(Sidebar: By this definition, goroutines are more like threads. It is best practice to isolate your goroutines from each other as much as possible, but nothing in the language enforces this, or even particularly helps. But a convention is still better than nothing, and can create a relatively well-isolated core library.)

An Erlang system is built on these processes. Like Go, Erlang can fire up millions of these on a single system. Each process has an ID, the "PID", which is a first-class value in Erlang. Contrast with Go, which has no first-class value representing a goroutine.

Next, we must understand how processes relate to each other. Two processes may:

Supervision Trees

From these pieces, it's easy to see how to build the basic structure of a supervisor process. It is a process that tells the runtime it is interested in whether a target process dies, and when the runtime tells it that it has, it takes the desired action, usually restarting it (though there are some other exotic options). It should do as little else as possible, because we don't really want the supervisor itself to crash, but we still must plan for the possibility that it will, if for no other reason than memory corruption. (Again, Erlang's focus on reliability means that this possibility is not ignored. At scale, memory corruption is a real thing.)

Since Erlang doesn't have mutation, "restarting" a process takes the form of spawning a new "process" using a given function with given arguments. For instance, I have a service that runs on multiple ports, each of which provide the same service. I have a supervisor that monitors all the listening processes, and conceptually, it knows that if the port 80 process crashes, it needs to spawn a new process with provide_service_on(80), whereas if the port 81 service crashes, it needs to spawn a new process with provide_service_on(81). Erlang's OTP library wraps this all up with some nice declarative functionality, various default behaviors, and a motley handful of default bits of functionality like a basic "server" or a basic "finite state machine".

If we think of the supervising process as being "over" the process it is supervising, we can create a "tree" of processes by supervising the supervisors. In practice, there isn't necessarily a lot of value in having a really deep tree, so I imagine most Erlang supervisor trees are quite bushy, but trees they indeed will be. In Erlang, you are expected to define an "application", which is some concrete bit of functionality wrapped up in a top-level supervisor that will then fire off some other supervisors which will actually implement the functionality. You can then start and stop these independently. Applications have access to additional functionality like application-specific configuration, special commands to start and stop them, and dependency graphs. So in even the simplest application you're at least two levels deep. As the top level of the tree, they are also treated specially if they die, bringing the entire OS process down by default. (Presumably so something can restart it.)

I remember the moment of shock I had when I realized that I had a node running one Erlang "application", and all I had to do to make the same process run another "application" was to run application:start(new_app). That's it.

Erlang also has developed safer restart methods; if something just sits and crashes endlessly on startups, the supervisor will stop restarting it. (This is done by setting a maximum number of crashes permitted within a certain number of seconds, and crashing the supervisor if this is exceeded.) There's some logging and crash integration. And so on. It's a very nice and tuned bit of drop-in functionality; if you're writing in Erlang and you're not writing supervisor trees, you're doing it wrong.

Supervisor Trees ported out of Erlang

To port supervisor trees into Go with the maximum value, we should carefully examine exactly what they are made of, carefully examine the pieces we have with Go, and figure out how to translate them as idiomatically as possible, seeing if there's anything useful we can pick up from Go that Erlang doesn't have along the way.

Erlang is arguably structured from top to bottom to support supervisor trees safely. Supervisors are made of:

Many of those are not strictly speaking required for supervisor trees to work, but they improve how they work, and in some cases affect the features. For instance, the set of features around "linking", and the way this enables one crashing service to take out its supervisor, in turn taking out and restarting all children, will not necessarily apply to other languages that lack these primitives (and in which for whatever reason they can't be added).

Comparing Go to Erlang

Let's compare Go to Erlang's points above:

But let us not forget the things that Go has that Erlang does not, that we can use:

The Suture Library

So let's look at what we can do with Go, trying to pull as many advantages in as possible while staying idiomatic and using its strengths.

We have at least enough primitives to obtain the basic functionality we are looking for. We have lightweight "processes", albeit shared-state ones. And we can cobble together something enough like "linking" that we can get what we need for supervisor trees. (The rest shall have to wait for another library.)

Some of the things that Erlang implements in its supervisor trees are themselves a reaction to Erlang's design, and we do not need to carry those along. For instance, the descriptions of how to set up services by specifying a module, function, and initial set of arguments is because Erlang must create an entirely new instance of a supervised process from scratch, and the way it breaks up the functions are to deal with the difference between "initialization" and "execution", something that is often initially confusing to new users. In Go, instead of requiring that we wrap our Service in some formalized creation method, we can simply let the user create a new instance of a value that implements the Service interface. Initialization is handled by the user just like any other initialization of a value. Thus, we can simplify.

We don't need "behaviors". We have interfaces, which are certainly simpler and probably more useful. The compiler will statically verify that if you try to use a "Service" that it has implemented the correct feature.

The fact that memory is not isolated is not something we can "fix" in a supervisor tree library. If your service crashes, you should clear out as much state as possible from your service, to try to avoid the case where corrupted state causes infinite crashing. It is possible to try to specify a heavy-handed framework for initializing the new service, and indeed I initially wrote it that way, but then I noticed it was trivial for a programmer to bypass that anyhow, by leaving Init() blank and just writing everything into Start(). In fact, it was the very first thing I did as a consumer of my own library, which is the sort of Clue a library author should not ignore. So I choose simplicity instead of bondage, and merely advise to you now that you should clean up your state as much as possible on service restart.

We don't have "links" or "pids", but it is possible to factor out the idea of catching crashes by the Service, and restarting it, with logging. It is possible to implement smarter restart logic, once, in a centralized location. You simply wrap the call to the "Start" function for the service in something that catches panics, logs them, and restarts the service. (The restart logic could probably use some tuning, but the current logic is at least a start.)

This allows us to create:

While I do not create any special support for "applications", I have found it advantageous to pack up my services into top level Supervisors, just as I do in Erlang. Even in my relatively small Go team here, we've already had great fun composing services together into various executables.

On the topic of composing, it also turns out to be very powerful for services to compose in a Supervisor instance of their own, if they are in fact some sort of composite themselves. It gives potentially complicated services a simple startup API. More examples of that to come.

The Supervisor also bundles up logging of how the Supervisor is doing, logs failures from the services including stack traces, and is easily adapted to call your local logging code if you provide it a callback.


In the end, is it quite as slick as Erlang is? Frankly, no. Erlang was in some sense built around supervisor trees, or at least the set of features that provides the ability to build them, and it's hard to compete with that.

However, even in my limited experience, adapting this style into Go still carries enough benefits to be worthwhile. I feel I've had a net benefit from this library just writing and using it myself. Everything I do that can possibly be a Suture service is, and I've already witnessed it taking some 99% functional code, and making it something I can deploy for a while without it completely failing. This is good stuff.

Plus, I will be proceeding to build on this. Stay tuned.

Addendum: Why Go Will Never Have Asynchronous Exceptions

As mentioned above, Erlang allows you to remotely kill a target process. This is accomplished with the exit function, which throws an asynchronous exception into the target PID, including the possibility of throwing an uncatchable exception in that forces termination, much like kill -9 in UNIX.

To understand why these are called "asynchronous", you have to look at the exception from the point of view of the process receiving the exception; it is not the usual sense of the term that programmers are used to. Most exceptions are synchronous, in that they either occur or don't occur at a particular point in the program; they are "synchronous" (in its original meaning of at the same time as) with the code that produced them. For instance, if your language is going to throw an exception for "file not found", it will occur when you try to open it, not at a random time. By contrast, from a thread's point of view, an "asynchronous exception" can occur at any time, and from the thread's accounting of time, it is completely unrelated to anything it is currently doing.

This is a subtle thing in the Erlang paradigm; since a process shares no state with any other process, and even most resources are held at arm's length via ports, it is feasible to asynchronously kill Erlang processes reasonably safely. The killed process will automatically drop all its values. Any resource it has open will be via a "port", which as an Erlang process, will be linked to the using process, and thus, when the using process dies, that process or port will also "die", so the killed process has a well-defined way of cleaning up its resources even when asynchronously killed. It's still not perfectly safe; some resources may leak depending on how it interacts with other threads, etc, but it is reasonably safe. In Erlang, arguably writing code that isn't safe to kill would be a bug.

Erlang gets away with this by having rigidly partitioned processes. That is, I don't think immutability enters into it; it is the rigid partitioning that accomplishes this. Languages with immutable values do have an easier time providing asynchronous exceptions, though I would observe it took Haskell several iterations to get it correct. In this case, arguably it was the laziness making it harder, but it still is not clear that a strict immutable language with shared values would have a trivial time either. However, it is a flamingly bad idea to have asynchronous exceptions in a shared-state mutable language, and despite the fact that Go uses convention to try to avoid sharing state, it is a shared-state mutable language.

It is not possible to program correctly in a mutation-based language when an "asynchronous exception" can happen at any time. In particular, you do not know what operation the thread was in the middle of that was never supposed to be observed; for instance, if the goroutine was in the middle of a critical section protected by a mutex, it is possible to clean up the mutex while the goroutine is dying, but there's no way to roll back anything the goroutine half did. There's a lot of other more subtle issues that arise, too. For instance, trying to protect a goroutine with a top-level defer doesn't prevent asynchronous exceptions from ruining your day... what if you get an asynchronous exception in the middle of the deferred function itself? Code that is safe in a world without asynchronous exceptions can end up bubbling a panic up past the top of a goroutine stack due to something out of the control of the running goroutine... in the current semantics, that's morally indistinguishable from a segfault, and your program terminates. Any attempts to get around that brings their own further problems. I went on for a couple of paragraphs here and deleted them due to being redundant. Thar be fractal fail here! If you feel like exploring the space yourself, remember to treat this as a hostile environment, like any other threading case. It's helpful to imagine a hostile adversary trying to find the worst possible time for your thread to experience an asynchronous exception, and remember: You can receive an arbitrary number of them, and the exception is itself important, it may be implementing some other guarantee... just ignoring it for any reason is itself a failure.

If easy answers are immediately leaping to your mind, bear in mind they've all been tried and they didn't work. This is a venerable problem faced by all the CLispScript languages, and it's fairly well established there's no practical solution. Even Java eventually had to pull them out, and I mention Java not necessarily as the paragon of software engineering, but as a project that demonstrably has had massive efforts poured into it, and every motivation in the world to make that functionality work for reverse compatibility. If they couldn't do it, and given the fundamental nature of the problems in a mutable state language, probably nobody else can either.

Therefore, there's no point in waiting for this functionality to exist before writing a supervision tree library.


Site Links


All Posts