Suture - Supervisor Trees for Go

2014-04-22 (Last Modified: 2014-04-29)

Supervisor trees are one of the core ingredients in Erlang's reliability and let it crash philosophy. A well-structured Erlang program is broken into multiple independent pieces that communicate via messages, and when a piece crashes, the supervisor of that piece automatically restarts it.

This may not sound very impressive if you've never used it. But I have witnessed systems that I have written experience dozens of crashes per minute, but function correctly for 99% of the users. Even as I have been writing suture, I have on occasion been astonished to flip my screen over to the console of Go program I've written with suture, and been surprised to discover that it's actually been merrily crashing away during my manual testing, but soldiering on so well I didn't even know.

(This is, of course, immediately followed by improving my logging so I do know when it happens in the future. Being crash-resistant is good, but one should not "spend" this valuable resource frivolously!)

I've been porting a system out of Erlang into Go for various other reasons, and I've missed having supervisor trees around. I decided to create them in Go. But this is one of those cases where we do not need a transliteration of the Erlang code into Go. For one thing, that's simply impossible as the two are mutually incompatible in some fundamental ways. We want an idiomatic translation of the functionality, which retains as much as possible of the original while perhaps introducing whatever new local capabilities into it make sense.

To correctly do that, step one is to deeply examine not only the what of Erlang supervision trees, but the why, and then figure out how to translate.

What Exactly are Erlang Supervisor Trees?

Erlang Processes

While I want to get to the why we must still cover the what. Let's start at the bottom and work our way up the abstraction stack.

Erlang, like Go, can support many simultaneous execution contexts running "simultaneously" (for some appropriate definition). In Erlang, they are called "processes". Recall that Erlang was developed in the late 1980s-1990s, where the major competing concurrency construct was "threads". Threads are separate execution contexts that run in the same memory space, able to freely read and write a shared heap. Erlang called its execution context a "process" to draw an analogy to an OS process. The idea is that like an OS process, an Erlang process can not freely read and write the RAM of any other process. A normal Erlang process is completely described by its local memory contents, which can feasibly be just a couple hundred bytes. I do not believe this idea originates with Erlang, but it was one of the earlier practical manifestations that went beyond LaTeX papers. (Not that there's anything wrong with that!)

(Sidebar: By this definition, goroutines are more like threads. It is best practice to isolate your goroutines from each other as much as possible, but nothing in the language enforces this, or even particularly helps. But a convention is still better than nothing, and can create a relatively well-isolated core library.)

An Erlang system is built on these processes. Like Go, Erlang can fire up millions of these on a single system. Each process has an ID, the "PID", which is a first-class value in Erlang. Contrast with Go, which has no first-class value representing a goroutine.

Next, we must understand how processes relate to each other. Two processes may:

Send arbitrary Erlang value to another process as a "message", via the target process' PID. This sending is asynchronous on both sides; the sender of a message simply lobs it out and gets on with life, and the receiver has the ability to choose when to receive it freely, including choosing to receive messages in an order other than what they were received in. The Erlang runtime holds the message until a process chooses to receive a given message, via Erlang's pattern matching.
This is the primary communication method in Erlang. Even in cases where it may not appear a message is sent, the API is sending a message under the hood. In Erlang, direct use of the message send operator seems to be a code smell. In general you should almost always be using a gen_* of some sort, and be using the supported mechanisms for talking to gen_* processes.
A process may monitor or link to another process; for my purposes here I'm going to skim over the differences. What you get is the ability for one process to say "Let me know if that other process over there dies", either by sending a message or by killing the listening process. This sounds violent, but has its uses; for instance, if process A is managing a resource for process B, you may just want to kill them both if either dies, for instance, on a socket closing or something. This makes resource management easy. I have missed this from Go, where I find myself manually wiring this relationship between two goroutines together.
A process may kill another process. I believe this is called an "asynchronous exception" in the programming language theory world. Go does not have this capability. In a postscript to this post, I've added why Go almost certainly never will.
All of the above works between nodes, which may live on different physical systems. A PID may reference a process on another connected Erlang node, and all this functionality works, including linking and monitoring.
The true key to understanding Erlang's design is to understand its pervasive focus on reliability, rather than getting caught up in the methods it uses to accomplish this. For Erlang, working on multiple independent chunks of hardware simultaneously is more than just a parlor trick. The Erlang philosophy includes the idea that it is impossible to make software reliable if it resides on only one piece of hardware. Cross-node communication in the Erlang world is more for reliability than sharing work.

Supervision Trees

From these pieces, it's easy to see how to build the basic structure of a supervisor process. It is a process that tells the runtime it is interested in whether a target process dies, and when the runtime tells it that it has, it takes the desired action, usually restarting it (though there are some other exotic options). It should do as little else as possible, because we don't really want the supervisor itself to crash, but we still must plan for the possibility that it will, if for no other reason than memory corruption. (Again, Erlang's focus on reliability means that this possibility is not ignored. At scale, memory corruption is a real thing.)

Since Erlang doesn't have mutation, "restarting" a process takes the form of spawning a new "process" using a given function with given arguments. For instance, I have a service that runs on multiple ports, each of which provide the same service. I have a supervisor that monitors all the listening processes, and conceptually, it knows that if the port 80 process crashes, it needs to spawn a new process with provide_service_on(80), whereas if the port 81 service crashes, it needs to spawn a new process with provide_service_on(81). Erlang's OTP library wraps this all up with some nice declarative functionality, various default behaviors, and a motley handful of default bits of functionality like a basic "server" or a basic "finite state machine".

If we think of the supervising process as being "over" the process it is supervising, we can create a "tree" of processes by supervising the supervisors. In practice, there isn't necessarily a lot of value in having a really deep tree, so I imagine most Erlang supervisor trees are quite bushy, but trees they indeed will be. In Erlang, you are expected to define an "application", which is some concrete bit of functionality wrapped up in a top-level supervisor that will then fire off some other supervisors which will actually implement the functionality. You can then start and stop these independently. Applications have access to additional functionality like application-specific configuration, special commands to start and stop them, and dependency graphs. So in even the simplest application you're at least two levels deep. As the top level of the tree, they are also treated specially if they die, bringing the entire OS process down by default. (Presumably so something can restart it.)

I remember the moment of shock I had when I realized that I had a node running one Erlang "application", and all I had to do to make the same process run another "application" was to run application:start(new_app). That's it.

Erlang also has developed safer restart methods; if something just sits and crashes endlessly on startups, the supervisor will stop restarting it. (This is done by setting a maximum number of crashes permitted within a certain number of seconds, and crashing the supervisor if this is exceeded.) There's some logging and crash integration. And so on. It's a very nice and tuned bit of drop-in functionality; if you're writing in Erlang and you're not writing supervisor trees, you're doing it wrong.

Supervisor Trees ported out of Erlang

To port supervisor trees into Go with the maximum value, we should carefully examine exactly what they are made of, carefully examine the pieces we have with Go, and figure out how to translate them as idiomatically as possible, seeing if there's anything useful we can pick up from Go that Erlang doesn't have along the way.

Erlang is arguably structured from top to bottom to support supervisor trees safely. Supervisors are made of:

isolated processes
immutability
Identification of processes via globally-shared (including between nodes) PIDs
multiprocess concurrency
signaling & asynchronous exceptions
Safe restarting (won't just endlessly retry, bulletproof, etc)

Many of those are not strictly speaking required for supervisor trees to work, but they improve how they work, and in some cases affect the features. For instance, the set of features around "linking", and the way this enables one crashing service to take out its supervisor, in turn taking out and restarting all children, will not necessarily apply to other languages that lack these primitives (and in which for whatever reason they can't be added).

Comparing Go to Erlang

Let's compare Go to Erlang's points above:

isolated processes: Not enforced, but the conventions and community of the language at least consider this an ideal. Go doesn't enforce separation, so of course Suture can not, but I can simply tell you that a Suture-monitored process ought to be as isolated as possible.
immutability: Yeah, total wash here. This affects suture's design quite a bit.
Identification of processes: No Go equivalent. Go does not give you any sort of goroutine ID. This is fully on purpose, I'm sure, and unlikely to change. However it turns out we can put all the information we need in the stack state using defer and this is not a problem. Supervisor's relationships to their supervisees are so stereotyped that we don't need a generalized messaging system.
multiprocess concurrency: Mostly yes, again subject to the usual caveats that they are isolated only by convention and not language enforcement.
Signaling and asynchronous exceptions: Where Erlang gives you a prepared solution, Go gives you tools to build a solution. While you can't stop a goroutine if it happens to go into an infinite loop, there's enough async communication to be able to shut down cooperating processes if you work at it. The simple implementation that stops a goroutine by sending a message on a special "stop" channel received in a service's core "select" loop is enough about 90% of the time.
safe restarting: Just a matter of writing the desired policy. Here we do adjust from Erlang a bit... since Go is not a "crash early, crash often" language, we don't actually terminate the supervisor if its clients start acting up. Suture instead uses a backoff approach, so that in the event that a service never starts up properly, it will at least not gulp CPU trying endlessly to restart.

But let us not forget the things that Go has that Erlang does not, that we can use:

User defined types, with methods and interfaces - I think my coworkers may be tired of me tying "My kingdom for a static type system!" into the team chat room. I am so ready to get out of the dynamic typing of Erlang. I think I understand why the Erlang type system is the way it is (though that would be another post), but I'm still tired of it.
Mutability... this does come with its own problems, but if we're going to use it, we should harness the advantages, too.

The Suture Library

So let's look at what we can do with Go, trying to pull as many advantages in as possible while staying idiomatic and using its strengths.

We have at least enough primitives to obtain the basic functionality we are looking for. We have lightweight "processes", albeit shared-state ones. And we can cobble together something enough like "linking" that we can get what we need for supervisor trees. (The rest shall have to wait for another library.)

Some of the things that Erlang implements in its supervisor trees are themselves a reaction to Erlang's design, and we do not need to carry those along. For instance, the descriptions of how to set up services by specifying a module, function, and initial set of arguments is because Erlang must create an entirely new instance of a supervised process from scratch, and the way it breaks up the functions are to deal with the difference between "initialization" and "execution", something that is often initially confusing to new users. In Go, instead of requiring that we wrap our Service in some formalized creation method, we can simply let the user create a new instance of a value that implements the Service interface. Initialization is handled by the user just like any other initialization of a value. Thus, we can simplify.

We don't need "behaviors". We have interfaces, which are certainly simpler and probably more useful. The compiler will statically verify that if you try to use a "Service" that it has implemented the correct feature.

The fact that memory is not isolated is not something we can "fix" in a supervisor tree library. If your service crashes, you should clear out as much state as possible from your service, to try to avoid the case where corrupted state causes infinite crashing. It is possible to try to specify a heavy-handed framework for initializing the new service, and indeed I initially wrote it that way, but then I noticed it was trivial for a programmer to bypass that anyhow, by leaving Init() blank and just writing everything into Start(). In fact, it was the very first thing I did as a consumer of my own library, which is the sort of Clue a library author should not ignore. So I choose simplicity instead of bondage, and merely advise to you now that you should clean up your state as much as possible on service restart.

We don't have "links" or "pids", but it is possible to factor out the idea of catching crashes by the Service, and restarting it, with logging. It is possible to implement smarter restart logic, once, in a centralized location. You simply wrap the call to the "Start" function for the service in something that catches panics, logs them, and restarts the service. (The restart logic could probably use some tuning, but the current logic is at least a start.)

This allows us to create:

A Service interface, which contains the following:
```
type Service interface {
    Serve()
    Stop()
}
```
Edit, January 2023: This post predates the Go context library. In v4, this has since been modified to:
```
type Service interface {
    Serve(ctx context.Context)
}
```
I have to admit to a small bit of pleasure that the interface is that simple. Believe it or not, I went through several iterations before I got it down to that. So far, it has seems to be enough. As a happy surprise, the minimal Suture service implementation is much smaller than a minimal Erlang gen_*, and easier to understand as well, since there's no "linking" vs. "initialization" confusion. With Go's sort-of structural typing, it also means that a package can easily provide a Suture service without depending on Suture, and can easily provide a non-Suture-dependent start function if desirable.
A Supervisor, which is a chunk of code that accepts Services and manages them. Of course, a Supervisor is itself a Service, so creating trees is just a matter of hooking up Supervisors with each other.

While I do not create any special support for "applications", I have found it advantageous to pack up my services into top level Supervisors, just as I do in Erlang. Even in my relatively small Go team here, we've already had great fun composing services together into various executables.

On the topic of composing, it also turns out to be very powerful for services to compose in a Supervisor instance of their own, if they are in fact some sort of composite themselves. It gives potentially complicated services a simple startup API. More examples of that to come.

The Supervisor also bundles up logging of how the Supervisor is doing, logs failures from the services including stack traces, and is easily adapted to call your local logging code if you provide it a callback.

Why?

In the end, is it quite as slick as Erlang is? Frankly, no. Erlang was in some sense built around supervisor trees, or at least the set of features that provides the ability to build them, and it's hard to compete with that.

However, even in my limited experience, adapting this style into Go still carries enough benefits to be worthwhile. I feel I've had a net benefit from this library just writing and using it myself. Everything I do that can possibly be a Suture service is, and I've already witnessed it taking some 99% functional code, and making it something I can deploy for a while without it completely failing. This is good stuff.

Edit, Jan 2023: Nearly nine years later it is still the case that every non-trivial program I build uses suture. The convenience of bundling services up into coherent blocks, and also getting a restart policy and logging policy on them, is too much to give up.

Addendum: Why Go Will Never Have Asynchronous Exceptions

As mentioned above, Erlang allows you to remotely kill a target process. This is accomplished with the exit function, which throws an asynchronous exception into the target PID, including the possibility of throwing an uncatchable exception in that forces termination, much like kill -9 in UNIX.

To understand why these are called "asynchronous", you have to look at the exception from the point of view of the process receiving the exception; it is not the usual sense of the term that programmers are used to. Most exceptions are synchronous, in that they either occur or don't occur at a particular point in the program; they are "synchronous" (in its original meaning of at the same time as) with the code that produced them. For instance, if your language is going to throw an exception for "file not found", it will occur when you try to open it, not at a random time. By contrast, from a thread's point of view, an "asynchronous exception" can occur at any time, and from the thread's accounting of time, it is completely unrelated to anything it is currently doing.

This is a subtle thing in the Erlang paradigm; since a process shares no state with any other process, and even most resources are held at arm's length via ports, it is feasible to asynchronously kill Erlang processes reasonably safely. The killed process will automatically drop all its values. Any resource it has open will be via a "port", which as an Erlang process, will be linked to the using process, and thus, when the using process dies, that process or port will also "die", so the killed process has a well-defined way of cleaning up its resources even when asynchronously killed. It's still not perfectly safe; some resources may leak depending on how it interacts with other threads, etc, but it is reasonably safe. In Erlang, arguably writing code that isn't safe to kill would be a bug.

Erlang gets away with this by having rigidly partitioned processes. That is, I don't think immutability enters into it; it is the rigid partitioning that accomplishes this. Languages with immutable values do have an easier time providing asynchronous exceptions, though I would observe it took Haskell several iterations to get it correct. In this case, arguably it was the laziness making it harder, but it still is not clear that a strict immutable language with shared values would have a trivial time either. However, it is a flamingly bad idea to have asynchronous exceptions in a shared-state mutable language, and despite the fact that Go uses convention to try to avoid sharing state, it is a shared-state mutable language.

It is not possible to program correctly in a mutation-based language when an "asynchronous exception" can happen at any time. In particular, you do not know what operation the thread was in the middle of that was never supposed to be observed; for instance, if the goroutine was in the middle of a critical section protected by a mutex, it is possible to clean up the mutex while the goroutine is dying, but there's no way to roll back anything the goroutine half did. There's a lot of other more subtle issues that arise, too. For instance, trying to protect a goroutine with a top-level defer doesn't prevent asynchronous exceptions from ruining your day... what if you get an asynchronous exception in the middle of the deferred function itself? Code that is safe in a world without asynchronous exceptions can end up bubbling a panic up past the top of a goroutine stack due to something out of the control of the running goroutine... in the current semantics, that's morally indistinguishable from a segfault, and your program terminates. Any attempts to get around that brings their own further problems. I went on for a couple of paragraphs here and deleted them due to being redundant. Thar be fractal fail here! If you feel like exploring the space yourself, remember to treat this as a hostile environment, like any other threading case. It's helpful to imagine a hostile adversary trying to find the worst possible time for your thread to experience an asynchronous exception, and remember: You can receive an arbitrary number of them, and the exception is itself important, it may be implementing some other guarantee... just ignoring it for any reason is itself a failure.

If easy answers are immediately leaping to your mind, bear in mind they've all been tried and they didn't work. This is a venerable problem faced by all the CLispScript languages, and it's fairly well established there's no practical solution. Even Java eventually had to pull them out, and I mention Java not necessarily as the paragon of software engineering, but as a project that demonstrably has had massive efforts poured into it, and every motivation in the world to make that functionality work for reverse compatibility. If they couldn't do it, and given the fundamental nature of the problems in a mutable state language, probably nobody else can either.

Therefore, there's no point in waiting for this functionality to exist before writing a supervision tree library.