Coroutines in Go

Posted on Jul 27, 2024

Approximately a year ago, rsc wrote about coroutines, and more specifically why we’d need them in Go, what should the API look like, and how they could be implemented. I was pretty skeptical about that post, but I think I’ve come around after seeing them used in the shiny new iter package.

In this post, I’ll cover the current implementation of coroutines in Go, and their use in the iter package.

What the heck are coroutines, really?

If you already know exactly what coroutines are, you can skip this section. If you are not sure, keep reading!

Let’s start with something we all already are very familiar with - function calls:

func main() {
    greeting := greet("world")
    fmt.Println(greeting)
}

func greet(name string) string {
	return "Hello, " + name + "!"
}

This straightforward snippet lays the intuition of coroutines, let me explain why.

When main calls greet:

it prepares the stack for the call, passing the name argument,
greet executes, returning a value,
main gets the control flow back, using the returned value.

Conceptually, main pauses its execution to call greet, and greet gives back control to main by exiting. Although greet would certainly get inlined by the compiler, we can nonetheless say that main and greet are running concurrently, but not in parallel - as main still exists while greet is running, it is just paused with its state saved in the goroutine’s stack.

Both functions are cooperating to exchange control flow and data. greet is a callee, or in essence, a subroutine of main, and main is the caller of greet. The caller and the callee are working together to achieve a common goal.

Now, let’s take this concept a step further. What if instead of exiting to give back control, greet could itself pause and give back control to main, and then main could resume greet again? They would need more than just one stack, as we would now have two concurrent states to keep track of. And now we have coroutines.

Coroutines are functions that run concurrently, but not in parallel. It’s like having two goroutines talking to each other over channels, each one waiting to be woken up by the other, but with much less overhead (they do not need synchronization primitives), and finer control over the control flow. Coroutines are inherently cooperative, and work together to exchange control flow and data. In one way, as we’ve seen earlier, they are a generalization of subroutines, where the callee’s lifespan is not limited to a single call, but the caller and callee both live concurrently and continuously bounce control flow to each other, resuming where they paused earlier. Hence, each coroutine is keeping its state in its own stack.

In other words, a coroutine is an execution lane, that can pause and redirect control flow to another coroutine, that will be able to give back control to the first one, or to another one, and so on. Think of a state machine, where each state is a coroutine, and the transitions between states are exchanges of the control flow.

But why do we need coroutines anyway?

At this point, you might think that coroutines sound complicated and unnecessary. And you’d be right, in a way. Coroutines are not necessary to write programs, and you can do without them. But they can make your life easier in some cases, making your code more readable, more modular, and even more efficient with negative cost abstraction, as explained by Raghav Roy in his wonderful GopherConAU 2023 talk on coroutines.

In another post, rsc brings up the 1977 John McCarty’s GOPHER example. I’ll let you dig further by yourself and just summarize the problem: you want to compare two binary trees that do not share the same structure. You want to compare their sequence, but not their structure, meaning that you’d want to traverse the trees concurrently and compare the elements in the same order. You totally can do this using goroutines and channels, which have a context switching cost of around 100ns. You’d have to consider preemption logic, scheduling, and all the other overheads that come with goroutines and channels. Or you could use coroutines, which have a context switching cost a magnitude lower, and are much more lightweight than goroutines exchanging over channels.

Another problem I’ll bring up is the case of the producer-consumer pattern. You have a producer that generates data, and a consumer that processes it. You might like to have the producer and the consumer to be coroutines, that can pause and resume each other, without the need for synchronization primitives. Think of lexer/parser relationships for example. A notable example is Conway’s paper on separable transition-diagram compiler ¹, where coroutines are used to modularize a compiler, while bringing a 9-pass COBOL compiler down to a single pass. In this paper, he defines coroutines as:

[…] coroutines are subroutines all at the same level, each acting as if it were the master program when in fact there is no master program. There is no bound placed by this definition on the number of inputs and outputs a coroutine may have.

The coroutine notion can greatly simplify the conception of a program when its modules do not communicate with each other synchronously.

— Conway 1963 ¹

Now let’s say we have a general function to apply to a collection of elements. That function’s knows nothing about the collection, it only knows what to do with one element. And let’s say we have another function that knows how to iterate over the collection. That collection could be an array, a linked list, or a binary tree, whatever. The point is that the function that knows how to iterate over the collection is separated from the function that knows what to do with each element.

That modularity model allows us to cleanly separate concerns, in that we can iterate over whatever collection type to apply whatever logic to each element. We could also iterate over nested collections, or something that is not really a conventional collections either, such as a stream of data. How would you build this? You could use coroutines, and distribute the work onto multiple cores by sharding the collection, and then iterating over each shard concurrently. Or maybe you are doing this as part of a request, in which case other requests could be processed concurrently, thus using just one thread per request is just fine.

To sum it up, coroutines are a way to write concurrent code without parallelism. The corrolary is that you don’t need parallelism to write concurrent code, but go’s goroutine model is built to support parallelism. Coroutines allows for a simpler and more efficient way to do that: there’s no need for synchronization primitives, as context switches act as synchronization points themselves. Again, the context switching cost of coroutines is an order of magnitude lower than goroutines, meaning you could switch ten times more often for the same cost.

What should coroutines look like?

Marlin’s 1980 doctoral thesis is regarded as the reference for conceptualizing coroutines. In the introduction 1.1, he defines fundamental characteristics of coroutines as:

(1) the values of data local to a coroutine persists between successive occasions on which control enters it (that is, between successive calls), and

(2) the execution of a coroutine is suspended as control leaves it, only to carry on where it left off when control re-enters the coroutine at some later stage.

— Marlin 1980 ²

On top of that, we need to add a more modern concept: yielding. Yielding is the action of returning a value while pausing and giving back control flow. This is a straightforward way of exhcanging data between coroutines. In the case of subroutines, when the callee function exits, it can return a value - which would be a yield in the context of coroutines, and it can happen during a pause, so that the coroutine can resume later and yield again.

Side note on generators

This might ring a bell and you might want to draw a parallel with generators in Python. Coroutines are in fact a generalization of generators, which are a stricter form of coroutines. With Python generators, only the top-most function is allowed to pass control flow, because generators are stackless. In other words, there is just one call stack, thus only the top-most stack frame can yield.

Given the above, we can dissect and extend Marlin’s and Conway’s definitions into a few key points. A coroutine should be able to:

pause and run a coroutine,
exit and yield a value,
pause and yield a value.

Refresher on Go’s scheduling actors

Before moving on, here is a quick refresher about Go’s scheduling actors - if you feel already educated on the subject, you can skip this section. Otherwise, keep in mind that this is a very minimal explanation of the matter, just enough to be able to understand the rest of the post.

The Go runtime is responsible for managing the execution of goroutines, ensuring fair access to cpu time, handling system calls, and all kinds of stuff. The three main scheduling actors are:

G: G’s are goroutines - you know what they are: lightweight threads. In this context, Gs are simple structs that hold the state of a goroutine.
M: M’s are machines - they represent actual OS threads. These guys are responsible for executing goroutines.
P: P’s are processors - they represent logical CPU cores. They are responsible for keeping track of goroutines.

G’s, as you may know, are multiplexed onto a generally smaller number of M’s. This is what we call a N:M scheduling model. You might already have the intuition that this ressembles a state machine, and goroutines have a lifecycle, going through different states until they die. This is a good approximation, and the main states for goroutines are:

_Grunnable: the goroutine is waiting for a M to run on,
_Grunning: the goroutine is currently running on a M,
_Gwaiting: the goroutine is waiting for something to happen,
_Gdead: the goroutine is dead.

When a goroutine is in the _Gwaiting state, another field is used to store the reason why it is waiting.

That’s it, you won’t need to know more than this, and I’ll explain the rest as we go.

The `runtime.coro` structure

In Go 1.22, a new coro structure was added to Go’s runtime package. Along with it, a private implementation that exposes a couple coroutine primitives in now available.

type coro struct {
	gp guintptr
	f  func(*coro)
}

The coro struct is very simple: it is composed of a pointer to a G struct, and a function that takes in a coro struct. This is the central part of the coroutine implementation, but there are also changes elsewhere in the runtime - we’ll cover them as they come into play later on.

gp and f in coro are linked together to run the coroutine

The gp field in a pointer to a goroutine in waiting state, ready to execute f and exit.

The `newcoro` function

//go:linkname newcoro

// newcoro creates a new coro containing a
// goroutine blocked waiting to run f
// and returns that coro.
func newcoro(f func(*coro)) *coro {
	c := new(coro)
	c.f = f
	pc := getcallerpc()
	gp := getg()
	systemstack(func() {
		start := corostart
		startfv := *(**funcval)(unsafe.Pointer(&start))
		gp = newproc1(startfv, gp, pc)
	})
	gp.coroarg = c
	gp.waitreason = waitReasonCoroutine
	casgstatus(gp, _Grunnable, _Gwaiting)
	c.gp.set(gp)
	return c
}

Side note on systemstack

The systemstack call we see above is a way to run a function on a system stack, as opposed to the current goroutine’s stack. In Go, each goroutine has a dedicated call stack. This call is necessary to run code onto a system stack, temporarily switching to it, which corresponds to the M’s stack currently running the goroutine. When creating a new goroutine, we need to switch to the context of the M to be able to do magical things and register the new goroutine in the P hosting the current M.

state after calling newcoro

The newcoro function is the entrypoint to create a new coro. Given a function f func(*coro), it creates a new goroutine, passes it in _Gwaiting state, with a reason of waitReasonCoroutine, and returns the resulting coro struct. This status and reason are obviously new additions to the runtime. The goroutine pointed by gp is given the coro struct as its coroarg field, which is another addition. In the systemstack call, the goroutine is created with a entry function corostart, that we cover below.

The `corostart` function

//go:linkname corostart

// corostart is the entry func for a new coroutine.
// It runs the coroutine user function f passed to corostart
// and then calls coroexit to remove the extra concurrency.
func corostart() {
	gp := getg()
	c := gp.coroarg
	gp.coroarg = nil

	c.f(c)
	coroexit(c)
}

state after calling corostart

The corostart function is the entry function of the coro.gp goroutine. This means that the goroutine state was changed to _Grunning and started - we’ll see that part later on. It gets the goroutine struct currently running it, reads the coroarg field, resets it to nil, and runs the f function. When f returns, it calls coroexit, that we see below.

The `coroexit` function

// coroexit is like coroswitch but closes the coro
// and exits the current goroutine
func coroexit(c *coro) {
	gp := getg()
	gp.coroarg = c
	gp.coroexit = true
	mcall(coroswitch_m)
}

state after calling coroexit

Here, the drawing shows the state right before the last instruction. coroexit actually just prepares the real exit call: it just sets the coroarg field back, sets gp.coroexit to true - yes, that’s another addition to the runtime. The interesting stuff is in the mcall.

Side note on mcall

The mcall function is another way to run a function onto a system stack, as in systemstack described earlier, but this time it is not a temporary switch, from the caller’s point of view. With mcall, the goroutines gives up control flow to the M, after saving the current state of the goroutine’s execution. This means that the argment for mcall generally enters the scheduler and never returns. The goroutine gets rescheduled later on by the runtime.

The `coroswitch` function

Before actually looking at the coroswitch_m function, let’s look at coroexit cousin, coroswitch.

//go:linkname coroswitch

// coroswitch switches to the goroutine blocked on c
// and then blocks the current goroutine on c.
func coroswitch(c *coro) {
	gp := getg()
	gp.coroarg = c
	mcall(coroswitch_m)
}

coroswitch is a simple function that sets the coroarg field of the current goroutine to the coro struct passed as an argument, and then calls coroswitch_m through mcall. This is the same as coroexit, but without setting gp.coroexit field.

The `coroswitch_m` function

Here’s the big part of the coroutine implementation, the coroswitch_m function. This is the function that actually switches the control flow from one coroutine to another. As explained earlier, this is being executed on a system stack, which corresponds to the local OS thread’s stack (the M’s stack). I am going to split it in multiple parts to explain it more easily. Also, I am removing parts of the code that are not relevant to the explanation.

// coroswitch_m is the implementation of coroswitch
// that runs on the m stack.
func coroswitch_m(gp *g) {
	c := gp.coroarg
	gp.coroarg = nil
	exit := gp.coroexit
	gp.coroexit = false
	mp := gp.m

Here, we get back the coro struct for the goroutine, and resets the coroarg field to nil. We also check the coroexit field, and reset it as well. If exit is true, this means that we come from coroexit, and we want to kill the goroutine.

We also save the current M to mp - keep that in mind for later.

	if exit {
		gdestroy(gp)
		gp = nil
	}

And this is were we do it. By calling gdestroy(gp) and then leaving it going out of scope, we effectively kill the goroutine, switching it to the _Gdead state. The gdestroy function is an addition to the runtime by the way - it was created by decomposing goexit0 - but you might not care about this detail.

Now, what if exit was false? It would mean that we come from coroswitch - hence that we just meant to switch control flow to another coroutine.

    if exit {
		...
	} else {
		// If we can CAS ourselves directly from running to waiting, so do,
		// keeping the control transfer as lightweight as possible.
		gp.waitreason = waitReasonCoroutine
		if !gp.atomicstatus.CompareAndSwap(_Grunning, _Gwaiting) {
			// The CAS failed: use casgstatus, which will take care of
			// coordinating with the garbage collector about the state change.
			casgstatus(gp, _Grunning, _Gwaiting)
		}

		// Clear gp.m.
		setMNoWB(&gp.m, nil)
	}

In that case, we somehow move the state of the goroutine from _Grunning to _Gwaiting. Then, we temporarily “detach” the goroutine from the current machine by setting gp.m to nil.

Alright, we got rid of the current goroutine, either by killing it, or by putting it asleep. Now, we need to wake up the other coroutine we want to switch to.

	// The goroutine stored in c is the one to run next.
	// Swap it with ourselves.
	var gnext *g
	for {
		// Note: this is a racy load, but it will eventually
		// get the right value, and if it gets the wrong value,
		// the c.gp.cas will fail, so no harm done other than
		// a wasted loop iteration.
		// The cas will also sync c.gp's
		// memory enough that the next iteration of the racy load
		// should see the correct value.
		// We are avoiding the atomic load to keep this path
		// as lightweight as absolutely possible.
		// (The atomic load is free on x86 but not free elsewhere.)
		next := c.gp
		if next.ptr() == nil {
			throw("coroswitch on exited coro")
		}
		var self guintptr
		self.set(gp)
		if c.gp.cas(next, self) {
			gnext = next.ptr()
			break
		}
	}

Here above, we try to switch the goroutine pointed by c.gp, which is the next coroutine to run. This is from the coro struct that was passed to coroswitch or coroexit.

The c.gp field is swapped with the goroutine that initially called mcall (the one where coroexit/coroswitch happened). This saves the current goroutine in the coro. Then, gnext is set to the original value of c.gp: this is the goroutine we want to switch to. This part is a bit complicated to explain as it is bypassing the normal path for goroutine scheduling, so just trust me on this.

Then, it is time to actually run the next coroutine.

	// Start running next, without heavy scheduling machinery.
	// Set mp.curg and gnext.m and then update scheduling state
	// directly if possible.
	setGNoWB(&mp.curg, gnext)
	setMNoWB(&gnext.m, mp)
	if !gnext.atomicstatus.CompareAndSwap(_Gwaiting, _Grunning) {
		// The CAS failed: use casgstatus, which will take care of
		// coordinating with the garbage collector about the state change.
		casgstatus(gnext, _Gwaiting, _Grunnable)
		casgstatus(gnext, _Grunnable, _Grunning)
	}

	// Switch to gnext. Does not return.
	gogo(&gnext.sched)
}

Then, we force mp.curg to be gnext - forcing the M’s current goroutine to be gnext. We also set gnext.m to mp, to cover the corollary. We force the state of gnext to _Grunning, and the last call, gogo(&gnext.sched) is the one that actually enters gnext entry function.

And that’s it! If you were able to follow along without prior knowledge of the runtime, I am happy. If you are still confused, don’t worry, here’s a quick recap.

`runtime.coro` recap

The runtime.coro structure represents a slot for executing a coroutine. It is created and returned by newcoro(f func(*coro)). The f argument passed to newcoro is the function that will get executed by the coroutine. You would typically save the returned coro struct, and the coroutine’s f function would have access to it, as a closure. It will be able to call coroswitch or coroexit to give up control flow.

control flow switches from a G to another

How `iter.Pull` works with coroutines

The new iter.Pull function converts a push-style iterator into a pull-style iterator. In other words, instead of the iterator pushing values to function, the function can pull values from the iterator.

Push-style iterators are Seq and Seq2 functions:

// Seq is an iterator over sequences of individual values.
// When called as seq(yield), seq calls yield(v) for each value v in the sequence,
// stopping early if yield returns false.
type Seq[V any] func(yield func(V) bool)

// Seq2 is an iterator over sequences of pairs of values, most commonly key-value pairs.
// When called as seq(yield), seq calls yield(k, v) for each pair (k, v) in the sequence,
// stopping early if yield returns false.
type Seq2[K, V any] func(yield func(K, V) bool)

Let’s say you have a binary tree, and you want to iterate over it in order. You would implement a Seq function that would yield the values in order.

Now let’s say that you want to iterate over two binary trees in parallel and compare there values. You could convert your push-style iterators into pull-style iterators using next, stop := iter.Pull(it), and you could then compare the values one by one by calling next on each trees, until next returns false, signaling we reached the end of a tree.

Here is the simplified code for iter.Pull:

func Pull[V any](seq Seq[V]) (next func() (V, bool), stop func()) {
	var (
		v     V
		ok    bool
		done  bool
	)
	c := newcoro(func(c *coro) {
		yield := func(v1 V) bool {
			if done {
				return false
			}
			v, ok = v1, true
			coroswitch(c)
			return !done
		}
		seq(yield)
		var v0 V
		v, ok = v0, false
		done = true
	})
	next = func() (v1 V, ok1 bool) {
		if done {
			return
		}
		coroswitch(c)
		return v, ok
	}
	stop = func() {
		if !done {
			done = true
			coroswitch(c)
		}
	}
	return next, stop
}

In Pull’s scope lives a v value, an ok flag, and a done flag. Then there’s a coro, a next function and a stop function, the last two being returned.

The stop function sets the done flag and calls coroswitch with the coro to give it the control flow. The next function checks if the done flag is set, and returns early if it is. Otherwise, it calls coroswitch with the coro to give it the control flow, then somehow the coro sets v and ok and yields back, and the next function returns v and ok.

Now let’s get a closer look at the coro’s function:

c := newcoro(func(c *coro) {
    yield := func(v1 V) bool {
        ...
    }
    seq(yield)
    var v0 V
    v, ok = v0, false
    done = true
})

The coro’s function is a closure that has access to the outer scope’s variables, just like next and stop. It defines a yield function that is passed on to seq, which is going to execute it for each value in the sequence. When seq is done, the coro’s function sets v and ok to their zero values, and sets the done flag to true.

yield := func(v1 V) bool {
    if done {
        return false
    }
    v, ok = v1, true
    coroswitch(c)
    return !done
}

The yield function defined by the coro’s function first checks for the done flag to be able to return early and stop the iteration. By doing so, the coro’s function will return early and coroexit will be called, killing the coroutine. Otherwise, if the done flag is not set, the yield function will set v and ok to the current value passed by seq, and call coroswitch to give the control flow back to the next function - or the stop function, for that matter.

Conclusion

As shown in the code snippets, most of the functions related to runtime.coro have linknames, which make them linkable from other packages, even with the upcoming 1.23 changes to the linker. This means you can use them in your own code by doing the same as the iter package does.

type coro struct{}

//go:linkname newcoro runtime.newcoro
func newcoro(func(*coro)) *coro

//go:linkname coroswitch runtime.coroswitch
func coroswitch(*coro)

And that’s it! I hope this whole post helped you understand how the current implementation of coroutines in Go works, and how they can be used in practice. Now go have fun!

Coroutines in Go

What the heck are coroutines, really?

But why do we need coroutines anyway?

What should coroutines look like?

Refresher on Go’s scheduling actors

The runtime.coro structure

The newcoro function

The corostart function

The coroexit function

The coroswitch function

The coroswitch_m function

runtime.coro recap

How iter.Pull works with coroutines