Let's dive: a tour of sync.Pool internals

Posted on Nov 25, 2023

In this post we will uncover what makes sync.Pool so damn good. You will get practical insights on how and when to use it.

Introduction

The Pool implemention of the sync package of Go’s standard library is a valuable asset to leverage. Deeply integrated with the runtime, it is an efficient way to lower allocations and relieve the garbage collector.

General usage

Let’s start by looking at the general lifecycle of items. When creating a new Pool, you give it a New func() any, that is responsible for allocating items. Consider the following:

package main

import (
	"fmt"
	"sync"
)

type JobState int

const (
	JobStateFresh JobState = iota
	JobStateRunning
	JobStateRecycled
)

type Job struct {
	state JobState
}

func (j *Job) Run() {
	switch j.state {
	case JobStateRecycled:
		fmt.Println("this job came from the pool")
	case JobStateFresh:
		fmt.Println("this job just got allocated")
	}

	j.state = JobStateRunning
}

func main() {
	pool := &sync.Pool{
		New: func() any {
			return &Job{state: JobStateFresh}
		},
	}

	// get a job from the pool
	job := pool.Get().(*Job)

	// run it
	job.Run()

	// put it back in the pool
	job.state = JobStateRecycled
	pool.Put(job)
}

This code is a foolish example, as in real world you’d use a sync.Pool for items widely used through your application. This will make more sense as we get into how it works.

Here, on the call to pool.Get, the pool is empty, thus Get calls New and returns the newly allocated job. From our side, we could assume the item comes from the pool, but here we’re using the state attribute to differentiate newly allocated jobs from recycled ones - that’s one way to monitor cache miss in sync.Pool.

So we’ve got a job from the pool, we’ve ran it, now we want to put it back in the pool. We call the pool.Put method with the job, and it goes out of scope. The job is now safely splashing around in the pool, and you forget about it, until you need it again. Or maybe it won’t be there anymore - that’s what we will be discussing in next section.

Garbage Collection

Let’s say you have some kind of burst of activity, and you start allocating lots of jobs. Once the burst is over, you’ve offloaded a mountain of jobs to the pool, but there’s little chance you will need so many soon. That’s where sync.Pool starts to shine: it works hand in hand with the garbage collector to free unused jobs.

sync.Pool is composed of two inner pools: local and victim

This is a simplified view of a Pool, we will get into more details later on. The first thing to note is that a Pool is actually composed of two distinct pools: local and victim. When calling Put, you are adding the item to local - and when calling Get, you are first looking into local, and if it is empty, you start looking into victim, and if that’s still not enough, you fallback to calling New. That’s not exactly right, but a good approximation for now.

The local pool is referred to as the primary cache, and victim is called the victim cache.

both inner pools are used to Get an item

What comes out of this overly simplified explanation is that something has to move items from local to victim. That something is a static function called poolCleanup:

func poolCleanup() {
	// This function is called with the world stopped, at the beginning of a garbage collection.
	// It must not allocate and probably should not call any runtime functions.

	// Because the world is stopped, no pool user can be in a
	// pinned section (in effect, this has all Ps pinned).

	// Drop victim caches from all pools.
	for _, p := range oldPools {
		p.victim = nil
		p.victimSize = 0
	}

	// Move primary cache to victim cache.
	for _, p := range allPools {
		p.victim = p.local
		p.victimSize = p.localSize
		p.local = nil
		p.localSize = 0
	}

	// The pools with non-empty primary caches now have non-empty
	// victim caches and no pools have primary caches.
	oldPools, allPools = allPools, nil
}

var (
	allPoolsMu Mutex

	// allPools is the set of pools that have non-empty primary
	// caches. Protected by either 1) allPoolsMu and pinning or 2)
	// STW.
	allPools []*Pool

	// oldPools is the set of pools that may have non-empty victim
	// caches. Protected by STW.
	oldPools []*Pool
)

func init() {
	runtime_registerPoolCleanup(poolCleanup)
}

As you can see, the code for sync.Pool is well documented - but let’s break it down a little. The first thing to understand here is that all instanciated Pool objects are registering themselves into the allPools static variable on change. allPools references all pools for which local (the primary cache) is non-empty, and oldPools references all pools that may have non-empty victim caches.

On init, the poolCleanup static function is registered into the runtime. poolCleanup is called by the runtime prior to collecting garbage in a stop-the-world (STW) context, meaning nothing runs at the same time - the world is stopped.

poolCleanup manages the death of items

The poolCleanup function is very simple - it does one thing, and it does it great: victim is trashed, local becomes victim, and thus, pools from allPools are moved to oldPools, and allPools becomes empty.

In other terms, this means that items that have been in the pool for too long without being retrieved will eventually fade out. On the first garbage collection cycle, they will go from local to victim, and if they’re still not picked up, on the second cycle they will be trashed.

Proc pinning

So far, we know that a Pool is composed of two inner pools, local and victim. We know that they are of the exact same type, as they are continuously rolling from one to the other to void.

type Pool struct {
	noCopy noCopy

	local     unsafe.Pointer // local fixed-size per-P pool, actual type is [P]poolLocal
	localSize uintptr        // size of the local array

	victim     unsafe.Pointer // local from previous cycle
	victimSize uintptr        // size of victims array

	// New optionally specifies a function to generate
	// a value when Get would otherwise return nil.
	// It may not be changed concurrently with calls to Get.
	New func() any
}

The first attribute noCopy serves a particular purpose - if you need a break, read my blogpost about it to know more.

We will get into details about what these pools actually are, but first, as you may have read in the comments, let’s talk about that “per-P” thing.

Here, we’ll need a very basic understanding of scheduling actors - if you’re not familiar with this, you definitely should look into it. There are three main actors : goroutines (G’s), machines (M’s), and Processors (P’s). These three rely upon each other to get work done. Goroutines are executed by machines, which represent OS threads. These machines, in order to do some work, first need to acquire a processor, which represent a CPU core. We can thus say that the P’s are the limited resource here; they are the limiting factor. Machines are fighting over a limited number of processors to run work.

Goroutines’ execution can be stopped at any time by the runtime in order to execute something else: this is preemption. Well, this is not exactly true: there are so-called “safe-points” where goroutines can be stopped in a clean state, and these are the only places where preemption can happen. But you are not in control of these safe-points - so it’s okay to assume it can happen anytime. It’s an interesting topic, but out-of scope here.

The act of proc pinning disables preemption: the P become ours, and ours only, until we unpin. In other words, our goroutine can safely assume it won’t be stopped, even by the GC, and that the P won’t be used by any other goroutine. Once pinned, the execution flow is uninterrupted on that P. This also means that thread-local data is accessible without the need for locking, and this is exactly what sync.Pool is leveraging here. Let’s have a look at the logic sourrounding the pinning:

// pin pins the current goroutine to P, disables preemption and
// returns poolLocal pool for the P and the P's id.
// Caller must call runtime_procUnpin() when done with the pool.
func (p *Pool) pin() (*poolLocal, int) {
	pid := runtime_procPin()
	// In pinSlow we store to local and then to localSize, here we load in opposite order.
	// Since we've disabled preemption, GC cannot happen in between.
	// Thus here we must observe local at least as large localSize.
	// We can observe a newer/larger local, it is fine (we must observe its zero-initialized-ness).
	s := runtime_LoadAcquintptr(&p.localSize) // load-acquire
	l := p.local                              // load-consume
	if uintptr(pid) < s {
		return indexLocal(l, pid), pid
	}
	return p.pinSlow()
}

func indexLocal(l unsafe.Pointer, i int) *poolLocal {
	lp := unsafe.Pointer(uintptr(l) + uintptr(i)*unsafe.Sizeof(poolLocal{}))
	return (*poolLocal)(lp)
}

A few things to unpack here; Let’s start by defining what local and localSize actually are. Until now, we’ve defined this as some kind of inner pool, but we now need to go into more details. local is a C-style array of poolLocal objects, and localSize is the size of that array. We will get into more details about that poolLocal structure later on. What we need to understand here is that each poolLocal object in that local array is for a given P. As you can see, the call to runtime_procPin returns the id of the processor we’re pinned to - and this pid starts at 0 and grows upwards by increments of one. Put simply, there’s P0, P1, P2 and so on up to GOMAXPROCS, which defines the maximum number of P’s, and that is initialized by default to match the number of CPU cores, and can change at runtime (do you see it comming?).

We’ll look into the initialization of local below, but let’s consider what happens on the happy path first. If the size of local is greater than our P’s pid, this means that we indeed have a matching poolLocal object. indexLocal reconstructs the pointer at the correct offset, and retuns the poolLocal object for our P. This mechanism allows us to get our local pool without using any locks - that’s a very fast path: we’re just pinning, and directly accessing our thread-local variable.

Before going further into the pinSlow fallback path, we need to address the load-acquire/load-consume comments. Here’s the code for LoadAcquintptr:

// this is the code for amd64

//go:nosplit
//go:noinline
func LoadAcquintptr(ptr *uintptr) uintptr {
	return *ptr
}

While the nosplit, noinline compiler directives ensure consistency in the execution of the function, in effect we could load-consume p.localSize on i386/amd64, but other architectures such as arm64, mips or riscv have more relaxed memory ordering and need more work, such as using atomic loads.

# this is the code for riscv64

TEXT ·LoadAcquintptr(SB),NOSPLIT|NOFRAME,$0-16
	JMP	·Load64(SB)

TEXT ·Load64(SB),NOSPLIT|NOFRAME,$0-16
	MOV	ptr+0(FP), A0
	LRD	(A0), A0
	MOV	A0, ret+8(FP)
	RET

The essential thing to understand here is that we want to ensure that we read localSize before we read local, as they are set in the other way around on the slow path.

Now let’s look at the slow path code. The pinSlow method is the fallback path to pin: we go there when our assumption about the size of local is wrong, meaning that our P does not have a matching poolLocal object. This means that either the pool is new, or that GOMAXPROCS was updated and we got some additional P’s.

func (p *Pool) pinSlow() (*poolLocal, int) {
	// Retry under the mutex.
	// Can not lock the mutex while pinned.
	runtime_procUnpin()
	allPoolsMu.Lock()
	defer allPoolsMu.Unlock()
	pid := runtime_procPin()
	// poolCleanup won't be called while we are pinned.
	s := p.localSize
	l := p.local
	if uintptr(pid) < s {
		return indexLocal(l, pid), pid
	}
	if p.local == nil {
		allPools = append(allPools, p)
	}
	// If GOMAXPROCS changes between GCs, we re-allocate the array and lose the old one.
	size := runtime.GOMAXPROCS(0)
	local := make([]poolLocal, size)
	atomic.StorePointer(&p.local, unsafe.Pointer(&local[0])) // store-release
	runtime_StoreReluintptr(&p.localSize, uintptr(size))     // store-release
	return &local[pid], pid
}

pinSlow cannot acquire the lock on the static allPools variable while pinned. Think about it: that would be a recipe for deadlocks. Since our pinned goroutine cannot be preempted, it could be trying to acquire a lock that is held by another goroutine waiting for us to release the P. So pinSlow starts by unpinning the goroutine, acquires the lock, and then re-enter a pinned context. During that time, we’re not protected against preemption, thus two things can happen:

the runtime could stop the world and run poolCleanup, but we don’t care, we’re only racing against other goroutines,
another goroutine could acquire the lock before us.

Once we get back to safety, in a pinned context, we need to recheck whether our condition has changed: if we have a matching poolLocal object, we just return it. This means that either another goroutine got this for us, or that we’re not on the same P anymore, and this one’s covered. If we’re still in the previous state, then we correct it. By the way, did you notice that this is the place where we register our pool into the allPools variable?

	// If GOMAXPROCS changes between GCs, we re-allocate the array and lose the old one.
	size := runtime.GOMAXPROCS(0)
	local := make([]poolLocal, size)
	atomic.StorePointer(&p.local, unsafe.Pointer(&local[0])) // store-release
	runtime_StoreReluintptr(&p.localSize, uintptr(size))     // store-release
	return &local[pid], pid

The last 5 lines of pinSlow are doing the initialization work: it build a new slice of poolLocal objects, one per P, and stores the address of the first one into local, and then the size of the slice into localSize (to be consistent with the fast path).

sync.Pool leverages proc pinning to create a fast path for thread-local data consumption, and uses a clever slow path to fix cases where we need to synchronize. The pin method hides all that complexity to the rest of the code - its job is to pin the goroutine to the P, but more importantly to register the pool into the allPools static variable, and do all the heavy-lifting required to return the poolLocal associated with the pinned P.

Poolception

Let’s go deeper: we need to explore the poolLocal structure and understand how it is being manipulated by Put and Get. Here is the poolLocal structure:

// Local per-P Pool appendix.
type poolLocalInternal struct {
	private any       // Can be used only by the respective P.
	shared  poolChain // Local P can pushHead/popHead; any P can popTail.
}

type poolLocal struct {
	poolLocalInternal

	// Prevents false sharing on widespread platforms with
	// 128 mod (cache line size) = 0 .
	pad [128 - unsafe.Sizeof(poolLocalInternal{})%128]byte
}

poolLocalInternal is in essence what we call the “pool”: it is composed of a private attribute, which is one pool item, and the rest of the items are stored in shared, which is a doubly linked list, used as a Last-in, First-out (LIFO) queue. poolLocal embeds poolLocalInternal, and ensures a 128 bytes padding.

To get a clearer picture of poolLocalInternal, let’s define attributes as:

private is a single item, ready to be used, for very fast access - this is an optimization;
shared is a local shard of the cache - this is the items we are storing for that cpu core.

Before moving on, let’s talk about the purpose of that padding in poolLocal. We make the general assumption that CPU cache lines are 128 bytes maximum, or a lower power of 2, and the goal is that the poolLocal structure fills one or more cache lines. As we’ve seen previously, p.local is a contiguous block of memory (an array) with poolLocal objects for each P next to each other. False sharing would be very detrimental to the fast path here, as it would mean that for every change a P makes to its poolLocal object, other P’s might need to synchronize cache lines containing their own poolLocal; instead, by making sure cache lines are not shared by multiple poolLocal objects, we prevent this false sharing. I wrote about false sharing in another post.

In other terms, each P can access and modify their own poolLocal object without causing cache invalidation in other cores.

Put, Get

Now that we’ve made this clear, let’s have a look at how Put works:

// Put adds x to the pool.
func (p *Pool) Put(x any) {
	if x == nil {
		return
	}
	if race.Enabled {
		if fastrandn(4) == 0 {
			// Randomly drop x on floor.
			return
		}
		race.ReleaseMerge(poolRaceAddr(x))
		race.Disable()
	}
	l, _ := p.pin()
	if l.private == nil {
		l.private = x
	} else {
		l.shared.pushHead(x)
	}
	runtime_procUnpin()
	if race.Enabled {
		race.Enable()
	}
}

Let’s ignore the race stuff; The code is pretty simple:

pin, get our poolLocal;
if private is empty, put the item there;
else, put it in shared, LIFO style;
unpin

…and that’s pretty much it. We could go into details about the implementation of the doubly linked list poolChain is made of, but that’s out-of scope here. All we need to know about this list is what’s said in the comments

Local P can pushHead/popHead; any P can popTail.

You might already have strong assumptions about how Get works by seeing Put and this comment. Let’s get to it:

// Get selects an arbitrary item from the Pool, removes it from the
// Pool, and returns it to the caller.
// Get may choose to ignore the pool and treat it as empty.
// Callers should not assume any relation between values passed to Put and
// the values returned by Get.
//
// If Get would otherwise return nil and p.New is non-nil, Get returns
// the result of calling p.New.
func (p *Pool) Get() any {
	if race.Enabled {
		race.Disable()
	}
	l, pid := p.pin()
	x := l.private
	l.private = nil
	if x == nil {
		// Try to pop the head of the local shard. We prefer
		// the head over the tail for temporal locality of
		// reuse.
		x, _ = l.shared.popHead()
		if x == nil {
			x = p.getSlow(pid)
		}
	}
	runtime_procUnpin()
	if race.Enabled {
		race.Enable()
		if x != nil {
			race.Acquire(poolRaceAddr(x))
		}
	}
	if x == nil && p.New != nil {
		x = p.New()
	}
	return x
}

Let’s still ignore the race stuff here. Get is simple too:

pin, get our poolLocal;
empty private, and if it was not empty, use that item;
in the event private was empty, pop shared, LIFO style, and use that item;
in the event shared was empty, we’ll go down the slow path;
in the event even that did not work, we’ll call New;
unpin.

There’s a couple interesting things here. This first one is the comment about popHead() - what does it mean, really? This is a design choice: we want to reduce temporal locality of reuse, meaning that we choose to reuse latest allocations, and let the oldest ones fade out. The second one is about the access pattern used to read private - memory ordering is ensured, the intent is clear, and we do not introduce useless branching - neat.

There’s just one last thing to cover here, it’s the slow path. This happens when our thread-local pool is empty. By now, you can easily guess what’s going to happen here, right? As suggested by the earlier comment (“any P can popTail”), we’re going to try to steal an item from other P’s. If that does not work, we’ll try to look in all the victim pools.

func (p *Pool) getSlow(pid int) any {
	// See the comment in pin regarding ordering of the loads.
	size := runtime_LoadAcquintptr(&p.localSize) // load-acquire
	locals := p.local                            // load-consume
	// Try to steal one element from other procs.
	for i := 0; i < int(size); i++ {
		l := indexLocal(locals, (pid+i+1)%int(size))
		if x, _ := l.shared.popTail(); x != nil {
			return x
		}
	}

	// Try the victim cache. We do this after attempting to steal
	// from all primary caches because we want objects in the
	// victim cache to age out if at all possible.
	size = atomic.LoadUintptr(&p.victimSize)
	if uintptr(pid) >= size {
		return nil
	}
	locals = p.victim
	l := indexLocal(locals, pid)
	if x := l.private; x != nil {
		l.private = nil
		return x
	}
	for i := 0; i < int(size); i++ {
		l := indexLocal(locals, (pid+i)%int(size))
		if x, _ := l.shared.popTail(); x != nil {
			return x
		}
	}

	// Mark the victim cache as empty for future gets don't bother
	// with it.
	atomic.StoreUintptr(&p.victimSize, 0)

	return nil
}

By now, this code is dead simple to you, but you need to read it knowing that it is called in pinned context. We’re reusing logic from the pin method to popTail the shared attribute of all local pools. If that does not work, we’ll try to look into the victim cache, but approximately just once (concurrent tries are doing the same). After that try, we’re setting victimSize to 0, removing that option for next gets until we hit the next GC cycle.

That’s it, now you know everything there is to know about sync.Pool. Or do you? Definitely not.

Handling byte buffers efficiently

By now, you may feel like a sync.Pool expert. Here’s the harsh truth: you’re probably not, and I’m not either.

Let’s consider this code:

package main

import "sync"

var pool sync.Pool

func init() {
	pool = sync.Pool{
		New: func() any {
			return make([]byte, 4<<10)
		},
	}
}

func main() {
	b := pool.Get().([]byte)

	// ...do something with b
	_ = b

	pool.Put(b) // this is line 21
}

Let’s forget that this exact code is useless, the idea is that we’re reusing some fixed length byte buffers through our application. Neat, right?

❯ go run -gcflags="-m" a.go
# command-line-arguments
./a.go:7:6: can inline init.0
./a.go:9:8: can inline init.0.func1
./a.go:9:8: func literal escapes to heap
./a.go:10:15: make([]byte, 4 << 10) escapes to heap
./a.go:10:15: make([]byte, 4 << 10) escapes to heap
./a.go:21:11: b escapes to heap

Well, that is a lot better than not reusing buffers at all, but still not perfect. Above we get an escape analysis red flag: although we’re directly reusing a buffer, it is escaping to the heap each time we are putting it back in the pool. Think about it: b is a []byte, hence we have a local copy of the slice header. By passing it to Put, we’re making it escape to the heap.

Now let’s consider this code:

package main

import "sync"

var pool sync.Pool

func init() {
	pool = sync.Pool{
		New: func() any {
			b := make([]byte, 4<<10)
			return &b
		},
	}
}

func main() {
	bPtr := pool.Get().(*[]byte)
	b := *bPtr

	// ...do something with b
	_ = b

	pool.Put(bPtr)
}

Now the pool contains pointers to byte buffers. We get a pointer, dereference it for use, but pass the original pointer back to the pool when releasing.

# command-line-arguments
./a.go:7:6: can inline init.0
./a.go:9:8: can inline init.0.func1
./a.go:10:4: moved to heap: b
./a.go:9:8: func literal escapes to heap
./a.go:10:13: make([]byte, 4 << 10) escapes to heap

Much better, right? That’s the spirit: test your assumptions.

Conclusion

Go’s sync.Pool elegantly manages memory and can optimize performance in highly concurrent applications. It feels like a masterpiece of simplicity by design, and I hope this dive helped you understand it better. Now go have fun!