This is your brain on false sharing

Posted on Dec 5, 2023

In this post, I make the case for being aware of false sharing situations in go. While diving into the sync.Pool internals, I stumbled upon some structure padding, which I had never considered as very useful performance-wise. Here, we’ll have a succinct refresher about false sharing, a few examples with benchmarks, and try to draw conclusions from it.

Refresher

If you’re familiar with false sharing, skip to the next section. If you need more clarification on it, keep reading!

Multiple caching layers within a CPU exist: you may have heard of L1, L2, and L3 caches. These memory chips are very close to the CPU and can be shared among cores or dedicated per core - the L3 cache is generally shared, and the L1 caches are per core. They are organized in cache lines, each of 64 bytes. These lines are the unit of transfer - meaning that they cannot be updated partially: a line is either valid, meaning it is a correct representation of memory, or it is invalid, in which case the whole line is replaced. You are probably familiar with distributed caches and know that cache invalidation is one of the complex parts. It is kind of the same between CPU cores: if a core is writing to memory, and that write invalidates cache lines in another core, they have to sync and arbitrate to reconcile the state of the cache. This issue I just described is the cause of false sharing: a cache line contains two or more items, one of which is modified, causing the whole line to be invalidated in another core depending on the unmodified items. In other words, false sharing is an inefficient use of cache lines. Multiple cores are forced to unnecessarily invalidate and reload data, leading to increased latency and reduced overall performance. This is an approximation, but it is accurate enough to understand the rest of the post.

Obvious demonstration

For this first example, we’ll set the scene for false sharing. Consider the following structures:

type someStruct struct {
    X int
}

type unalignedStruct struct {
    someStruct
}

type alignedStruct struct {
    someStruct

    pad [128 - unsafe.Sizeof(someStruct{})%128]byte
}

The structure someStruct contains just an int, which, in my case, is 8 bytes long. The unalignedStruct structure just embeds this one to clarify that we can fit many in a cache line. The alignedStruct has a pad field, which ensures it is precisely 128 bytes long, so it must fill one or two cache lines, depending on the cache line size (in most cases, this will fill two cache lines).

There is a memory overhead of 120 bytes per structure! But in our specific case, it is beneficial to cpu time. Let’s move on:

func DirectUnaligned(X int) {
    nP := runtime.GOMAXPROCS(0)

    data := make([]unalignedStruct, nP)
    dataPtr := unsafe.Pointer(&data[0])

    wg := &sync.WaitGroup{}

    for n := 0; n < nP; n++ {
        wg.Add(1)

        go func(n int) {
            for i := 0; i < X; i++ {
                ptr := unsafe.Pointer(uintptr(dataPtr) + uintptr(n)*unsafe.Sizeof(unalignedStruct{}))
                data := (*unalignedStruct)(ptr)

                data.X++
            }

            wg.Done()
        }(n)
    }

    wg.Wait()
}

func DirectAligned(X int) {
    nP := runtime.GOMAXPROCS(0)

    data := make([]alignedStruct, nP)
    dataPtr := unsafe.Pointer(&data[0])

    wg := &sync.WaitGroup{}

    for n := 0; n < nP; n++ {
        wg.Add(1)

        go func(n int) {
            for i := 0; i < X; i++ {
                ptr := unsafe.Pointer(uintptr(dataPtr) + uintptr(n)*unsafe.Sizeof(alignedStruct{}))
                data := (*alignedStruct)(ptr)

                data.X++
            }

            wg.Done()
        }(n)
    }

    wg.Wait()
}

These two are the same: they’re doing little valuable work. The only thing that sets them apart is the use of unalignedStruct or alignedStruct. Here’s a breakdown:

  1. nP is defined as the number of cpu cores available;
  2. nP structs are allocated in a slice. In other words, there’s one struct per CPU core in a contiguous block of memory;
  3. Then, nP goroutines start, modifying their own struct concurrently.

Let’s look at the benchmarks (count=10):

DirectUnaligned-8        4.708n ± 9%
DirectAligned-8                        2.875n ± 3%

The data variable being an array, the items are packed together in a contiguous block of memory. unalignedStruct is 8 bytes long, so a 64-byte cache line can at most contain 8. Modifying it invalidates the cache lines of other cores. alignedStruct, with its padding, is 128 bytes - it is aligned on our cache lines; hence modifying it does not invalidate the cache lines of the other cores.

Less obvious demonstration

But in real life, you wouldn’t do it that way, would you? I bet you would not use unsafe.Pointer with some useless pointer arithmetics - so, maybe you would do something like this:

func IndirectionUnaligned(X int) {
    nP := runtime.GOMAXPROCS(0)

    data := make([]*unalignedStruct, nP)
    for n := 0; n < nP; n++ {
        data[n] = &unalignedStruct{}
    }

    wg := &sync.WaitGroup{}

    for n := 0; n < nP; n++ {
        wg.Add(1)

        go func(n int, data []*unalignedStruct) {
            for i := 0; i < X; i++ {
                data[n].X++
            }

            wg.Done()
        }(n, data)
    }

    wg.Wait()
}

func IndirectionAligned(X int) {
    nP := runtime.GOMAXPROCS(0)

    data := make([]*alignedStruct, nP)
    for n := 0; n < nP; n++ {
        data[n] = &alignedStruct{}
    }

    wg := &sync.WaitGroup{}

    for n := 0; n < nP; n++ {
        wg.Add(1)

        go func(n int, data []*alignedStruct) {
            for i := 0; i < X; i++ {
                data[n].X++
            }

            wg.Done()
        }(n, data)
    }

    wg.Wait()
}

Now, the data array is filled with pointers that are not modified; hence, there’s no false sharing here. The structs are allocated individually: there’s no reason these are next to each other anymore. Right? Let’s look at the benchmarks (count=10):

IndirectionUnaligned-8         4.432n ± 11%
IndirectionAligned-8                          2.771n ± 1%

Yikes, looks exactly like the previous one. But why?

Go allocations

Let’s consider the following:

func SmallAllocations() {
    data := make([]*someStruct, 10)
    for n := 0; n < 10; n++ {
        data[n] = &someStruct{}
    }

    for i, s := range data {
        fmt.Printf("struct %d is at addr %p\n", i, s)
    }
}

We’re allocating ten small structs (8 bytes each) in a row, then printing the addresses for each. Here’s the output:

struct 0 is at addr 0x1400000e1e8
struct 1 is at addr 0x1400000e1f0
struct 2 is at addr 0x1400000e1f8
struct 3 is at addr 0x1400000e200
struct 4 is at addr 0x1400000e208
struct 5 is at addr 0x1400000e210
struct 6 is at addr 0x1400000e218
struct 7 is at addr 0x1400000e220
struct 8 is at addr 0x1400000e228
struct 9 is at addr 0x1400000e230

Surprisingly, the addresses are sequential - in other words, all structures are allocated next to each other.

Go’s memory allocator groups small objects into size classes - one for 8 bytes, one for 16 bytes, and so on. There are even some weird size classes that match common structures’ sizes. Each size class is associated with a span, a contiguous memory block on the heap. When a new allocation comes in, if it is small enough for span allocation, it will go next to the previous allocation for its size class - which is pretty much what we did earlier with our array of structs. In our case, it is the cause of false sharing, but in most cases, it is very beneficial to performance, maximizing cache locality. Let’s say I am ranging over some kind of collection to compute something - let’s say, a sum - this would be very fast: the following elements are already there in my L1 cache, and the rest will be prefetched before I ask for it. Assuming, of course, there is temporal locality in the allocations, which is a typical pattern for such operations. But in rare cases, this cache locality can lead to degraded performance in the form of false sharing - and now, you are aware of it.

Conclusion

Although you know how you might inadvertently end up in a false sharing situation in Go, do not fall into paranoia. The cases in which you would have a tangible impact on performances with structure padding are rare, but they exist nonetheless. Structure padding is a tradeoff between memory overhead and cpu time, so, as always, it boils down to the context-specific answer of it depends. The most important takeaway here is that knowing what happens under the hood will help you make well-informed decisions when the time for optimization comes. Now go have fun!

All fields requesting personally identifiable information (PII) are optional. Any personal data you provide will be kept private and is used solely for moderation purposes. Your information is stored securely on a server in Switzerland and will not be used directly, shared, or disclosed to third parties. By submitting your personal data, you consent to its indefinite storage under these terms. Should you wish to have your data removed or have any privacy concerns, please contact me using the information provided in the 'About' section.