Article Summary: Understanding Real-World Concurrency Bugs In Go

I came across an interesting research article on concurrency in Go by Tengfei Tu, Xiaoyu Liu, Linkai Song and Yiying Zhang exploring real-world concurrency bugs in Go. They study six open-source applications, all written in Go and available on Github: Docker, Kubernetes, etcd, gRPC, CockroachDB, BoltDB.

They first give a brief overview of goroutines synchronization techniques, namely shared memory (Mutex, RWMutex,Once Cond, WaitGroup) and message passing.

Golang recommends using message passing (‘‘Share memory by communicating, don’t communicate by sharing memory’’). Interestingly, they also compare the number of goroutines created in gRPC to the number of threads created in gRPC-C and their lifetime with three simple client/server applications. The result is that more goroutines are created in gRPC than threads are in gRPC-C, but they have a way shorter lifetime. The threads created by gRPC-C lives until the end of the application.

They then study go concurrency usage patterns. A majority of them are done by using shared memory (from 60% in Docker to 80% in Kubernetes). This usage of shared memory compared to message passing is stable over time.

To make an inventory of bugs, they go through git commit messages to determine if it is a commit that fixes a concurrency bug (by using keywords such as deadlock, race condition, etc.). You cand find these commit logs in this Github repository. From 3211 distinct commits they get, they extract 171 concurrency bugs that they study more thoroughly.

They classify bugs by their behavior (blocking or non-blocking), and their cause (shared memory or message passing), thus having four categories(blocking/shared memory, blocking/message passing, non-blocking/shared memory, non-blocking/message passing). There is no noticeable difference in bugs lifetime between shared memory and message passing bugs.

Their findings go against Golang’s recommendation as message passing is more error-prone than shared memory.

Blocking bugs

Out of the 171 bugs studied, 85 are blocking bugs.

Misuse of Shared Memory

Among the shared memory bugs, there are those caused by traditional concurrency features, and those caused by go-specific concurrency features:

  • Misuse of locks in Mutex (28 blocking bugs): double locking, conflicting orders of acquisition of locks, missing unlock.
  • Misuse of RWMutex: A RWMutex is a reader/mutual exclusion lock. It is held by any number of readers or by a writer. One common misuse is to have two read lock in a goroutine, interlaced with a write lock in another goroutine. In the documentation, golang warns that read locks should not be used for recursive read locking, as it can cause this kind of bugs. The misuse is probably favored by the fact that RWMutex does not have the same bahavior than its C equivalent, pthread_rwlock_t (reads have priority over writes).
  • Misuse of Wait. Among the three studied bugs that falls into this category, two are due to Cond.Wait() for which no Cond.Signal() is called. The last one, in Docker, is due to a misuse of WaitGroup:
    // A blocking bug caused by WaitGroup
    var group sync.WaitGroup
    group.Add(len(pm.plugins))
    for _, p := range pm.plugins {
        go func(p *plugin) {
            defer group.Done()
        }
        group.Wait()
    }

Misuse of Message Passing

  • Channel: misuse of channels accounts for 29 blocking bugs. Many are due not sending or not receiving data, or sending data to a closed channel.
    // A blocking bug caused by channel.
    func finishReq(timeout time.Duration) r ob {
        ch := make(chan ob)
        go func() {
            result := fn()
            ch <- result // block
        }()
        select {
            case result = <- ch:
                return result
            case <- time.After(timeout):
                return nil
        }
    }
    Sometimes, the misuse of a go special library can also lead to bugs. The one that follows is with the context library, and I find it interesting:
    // A blocking bug caused by context
    hctx, hcancel := context.WithCancel(ctx)
    if timeout > 0 {
        hctx, hcancel = context.WithTimeout(ctx, timeout)
    }
    The call to context.WithCancel(), returning hcancel, creates a goroutine to which messages can be sent through hcancel.channel. If a timeout is defined, the hcancel object is updated with a call to context.WithTimeout(), making the first instance unreachable, and its channel not reachable or closeable.
  • Channel and other blocking primitives (16 bugs): one goroutine is blocked at a channel operation, and one is blocked by a lock or wait. For instance:
    // Blocking bug caused by wrong usage of channel with lock.
    func goroutine1() {
        m.Lock()
        ch <- request // blocks
        m.Unlock()
    }
    
    func goroutine2() {
        for {
            m.Lock() // blocks
            m.Unlock()
            request <- ch
        }
    }

After reviewing these bugs, the following part is dedicated to how the bugs they studied were fixed. They distinguish 4 patterns:

  • add missing synchronization operations.
  • move misplaced synchronization operations.
  • change misused synchronization operations.
  • remove extra synchronization operations.

Most fixes fall in these categories. This suggests that it should be possible to develop fully automated or semi-automated tools to detect and fix these blocking bugs. Follows a discussion on why the Go built-in deadlock detector is not good at spotting these bugs. The detector is designed for minimal runtime overhead. Thus, it does not consider the system as blocking if there are still some running goroutines, and goroutines that wait for other systems resources.

Non-Blocking Bugs

The study then focuses on non-blocking bugs. Tehy have studied 86 of them. Here again, there are two root causes of non-blocking bugs:

  • Failing to protect shared memory (69 bugs)
  • Errors during message passing (17 bugs)

Misuse of Shared Memory

Failing to protect shared memory is the main cause of non-blocking bugs.

  • 46 traditional bugs (bugs that you can find in C or Java) like atomicity violation, order violation and data race.
  • Anonymous function: when using such function, all local variables defined before are accessible, which can lead to a data race like in the following example:
    // data race caused by anonymous function
    for i := 17; i <= 21; i++ { // write
        go func() {
            apiVersion := fmt.Sprintf("v1.%d",i) // read
        }()
    }
  • Misuse of WaitGroup (6 bugs): Add has to be invoked before Wait
    // Non-blocking bug caused by misuse of WaitGroup.
    func (p *peer) send() {
        p.mu.Lock()
        defer p.mu.Unlock()
        switch p.status {
            case idle:
                go func() {
                    p.wg.Add(1) // no guarantee it happens before the
                    ...         // call to Wait in the stop function
                    p.wg.Done()
                }()
            case stopped:
        }
    }
    
    func (p *peer) stop() {
        p.mu.Lock()
        p.status = stopped
        p.mu.Unlock()
        p.wg.Wait()
    }
  • Special libraries: some libraries use objects that are implicitly shared by multiple goroutines (context object type can be accessed by multiple goroutines, if several goroutines access to the same field of the context object, it can cause a data race).

The Go-specific bugs accounts for one-third of non-blocking bugs caused by shared memory misuse.

Misuse of Message Passing

Errors during message passing represent 20% of non-blocking bugs. They fall into two categories:

  • misuse of channel (16 bugs): it can be closing a channel twice
    // Example of bug caused by closing channel twice (found in Docker)
    select {
        case <- c.closed:
        default:
            close(c.closed)
    }
  • special libraries: the use of channel inside a special library can also lead to non-blocking bugs.
    // A non-blocking bug caused by Timer.
    timer := time.NewTimer(0)
    if dur > 0 {
        timer = time.NewTimer(dur)
    }
    select {
        case <- timer.C:
        case <- ctx.Done():
            return nil
    }

Go provides a data race detector (enabled by using the -race flag). When run against the studied bugs, it detects 7 out of 13 traditional bugs, and 3 out of 4 bugs due to anonymous functions. 4 bugs had to be run 100 times in order to be spotted by the data race detector, which shows there is room for improvement.