Spinlock - Basics (pt. 1)

Synchronization is a pretty exciting topic. An important key in the multithreaded and concurrent process.

Today I plan to do a deliberate practice over the spinlocks. Probably it also will be a series of posts over the other synchronization primitives.

Spinlocks are synchronization primitives used to protect shared resources in multithreaded environments. They might seem simple, but mastering them requires a deep understanding of their implementation details.

At the core, a spinlock is a binary flag that indicates whether a resource is currently being accessed by a thread. When a thread wants to access the r it first checks if the spinlock is available. If it is, the thread sets the spinlock to a “locked” state and proceeds with the critical section of the code. If the spinlock is already locked, the thread “spins” in a loop, continuously checking the lock’s status until it becomes available.

Because they avoid overhead from rescheduling and context switching, spinlocks are efficient if threads are likely to be blocked for only short periods (since the thread remains active but is not performing a useful task). For this reason, kernels often use spinlock (https://en.wikipedia.org/wiki/Spinlock).

So, let’s start this study by trying to understand the basics of a spinlock, and how to implement it.

What is a spinlock?

As I’ve mentioned, a spinlock is a pattern that blocks the code until a certain condition is validated. So, in general, we can describe the spinlock based on three elements:

A state which should be evaluated;
A lock function that should block the program;
An unlock function that should release the resource.

class Spinlock {
  private:
    bool locked{false};

  public:
    void lock() {
				locked = true;
        while (locked);
    }

    void unlock() { locked=false; }
};

This is the template of this pattern. A state (locked), the lock function that blocks (i.e. while(1)), and the unlock function that releases the lock condition.

Now, let’s test our simplest spinlock implementation. Here, I’ve proposed (work) a simple function that tries to modify the value of a variable in a multithreading scenario.

void work(std::int64_t &val) {
    for (int i = 0; i < 100000; i++) {
        val++;
    }
}

And then, the test_work_on_two_threads to evaluate it in a concurrent environment.

void test_work_on_two_threads() {
	std::int64_t val = 0;

	Spinlock sl;

	std::thread t1(work, std::ref(sl), std::ref(val));
	std::thread t2(work, std::ref(sl), std::ref(val));

	t1.join();
	t2.join();

	std::cout << val << std::endl;
}

gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
119916
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
177647
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
110945
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
100848
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
127276
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
150574
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
172791
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
109485
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
106638

This is the ideal of a basic spinlock, obviously, this code does not run correctly, in special in a concurrent scenario. Each thread tries to modify the read-modify value concurrently. Let’s modify it to get a useful version.

I’ve planned to implement different versions of this spinlock. My objective here is to understand deeply the concept. Therefore, let’s do it in baby steps.

An initial approach to creating a viable implementation is to use the std::atomic (https://en.cppreference.com/w/cpp/atomic/atomic). It will provide us with the possibility of working in multithreading and also some tools to maintain the spinlock.

In special, the std::atomic provides us with a state (1) that holds atomically the current status of the spinlock. Also, the very intuitive exchange that atomically replaces the value of the atomic object and obtains the value held previously. It means, that when the locked comes false the lock condition will become false too.

class Spinlock {
  private:
    std::atomic<bool> locked{false};

  public:
    void lock() {
        while (locked.exchange(true));
    }

    void unlock() { locked.store(false); }
};

Adding the spinlock to guarantee the concurrent access we have:

void work(Spinlock &s, std::int64_t &val) {
    for (int i = 0; i < MAX_ITER; i++) {
        s.lock();
        val++;
        s.unlock();
    }
}

Then, running the test_work_on_two_threads we got:

gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
200000
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
200000
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
200000
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
200000
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
200000
gbs@gojira:~/fun/blog/cpp/spinlock/build $ ./spinlock
200000

And as expected, it works pretty well. Let’s take a look at the assembly code of our implementation (https://godbolt.org/z/Mr8G3xY94). Feel free to change the compiler and CPU architecture, I’m using an AMR64-Clang here.

.LBB0_1:
        swpalb  w9, w11, [x0]
        tbnz    w11, #0, .LBB0_1
        ldr     x11, [x1]
        add     w8, w8, #1
        cmp     w8, w10
        add     x11, x11, #1
        str     x11, [x1]
        stlrb   wzr, [x0]
        b.ne    .LBB0_1
        ret

Here is our principal loop. It’s a straightforward loop, the first thing we see is the swpalb, the aarch64 instruction for swap byte in memory with release semantics (atomic). This is our lock() which is spinning in a loop by the tbnz instruction util the previous swap returns false.

Then, we have the ldr, add, str incrementing our variable (i.e. val++). Finally, we have the stlrb, atomic store for the unlock().

Now, let’s create a better test and also add the Google Benchmark (https://github.com/google/benchmark) to evaluate our different implementations of spinlock.

static void bm_ref(benchmark::State &s) {
    auto num_threads = s.range(0);

    std::int64_t val = 0;

    std::vector<std::thread> threads;
    threads.reserve(num_threads);

    Spinlock sl;

    for (auto _ : s) {
        for (auto i = 0u; i < num_threads; i++) {
            threads.emplace_back([&] { work(sl, val); });
        }
		
        for (auto &thread : threads)
            thread.join();
        threads.clear();
    }
}

BENCHMARK(bm_ref)
		->DenseRange(1, std::thread::hardware_concurrency(), 1)
		->UseRealTime();

BENCHMARK_MAIN();

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
bm_ref/1/real_time    1050932 ns        17797 ns          632
bm_ref/2/real_time    9894298 ns        40638 ns           69
bm_ref/3/real_time   15041799 ns        53106 ns           47
bm_ref/4/real_time   28551069 ns        80500 ns           24
bm_ref/5/real_time   67210773 ns       103727 ns           11
bm_ref/6/real_time  114636946 ns       135429 ns            7
bm_ref/7/real_time  127173375 ns       124667 ns            6
bm_ref/8/real_time  161605250 ns       137000 ns            5

This is the first part of this series. In the next parts, we’ll try to evaluate and improve this spinlock. Let’s try to analyze it deeply and take a look at how spinlock is really implemented.