Locks from scratch

Implementing a simple mutex in x86 assembly

October 30, 2024

Today, we'll continue the exploration of threads and concurrency. As we discussed in the last investigation, a thread is just a single sequence of execution within the address space of a process (also known as a process control block / PCB). A central advantage to using threads is that they are more efficient and faster than spawning several individual processes. Context switching is also much quicker because most of the data that a thread needs is already loaded by the parent process. A thread just needs to keep track of its registers and stack. There are downsides, however, including complexity. If we have many threads reading and writing to the same memory locations, this can lead to race conditions, whereby unexpected and inaccurate results can surface.

A solution to this is to create a lock or mutex (short for mutual exclusion), which ensures that only one thread can enter into a critical section of execution at a time. This will be some kind of flag in shared memory that signals to each thread if it is allowed to continue into a specific part of the program. Before we implement a lock, let's see an example (and consequences) of a multi-threaded program where a lock is not present.

Thread free for all

First, let's create a program that spawns a bunch of threads, which will each increment the value of a counter that is defined in memory common to all threads in the process. We can adjust the previous code to spin up several threads. We can define the maximum value of threads as an initialized variable. We'll also reserve a buffer to store the number of active threads:

section .data
	max_threads db 16
section .bss
	num_threads resb 1

When we start our program we will spawn a new thread with the clone system call and increment num_threads until we reach the maximum:

_start:
	call init_threads

init_threads:
	; check num threads
	mov al, [num_threads]
	cmp al, [max_threads]
	; exit program if we are at the thread maximum
	jge exit
	inc byte [num_threads]
	call spawn_thread
	; if return code is not zero, we are in parent process so continue spawning
	test eax, eax
	jnz init_threads
	; update new stack pointer for thread
	mov esp, eax
	; do something with the thread
	call child_process

Each child (i.e. thread) will get its own stack and then move into the child_process, which we'll define below. The parent process continues to spawn threads until it reaches the maximum number, at which point it just exits silently.

In the child_process we will do the following:

Increment the value of some counter by 1
Print the value of the counter to stdout
Exit

We'll initialize our counter as a single byte with the value of 0:

counter db 0

If we want to display this as a numeric representation in ASCII, we can add 0x30 to the value before printing.

Note: this isn't the right way to translate a byte value to ASCII, but performing a simple add operations in our threads will display the importance of atomicity in programming... but we'll get to that soon.

child_process:
	; increment counter
	inc byte [counter]
	add byte [counter], 0x00000030
	; print to stdout
	mov eax, 4
	mov ebx, 1
	mov ecx, counter
	mov edx, 1
	; revert translation
	sub byte [counter], 0x30
	
	.. free the allocated stack using munmap
	
	jmp exit

Ok, we've got a program that spins up 16 threads. Each thread increments the counter, prints, and exits. Let's try running it!

nasm -f elf -o nolock.o nolock.s
ld -m elf_i386 -o nolock nolock.o

Now that we have assembled our program we can try running it a bunch in a loop:

while true; do
	./nolock
	echo ""
	sleep 0.1
done

This produces the following output:

123456789:;<=>?@
123456789j;<=>?@
123d56789:k<=>?@
12c456g89:;<=>?@
12cd56789:;<=>?@
123d567h9:;<=>?@
12c456789:;<=>?@
123456g89:k<=>?@
12c456789:k<=>?@
12cde6789:;<=>?@
1234e678i:;<=>?@
123456789:;<=>?@
123456789:k<=>?@
12cd56gh9:;<=>?@
123456789:k<=>?@
12c456789:;<=>?@
12c456789:;<=>?@
123d56789:k<=>?@
12cde6g89:;<=>?@

There are inconsistencies here: sometimes the correct ASCII representation of the counter prints as expected, but in many cases, it’s inaccurate. The 0x30 addition or subtraction instruction often doesn’t complete before the next thread takes its turn—I smell a race condition! To solve this, we need to ensure the next thread increments only once the active thread has completed its work. Enter: locks.

Locks to the rescue

Generally speaking, a lock is just a variable that holds the state of whether or not it is being held. If a thread holds the lock, we'll set it to 1. Other threads will look at the value, and refrain from entering a critical section of code. When one thread is finished, it will set the lock to 0, and now other threads will have the chance to acquire it.

But we'll have to be careful when building the lock. Consider the following example:

1  .data
2 	  mutex db 0

...
3  child_process:
4	  mov eax, [mutex]
5	  cmp eax, 0
6	  jne child_process
7	  mov [mutex], 1
8	  call critical_section

In the code above, we define a variable mutex , which is the state of the lock. It is initialized at 0. A thread will enter child_process, load the value of the lock into a register and check if it is help by another thread. If it is (i.e. the value is not 0) it will loop back to the beginning of the label. If it is 0, the thread assumes the lock is free, moves 1 into the lock to attain it, and continues into a critical section. Seems reasonable, right?

What we have to remember is that the OS has control over thread scheduling. We must assume that any thread can be interrupted at any point and switched over to another one, or even a new process altogether. Consider what would happen if the scheduler revoked control of a thread just before it executed line 7. Another thread would likely gain control and evaluate the lock as being free, despite the fact that the previous thread will soon enter into the critical section as soon as it runs again. Bad news!

To solve this we need to find a way to evaluate a variable that is immune to interruption. Luckily all modern architectures provide atomic instructions like these. In our x86 architecture, we need one instruction that can perform all the functions of lines 4-7. There are a few possibilities, but for this investigation we'll use the xchg instruction. The usage is as follows:

xchg source, destination

We can atomically swap the values of two memory locations. Once the values have been switched, we can test the result. Let's improve our code using this new instruction:

1  .data
2 	  mutex db 0

...
3  child_process:
...
4  acquire_lock:
5     mov eax, 1
6  spin:
7     xchg eax, [mutex]
8     test eax, eax
9     jnz spin
10    jmp critical_section

Instead of setting the lock conditionally, each thread always attempts to set it and evaluates the result. One 8, we check whether eax is 1 or 0. If the value is 1, it means the lock is currently held by another thread. In this case we'll jump back to the spin label and try again (yep, this is a spinlock). If the value is 0, it means the lock was free and we have now attained it! But wait, what happens if the scheduler interrupts the thread after line 8? Well, we're in luck because we've already claimed the lock, so no other thread can do anything further. Once we're done with the critical section we can release the lock:

release_lock:
	mov byte [mutex], 0
	ret

Ok, now let's update our broken multi-threaded program with our new spinlock:

child_process:
	call acquire lock
	
	; CRITICAL SECTION
	; increment counter
	inc byte [counter]
	add byte [counter], 0x00000030
	; print to stdout
	mov eax, 4
	mov ebx, 1
	mov ecx, counter
	mov edx, 1
	; revert translation
	sub byte [counter], 0x30

	; END CRITICAL SECTION
	call release_lock

	.. free the allocated stack using munmap
	
	jmp exit

We attain the lock right before performing any operations to the counter variable. Once we claim the lock, we'll continue through the critical section before releasing it. Nice, let's try running this program in the same loop as before:

123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@
123456789:;<=>?@

Looks pretty consistent to me! Our spinlock was a success and we can now safely ship this fancy multi-threaded counter program to production without any risk of race conditions.

Find the full code here.