Threads from scratch

"Multi-threaded" programming with assembly

October 24, 2024

This week I'm beginning to explore the daunting topic of concurrency, specifically in the context of multi-threaded programs. A thread is simply a string of instructions that shares the same address space and resources as other threads in the same program. Earlier we explored the concept of a processes, namely how an operating system can create, track and schedule many processes at once. Threads share many of the same principles, but operate within a single process. This means that we can spawn and run threads much quicker than processes. We don't need to copy address spaces and the file table (among other things).

The POSIX standard library exposes an API for thread management (pthread). Today, I'd like to explore what system calls this library uses under the hood using some simple examples.

Inspecting system calls with `strace`

In order to figure out what system calls a multithreaded program makes we can use the strace tool. From the strace man pages we learn the following:

In the simplest case strace runs the specified command until it exits. It intercepts and records the system calls which are called by a process and the signals which are received by a process.

Ok, great. This will allow us to inspect some of the underlying system calls any program performs. Here's a really simple example of a C program that leverages the POSIX thread library:

1  #include <stdio.h>
2  #include <pthread.h>
3
4  typedef struct {
5 	 int a;
6	 int b;
7  } myarg_t;
8
9   void *mythread(void *arg) 
10  {
11	 myarg_t *args = (myarg_t *) arg;
12	 printf("Thread created\n");
13	 printf("%d %d\n", args->a, args->b);
14	 return NULL;
15  }
16  int main(int argc, char *argv[])
17  {
18	 pthread_t p;
19	 myarg_t args = { 10, 20 };
20	 int rc = pthread_create(&p, NULL, mythread, &args);
21	 return 0;
22  }

note: this example is sourced from OSTEP

Let's quickly go over this program before we inspect it with strace. On a basic level, this program makes a call to the pthread_create function (line 20). With man pthread_create we can see the function takes the following arguments:

A pointer to a pthread struct. We define this on line 18.
Some attributes of the thread we'd like to create (attr), but we leave this as NULL to go with the defaults.
The start_routine of the thread. Simply put, this is the function that the thread should call once initiated. In the program above this is mythread on line 9.
The argument to be passed to the function specified above.

Our program defines a thread to be created and upon creation will begin executing the mythread function, which just prints out the args specified in pthread_create. Nothing fancy, right. Ok let's run strace.

gcc my_thread.c -o my_thread
sudo apt install strace
strace ./my_thread

Here's an abridged output:

execve("./my_thread", ["./my_thread"], 0x7ffd82734c90 /* 23 vars */) = 0
brk(NULL)                               = 0x562c1fe8c000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=36241, ...}) = 0
mmap(NULL, 36241, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fab31b35000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
...
mmap(NULL, 136304, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fbd4368b000
mmap(0x7fbd43691000, 65536, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7fbd43691000
mmap(0x7fbd436a1000, 24576, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7fbd436a1000
mmap(0x7fbd436a7000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b000) = 0x7fbd436a7000
mmap(0x7fbd436a9000, 13424, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fbd436a9000
...
clone(child_stack=0x7fbd434b2fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|
CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[211804], tls=0x7fbd434b3700, child_tidptr=0x7fbd434b39d0) = 211804
exit_group(0)

Wow! That's a lot of system calls for such a simple program. At this point, we shouldn't be surprised. But let's try and focus on the last few lines, specifically clone.

The `clone` system call

Turns out, most multi-threaded programs rely heavily on this call specifically. From man 2 clone:

create a child process These system calls create a new ("child") process, in a manner similar to fork(2). By contrast with fork(2), these system calls provide more precise control over what pieces of execution context are shared between the calling process and the child process. For example, using these system calls, the caller can control whether or not the two processes share the virtual address space, the table of file descriptors, and the table of signal handlers. These system calls also allow the new child process to be placed in separate namespaces(7).

The process of cloning is quite similar to forking, which we explored earlier. However, it seems like the clone system call gives us much more control over how we create a new process. The most important element control in the case of threads is that we want to create a new thread of executing from within the same address space as the program that invokes the call. We can control this, among many other things, with the flags argument (specified as bit-mask values):

CLONE_VM    Parent and child share virtual memory

In the man pages for clone we can see all of the other flags. Here's a few more just to get an idea:

CLONE_CHILD_CLEARTID Clear ctid when child calls exec() or _exit()
CLONE_CHILD_SETTID   Write thread ID of child into ctid (2.6 onward) 
CLONE_FILES          Parent and child share table of open file descriptors 
CLONE_FS             Parent and child share attributes related to file system 
CLONE_IO             Child shares parent’s I/O context
CLONE_NEWIPC         Child gets new System V IPC namespace 
CLONE_NEWNET         Child gets new network namespace
CLONE_NEWNS          Child gets copy of parent’s mount namespace
CLONE_NEWPID         Child gets new process-ID namespace

Descriptions from The Linux Programing Interface

Another interesting detail is that we have to provide the clone call with a new stack for the thread. Again, from The Linux Programing Interface [599]:

Since a cloned child may (like vfork()) share the parent’s memory, it can’t use the parent’s stack. Instead, the caller must allocate a suitably sized block of memory for use as the child's stack and pass a pointer to that block in the argument child stack.

So, if we want to create a new thread within our program we need to allocate a block of memory for that specific thread. Ok, let's try making a thread from scratch with assembly.

A single thread in assembly

To create and run a thread from within the same address space we'll need to do the following:

Allocate a block of memory for the thread's stack
Make the clone call and provide the pointer to the new stack's memory in the args, along with the CLONE_FM flag.
Inside the child, do something and then exit.
Inside the parent, do something else and then exit.

To request a chunk of memory for the stack we can use the mmap system call:

mov ebx, 0 ; set to NULL to let OS decide starting address
mov ecx, 4096 ; 4k of memory
mov edx, 0x7 ; protection: PROT_EXEC | PROT_WRITE | PROT_READ
mov esi, 0x22 ; flags: MAP_ANONYMOUS | MAP_PRIVATE
mov edi, -1 ; fd: according to man page shoud be set to -1
mov ebp, 0 ; offset: according to man page shoud be set to zero
mov eax, 192 ; sys call number for mmap
int 0x80

We'll request 4kb of memory and let the OS decide where to allocate it from. Upon success, the pointer to the new block of memory will be available in eax. If mmap fails we'll get -1. Let's add a quick check for this:

cmp eax, -1 ;
je exit_with_error

Let's prepare for the clone call now. First, we'll need the address for the pointer to the top of the stack. Since the stack grows downwards, we will perform the following operation

mov ecx, eax
add ecx, 4096

The pointer in ecx is now at the top of the stack.

In ebx we need the right flags. For now, lets just specify CLONE_VM and CLONE_FILES, meaning that the thread will share the address space and the open file table of the parent process.

mov ebx, 0x00000100 | 0x00000400

Let's perform the call. Once we've made it we can check if we're in the parent or child process by looking at the return value in eax. If 0 we are in the child process. Otherwise we are in the parent process. This is pretty similar to fork it turns out!

mov eax, 120 ; sys_clone
int 0x80
test eax, eax
jnz parent_process

In the child process, we need to set esp, which we can grab from ecx:

mov esp, ecx

In the child let's just print a message, free up our stack memory and exit.

; print
mov ecx, child_msg
mov edx, child_msg_len
mov ebx, 1
mov eax, 4
int 0x80
; free the allocated stack using munmap
mov eax, 91
mov ebx, [esp+4]
mov ecx, 4096
int 0x80
; check for negative status code
cmp eax, 0
jl exit_with_error
jmp exit

In the parent process let's just print and exit:

mov ecx, parent_msg
mov edx, parent_msg_length
mov ebx, 1
mov eax, 4
int 0x80
call exit

Once we assembly and execute our program, we can see the child and parent messages printed to stdout. Great! We made a thread inside the same address space, did something in the thread, and excited. You can view the entire program here

Simplicity... for now

Our program leaves a lot to be desired. Just to name a few qualms, we don't really do anything with the allocated thread stack. Also, we don't specify if the parent should wait for the thread to finish executing. This was merely an attempt to display a minimal way to create a single thread. Of course, real multi-threaded programs are far more complicated, and definitely use more than one thread! But complexity isn't necessarily the goal here.