This week I'm beginning to explore the daunting topic of concurrency, specifically in the context of multi-threaded programs. A thread is simply a string of instructions that shares the same address space and resources as other threads in the same program. Earlier we explored the concept of a processes, namely how an operating system can create, track and schedule many processes at once. Threads share many of the same principles, but operate within a single process. This means that we can spawn and run threads much quicker than processes. We don't need to copy address spaces and the file table (among other things).
The POSIX standard library exposes an API for thread management (pthread
). Today, I'd like to explore what system calls this library uses under the hood using some simple examples.
strace
In order to figure out what system calls a multithreaded program makes we can use the strace
tool. From the strace
man pages we learn the following:
In the simplest case
strace
runs the specified command until it exits. It intercepts and records the system calls which are called by a process and the signals which are received by a process.
Ok, great. This will allow us to inspect some of the underlying system calls any program performs. Here's a really simple example of a C program that leverages the POSIX thread library:
1 #include <stdio.h>
2 #include <pthread.h>
3
4 typedef struct {
5 int a;
6 int b;
7 } myarg_t;
8
9 void *mythread(void *arg)
10 {
11 myarg_t *args = (myarg_t *) arg;
12 printf("Thread created\n");
13 printf("%d %d\n", args->a, args->b);
14 return NULL;
15 }
16 int main(int argc, char *argv[])
17 {
18 pthread_t p;
19 myarg_t args = { 10, 20 };
20 int rc = pthread_create(&p, NULL, mythread, &args);
21 return 0;
22 }
note: this example is sourced from OSTEP
Let's quickly go over this program before we inspect it with strace
. On a basic level, this program makes a call to the pthread_create
function (line 20). With man pthread_create
we can see the function takes the following arguments:
pthread
struct. We define this on line 18.attr
), but we leave this as NULL
to go with the defaults. start_routine
of the thread. Simply put, this is the function that the thread should call once initiated. In the program above this is mythread
on line 9.Our program defines a thread to be created and upon creation will begin executing the mythread
function, which just prints out the args specified in pthread_create
. Nothing fancy, right. Ok let's run strace
.
gcc my_thread.c -o my_thread
sudo apt install strace
strace ./my_thread
Here's an abridged output:
execve("./my_thread", ["./my_thread"], 0x7ffd82734c90 /* 23 vars */) = 0
brk(NULL) = 0x562c1fe8c000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=36241, ...}) = 0
mmap(NULL, 36241, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fab31b35000
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
...
mmap(NULL, 136304, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fbd4368b000
mmap(0x7fbd43691000, 65536, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7fbd43691000
mmap(0x7fbd436a1000, 24576, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7fbd436a1000
mmap(0x7fbd436a7000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b000) = 0x7fbd436a7000
mmap(0x7fbd436a9000, 13424, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fbd436a9000
...
clone(child_stack=0x7fbd434b2fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|
CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[211804], tls=0x7fbd434b3700, child_tidptr=0x7fbd434b39d0) = 211804
exit_group(0)
Wow! That's a lot of system calls for such a simple program. At this point, we shouldn't be surprised. But let's try and focus on the last few lines, specifically clone
.
clone
system callTurns out, most multi-threaded programs rely heavily on this call specifically. From man 2 clone
:
create a child process These system calls create a new ("child") process, in a manner similar to fork(2). By contrast with
fork(2)
, these system calls provide more precise control over what pieces of execution context are shared between the calling process and the child process. For example, using these system calls, the caller can control whether or not the two processes share the virtual address space, the table of file descriptors, and the table of signal handlers. These system calls also allow the new child process to be placed in separatenamespaces(7)
.
The process of cloning is quite similar to forking, which we explored earlier. However, it seems like the clone system call gives us much more control over how we create a new process. The most important element control in the case of threads is that we want to create a new thread of executing from within the same address space as the program that invokes the call. We can control this, among many other things, with the flags argument (specified as bit-mask values):
CLONE_VM Parent and child share virtual memory
In the man pages for clone we can see all of the other flags. Here's a few more just to get an idea:
CLONE_CHILD_CLEARTID Clear ctid when child calls exec() or _exit()
CLONE_CHILD_SETTID Write thread ID of child into ctid (2.6 onward)
CLONE_FILES Parent and child share table of open file descriptors
CLONE_FS Parent and child share attributes related to file system
CLONE_IO Child shares parent’s I/O context
CLONE_NEWIPC Child gets new System V IPC namespace
CLONE_NEWNET Child gets new network namespace
CLONE_NEWNS Child gets copy of parent’s mount namespace
CLONE_NEWPID Child gets new process-ID namespace
Descriptions from The Linux Programing Interface
Another interesting detail is that we have to provide the clone
call with a new stack for the thread. Again, from The Linux Programing Interface [599]:
Since a cloned child may (like
vfork()
) share the parent’s memory, it can’t use the parent’s stack. Instead, the caller must allocate a suitably sized block of memory for use as the child's stack and pass a pointer to that block in the argument child stack.
So, if we want to create a new thread within our program we need to allocate a block of memory for that specific thread. Ok, let's try making a thread from scratch with assembly.
To create and run a thread from within the same address space we'll need to do the following:
CLONE_FM
flag. To request a chunk of memory for the stack we can use the mmap
system call:
mov ebx, 0 ; set to NULL to let OS decide starting address
mov ecx, 4096 ; 4k of memory
mov edx, 0x7 ; protection: PROT_EXEC | PROT_WRITE | PROT_READ
mov esi, 0x22 ; flags: MAP_ANONYMOUS | MAP_PRIVATE
mov edi, -1 ; fd: according to man page shoud be set to -1
mov ebp, 0 ; offset: according to man page shoud be set to zero
mov eax, 192 ; sys call number for mmap
int 0x80
We'll request 4kb of memory and let the OS decide where to allocate it from. Upon success, the pointer to the new block of memory will be available in eax
. If mmap
fails we'll get -1
. Let's add a quick check for this:
cmp eax, -1 ;
je exit_with_error
Let's prepare for the clone
call now. First, we'll need the address for the pointer to the top of the stack. Since the stack grows downwards, we will perform the following operation
mov ecx, eax
add ecx, 4096
The pointer in ecx
is now at the top of the stack.
In ebx
we need the right flags. For now, lets just specify CLONE_VM
and CLONE_FILES
, meaning that the thread will share the address space and the open file table of the parent process.
mov ebx, 0x00000100 | 0x00000400
Let's perform the call. Once we've made it we can check if we're in the parent or child process by looking at the return value in eax
. If 0
we are in the child process. Otherwise we are in the parent process. This is pretty similar to fork
it turns out!
mov eax, 120 ; sys_clone
int 0x80
test eax, eax
jnz parent_process
In the child process, we need to set esp
, which we can grab from ecx
:
mov esp, ecx
In the child let's just print a message, free up our stack memory and exit.
; print
mov ecx, child_msg
mov edx, child_msg_len
mov ebx, 1
mov eax, 4
int 0x80
; free the allocated stack using munmap
mov eax, 91
mov ebx, [esp+4]
mov ecx, 4096
int 0x80
; check for negative status code
cmp eax, 0
jl exit_with_error
jmp exit
In the parent process let's just print and exit:
mov ecx, parent_msg
mov edx, parent_msg_length
mov ebx, 1
mov eax, 4
int 0x80
call exit
Once we assembly and execute our program, we can see the child and parent messages printed to stdout
. Great! We made a thread inside the same address space, did something in the thread, and excited. You can view the entire program here
Our program leaves a lot to be desired. Just to name a few qualms, we don't really do anything with the allocated thread stack. Also, we don't specify if the parent should wait for the thread to finish executing. This was merely an attempt to display a minimal way to create a single thread. Of course, real multi-threaded programs are far more complicated, and definitely use more than one thread! But complexity isn't necessarily the goal here.