Last few lectures: how to use pthread library calls to create properly synchronized multi-threaded programs There are other synchronization mechanisms. The textbook teaches semaphores: (introduced by Dijkstra) Semaphore is initialized with a non-negative integer (sem_init) P(s) / sem_wait() - if s is non-zero, decrement s by 1 and return immediately - if s is zero, block thread until s becomes non-zero V(s) / sem_post() - increments s by 1. - if there are threads blocked by P(s), resume one of those threads. Semaphores can be used for mutual exclusion: sem_init(&s,..,1); sem_wait(); //lock() sum++; sem_post() semaphores can be used in place of conditional variable as well ----- We recommend using mutex+conditional variables rather than semaphores * no conflation of mutex & conditional variables --> more readable code This lecture: how does pthread library work? A threading library needs both 1. OS kernel support 2. hardware support * Thread creation and scheduling. - One address space, multiple threads each with its own hardware execution context, (called light weight processes in Linux) - Each thread can be scheduled separately, suspend execution (saving its registers), resume execution (restore its registers) pthread_create invokes clone(fn, ..., void *child_stack, ) system call * How are locks implemented? Consider the strawman below: int lock = 0; int *m = &lock; //lock while (*m == 0); *m = 1; ... critical section *m = 0; //unlock T1: T2: mov m, %eax mov m, %eax cmp ... cmp ... mov $1, m mov $1, m both can enter critical section - Need hardware atomic instructions to support locking xchg while (xchg(m, 1) != 0) *m = 1; ... //critical section xchg(m, 0) Previous interleaving example: L1: mov $1, %eax xchg m, %eax L1: mov $1, %eax xchg m, %eax cmp ... jmp L1 cmp ... jmp L1 mov $1, m mov $1, m What if we do the regular *m = 0 instead of xchg(m,0) The answer is unclear!! In our interleaving examples, we have assumed that processors execute operations in the order they appear in the program. Not true in reality. Modern processor execute instructions out of order to maximize performance: if one instruction takes longer to run (e.g. needs to fetch from memory instead of L1 cache), execute instructions behind it instead of letting the processor idle. Processors make different guarantees about the ordering of memory operations. so it is possible The code: /*critical section* ... mov %rbx, m2 //some memory operation mov $0, m //the instruction to unlock The actual execution mov $0, m mov %rbx, m2 !!! It 's explicitly specified by Intel that no reordering happen across xchg instruction Hence, it's the right thing to use xchg(m, 0) instead of *m = 0 There are other hardware atomic instructions, e.g. void atomic_add(int *m, int val) atomically adds val to *m oldv = cmpxchg(int *m, oldv, newv) atomically compares *m with oldv, if equal, set *m = newval. returns old value of *m. * the lock we've implemented using xchg is called a spinlock. If lock is not available, the thread spins (busy waits).. Not very efficient: if one thread cannot acquire lock, we can execute another thread that can make progress. Kernel provides abstraction to put a thread on the kernel's wait queue futex(int *addr, FUTEX_WAIT, val, ...): atomically checks that *addr == val and put calling thread on kernel's wait queue for addr futex(int *addr, FUTEX_WAKE, #-of-threads-to-wake (1 or INT_MAX)...) wakes threads on addr's kernel wait queue lock() { while (xchg(m, 1) != 0) futex_wait(m, 1); } unlock() { xchg(m, 1); futex_wake(m, 1); } Above is not the most efficient implementation, as I do futex_wake even when no thread is waiting in the kernel! The pthread mutex implementation does no futex syscalls when the lock is not contended. In generally, the mutex/conditional variable implementation is made complex because of performance considerations. If you are curious, please google "Futexes are tricky" by ulrich drepper