Threads and Concurrency¶
Threads vs Processes¶
A single-threaded process has one address space and one execution context (registers, stack, PC). To exploit multi-CPU systems, a process needs multiple execution contexts — these are threads.
Threads share the same virtual address space (code, data, files, mappings)
Each thread has its own PC, stack pointer, stack, and registers
OS represents a multithreaded process with a more complex PCB: shared state + per-thread execution contexts
Benefits of Multithreading¶
Parallelism:
Threads execute same code on different input subsets (data parallelism) or different code paths (task parallelism)
Specialization allows hotter caches — thread repeatedly executes smaller code portion, keeping more state in processor cache
Efficiency over multiprocess:
Threads share address space → lower memory requirements (one address space allocation vs N)
Inter-thread communication via shared memory is cheaper than IPC
Context switching between threads is faster (no TLB flush / new address mappings needed)
Single CPU benefits:
If
t_idle > 2 * t_context_switch, it pays to switch to another thread during I/O waitThread context switch is cheaper than process context switch (shared address space → no new virtual-to-physical mappings)
Multithreading hides I/O latency even on a single CPU
OS-level multithreading:
Kernel threads allow OS services (daemons, drivers) to run concurrently on multiple CPUs
Basic Thread Mechanisms¶
Requirements for thread support:
Thread data structure — distinguish threads, track resources
Creation/management mechanisms — fork, join
Coordination mechanisms — mutual exclusion (mutexes), waiting (condition variables)
Thread Creation (Birrell’s Model)¶
Thread type: contains thread ID, registers (PC, SP), stack, attributes
Fork(proc, args): creates new thread executing
proc(args); not Unix forkJoin(thread): blocks caller until target thread completes; returns child’s result; frees child resources
After fork, parent and child execute concurrently — execution order is not deterministic.
Mutexes¶
Problem: concurrent threads accessing shared data can cause inconsistencies (data races). E.g., two threads inserting into a linked list simultaneously may lose an element.
Mutex (mutual exclusion lock):
Only the lock holder can access the critical section; others block
Data structure: lock status (locked/free), owner, blocked-threads list
Critical section: code protected by the mutex
Birrell’s lock construct uses implicit unlock at closing brace. Most real APIs use explicit lock() / unlock() calls.
Mutex Example¶
With safe_insert: the thread that acquires the mutex first inserts its element; the other blocks until the mutex is released, then inserts. Final list order depends on scheduling.
Lock contention: when mutex is released, any pending thread (or a newly arriving thread) may acquire it — no ordering guarantees.
Condition Variables¶
Mutexes alone cannot express “wait until condition X is true.” Condition variables solve this.
Producer/Consumer motivation: a consumer should only process a list when it’s full — busy-waiting (repeatedly lock/check/unlock) wastes CPU.
Condition variable API (Birrell):
Wait(mutex, cond): atomically release mutex + enqueue thread on cond’s wait queue. On wakeup, re-acquire mutex before returning.
Signal(cond): wake one waiting thread
Broadcast(cond): wake all waiting threads
Why use ``while`` not ``if`` before wait:
Multiple threads may wait on same condition; another may consume the event first
No guarantee of mutex acquisition order after signal
Shared state can change between signal and actual mutex re-acquisition
Readers-Writer Problem¶
Policy: 0+ readers OR 0-1 writers can access the resource concurrently (never readers + writer simultaneously).
Solution using proxy variable resource_counter:
resource_counter == 0: resource freeresource_counter > 0: number of active readersresource_counter == -1: one active writer
Use one mutex (counter_mutex) to protect resource_counter, plus two condition variables: read_phase and write_phase.
Reader enter:
lock(counter_mutex)
while (resource_counter == -1):
wait(counter_mutex, read_phase)
resource_counter++
unlock(counter_mutex)
// ... read data ...
Reader exit:
lock(counter_mutex)
resource_counter--
if resource_counter == 0:
signal(write_phase)
unlock(counter_mutex)
Writer enter:
lock(counter_mutex)
while (resource_counter != 0):
wait(counter_mutex, write_phase)
resource_counter = -1
unlock(counter_mutex)
// ... write data ...
Writer exit:
lock(counter_mutex)
resource_counter = 0
broadcast(read_phase)
signal(write_phase)
unlock(counter_mutex)
Critical Section Structure¶
General pattern for enter/exit critical sections:
Lock mutex
While predicate not met → wait(mutex, cond_var)
Update shared state
Signal/broadcast appropriate condition variable
Unlock mutex
The actual resource access (read/write data) happens outside the lock — the mutex only protects the proxy variable, allowing multiple readers concurrently.
Common Pitfalls¶
Always protect a shared variable with the same mutex everywhere
Use signal vs broadcast correctly (broadcast is safe but may hurt performance; signal when wrong can cause indefinite waiting or deadlock)
Signal/broadcast order does not guarantee thread execution order
Keep mutex/condvar associations well-documented
Spurious Wakeups¶
If signal/broadcast is issued before unlocking the mutex, woken threads immediately block again on the mutex → wasted context switches.
Fix: move signal/broadcast after unlock — but only if the signal doesn’t depend on the shared state protected by the mutex.
Deadlocks¶
Definition: two or more threads each waiting on a resource held by the other → cycle in wait graph → execution stuck.
Example: T1 locks A then B; T2 locks B then A → potential deadlock.
Prevention strategies:
Lock ordering: all threads acquire locks in the same global order — prevents cycles
Fine-grained locking: release one lock before acquiring another (only works if both aren’t needed simultaneously)
Coarse-grained locking: single mega-lock (limits parallelism)
Handling approaches:
Prevention: analyze each lock request for potential cycles (expensive)
Detection + Recovery: monitor wait graph, rollback on cycle (requires checkpointing)
Ostrich Algorithm: ignore the problem, reboot if deadlocked (common in practice)
Kernel vs User-Level Threads¶
Kernel-level threads: visible to OS, scheduled by kernel scheduler onto CPUs.
User-level threads: managed by a user-space thread library; must map to kernel threads to execute.
Multithreading Models¶
One-to-One (e.g., Linux NPTL):
Each user thread maps to one kernel thread
OS sees all threads; full kernel scheduling support
Downside: every thread operation requires syscall; limited by kernel policies/thread limits
Many-to-One:
All user threads map to single kernel thread
Fully portable; no syscall overhead for thread management
Downside: one blocking call blocks entire process; no true parallelism
Many-to-Many (e.g., Solaris LWP):
Multiple user threads map to multiple (fewer) kernel threads
Threads can be bound (1:1 permanent) or unbound (multiplexed)
Blocking one user thread doesn’t block the process
Downside: requires coordination between user-level and kernel-level thread managers
Scope of Multithreading¶
Process scope: user-level library manages threads within one process; OS allocates resources per-process equally regardless of thread count
System scope: all user threads visible to kernel; resources allocated per-thread across all processes
Multithreading Patterns¶
Boss-Workers¶
Boss thread accepts work, places on shared queue
Worker threads pick from queue, perform entire task
Throughput limited by boss efficiency → keep boss logic minimal
Use thread pool (pre-created workers, dynamically resizable) to avoid creation overhead
Communication via producer-consumer queue (boss produces, workers consume)
Variant — specialized workers: boss routes tasks to specialized worker groups → exploits locality (hotter caches), enables QoS differentiation. Tradeoff: harder load balancing.
Pipeline¶
Task divided into stages; each stage assigned to thread(s)
Multiple tasks in-flight simultaneously, each at different stage
Throughput limited by slowest stage → assign more threads to bottleneck stages
Use shared buffers between stages for elasticity
Benefits: high specialization, locality
Downsides: complex balancing, more synchronization points
Layered¶
Group related subtasks into layers; threads in a layer can perform any subtask in that group
Tasks pass up/down through layers (bidirectional, unlike pipeline)
Less fine-grained than pipeline → easier thread allocation
Downsides: not always applicable; complex inter-layer synchronization