Threads and Concurrency¶

Threads vs Processes¶

A single-threaded process has one address space and one execution context (registers, stack, PC). To exploit multi-CPU systems, a process needs multiple execution contexts — these are threads.

Threads share the same virtual address space (code, data, files, mappings)
Each thread has its own PC, stack pointer, stack, and registers
OS represents a multithreaded process with a more complex PCB: shared state + per-thread execution contexts

Benefits of Multithreading¶

Parallelism:

Threads execute same code on different input subsets (data parallelism) or different code paths (task parallelism)
Specialization allows hotter caches — thread repeatedly executes smaller code portion, keeping more state in processor cache

Efficiency over multiprocess:

Threads share address space → lower memory requirements (one address space allocation vs N)
Inter-thread communication via shared memory is cheaper than IPC
Context switching between threads is faster (no TLB flush / new address mappings needed)

Single CPU benefits:

If t_idle > 2 * t_context_switch, it pays to switch to another thread during I/O wait
Thread context switch is cheaper than process context switch (shared address space → no new virtual-to-physical mappings)
Multithreading hides I/O latency even on a single CPU

OS-level multithreading:

Kernel threads allow OS services (daemons, drivers) to run concurrently on multiple CPUs

Basic Thread Mechanisms¶

Requirements for thread support:

Thread data structure — distinguish threads, track resources
Creation/management mechanisms — fork, join
Coordination mechanisms — mutual exclusion (mutexes), waiting (condition variables)

Thread Creation (Birrell’s Model)¶

Thread type: contains thread ID, registers (PC, SP), stack, attributes
Fork(proc, args): creates new thread executing proc(args); not Unix fork
Join(thread): blocks caller until target thread completes; returns child’s result; frees child resources

After fork, parent and child execute concurrently — execution order is not deterministic.

Mutexes¶

Problem: concurrent threads accessing shared data can cause inconsistencies (data races). E.g., two threads inserting into a linked list simultaneously may lose an element.

Mutex (mutual exclusion lock):

Only the lock holder can access the critical section; others block
Data structure: lock status (locked/free), owner, blocked-threads list
Critical section: code protected by the mutex

Birrell’s lock construct uses implicit unlock at closing brace. Most real APIs use explicit lock() / unlock() calls.

Mutex Example¶

With safe_insert: the thread that acquires the mutex first inserts its element; the other blocks until the mutex is released, then inserts. Final list order depends on scheduling.

Lock contention: when mutex is released, any pending thread (or a newly arriving thread) may acquire it — no ordering guarantees.

Condition Variables¶

Mutexes alone cannot express “wait until condition X is true.” Condition variables solve this.

Producer/Consumer motivation: a consumer should only process a list when it’s full — busy-waiting (repeatedly lock/check/unlock) wastes CPU.

Condition variable API (Birrell):

Wait(mutex, cond): atomically release mutex + enqueue thread on cond’s wait queue. On wakeup, re-acquire mutex before returning.
Signal(cond): wake one waiting thread
Broadcast(cond): wake all waiting threads

Why use ``while`` not ``if`` before wait:

Multiple threads may wait on same condition; another may consume the event first
No guarantee of mutex acquisition order after signal
Shared state can change between signal and actual mutex re-acquisition

Readers-Writer Problem¶

Policy: 0+ readers OR 0-1 writers can access the resource concurrently (never readers + writer simultaneously).

Solution using proxy variable resource_counter:

resource_counter == 0: resource free
resource_counter > 0: number of active readers
resource_counter == -1: one active writer

Use one mutex (counter_mutex) to protect resource_counter, plus two condition variables: read_phase and write_phase.

Reader enter:

lock(counter_mutex)
while (resource_counter == -1):
    wait(counter_mutex, read_phase)
resource_counter++
unlock(counter_mutex)
// ... read data ...

Reader exit:

lock(counter_mutex)
resource_counter--
if resource_counter == 0:
    signal(write_phase)
unlock(counter_mutex)

Writer enter:

lock(counter_mutex)
while (resource_counter != 0):
    wait(counter_mutex, write_phase)
resource_counter = -1
unlock(counter_mutex)
// ... write data ...

Writer exit:

lock(counter_mutex)
resource_counter = 0
broadcast(read_phase)
signal(write_phase)
unlock(counter_mutex)

Critical Section Structure¶

General pattern for enter/exit critical sections:

Lock mutex
While predicate not met → wait(mutex, cond_var)
Update shared state
Signal/broadcast appropriate condition variable
Unlock mutex

The actual resource access (read/write data) happens outside the lock — the mutex only protects the proxy variable, allowing multiple readers concurrently.

Common Pitfalls¶

Always protect a shared variable with the same mutex everywhere
Use signal vs broadcast correctly (broadcast is safe but may hurt performance; signal when wrong can cause indefinite waiting or deadlock)
Signal/broadcast order does not guarantee thread execution order
Keep mutex/condvar associations well-documented

Spurious Wakeups¶

If signal/broadcast is issued before unlocking the mutex, woken threads immediately block again on the mutex → wasted context switches.

Fix: move signal/broadcast after unlock — but only if the signal doesn’t depend on the shared state protected by the mutex.

Deadlocks¶

Definition: two or more threads each waiting on a resource held by the other → cycle in wait graph → execution stuck.

Example: T1 locks A then B; T2 locks B then A → potential deadlock.

Prevention strategies:

Lock ordering: all threads acquire locks in the same global order — prevents cycles
Fine-grained locking: release one lock before acquiring another (only works if both aren’t needed simultaneously)
Coarse-grained locking: single mega-lock (limits parallelism)

Handling approaches:

Prevention: analyze each lock request for potential cycles (expensive)
Detection + Recovery: monitor wait graph, rollback on cycle (requires checkpointing)
Ostrich Algorithm: ignore the problem, reboot if deadlocked (common in practice)

Kernel vs User-Level Threads¶

Kernel-level threads: visible to OS, scheduled by kernel scheduler onto CPUs.

User-level threads: managed by a user-space thread library; must map to kernel threads to execute.

Multithreading Models¶

One-to-One (e.g., Linux NPTL):

Each user thread maps to one kernel thread
OS sees all threads; full kernel scheduling support
Downside: every thread operation requires syscall; limited by kernel policies/thread limits

Many-to-One:

All user threads map to single kernel thread
Fully portable; no syscall overhead for thread management
Downside: one blocking call blocks entire process; no true parallelism

Many-to-Many (e.g., Solaris LWP):

Multiple user threads map to multiple (fewer) kernel threads
Threads can be bound (1:1 permanent) or unbound (multiplexed)
Blocking one user thread doesn’t block the process
Downside: requires coordination between user-level and kernel-level thread managers

Scope of Multithreading¶

Process scope: user-level library manages threads within one process; OS allocates resources per-process equally regardless of thread count
System scope: all user threads visible to kernel; resources allocated per-thread across all processes

Multithreading Patterns¶

Boss-Workers¶

Boss thread accepts work, places on shared queue
Worker threads pick from queue, perform entire task
Throughput limited by boss efficiency → keep boss logic minimal
Use thread pool (pre-created workers, dynamically resizable) to avoid creation overhead
Communication via producer-consumer queue (boss produces, workers consume)

Variant — specialized workers: boss routes tasks to specialized worker groups → exploits locality (hotter caches), enables QoS differentiation. Tradeoff: harder load balancing.

Pipeline¶

Task divided into stages; each stage assigned to thread(s)
Multiple tasks in-flight simultaneously, each at different stage
Throughput limited by slowest stage → assign more threads to bottleneck stages
Use shared buffers between stages for elasticity
Benefits: high specialization, locality
Downsides: complex balancing, more synchronization points

Layered¶

Group related subtasks into layers; threads in a layer can perform any subtask in that group
Tasks pass up/down through layers (bidirectional, unlike pipeline)
Less fine-grained than pipeline → easier thread allocation
Downsides: not always applicable; complex inter-layer synchronization