Threads and Concurrency

Threads vs Processes

A single-threaded process has one address space and one execution context (registers, stack, PC). To exploit multi-CPU systems, a process needs multiple execution contexts — these are threads.

  • Threads share the same virtual address space (code, data, files, mappings)

  • Each thread has its own PC, stack pointer, stack, and registers

  • OS represents a multithreaded process with a more complex PCB: shared state + per-thread execution contexts

Benefits of Multithreading

Parallelism:

  • Threads execute same code on different input subsets (data parallelism) or different code paths (task parallelism)

  • Specialization allows hotter caches — thread repeatedly executes smaller code portion, keeping more state in processor cache

Efficiency over multiprocess:

  • Threads share address space → lower memory requirements (one address space allocation vs N)

  • Inter-thread communication via shared memory is cheaper than IPC

  • Context switching between threads is faster (no TLB flush / new address mappings needed)

Single CPU benefits:

  • If t_idle > 2 * t_context_switch, it pays to switch to another thread during I/O wait

  • Thread context switch is cheaper than process context switch (shared address space → no new virtual-to-physical mappings)

  • Multithreading hides I/O latency even on a single CPU

OS-level multithreading:

  • Kernel threads allow OS services (daemons, drivers) to run concurrently on multiple CPUs

Basic Thread Mechanisms

Requirements for thread support:

  1. Thread data structure — distinguish threads, track resources

  2. Creation/management mechanisms — fork, join

  3. Coordination mechanisms — mutual exclusion (mutexes), waiting (condition variables)

Thread Creation (Birrell’s Model)

  • Thread type: contains thread ID, registers (PC, SP), stack, attributes

  • Fork(proc, args): creates new thread executing proc(args); not Unix fork

  • Join(thread): blocks caller until target thread completes; returns child’s result; frees child resources

After fork, parent and child execute concurrently — execution order is not deterministic.

Mutexes

Problem: concurrent threads accessing shared data can cause inconsistencies (data races). E.g., two threads inserting into a linked list simultaneously may lose an element.

Mutex (mutual exclusion lock):

  • Only the lock holder can access the critical section; others block

  • Data structure: lock status (locked/free), owner, blocked-threads list

  • Critical section: code protected by the mutex

Birrell’s lock construct uses implicit unlock at closing brace. Most real APIs use explicit lock() / unlock() calls.

Mutex Example

With safe_insert: the thread that acquires the mutex first inserts its element; the other blocks until the mutex is released, then inserts. Final list order depends on scheduling.

Lock contention: when mutex is released, any pending thread (or a newly arriving thread) may acquire it — no ordering guarantees.

Condition Variables

Mutexes alone cannot express “wait until condition X is true.” Condition variables solve this.

Producer/Consumer motivation: a consumer should only process a list when it’s full — busy-waiting (repeatedly lock/check/unlock) wastes CPU.

Condition variable API (Birrell):

  • Wait(mutex, cond): atomically release mutex + enqueue thread on cond’s wait queue. On wakeup, re-acquire mutex before returning.

  • Signal(cond): wake one waiting thread

  • Broadcast(cond): wake all waiting threads

Why use ``while`` not ``if`` before wait:

  • Multiple threads may wait on same condition; another may consume the event first

  • No guarantee of mutex acquisition order after signal

  • Shared state can change between signal and actual mutex re-acquisition

Readers-Writer Problem

Policy: 0+ readers OR 0-1 writers can access the resource concurrently (never readers + writer simultaneously).

Solution using proxy variable resource_counter:

  • resource_counter == 0: resource free

  • resource_counter > 0: number of active readers

  • resource_counter == -1: one active writer

Use one mutex (counter_mutex) to protect resource_counter, plus two condition variables: read_phase and write_phase.

Reader enter:

lock(counter_mutex)
while (resource_counter == -1):
    wait(counter_mutex, read_phase)
resource_counter++
unlock(counter_mutex)
// ... read data ...

Reader exit:

lock(counter_mutex)
resource_counter--
if resource_counter == 0:
    signal(write_phase)
unlock(counter_mutex)

Writer enter:

lock(counter_mutex)
while (resource_counter != 0):
    wait(counter_mutex, write_phase)
resource_counter = -1
unlock(counter_mutex)
// ... write data ...

Writer exit:

lock(counter_mutex)
resource_counter = 0
broadcast(read_phase)
signal(write_phase)
unlock(counter_mutex)

Critical Section Structure

General pattern for enter/exit critical sections:

  1. Lock mutex

  2. While predicate not met → wait(mutex, cond_var)

  3. Update shared state

  4. Signal/broadcast appropriate condition variable

  5. Unlock mutex

The actual resource access (read/write data) happens outside the lock — the mutex only protects the proxy variable, allowing multiple readers concurrently.

Common Pitfalls

  • Always protect a shared variable with the same mutex everywhere

  • Use signal vs broadcast correctly (broadcast is safe but may hurt performance; signal when wrong can cause indefinite waiting or deadlock)

  • Signal/broadcast order does not guarantee thread execution order

  • Keep mutex/condvar associations well-documented

Spurious Wakeups

If signal/broadcast is issued before unlocking the mutex, woken threads immediately block again on the mutex → wasted context switches.

Fix: move signal/broadcast after unlock — but only if the signal doesn’t depend on the shared state protected by the mutex.

Deadlocks

Definition: two or more threads each waiting on a resource held by the other → cycle in wait graph → execution stuck.

Example: T1 locks A then B; T2 locks B then A → potential deadlock.

Prevention strategies:

  • Lock ordering: all threads acquire locks in the same global order — prevents cycles

  • Fine-grained locking: release one lock before acquiring another (only works if both aren’t needed simultaneously)

  • Coarse-grained locking: single mega-lock (limits parallelism)

Handling approaches:

  • Prevention: analyze each lock request for potential cycles (expensive)

  • Detection + Recovery: monitor wait graph, rollback on cycle (requires checkpointing)

  • Ostrich Algorithm: ignore the problem, reboot if deadlocked (common in practice)

Kernel vs User-Level Threads

Kernel-level threads: visible to OS, scheduled by kernel scheduler onto CPUs.

User-level threads: managed by a user-space thread library; must map to kernel threads to execute.

Multithreading Models

One-to-One (e.g., Linux NPTL):

  • Each user thread maps to one kernel thread

  • OS sees all threads; full kernel scheduling support

  • Downside: every thread operation requires syscall; limited by kernel policies/thread limits

Many-to-One:

  • All user threads map to single kernel thread

  • Fully portable; no syscall overhead for thread management

  • Downside: one blocking call blocks entire process; no true parallelism

Many-to-Many (e.g., Solaris LWP):

  • Multiple user threads map to multiple (fewer) kernel threads

  • Threads can be bound (1:1 permanent) or unbound (multiplexed)

  • Blocking one user thread doesn’t block the process

  • Downside: requires coordination between user-level and kernel-level thread managers

Scope of Multithreading

  • Process scope: user-level library manages threads within one process; OS allocates resources per-process equally regardless of thread count

  • System scope: all user threads visible to kernel; resources allocated per-thread across all processes

Multithreading Patterns

Boss-Workers

  • Boss thread accepts work, places on shared queue

  • Worker threads pick from queue, perform entire task

  • Throughput limited by boss efficiency → keep boss logic minimal

  • Use thread pool (pre-created workers, dynamically resizable) to avoid creation overhead

  • Communication via producer-consumer queue (boss produces, workers consume)

Variant — specialized workers: boss routes tasks to specialized worker groups → exploits locality (hotter caches), enables QoS differentiation. Tradeoff: harder load balancing.

Pipeline

  • Task divided into stages; each stage assigned to thread(s)

  • Multiple tasks in-flight simultaneously, each at different stage

  • Throughput limited by slowest stage → assign more threads to bottleneck stages

  • Use shared buffers between stages for elasticity

  • Benefits: high specialization, locality

  • Downsides: complex balancing, more synchronization points

Layered

  • Group related subtasks into layers; threads in a layer can perform any subtask in that group

  • Tasks pass up/down through layers (bidirectional, unlike pipeline)

  • Less fine-grained than pipeline → easier thread allocation

  • Downsides: not always applicable; complex inter-layer synchronization