Synchronization Constructs¶
Why More Synchronization?¶
Mutexes and condition variables have known pitfalls:
Correctness: forgetting to unlock, signalling wrong condition variable
Lack of expressive power: cannot easily express count-based, priority, or reader/writer constraints without helper variables
Hardware dependency: all software synchronization requires low-level atomic instruction support
This lesson covers additional constructs and how hardware atomics enable efficient spinlock implementations (following Anderson’s paper: The Performance of Spin Lock Alternatives on Shared Memory Multiprocessors).
Spinlocks¶
Like a mutex: protects critical sections via lock/unlock
Key difference: when lock is busy, the thread spins (busy-waits) on the CPU rather than blocking
Thread burns CPU cycles until lock is free or it gets preempted
Basic primitive used to build more complex synchronization constructs
Semaphores¶
Represented by a positive integer counter, initialized to some max value
Wait (P/proberen): if counter > 0, decrement and proceed; if 0, block
Post (V/verhogen): increment counter, potentially unblock a waiting thread
Allows count-based synchronization (e.g., limit to N concurrent producers)
Binary semaphore (initialized to 1) = equivalent to a mutex
Designed by Dijkstra
POSIX Semaphore API¶
#include <semaphore.h>
sem_t sem;
sem_init(&sem, pshared, count); // pshared: 0=threads, 1=processes
sem_wait(&sem); // decrement or block
sem_post(&sem); // increment
To use as mutex: sem_init(&sem, 0, 1)
Reader/Writer Locks¶
Distinguish read (shared) access from write (exclusive) access
Multiple readers can hold the lock concurrently
Writers require exclusive access
pthread_rwlock_t rwlock;
pthread_rwlock_rdlock(&rwlock); // shared
pthread_rwlock_wrlock(&rwlock); // exclusive
pthread_rwlock_unlock(&rwlock);
Implementation differences across platforms:
Recursive read_lock semantics (single vs. per-lock unlock)
Priority: can a reader upgrade to writer? Does writer priority block new readers?
Interaction with scheduling policy and thread priorities
Monitors¶
Higher-level construct that encapsulates: the shared resource, entry procedures, and condition variables
All locking/unlocking/checking happens automatically on entry/exit
Reduces programmer error (no manual lock/unlock pairing)
Historically: MESA language (Xerox PARC); today: Java (
synchronizedmethods, internal object lock)Also refers to the programming style using mutexes + condition variables for enter/exit critical section code
Other Constructs¶
Serializers: define priorities, hide signaling details
Path expressions: specify correct synchronization as a regular expression (e.g., “many reads or single write”)
Barriers: block all threads until N arrive (reverse of semaphore)
Rendezvous points: multiple threads must meet at a specific execution point
Wait-free synchronization: optimistic, lock-free approaches (e.g., Linux RCU — Read-Copy-Update)
All require hardware atomic instruction support at the lowest level.
Hardware Atomic Instructions¶
Examples: test-and-set, read-and-increment, compare-and-swap
Guarantees provided by hardware:
Atomicity: entire operation completes or none of it does
Mutual exclusion: only one atomic instruction at a time on the same memory location
Remaining operations are queued
Different platforms support different atomics with different efficiencies → synchronization code must be ported and optimized per platform.
Cache Coherence¶
When multiple CPUs cache the same data:
Non-cache-coherent: software must maintain consistency
Cache-coherent: hardware ensures consistency
Two protocols:
Write-invalidate: on write, invalidate other cached copies; future reads cause cache miss → lower bandwidth (only send address, not data); amortized cost if data changes multiple times before others need it
Write-update: on write, update all cached copies; data immediately available on other CPUs → better for data needed immediately after update
Atomics and Cache Coherence¶
Atomic instructions bypass caches and go directly to memory controller
Creates a single serialization point for correctness
Triggers coherence traffic (invalidate/update) regardless of whether value actually changed
Consequence: atomics are more expensive on SMP systems (memory contention + cache coherence overhead)
Spinlock Performance Metrics¶
Three objectives (often conflicting):
Low latency: time to acquire a free lock (ideally one atomic instruction)
Low delay (waiting time): time from lock-freed to acquisition by a waiting thread
Low contention: minimize traffic on shared bus/interconnect
Conflicts: reducing latency/delay requires aggressive spinning → increases contention.
Test-and-Set Spinlock¶
// lock = 0 (free), 1 (busy)
spinlock_lock(lock):
while (test_and_set(lock) == busy);
spinlock_unlock(lock):
lock = 0;
Latency: optimal (single atomic)
Delay: good (continuous spinning detects free immediately)
Contention: terrible — every spin iteration goes to memory (bypasses cache even on cache-coherent systems)
Test-and-Test-and-Set Spinlock (Spin on Read)¶
spinlock_lock(lock):
while (lock == busy || test_and_set(lock) == busy);
Spins on cached value of lock; only executes atomic when cache indicates lock is free
With write-update coherence: all CPUs see lock freed simultaneously → all issue test_and_set → O(N) contention
With write-invalidate coherence: every failed test_and_set invalidates all caches → O(N²) contention on lock release
Delay-Based Spinlocks¶
Introduce delay to reduce simultaneous atomic attempts:
Delay after lock release detected:
spinlock_lock(lock):
while (lock == busy);
delay();
// recheck and attempt test_and_set
Reduces contention; worsens delay when no contention
Delay after every reference:
Works on non-cache-coherent architectures
Hurts delay even more
Picking a delay:
Static (based on CPU ID × critical section length): simple, good under high load, wasteful under low load
Dynamic (random delay scaled by perceived contention): better under low load; contention estimated by counting failed test_and_set attempts; popular approach is exponential backoff
Anderson’s Queueing Lock¶
Solves the core problem: with other locks, all threads see the lock freed simultaneously and all try to acquire at once.
Design:
Array of flags with N elements (N = number of processors): values
has-lockormust-waitqueuelastpointer tracks the tailArriving thread atomically increments
queuelast(via read-and-increment) to get a ticketThread spins on its own array element (private lock)
On unlock: set next element to
has-lock
// Initialization
flags[0] = has-lock;
flags[1..N-1] = must-wait;
queuelast = 0;
lock(flags, queuelast):
myplace = read_and_increment(queuelast) % N;
while (flags[myplace] == must-wait); // spin on own element
unlock(flags, myplace):
flags[myplace] = must-wait;
flags[(myplace + 1) % N] = has-lock;
Performance:
Latency: worse (read-and-increment + modular arithmetic more complex than test-and-set)
Delay: excellent — next thread signaled directly
Contention: excellent — atomic executed once (not part of spin); spinning on different variables than the atomic target
Requirements:
Cache-coherent architecture
Each array element must be in a separate cache line (prevents false sharing)
Array size: N × cache_line_size (e.g., 32 CPUs × 64B = 2KB per lock)
Performance Comparison (Anderson’s Paper)¶
Platform: Sequent Symmetry (20 processors, cache-coherent write-invalidate).
Under high load (many processors contending):
Queueing lock: best (most scalable)
Test-and-test-and-set: worst (O(N²) contention on write-invalidate)
Static delay > dynamic delay (static distributes attempts more evenly)
Delay after every reference > delay after release (avoids extra invalidations)
Under low load (few processors):
Test-and-test-and-set: performs well (low latency)
Dynamic delay > static delay (avoids unnecessary waiting)
Queueing lock: worst (high latency overhead)
Key takeaway: no single spinlock design is universally best — the choice depends on expected workload, contention level, number of processors, and cache coherence protocol.