Synchronization Constructs¶

Why More Synchronization?¶

Mutexes and condition variables have known pitfalls:

Correctness: forgetting to unlock, signalling wrong condition variable
Lack of expressive power: cannot easily express count-based, priority, or reader/writer constraints without helper variables
Hardware dependency: all software synchronization requires low-level atomic instruction support

This lesson covers additional constructs and how hardware atomics enable efficient spinlock implementations (following Anderson’s paper: The Performance of Spin Lock Alternatives on Shared Memory Multiprocessors).

Spinlocks¶

Like a mutex: protects critical sections via lock/unlock
Key difference: when lock is busy, the thread spins (busy-waits) on the CPU rather than blocking
Thread burns CPU cycles until lock is free or it gets preempted
Basic primitive used to build more complex synchronization constructs

Semaphores¶

Represented by a positive integer counter, initialized to some max value
Wait (P/proberen): if counter > 0, decrement and proceed; if 0, block
Post (V/verhogen): increment counter, potentially unblock a waiting thread
Allows count-based synchronization (e.g., limit to N concurrent producers)
Binary semaphore (initialized to 1) = equivalent to a mutex
Designed by Dijkstra

POSIX Semaphore API¶

#include <semaphore.h>

sem_t sem;
sem_init(&sem, pshared, count);  // pshared: 0=threads, 1=processes
sem_wait(&sem);   // decrement or block
sem_post(&sem);   // increment

To use as mutex: sem_init(&sem, 0, 1)

Reader/Writer Locks¶

Distinguish read (shared) access from write (exclusive) access
Multiple readers can hold the lock concurrently
Writers require exclusive access

pthread_rwlock_t rwlock;
pthread_rwlock_rdlock(&rwlock);   // shared
pthread_rwlock_wrlock(&rwlock);   // exclusive
pthread_rwlock_unlock(&rwlock);

Implementation differences across platforms:

Recursive read_lock semantics (single vs. per-lock unlock)
Priority: can a reader upgrade to writer? Does writer priority block new readers?
Interaction with scheduling policy and thread priorities

Monitors¶

Higher-level construct that encapsulates: the shared resource, entry procedures, and condition variables
All locking/unlocking/checking happens automatically on entry/exit
Reduces programmer error (no manual lock/unlock pairing)
Historically: MESA language (Xerox PARC); today: Java (synchronized methods, internal object lock)
Also refers to the programming style using mutexes + condition variables for enter/exit critical section code

Other Constructs¶

Serializers: define priorities, hide signaling details
Path expressions: specify correct synchronization as a regular expression (e.g., “many reads or single write”)
Barriers: block all threads until N arrive (reverse of semaphore)
Rendezvous points: multiple threads must meet at a specific execution point
Wait-free synchronization: optimistic, lock-free approaches (e.g., Linux RCU — Read-Copy-Update)

All require hardware atomic instruction support at the lowest level.

Hardware Atomic Instructions¶

Examples: test-and-set, read-and-increment, compare-and-swap

Guarantees provided by hardware:

Atomicity: entire operation completes or none of it does
Mutual exclusion: only one atomic instruction at a time on the same memory location
Remaining operations are queued

Different platforms support different atomics with different efficiencies → synchronization code must be ported and optimized per platform.

Shared Memory Multiprocessors¶

Multiple CPUs share memory via interconnect (bus-based or switch/crossbar)
Caches hide memory latency (especially important under memory contention)
Write policies: no-write (invalidate cache), write-through, write-back

Cache Coherence¶

When multiple CPUs cache the same data:

Non-cache-coherent: software must maintain consistency
Cache-coherent: hardware ensures consistency

Two protocols:

Write-invalidate: on write, invalidate other cached copies; future reads cause cache miss → lower bandwidth (only send address, not data); amortized cost if data changes multiple times before others need it
Write-update: on write, update all cached copies; data immediately available on other CPUs → better for data needed immediately after update

Atomics and Cache Coherence¶

Atomic instructions bypass caches and go directly to memory controller
Creates a single serialization point for correctness
Triggers coherence traffic (invalidate/update) regardless of whether value actually changed
Consequence: atomics are more expensive on SMP systems (memory contention + cache coherence overhead)

Spinlock Performance Metrics¶

Three objectives (often conflicting):

Low latency: time to acquire a free lock (ideally one atomic instruction)
Low delay (waiting time): time from lock-freed to acquisition by a waiting thread
Low contention: minimize traffic on shared bus/interconnect

Conflicts: reducing latency/delay requires aggressive spinning → increases contention.

Test-and-Set Spinlock¶

// lock = 0 (free), 1 (busy)
spinlock_lock(lock):
    while (test_and_set(lock) == busy);

spinlock_unlock(lock):
    lock = 0;

Latency: optimal (single atomic)
Delay: good (continuous spinning detects free immediately)
Contention: terrible — every spin iteration goes to memory (bypasses cache even on cache-coherent systems)

Test-and-Test-and-Set Spinlock (Spin on Read)¶

spinlock_lock(lock):
    while (lock == busy || test_and_set(lock) == busy);

Spins on cached value of lock; only executes atomic when cache indicates lock is free
With write-update coherence: all CPUs see lock freed simultaneously → all issue test_and_set → O(N) contention
With write-invalidate coherence: every failed test_and_set invalidates all caches → O(N²) contention on lock release

Delay-Based Spinlocks¶

Introduce delay to reduce simultaneous atomic attempts:

Delay after lock release detected:

spinlock_lock(lock):
    while (lock == busy);
    delay();
    // recheck and attempt test_and_set

Reduces contention; worsens delay when no contention

Delay after every reference:

Works on non-cache-coherent architectures
Hurts delay even more

Picking a delay:

Static (based on CPU ID × critical section length): simple, good under high load, wasteful under low load
Dynamic (random delay scaled by perceived contention): better under low load; contention estimated by counting failed test_and_set attempts; popular approach is exponential backoff

Anderson’s Queueing Lock¶

Solves the core problem: with other locks, all threads see the lock freed simultaneously and all try to acquire at once.

Design:

Array of flags with N elements (N = number of processors): values has-lock or must-wait
queuelast pointer tracks the tail
Arriving thread atomically increments queuelast (via read-and-increment) to get a ticket
Thread spins on its own array element (private lock)
On unlock: set next element to has-lock

// Initialization
flags[0] = has-lock;
flags[1..N-1] = must-wait;
queuelast = 0;

lock(flags, queuelast):
    myplace = read_and_increment(queuelast) % N;
    while (flags[myplace] == must-wait);  // spin on own element

unlock(flags, myplace):
    flags[myplace] = must-wait;
    flags[(myplace + 1) % N] = has-lock;

Performance:

Latency: worse (read-and-increment + modular arithmetic more complex than test-and-set)
Delay: excellent — next thread signaled directly
Contention: excellent — atomic executed once (not part of spin); spinning on different variables than the atomic target

Requirements:

Cache-coherent architecture
Each array element must be in a separate cache line (prevents false sharing)
Array size: N × cache_line_size (e.g., 32 CPUs × 64B = 2KB per lock)

Performance Comparison (Anderson’s Paper)¶

Platform: Sequent Symmetry (20 processors, cache-coherent write-invalidate).

Under high load (many processors contending):

Queueing lock: best (most scalable)
Test-and-test-and-set: worst (O(N²) contention on write-invalidate)
Static delay > dynamic delay (static distributes attempts more evenly)
Delay after every reference > delay after release (avoids extra invalidations)

Under low load (few processors):

Test-and-test-and-set: performs well (low latency)
Dynamic delay > static delay (avoids unnecessary waiting)
Queueing lock: worst (high latency overhead)

Key takeaway: no single spinlock design is universally best — the choice depends on expected workload, contention level, number of processors, and cache coherence protocol.