Synchronization Constructs

Why More Synchronization?

Mutexes and condition variables have known pitfalls:

  • Correctness: forgetting to unlock, signalling wrong condition variable

  • Lack of expressive power: cannot easily express count-based, priority, or reader/writer constraints without helper variables

  • Hardware dependency: all software synchronization requires low-level atomic instruction support

This lesson covers additional constructs and how hardware atomics enable efficient spinlock implementations (following Anderson’s paper: The Performance of Spin Lock Alternatives on Shared Memory Multiprocessors).

Spinlocks

  • Like a mutex: protects critical sections via lock/unlock

  • Key difference: when lock is busy, the thread spins (busy-waits) on the CPU rather than blocking

  • Thread burns CPU cycles until lock is free or it gets preempted

  • Basic primitive used to build more complex synchronization constructs

Semaphores

  • Represented by a positive integer counter, initialized to some max value

  • Wait (P/proberen): if counter > 0, decrement and proceed; if 0, block

  • Post (V/verhogen): increment counter, potentially unblock a waiting thread

  • Allows count-based synchronization (e.g., limit to N concurrent producers)

  • Binary semaphore (initialized to 1) = equivalent to a mutex

  • Designed by Dijkstra

POSIX Semaphore API

#include <semaphore.h>

sem_t sem;
sem_init(&sem, pshared, count);  // pshared: 0=threads, 1=processes
sem_wait(&sem);   // decrement or block
sem_post(&sem);   // increment

To use as mutex: sem_init(&sem, 0, 1)

Reader/Writer Locks

  • Distinguish read (shared) access from write (exclusive) access

  • Multiple readers can hold the lock concurrently

  • Writers require exclusive access

pthread_rwlock_t rwlock;
pthread_rwlock_rdlock(&rwlock);   // shared
pthread_rwlock_wrlock(&rwlock);   // exclusive
pthread_rwlock_unlock(&rwlock);

Implementation differences across platforms:

  • Recursive read_lock semantics (single vs. per-lock unlock)

  • Priority: can a reader upgrade to writer? Does writer priority block new readers?

  • Interaction with scheduling policy and thread priorities

Monitors

  • Higher-level construct that encapsulates: the shared resource, entry procedures, and condition variables

  • All locking/unlocking/checking happens automatically on entry/exit

  • Reduces programmer error (no manual lock/unlock pairing)

  • Historically: MESA language (Xerox PARC); today: Java (synchronized methods, internal object lock)

  • Also refers to the programming style using mutexes + condition variables for enter/exit critical section code

Other Constructs

  • Serializers: define priorities, hide signaling details

  • Path expressions: specify correct synchronization as a regular expression (e.g., “many reads or single write”)

  • Barriers: block all threads until N arrive (reverse of semaphore)

  • Rendezvous points: multiple threads must meet at a specific execution point

  • Wait-free synchronization: optimistic, lock-free approaches (e.g., Linux RCU — Read-Copy-Update)

All require hardware atomic instruction support at the lowest level.

Hardware Atomic Instructions

Examples: test-and-set, read-and-increment, compare-and-swap

Guarantees provided by hardware:

  • Atomicity: entire operation completes or none of it does

  • Mutual exclusion: only one atomic instruction at a time on the same memory location

  • Remaining operations are queued

Different platforms support different atomics with different efficiencies → synchronization code must be ported and optimized per platform.

Shared Memory Multiprocessors

  • Multiple CPUs share memory via interconnect (bus-based or switch/crossbar)

  • Caches hide memory latency (especially important under memory contention)

  • Write policies: no-write (invalidate cache), write-through, write-back

Cache Coherence

When multiple CPUs cache the same data:

  • Non-cache-coherent: software must maintain consistency

  • Cache-coherent: hardware ensures consistency

Two protocols:

  • Write-invalidate: on write, invalidate other cached copies; future reads cause cache miss → lower bandwidth (only send address, not data); amortized cost if data changes multiple times before others need it

  • Write-update: on write, update all cached copies; data immediately available on other CPUs → better for data needed immediately after update

Atomics and Cache Coherence

  • Atomic instructions bypass caches and go directly to memory controller

  • Creates a single serialization point for correctness

  • Triggers coherence traffic (invalidate/update) regardless of whether value actually changed

  • Consequence: atomics are more expensive on SMP systems (memory contention + cache coherence overhead)

Spinlock Performance Metrics

Three objectives (often conflicting):

  1. Low latency: time to acquire a free lock (ideally one atomic instruction)

  2. Low delay (waiting time): time from lock-freed to acquisition by a waiting thread

  3. Low contention: minimize traffic on shared bus/interconnect

Conflicts: reducing latency/delay requires aggressive spinning → increases contention.

Test-and-Set Spinlock

// lock = 0 (free), 1 (busy)
spinlock_lock(lock):
    while (test_and_set(lock) == busy);

spinlock_unlock(lock):
    lock = 0;
  • Latency: optimal (single atomic)

  • Delay: good (continuous spinning detects free immediately)

  • Contention: terrible — every spin iteration goes to memory (bypasses cache even on cache-coherent systems)

Test-and-Test-and-Set Spinlock (Spin on Read)

spinlock_lock(lock):
    while (lock == busy || test_and_set(lock) == busy);
  • Spins on cached value of lock; only executes atomic when cache indicates lock is free

  • With write-update coherence: all CPUs see lock freed simultaneously → all issue test_and_set → O(N) contention

  • With write-invalidate coherence: every failed test_and_set invalidates all caches → O(N²) contention on lock release

Delay-Based Spinlocks

Introduce delay to reduce simultaneous atomic attempts:

Delay after lock release detected:

spinlock_lock(lock):
    while (lock == busy);
    delay();
    // recheck and attempt test_and_set
  • Reduces contention; worsens delay when no contention

Delay after every reference:

  • Works on non-cache-coherent architectures

  • Hurts delay even more

Picking a delay:

  • Static (based on CPU ID × critical section length): simple, good under high load, wasteful under low load

  • Dynamic (random delay scaled by perceived contention): better under low load; contention estimated by counting failed test_and_set attempts; popular approach is exponential backoff

Anderson’s Queueing Lock

Solves the core problem: with other locks, all threads see the lock freed simultaneously and all try to acquire at once.

Design:

  • Array of flags with N elements (N = number of processors): values has-lock or must-wait

  • queuelast pointer tracks the tail

  • Arriving thread atomically increments queuelast (via read-and-increment) to get a ticket

  • Thread spins on its own array element (private lock)

  • On unlock: set next element to has-lock

// Initialization
flags[0] = has-lock;
flags[1..N-1] = must-wait;
queuelast = 0;

lock(flags, queuelast):
    myplace = read_and_increment(queuelast) % N;
    while (flags[myplace] == must-wait);  // spin on own element

unlock(flags, myplace):
    flags[myplace] = must-wait;
    flags[(myplace + 1) % N] = has-lock;

Performance:

  • Latency: worse (read-and-increment + modular arithmetic more complex than test-and-set)

  • Delay: excellent — next thread signaled directly

  • Contention: excellent — atomic executed once (not part of spin); spinning on different variables than the atomic target

Requirements:

  • Cache-coherent architecture

  • Each array element must be in a separate cache line (prevents false sharing)

  • Array size: N × cache_line_size (e.g., 32 CPUs × 64B = 2KB per lock)

Performance Comparison (Anderson’s Paper)

Platform: Sequent Symmetry (20 processors, cache-coherent write-invalidate).

Under high load (many processors contending):

  • Queueing lock: best (most scalable)

  • Test-and-test-and-set: worst (O(N²) contention on write-invalidate)

  • Static delay > dynamic delay (static distributes attempts more evenly)

  • Delay after every reference > delay after release (avoids extra invalidations)

Under low load (few processors):

  • Test-and-test-and-set: performs well (low latency)

  • Dynamic delay > static delay (avoids unnecessary waiting)

  • Queueing lock: worst (high latency overhead)

Key takeaway: no single spinlock design is universally best — the choice depends on expected workload, contention level, number of processors, and cache coherence protocol.