Distributed Shared Memory

Distributed Shared Memory Overview

Distributed shared memory (DSM) manages memory across multiple nodes so applications perceive a single shared address space (like an SMP). Each node owns some physical memory, services reads/writes from any node, and participates in consistency protocols to ensure shared accesses have meaningful semantics.

Key design decisions parallel distributed file systems:

  • Placement — where to store data (close to the process that created/uses it)

  • Migration — when and how to move data between nodes

  • Sharing rules — how operations are ordered and propagated across copies

Peer Distributed Applications

Unlike client-server DFS, many distributed applications (big data analytics, web search, social media processing) have no clear client/server distinction. Every node owns some state, processes it, and accesses state on other nodes — all nodes are peers.

  • Some designated nodes may still handle management/control-plane tasks (not fully peer-to-peer)

  • DSM is a concrete example: each node contributes memory and manages its portion while accessing remote memory

Why DSM Matters

  • Scales beyond single-machine memory limits — large-memory machines are disproportionately expensive; adding nodes is cheaper

  • Commodity RDMA interconnects (Remote Direct Memory Access) now offer low-enough latency to make remote memory access practical

  • Applications written for shared memory can run across nodes with careful data placement and consistency management

Hardware vs Software DSM

Hardware DSM — relies on specialized high-end interconnect NICs that:

  • Translate memory operations into network messages

  • Handle consistency and atomic operations in hardware

  • Allow the OS to map virtual addresses to remote physical memory

Expensive; used in supercomputing and ultra-high-end machines.

Software DSM — implemented at:

  • OS level: OS detects local vs remote accesses, generates messages, enforces consistency

  • Language/runtime level: runtime tracks local vs remote objects, generates communication; OS unchanged but less general (language-specific)

In hybrid (HW+SW) implementations, prefetching is the common software task (application-dependent, hard to do purely in hardware), while address translation and triggering invalidations are handled in hardware.

DSM Design: Sharing Granularity

In SMPs, coherence operates at cache-line granularity. For DSM, sending cache-line-sized messages over a network is too expensive. Larger granularities are used:

  • Variables — meaningful to programmers but often too small (e.g., integers → high overhead)

  • Pages — natural unit for OS-level DSM; OS tracks page modifications via MMU and triggers coherence messages; common page size = 4 KB amortizes remote access cost

  • Objects — natural for language/runtime-level DSM; runtime distinguishes local vs remote objects

With compiler support, application objects can be laid out on separate pages, enabling pure page-based OS mechanisms.

False Sharing

When two unrelated variables (x, y) reside on the same page, and two different nodes each exclusively access one variable, the DSM system sees concurrent writes to a shared page and triggers unnecessary coherence operations.

Mitigation: careful data layout by programmer or smart compiler that separates truly shared state from independently accessed state.

DSM Design: Access Algorithm

  • Single reader / single writer — DSM simply provides additional remote memory; no consistency challenge

  • Multiple readers / single writer — simpler special case

  • Multiple readers / multiple writers — most general; must correctly order writes and propagate updates so all nodes see a consistent view

DSM Design: Migration vs Replication

Migration — copy state to the requesting node. Suitable for single reader/single writer but causes thrashing with multiple concurrent accessors.

Replication (including caching) — copy state to multiple nodes:

  • Improves latency for subsequent accesses

  • Requires consistency management (overhead proportional to number of copies)

  • Can limit replica count to control overhead

Both caching and replication improve access latency for the general multi-reader/multi-writer case. Migration can increase latency in that scenario due to page thrashing.

When concurrent writes are heavy, caching/replication overhead can become excessive (cf. Sprite DFS disabling caching under concurrent writes).

DSM Design: Consistency Management

SMP write-invalidate/write-update coherence triggers on every write — too costly for distributed systems. DSM uses approaches similar to DFS:

Push (eager/pessimistic) — proactively send invalidations when data is modified:

  • Expects modified state will be needed elsewhere immediately

  • Higher bandwidth usage but fresher data

Pull (lazy/optimistic) — request information only when needed (on demand or periodically):

  • Hopes modified state won’t be needed elsewhere soon

  • Accumulates updates, reducing message count

  • When exactly coherence triggers depends on the consistency model

DSM Architecture

Structure:

  • Each node contributes part of its physical memory to the DSM pool

  • Remaining memory used for caches, replicas, and DSM metadata

  • All contributed pages form the global shared memory

  • Each page address = node ID + page frame number; node ID identifies the home node

Roles:

  • Home node — manages a page: tracks readers, writers, cached copies, locks, version; drives coherence operations

  • Owner — node currently holding exclusive write access; may differ from home node and can change over time. Home node tracks who the current owner is.

Additional features:

  • Caching improves latency; home/manager node enforces coherence

  • Dynamic enable/disable of caching (like Sprite)

  • Explicit replicas for load balancing, hot-spot avoidance, and fault tolerance (e.g., triplicate: local, same-rack, remote rack/datacenter)

Indexing Distributed State

Finding the manager for a page:

  • Simple: extract node ID from the page address → directly identifies home node

  • Flexible: use page ID (or hash of it) as index into a global mapping table replicated on every node; each entry encodes the manager node. Changing managers only requires updating the table, not the page identifier.

Per-page metadata is partitioned across home nodes, but the global map enabling manager lookup is replicated on all nodes.

Implementing DSM

The DSM layer must intercept every shared-memory access to determine if it is local or remote, and trigger messages accordingly. This overhead should not apply to non-shared, purely local pages.

Leveraging hardware MMU:

  • No valid virtual→physical mapping for remote page → page fault trap → OS passes to DSM layer → DSM sends remote request

  • Cached remote content is write-protected → write attempt traps → DSM triggers coherence

  • Additional page metadata (dirty bit, accessed bit) supports different coherence policies

Object-based DSM (language runtime): compiler generates checks on every object reference to determine local vs remote, modified vs unmodified.

Consistency Models

A consistency model is a guarantee about how state changes (memory accesses) will behave, provided upper software layers follow certain rules. It defines how reads/writes are ordered and when updates become visible.

Notation: R_m1(x) = read value x from location m1; W_m1(y) = write value y to m1. All memory initially zero.

Strict Consistency

  • Every update is instantaneously and globally visible in real-time order

  • All nodes see all writes in the exact same real-time order

  • Impossible in practice — even single-machine SMPs don’t guarantee this without synchronization; network latency and message reordering make it infeasible in distributed systems

Sequential Consistency

  • Updates from different processors may be arbitrarily interleaved

  • But all processes must observe the same interleaving

  • Updates from the same process are never reordered relative to each other

  • Equivalent to some legal sequential execution on a shared-memory machine

Example: P1 writes x to m1, P2 writes y to m2. Under sequential consistency, P3 may see m2 updated before m1 (or vice versa), but P4 must see them in the same order as P3.

Causal Consistency

  • If update B is causally related to update A (e.g., process read A’s value before writing B), then all processes must see A before B

  • Concurrent writes (no causal relationship) may appear in any order on different processors

  • Writes from the same process are never reordered

Example: P1 writes x to m1. P2 reads m1 (sees x), then writes y to m2. The write to m2 is causally dependent on the write to m1 → every process must observe m1=x before m2=y.

Weak Consistency

Introduces synchronization points beyond read/write:

  • A sync operation ensures:

    • All prior local updates become visible to other processors

    • All remote updates become visible locally (after their respective syncs)

  • Without sync, no ordering guarantees — processes may see updates in any order

  • Sync must be called by both the updating process and the observing process

Variants:

  • Per-variable or per-page sync (finer granularity)

  • Entry/exit points — separate acquire (pull others’ updates) from release (push own updates) to minimize data movement

Weak consistency reduces coherence overhead but requires additional DSM state to track synchronization semantics.

Key quiz results:

  • An execution where P3 and P4 see causally unrelated writes in different orders is not sequentially consistent but is causally consistent

  • An execution where causally related writes appear in wrong order is neither sequentially nor causally consistent

  • An execution is weakly consistent if no sync operations are used (no guarantees expected)

  • Weak consistency does not imply causal consistency — writes from a single processor can appear reordered without sync, violating causal consistency rules