Distributed Shared Memory¶
Distributed Shared Memory Overview¶
Distributed shared memory (DSM) manages memory across multiple nodes so applications perceive a single shared address space (like an SMP). Each node owns some physical memory, services reads/writes from any node, and participates in consistency protocols to ensure shared accesses have meaningful semantics.
Key design decisions parallel distributed file systems:
Placement — where to store data (close to the process that created/uses it)
Migration — when and how to move data between nodes
Sharing rules — how operations are ordered and propagated across copies
Peer Distributed Applications¶
Unlike client-server DFS, many distributed applications (big data analytics, web search, social media processing) have no clear client/server distinction. Every node owns some state, processes it, and accesses state on other nodes — all nodes are peers.
Some designated nodes may still handle management/control-plane tasks (not fully peer-to-peer)
DSM is a concrete example: each node contributes memory and manages its portion while accessing remote memory
Why DSM Matters¶
Scales beyond single-machine memory limits — large-memory machines are disproportionately expensive; adding nodes is cheaper
Commodity RDMA interconnects (Remote Direct Memory Access) now offer low-enough latency to make remote memory access practical
Applications written for shared memory can run across nodes with careful data placement and consistency management
Hardware vs Software DSM¶
Hardware DSM — relies on specialized high-end interconnect NICs that:
Translate memory operations into network messages
Handle consistency and atomic operations in hardware
Allow the OS to map virtual addresses to remote physical memory
Expensive; used in supercomputing and ultra-high-end machines.
Software DSM — implemented at:
OS level: OS detects local vs remote accesses, generates messages, enforces consistency
Language/runtime level: runtime tracks local vs remote objects, generates communication; OS unchanged but less general (language-specific)
In hybrid (HW+SW) implementations, prefetching is the common software task (application-dependent, hard to do purely in hardware), while address translation and triggering invalidations are handled in hardware.
DSM Design: Sharing Granularity¶
In SMPs, coherence operates at cache-line granularity. For DSM, sending cache-line-sized messages over a network is too expensive. Larger granularities are used:
Variables — meaningful to programmers but often too small (e.g., integers → high overhead)
Pages — natural unit for OS-level DSM; OS tracks page modifications via MMU and triggers coherence messages; common page size = 4 KB amortizes remote access cost
Objects — natural for language/runtime-level DSM; runtime distinguishes local vs remote objects
With compiler support, application objects can be laid out on separate pages, enabling pure page-based OS mechanisms.
False Sharing¶
When two unrelated variables (x, y) reside on the same page, and two different nodes each exclusively access one variable, the DSM system sees concurrent writes to a shared page and triggers unnecessary coherence operations.
Mitigation: careful data layout by programmer or smart compiler that separates truly shared state from independently accessed state.
DSM Design: Access Algorithm¶
Single reader / single writer — DSM simply provides additional remote memory; no consistency challenge
Multiple readers / single writer — simpler special case
Multiple readers / multiple writers — most general; must correctly order writes and propagate updates so all nodes see a consistent view
DSM Design: Migration vs Replication¶
Migration — copy state to the requesting node. Suitable for single reader/single writer but causes thrashing with multiple concurrent accessors.
Replication (including caching) — copy state to multiple nodes:
Improves latency for subsequent accesses
Requires consistency management (overhead proportional to number of copies)
Can limit replica count to control overhead
Both caching and replication improve access latency for the general multi-reader/multi-writer case. Migration can increase latency in that scenario due to page thrashing.
When concurrent writes are heavy, caching/replication overhead can become excessive (cf. Sprite DFS disabling caching under concurrent writes).
DSM Design: Consistency Management¶
SMP write-invalidate/write-update coherence triggers on every write — too costly for distributed systems. DSM uses approaches similar to DFS:
Push (eager/pessimistic) — proactively send invalidations when data is modified:
Expects modified state will be needed elsewhere immediately
Higher bandwidth usage but fresher data
Pull (lazy/optimistic) — request information only when needed (on demand or periodically):
Hopes modified state won’t be needed elsewhere soon
Accumulates updates, reducing message count
When exactly coherence triggers depends on the consistency model
DSM Architecture¶
Structure:
Each node contributes part of its physical memory to the DSM pool
Remaining memory used for caches, replicas, and DSM metadata
All contributed pages form the global shared memory
Each page address = node ID + page frame number; node ID identifies the home node
Roles:
Home node — manages a page: tracks readers, writers, cached copies, locks, version; drives coherence operations
Owner — node currently holding exclusive write access; may differ from home node and can change over time. Home node tracks who the current owner is.
Additional features:
Caching improves latency; home/manager node enforces coherence
Dynamic enable/disable of caching (like Sprite)
Explicit replicas for load balancing, hot-spot avoidance, and fault tolerance (e.g., triplicate: local, same-rack, remote rack/datacenter)
Indexing Distributed State¶
Finding the manager for a page:
Simple: extract node ID from the page address → directly identifies home node
Flexible: use page ID (or hash of it) as index into a global mapping table replicated on every node; each entry encodes the manager node. Changing managers only requires updating the table, not the page identifier.
Per-page metadata is partitioned across home nodes, but the global map enabling manager lookup is replicated on all nodes.
Implementing DSM¶
The DSM layer must intercept every shared-memory access to determine if it is local or remote, and trigger messages accordingly. This overhead should not apply to non-shared, purely local pages.
Leveraging hardware MMU:
No valid virtual→physical mapping for remote page → page fault trap → OS passes to DSM layer → DSM sends remote request
Cached remote content is write-protected → write attempt traps → DSM triggers coherence
Additional page metadata (dirty bit, accessed bit) supports different coherence policies
Object-based DSM (language runtime): compiler generates checks on every object reference to determine local vs remote, modified vs unmodified.
Consistency Models¶
A consistency model is a guarantee about how state changes (memory accesses) will behave, provided upper software layers follow certain rules. It defines how reads/writes are ordered and when updates become visible.
Notation: R_m1(x) = read value x from location m1; W_m1(y) = write value y to m1. All memory initially zero.
Strict Consistency¶
Every update is instantaneously and globally visible in real-time order
All nodes see all writes in the exact same real-time order
Impossible in practice — even single-machine SMPs don’t guarantee this without synchronization; network latency and message reordering make it infeasible in distributed systems
Sequential Consistency¶
Updates from different processors may be arbitrarily interleaved
But all processes must observe the same interleaving
Updates from the same process are never reordered relative to each other
Equivalent to some legal sequential execution on a shared-memory machine
Example: P1 writes x to m1, P2 writes y to m2. Under sequential consistency, P3 may see m2 updated before m1 (or vice versa), but P4 must see them in the same order as P3.
Causal Consistency¶
If update B is causally related to update A (e.g., process read A’s value before writing B), then all processes must see A before B
Concurrent writes (no causal relationship) may appear in any order on different processors
Writes from the same process are never reordered
Example: P1 writes x to m1. P2 reads m1 (sees x), then writes y to m2. The write to m2 is causally dependent on the write to m1 → every process must observe m1=x before m2=y.
Weak Consistency¶
Introduces synchronization points beyond read/write:
A sync operation ensures:
All prior local updates become visible to other processors
All remote updates become visible locally (after their respective syncs)
Without sync, no ordering guarantees — processes may see updates in any order
Sync must be called by both the updating process and the observing process
Variants:
Per-variable or per-page sync (finer granularity)
Entry/exit points — separate acquire (pull others’ updates) from release (push own updates) to minimize data movement
Weak consistency reduces coherence overhead but requires additional DSM state to track synchronization semantics.
Key quiz results:
An execution where P3 and P4 see causally unrelated writes in different orders is not sequentially consistent but is causally consistent
An execution where causally related writes appear in wrong order is neither sequentially nor causally consistent
An execution is weakly consistent if no sync operations are used (no guarantees expected)
Weak consistency does not imply causal consistency — writes from a single processor can appear reordered without sync, violating causal consistency rules