SDSM Progression. Implementing Shared Memory on Distributed Systems. Software Distributed Shared Memory. Why a Distributed Shared Memory (DSM) System?

Similar documents
Is S-DSM Dead? Michael L. Scott. Department of Computer Science University of Rochester

Distributed Shared Memory. Presented by Humayun Arafat

DISTRIBUTED SYSTEMS [COMP9243] Lecture 3b: Distributed Shared Memory DISTRIBUTED SHARED MEMORY (DSM) DSM consists of two components:

Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

Shared Memory Models

Shared Virtual Memory. Programming Models

Distributed Memory and Cache Consistency. (some slides courtesy of Alvin Lebeck)

Multiple-Writer Distributed Memory

An Efficient Lock Protocol for Home-based Lazy Release Consistency

OpenMP for Networks of SMPs

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski

Adaptive Techniques for Homebased Software DSMs

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

Consistency Issues in Distributed Shared Memory Systems

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Producer-Push a Technique to Tolerate Latency in Software Distributed Shared Memory Systems

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Run-Time Support for Distributed Sharing in Typed Languages

"is.n21.jiajia" "is.n21.nautilus" "is.n22.jiajia" "is.n22.nautilus"

Optimizing Home-Based Software DSM Protocols

Adaptive Prefetching Technique for Shared Virtual Memory

Designing High Performance DSM Systems using InfiniBand Features

VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks

Beyond S-DSM: Shared State for Distributed Systems

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

DSM64: A DISTRIBUTED SHARED MEMORY SYSTEM IN USER-SPACE. A Thesis. Presented to. the Faculty of California Polytechnic State University

VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks 1

Towards Transparent and Efficient Software Distributed Shared Memory

9. Distributed Shared Memory

Distributed Shared Memory

The Effect of Network Total Order, Broadcast, and Remote-Write Capability on Network-Based Shared Memory Computing

Evaluation of the JIAJIA Software DSM System on High Performance Computer Architectures y

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Selection-based Weak Sequential Consistency Models for. for Distributed Shared Memory.

Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory

A Comparison of Entry Consistency and Lazy Release Consistency Implementations

On the Design and Implementation of a Portable DSM System for Low-Cost Multicomputers

Concurrency for data-intensive applications

Distributed Shared Memory and Memory Consistency Models

Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Multi-level Shared State for Distributed Systems

Pete Keleher. The University of Maryland. Alan L. Cox, Sandhya Dwarkadas, Willy Zwaenepoel. Rice University. Abstract

Coherence-Centric Logging and Recovery for Home-Based Software Distributed Shared Memory

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters

Cristiana Amza, Alan L. Cox, Sandhya Dwarkadas, Pete Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu and Willy Zwaenepoel.

Multiprocessors & Thread Level Parallelism

Fine-Grain Software Distributed Shared Memory on SMP Clusters

Memory Consistency and Multiprocessor Performance

CSE 513: Distributed Systems (Distributed Shared Memory)

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

criterion, thereby reducing communication overhead and improving eciency. Since traditional shared memory environments do not provide such alternative

Part I Overview Chapter 1: Introduction

Distributed Shared Memory: Concepts and Systems

Delphi: Prediction-Based Page Prefetching to Improve the Performance of Shared Virtual Memory Systems

Ecient Use of Memory-Mapped Network Interfaces for. Sandhya Dwarkadas, Leonidas Kontothanassis, and Michael L. Scott. University ofrochester

Distributed Shared Memory

White Paper Jay P. Hoeflinger Senior Staff Software Engineer Intel Corporation. Extending OpenMP* to Clusters

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

6.1 Multiprocessor Computing Environment

TreadMarks: Distributed Shared Memory. on Standard Workstations and Operating Systems

Introduction CHAPTER. Practice Exercises. 1.1 What are the three main purposes of an operating system? Answer: The three main puropses are:

Integrating Remote Invocation and Distributed Shared State

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Lightweight Logging for Lazy Release Consistent Distributed Shared Memory

Software-Controlled Multithreading Using Informing Memory Operations

Homeless and Home-based Lazy Release Consistency Protocols on Distributed Shared Memory

Lazy Home-Based Protocol: Combining Homeless and Home-Based Distributed Shared Memory Protocols

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Run-Time Support for Distributed Sharing in Safe Languages

Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers

OpenMP for Networks of SMPs

TSO-CC: Consistency-directed Coherence for TSO. Vijay Nagarajan

Two Techniques for Improvement the Speedup ofnautilus DSM

The Google File System

Chapter 9 Multiprocessors

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

TERN: Stable Deterministic Multithreading through Schedule Memoization

Why Multiprocessors?

Can High Performance Software DSM Systems Designed With InfiniBand Features Benefit from PCI-Express?

A Migrating-Home Protocol for Implementing Scope Consistency Model on a Cluster of Workstations

Support for Machine and Language Heterogeneity in a Distributed Shared State System

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

Memory Consistency. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab.

Implementing TreadMarks over GM on Myrinet: Challenges, Design Experience, and Performance Evaluation

Computer Organization. Chapter 16

Distributed Systems CS /640

Operating Systems. studykorner.org

A Method to Reduce the Acknowledgement Overhead of S-DSM Systems

Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems

Overview: Memory Consistency

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

Advance Operating Systems (CS202) Locks Discussion

Introduction CHAPTER. Exercises

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Processor Architecture and Interconnect

Transcription:

SDSM Progression Implementing Shared Memory on Distributed Systems Sandhya Dwarkadas University of Rochester TreadMarks shared memory for networks of workstations Cashmere-2L - 2-level shared memory system InterWeave - 3-level versioned shared state Software Distributed Shared Memory Why a Distributed Shared Memory (DSM) System? Motivation: parallelism utilizing commodity hardware; including relatively high-latency interconnect between nodes Comparable to pthreads library in functionality Implicit data communication (in contrast to an MPI-style approach to parallelism) Presents same/similar environment as shared memory multiprocessor machines 1

Applications (SDSM) Correspondence problem [CAHPP97] Object recognition Volumetric reconstruction Intelligent environments CFD in Astrophysics [CS00] Genetic linkage analysis [HH94,CBR95] Protein folding Laser fusion Cone beam tomography Detecting Shared Accesses Virtual memory page-based coherence unit Instrumentation overhead on every read and write Conventional SDSM System Using Virtual Memory [Li 86] Problems Sequential consistency can cause large amounts of communication Communications is $$$ on a workstation network (Latency) Performance Problem: False Sharing 2

Goals TreadMarks [USENIX 94,Computer 96] Keep shared memory model Reduce communication using techniques such as Lazy Release Consistency (to reduce frequency of metadata exchange) Multiple Writer Protocols (to address false sharing overheads) State-of-the-art software distributed shared memory system Page-based Lazy release consistency [Keleher et al. 92] Using distributed vector timestamps Multiple writer protocol [Carter et al. 91] API Eager Release Consistency tmk_startup(int argc, char argv) tmk_exit(int status) tmk_malloc(unsigned size) tmk_free(char ptr) tmk_barrier(unsigned id) tmk_lock acquire(unsigned id) tmk_lock release(unsigned id) Changes to memory pages ( x ) propagated to all nodes at time of lock release Inefficient use of network Can we improve this? 3

Lazy Release Consistency Release Consistency: Eager vs. Lazy Eager Synchronization of memory occurs upon successful acquire of lock ( l ). More efficient; TreadMarks uses this. Changes to memory piggyback on lock acquire notifications Lazy Multiple Writer Protocols Vector TimeStamps TreadMarks traps write access to TM pages using VM system Copy of page -- a twin -- is created Memory pages are synced by generating a binary diff of the twin and the current copy of a page Remote node applies the diff to its current copy of the page 4

Protocol Actions for TreadMarks TreadMarks Implementation Overview Totally implemented in userspace Provides a TreadMarks heap [malloc() / free()] to programs; memory allocated from said heap is shared Several synchronization primitives: barrier, locks Memory page accesses (reading or writing) can be trapped by using mprotect() Accessing a page that has been protected causes a SIGSEGV -- segmentation fault TreadMarks installs a signal handler for SIGSEGV that differentiates faults on TreadMarks-owned pages. Messages from other nodes use SIGIO handler Writing to a page causes an invalidation notice rather than a data update Uses vector timestamps to determine causally related modifications needed TreadMarks Read Fault Example Remember: a read fault means that the local copy needs to be updated. Pages are initially not loaded by diffs. TreadMarks Write Fault The program on P1 attempts a write to a protected page The MMU intercepts this operation and throws a signal The TM signal handler intercepts this signal and determines whether it applies to a TM page Flags page as modified, unprotects it, and resumes execution at the write (creating a twin along the way) 5

Implementation TreadMarks Synchronization Events Let us suppose that P1 has yielded a lock and P2 is acquiring it. P1 has modified pages Lazy release consistency tells us that an acquiring process needs the changes from the previous holder of the lock. P2 flags pages as invalid and uses mprotect() to trap reads and writes to said pages. P1 has diffs for its changes up to this synchronization event. More on Synchronization Events TreadMarks Summary TM may actually defer diff creation and simply flag that it needs to do a diff at some point. Many programs with high locality benefit from this. Set of updated pages (write notices) is constructed by using vector timestamps. Each process monitors its writes within each acquirerelease block or interval. The set of write notices sent to an acquiring process consists of the union of all writes belonging to intervals at the releasing node that have not been performed at the acquiring node. Lazy release consistency minimizes the need to push updates to other processors and allows updates to piggyback on lock acquisitions Multiple writer protocols minimize false sharing and reduce update size Requires no kernel or compiler support Not good for applications with a high frequency of communication and synchronization 6

Cashmere-2L [SOSP 97,TOCS 05] Tightly-coupled cost-effective shared memory computing Motivation Take advantage of low-latency system-area networks Leverage available hardware coherence in SMP nodes Core Cashmere Features Protocol Actions for Cashmere-2L Virtual memory-based coherence Page-size coherence blocks Data-race-free programming model Moderately lazy release consistency protocol Multiple concurrent writers to a page Master copy of data at home node Distributed directory-based coherence 7

Protocol Actions for Cashmere-2L & TreadMarks Cashmere TreadMarks Hardware/Software Interaction Software coherence operations performed for entire node Coalesce fetches of data to validate pages Coalesce updates of modified data to home node Per-page timestamps within SMP to track remote-smp events Last write notice received Last home node update Last data fetch Avoiding Redundant Fetches Incoming Diffs Time Write Notice P1: P2: Acq Read Rel Acq Read Compare up-to-date data to the twin. Twin? Up-to-date Logical Clock: 5 6 Last Page Fetch: 3 3 Write Notice: 2 6 6 7 3 7 6 6 Page Fetch 8 7 6 8 7 6 No Page Fetch Copy differences to the working copy and the twin. Working Copy Twin Up-to-date 8

Experimental platform Cashmere-2L Speedups 8-node cluster of 4-way SMPs 32 processors total Early work: 233 MHz EV45s, 2 GB total memory; 5us remote latency; 30 MB/s per-link bandwidth; 60 MB/s total bandwidth Nov. 1998: 600 MHz EV56s, 16 GB total memory; 3us remote latency; 70 MB/s per-link bandwidth; >500 MB/s total bandwidth Speedup (32 processors) 35 30 25 20 15 10 5 0 SOR LU Water TSP Gauss Ilink EM3D Barnes CFD Cashmere-2L vs. Cashmere-1L Comparison to TreadMarks 9

Shasta [ASPLOS 96,HPCA 98] Comparison to Shasta [HPCA 99] 256-byte coherence granularity Inline checks to intercept shared accesses Intelligent scheduling and batching of checks Directory-based protocol Release consistent (can provide sequential consistency) Can provide variable coherence granularity Speedups (16 Processors) 14 12 10 8 6 4 2 0 Barnes LU LU-Def Ocean Raytrace SPLASH-2 Volrend WaterNSQ WaterSP Page-based S-DSM Em3d Gauss Ilink SOR TSP Shasta CSM-2L Modified/Shasta Modified/CSM-2L Summary: Cashmere-2L Low-latency networks make directories a viable alternative for SDSM [ISCA 97] Remote-write capability mainly useful for fast messaging and polling [HPCA 00] Two-level design exploits hardware coherence within SMP [SOSP 97] Sharing within SMP uses hardware coherence Operations due to sharing across SMPs coalesced InterWeave [LCR 00,ICPP 02,ICDCS 03,PPoPP 03, IPDPS 04] Shared state in a distributed and heterogeneous environment Builds on recent work to create a 3-level system of sharing using hardware coherence within nodes lazy release consistent software coherence across tightly-coupled nodes (Cashmere[ISCA 97,SOSP 97,HPCA 99,HPC A 00]) versioned coherence across loosely-coupled nodes (InterAct[LCR 98,ICDM 01]) 10

Most applications cache state e-commerce CSCW even web pages! Problems: Motivation Major source of complexity (design & maintenance) High overhead if naively written Solution: Automate the caching! multi-user games peer-to-peer sharing much faster than simple messaging much simpler than fancy messaging Target Applications (Wide-Area) Compute engines with remote satellites Remote visualization and steering (Astrophysics) Client-server division of labor Data mining (interactive, iterative) Distributed coordination and sharing Goal: Intelligent environments Ease of use Maximize utilization of available hardware Distributed Environment Desired Model 11

InterWeave InterWeave Server data Heterogeneity Low bandwidth InterWeave API Data mapping and management as segments URL server = "iw.cs.rochester.edu/simulation/sega"; IW_handle_t h = IW_open_segment(server); IW library cache Cluster Fortran/C Internet IW library cache Desktop C/C++ IW library cache Handheld Device Java Synchronization using reader writer locks IW_wl_acquire(h), IW_rl_acquire(h) Data allocation p = (part*) IW_malloc(h, part_desc); Coherence requirement specification IW_use_coherence(h) Data access using ordinary reads and writes InterWeave Design Highlights Heterogeneity Transparently handle optimized communication across multiple machine types and operating systems Leverage language reflection mechanisms and the use of an IDL Two-way machine-independent wire format diffing Handle multiple languages Application-specific coherence information Relaxed coherence models and dynamic views Hash-based consistency A multi-level shared memory structure Leverage coherence and consistency management provided by the underlying hardware and software distributed memory systems 12