Algorithm Engineering with PRAM Algorithms

Similar documents
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

An Introduction to Parallel Programming

Simple Parallel Biconnectivity Algorithms for Multicore Platforms

Introduction to Parallel Computing

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Convergence of Parallel Architecture

High Performance Computing

Introduction to parallel Computing

Lecture 7: Parallel Processing

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms

Parallel Euler tour and Post Ordering for Parallel Tree Accumulations

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

DISTRIBUTED SHARED MEMORY

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Review: Creating a Parallel Program. Programming for Performance

Lecture 7: Parallel Processing

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Lecture 3: Intro to parallel machines and models

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lecture 8: RISC & Parallel Computers. Parallel computers

Lecture 9: MIMD Architectures

Parallel Computing Introduction

COSC 6385 Computer Architecture - Multi Processor Systems

Clusters of SMP s. Sean Peisert

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Practical Near-Data Processing for In-Memory Analytics Frameworks

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Design of Parallel Algorithms. Course Introduction

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Distributed Systems CS /640

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Foundation of Parallel Computing- Term project report

CSCE 626 Experimental Evaluation.

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Chapter 8: Memory-Management Strategies

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Lecture 13. Shared memory: Architecture and programming

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Lecture 9: MIMD Architectures

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Chap. 4 Multiprocessors and Thread-Level Parallelism

Computer Architecture

Lecture 9: MIMD Architecture

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

Parallel Computing. Hwansoo Han (SKKU)

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

Lecture 10 Midterm review

An Experimental Study of Parallel Biconnected Components Algorithms on Symmetric Multiprocessors (SMPs)

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Portland State University ECE 588/688. Graphics Processors

Multi-core Computing Lecture 2

Computer parallelism Flynn s categories

ECE/CS 757: Homework 1

Chapter 8: Main Memory

What are Clusters? Why Clusters? - a Short History

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Memory Systems in Pipelined Processors

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

High Performance Computing. Introduction to Parallel Computing

COURSE 12. Parallel DBMS

Parallel Architectures

CS 475: Parallel Programming Introduction

CSCI 4717 Computer Architecture

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

1. Memory technology & Hierarchy

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Algorithms and Applications

Network Design Considerations for Grid Computing

Sorting. Overview. External sorting. Warm up: in memory sorting. Purpose. Overview. Sort benchmarks

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Computer Systems Architecture

Computer Systems Architecture

Transcription:

Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29

Measuring and Parallel Speedup Measuring Parallel Execution Algorithm Engineering for Parallel Algorithms (briefly) Message-Passing Computations (briefly) Shared-Memory Computations: Can We Finally Put to Use 30 Years of Research in PRAM Algorithms? Rome School on Alg. Eng. p.2/29

What to Measure In a parallel execution, measure wallclock time! takes into account all processors (all must finish!) as well as system overhead but requires dedicated system and warm-up runs to allow threads to migrate and stabilize on their processors Rome School on Alg. Eng. p.3/29

How to Compare Compare to the best sequential algorithm (though it will make the parallel code look bad when run on few CPUs). Rome School on Alg. Eng. p.4/29

How to Compare Compare to the best sequential algorithm (though it will make the parallel code look bad when run on few CPUs). Compare speedup against (a cleaned-up version of) the parallel code run on a single processor gives a better idea of scaling. Rome School on Alg. Eng. p.4/29

How to Compare Compare to the best sequential algorithm (though it will make the parallel code look bad when run on few CPUs). Compare speedup against (a cleaned-up version of) the parallel code run on a single processor gives a better idea of scaling. Average over lots of instances of various types to avoid data biases that may favor specific numbers of CPUs. Rome School on Alg. Eng. p.4/29

How to Compare Compare to the best sequential algorithm (though it will make the parallel code look bad when run on few CPUs). Compare speedup against (a cleaned-up version of) the parallel code run on a single processor gives a better idea of scaling. Average over lots of instances of various types to avoid data biases that may favor specific numbers of CPUs. In an MP environment, ensure the single-cpu version has enough memory. Rome School on Alg. Eng. p.4/29

Parallel Algorithm Engineering Does not exist... Rome School on Alg. Eng. p.5/29

Parallel Algorithm Engineering Does not exist... Can reuse techniques of algorithm engineering for sequential code most gains will be found in the sequential part. Rome School on Alg. Eng. p.5/29

Parallel Algorithm Engineering Does not exist... Can reuse techniques of algorithm engineering for sequential code most gains will be found in the sequential part. In distributed memory (or virtual shared-memory) systems, need to focus on transmission of information. Rome School on Alg. Eng. p.5/29

Parallel Algorithm Engineering Does not exist... Can reuse techniques of algorithm engineering for sequential code most gains will be found in the sequential part. In distributed memory (or virtual shared-memory) systems, need to focus on transmission of information. In true shared-memory systems, need to focus on synchronization and cache utilization. Rome School on Alg. Eng. p.5/29

Parallel Algorithm Engineering Does not exist... Can reuse techniques of algorithm engineering for sequential code most gains will be found in the sequential part. In distributed memory (or virtual shared-memory) systems, need to focus on transmission of information. In true shared-memory systems, need to focus on synchronization and cache utilization. Memory hierarchy is deeper and more complex. Rome School on Alg. Eng. p.5/29

Message-Passing Systems Communication is the bottleneck. Load-balancing (data migration) is a problem. PRAM algorithms have little in common with efficient message-passing code. BSP or virtual shared-memory cannot overcome the basic facts, but they do allow the programmer to use PRAM-style algorithms, although with a mandatory slowdown. Rome School on Alg. Eng. p.6/29

Shared-Memory Programming Symmetric Multiprocessor (SMP) architectures Uniform-Memory-Access (UMA) SMPs PRAMs vs. UMA SMPs A UMA SMP programming model Our running example: Ear decomposition Implementation Experimental setup Experimental results Rome School on Alg. Eng. p.7/29

Symmetric Multiprocessor Architectur SMP: several processors integrated into one machine, using shared-memory with concurrent read. Small SMPs ( 4 CPUs) are commodity items. Larger SMPs are the choice for servers (large memory, high throughput, redundancy) Largest SMP is the Sun Starcat with 105 CPUs and 210GB memory. Clusters of large SMPs (IBM Netfinity) form the next-generation terascale architecture. Rome School on Alg. Eng. p.8/29

Uniform-Memory-Access SMPs Early SMPs (SGI Origin) had variable memory access times, so efficiency depended on keeping items local to each processor just as in message-passing style. Sun and IBM Netfinity SMPs have true uniform-access shared-memory. UMA SMPs open the way to efficient parallel computing for complex, irregular problems. Rome School on Alg. Eng. p.9/29

Memory Access on Sun SMPs Large SMPs have a deep memory hierarchy 10ns to 600ns in the Sun E10K and 3ns to 250ns in the Starcat. The 250ns worst case is still two orders of magnitude faster than message-passing. We tested memory access from a single processor to the full addressable memory on our lab s E4500 and on the San Diego Supercomputing Center s E10K. Results clearly show cache sizes (L1 and L2). Rome School on Alg. Eng. p.10/29

Sun Enterprise Results 1000 ns 1000 ns Time per memory read (ns) 100 ns C, 1 C, 2 C, 4 C, 8 C, 16 C, 32 C, 64 R Time per memory read (ns) 100 ns C, 1 C, 2 C, 4 C, 8 C, 16 C, 32 C, 64 R 10 ns 10 ns 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 log 2 (Problem Size) log 2 (Problem Size) E 4500 E 10,000 Rome School on Alg. Eng. p.11/29

PRAMs and UMA SMPs Similarities are striking: all processors can access any memory location concurrent read capability no need for messages Two major differences remain: few processors: up to 100, not O(n O(1) ) no lockstep synchronization: one must use software barriers Rome School on Alg. Eng. p.12/29

Validating the Model A methodology for designing practical shared-memory algorithms on UMA shared-memory machines, A fast and scalable shared-memory implementation of ear decomposition demonstrating the first significant parallel speedup for this class of problems. An example of experimental performance analysis for a nontrivial parallel implementation. Rome School on Alg. Eng. p.13/29

Linear Speedups on Irregular Graph P 100 10 Time (ms) 1 0.1 0.01 8 9 10 11 12 13 14 15 16 17 Problem Size (log 2 n) graph type fixed, graph size and # processors variable Rome School on Alg. Eng. p.14/29

Linear Speedups on Irregular Graph P 10 Time (ms) 1 Random Graph ( Count = n ) Random Graph ( Count = 2*n ) Random Graph ( Count = n 2 /2 ) Random Graph ( Count = n 2 ) Regular Triangulation Regular Mesh Constrained Delaunay Triangulation 0.1 1 2 4 8 16 32 Number of Processors (p) size fixed (8192), # processors and types of sparse graphs variable Rome School on Alg. Eng. p.15/29

Methodology: PRAM to SMP how to partition the tasks (and data) among the very limited number of processors available how to optimize the use of caches how to minimize the work spent in synchronization (barrier calls) Good data and task partitioning, coupled with cache-sensitive coding; barriers are trickier, because of the need to flush caches. Rome School on Alg. Eng. p.16/29

Methodology: Complexity Model matching complexity model (Helman and Jájá 99) captures contention at processors and contiguous vs. noncontiguous memory accesses analysis reports triple (M A, M E, T C ), where M A : number of noncontiguous accesses to main memory M E : amount of data exchanged with main memory T C : upper bound on the local computational work Rome School on Alg. Eng. p.17/29

Ear Decomposition The ear decomposition of an undirected graph G = (V, E) is a partition of E into an ordered collection of simple paths such that each endpoint of a path is contained in an earlier path, but no interior point of a path is contained in any other path. Useful in its own right (structural analysis) and a good decomposition step for complex graph algorithms (in lieu of dfs, for instance). Has a simple linear-time sequential algorithm based on dfs. Rome School on Alg. Eng. p.18/29

Ear Decomposition Challenging to implement in parallel Requires prefix sum, prefix product, list-ranking, least-common ancestor, tree raking, pipelined sorting, and spanning tree construction. Except for prefix computation, none of these tasks has been shown to scale on message-passing architectures. Rome School on Alg. Eng. p.19/29

Ear Decomposition: An Example 2 8 e0 e3 e6 e13 1 e1 e4 3 e11 9 1 e2 e0 5 2 e8 e1 e3 e2 5 e8 4 e7 6 e9 e5 7 e12 10 e15 e14 Q0 2 Q0 Q1 8 Q5 6 3 e10 11 1 Q0 Q0 3 Q1 9 e7 e4 e5 e6 Q2 4 Q1 Q5 e9 4 e12 7 e11 5 Q2 Q2 6 Q3 7 Q4 Q4 10 e10 10 8 Q4 11 e15 e14 e13 11 9 Rome School on Alg. Eng. p.20/29

The PRAM Algorithm find a spanning tree root tree and compute level of each node each non-tree edge corresponds to a distinct ear, so find least common ancestor (LCA) of endpoints of each non-tree edge and label the edge by the level of its LCA label each tree edge by choosing the smallest label of any non-tree edge whose cycle contains it labels are sorted Rome School on Alg. Eng. p.21/29

SMP Implementation and Analysis find a spanning tree by successive grafting; each takes O((m + n)/p) time, so total time is T (n, p) = O(1, n p, ((m + n)/p) log n) root the tree and compute the level of each node with the Euler tour technique in O(n/p) time and O(n/p) noncontiguous memory accesses computing the edge labels takes O(n/p) time and O(n/p) noncontiguous memory accesses total algorithm runs in O( n p, n p, m+n p log n) Rome School on Alg. Eng. p.22/29

Experimental Setup Input: sparse graphs (regular and not, planar and not) from 2 8 to 2 1 8 vertices. Environment: the SMP node library of SIMPLE (by D. Bader), built on top of POSIX threads Machines: our laboratory s E4500 (14 CPUS, 14GB memory) and the SDSC s E10K (64 CPUS, 64GB memory), both with 450MHz Sparc II processors, each with 16KB on-chip direct-mapped L1 cache and 8MB L2 cache Rome School on Alg. Eng. p.23/29

Classes of Input Graphs Regular 4-connected mesh of n n vertices Regular triangulated mesh, obtained from above by adding an edge connecting a vertex to its down-and-right neighbor, if any Random planar graphs of varying density, from very sparse (few, if any, cycles) to nearly triangulated Constrained Delaunay triangulation of n random points in the unit square Rome School on Alg. Eng. p.24/29

Experimental Results We plot efficiency: the ratio of the measured speedup to the number of processors used (which is the ideal speedup). A perfect implementation of a perfect parallel algorithm has a constant efficiency of 1. We expect increases in efficiency as the size of the problem increases due to relatively smaller influence of overhead. Conversely, we expect decreases in efficiency as the number of processors increases (for the same problem size). Rome School on Alg. Eng. p.25/29

Experimental Results Sparse random graphs on the NPACI Sun E10K: Efficiency 1.0 0.8 0.6 0.4 log 2 n = 8 log 2 n = 9 log 2 n = 10 log 2 n = 11 log 2 n = 12 log 2 n = 13 log 2 n = 14 log 2 n = 15 log 2 n = 16 log 2 n = 17 0.2 0.0 2 4 8 16 32 Number of Processors (p) Rome School on Alg. Eng. p.26/29

Experimental Results Sparse random graphs on the NPACI Sun E10K: 1.0 0.8 P = 2 P = 4 P = 8 P = 16 P = 32 Efficiency 0.6 0.4 0.2 0.0 8 9 10 11 12 13 14 15 16 17 Problem Size (log 2 n) Rome School on Alg. Eng. p.27/29

SMPs: Conclusions first practical linear speed-up for a difficult combinatorial problem good match of model predictions and experimental results transfer of PRAM algorithms to practical applications is possible on UMA SMPs promise for a whole new range of applications for parallel computation Rome School on Alg. Eng. p.28/29

Parallel Alg. Eng.: Conclusions We can and should develop parallel algorithm engineering: It will improve both the sequential and parallel parts of the code and yield new insights into the memory hierarchy, compiler development, and related topics. Shared-memory implementations will be more challenging (more parameters), but we have promising results already. Rome School on Alg. Eng. p.29/29