Overpartioning with the Rice dhpf Compiler

Similar documents
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Taming High Performance Computing with Compiler Technology

Parallel Matlab Based on Telescoping Languages and Data Parallel Compilation. Ken Kennedy Rice University

Compilation for Heterogeneous Platforms

Compilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

RICE UNIVERSITY. Transforming Complex Loop Nests For Locality by Qing Yi

Compilers and Compiler-based Tools for HPC

Transforming Complex Loop Nests For Locality

John Mellor-Crummey Department of Computer Science Center for High Performance Software Research Rice University

Principles of Parallel Algorithm Design: Concurrency and Mapping

Generation of High Performance Domain- Specific Languages from Component Libraries. Ken Kennedy Rice University

Maunam: A Communication-Avoiding Compiler

Recap. Practical Compiling for Modern Machines (Special Topics in Programming Languages)

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

LooPo: Automatic Loop Parallelization

Principles of Parallel Algorithm Design: Concurrency and Mapping

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Compiling for Advanced Architectures

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Exploring Parallelism At Different Levels

Compiler-Supported Simulation of Highly Scalable Parallel Applications

High Performance Fortran. James Curry

Parallelization Principles. Sathish Vadhiyar

Advanced Compiler Construction

Advanced Compiler Construction Theory And Practice

Design of Parallel Algorithms. Models of Parallel Computation

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

ECE453: Advanced Computer Architecture II Homework 1

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

ECE 669 Parallel Computer Architecture

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Shared-memory Parallel Programming with Cilk Plus

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Programming as Successive Refinement. Partitioning for Performance

Multicore Cache Coherence Control by a Parallelizing Compiler

Transforming Complex Loop Nests For Locality

Standard promoted by main manufacturers Fortran. Structure: Directives, clauses and run time calls

Standard promoted by main manufacturers Fortran

HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90

ECE 669 Parallel Computer Architecture

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

Tiling: A Data Locality Optimizing Algorithm

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

10th August Part One: Introduction to Parallel Computing

=1, and F n. for all n " 2. The recursive definition of Fibonacci numbers immediately gives us a recursive algorithm for computing them:

Compiler techniques for leveraging ILP

Parallel Programming Patterns

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Coarse-Grained Parallelism

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Tiling: A Data Locality Optimizing Algorithm

Lecture 17: Array Algorithms

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

LINPACK Benchmark Optimizations on a Virtual Processor Grid

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

Component Architectures

Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems

MPI Programming. Henrik R. Nagel Scientific Computing IT Division

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Future Applications and Architectures

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Numerical Algorithms

Parallelising Pipelined Wavefront Computations on the GPU

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Code Shape II Expressions & Assignment

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Lecture 2. Memory locality optimizations Address space organization

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

On Efficient Parallelization of Line-Sweep Computations

Tessellating Stencils. Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS

41st Cray User Group Conference Minneapolis, Minnesota

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Patterns for! Parallel Programming!

Matrix Multiplication

John Mellor-Crummey Department of Computer Science Rice University

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden

Caching and Buffering in HDF5

Parallel Programming with Coarray Fortran

6.1 Multiprocessor Computing Environment

Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program

Program Transformations for the Memory Hierarchy

Architecture without explicit locks for logic simulation on SIMD machines

Transcription:

Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf

Collaborators Bradley Broom Daniel Chavarria-Miranda Guohua Jin Rob Fowler Ken Kennedy John Mellor-Crummey Qing Yi

Outline Overview of dhpf Framework Introduction to overpartitioning Integer set framework Virtual processors Multipartitioning Strategy for load balance in series of dimensional sweeps Out-of-Core Distributions Use of overpartitioning to generalize Automatic Conversion to Recursive Blocking Use of iteration-space slicing and transitive dependence

Motivation and Goals Goal: Compile-time techniques that provide highest performance, Competitive with hand-coded parallel programs With little or no recoding to suit compiler To achieve this goal, a data-parallel compiler must: Support complex choices for partitioning computation Facilitate analysis and code generation for sophisticated optimizations An abstract integer-set approach in dhpf makes sophisticated analysis and code generation practical

Overpartitioning Framework in dhpf Generalization of BLOCK partitionings Partition computation into tiles each processor will compute one or more tiles symbolic variables specify tile extents Treat each tile as a virtual processor when analyzing data movement (i.e. communication, I/O) Generate code to orchestrate tiled computations iterate over a processor s multiple tiles in the right order must preserve a computation s data dependences aggregate data movement for a processor s tiles when appropriate yes: shift communication for all tiles no: data movement for an out-of-core tile

Analysis Support: Integer Set Framework 3 types of Sets Data Iterations Processors Layout: Reference: Comp Part: 3 types of Mappings data iterations iterations processors data processors Representation: Omega Library [Pugh et al] relations: sets and mappings of integer tuples relation constraints expressed using Presburger Arithmetic Analysis using set equations to compute iterations allocated to a tile non-local data needed by a tile tiles that contain non-local data of interest,... Code generation from sets use iteration sets to generate SPMD loops and statement guards use data movement sets to generate messages or I/O

Innovations Using the Set Framework Generalized computation partitionings non-owning processors can compute values for data elements select partitionings to reduce data movement frequency partially replicate computation to reduce data movement data movement analysis support for non-local writes Data movement optimization e.g. aggregation of communication for multiple tiles Sophisticated computation restructuring e.g. local/non-local index set splitting for loops Scalar code optimization logical constraint propagation to reduce guards

Symbolic Product Constraints in HPF HPF BLOCK distribution on unknown P; block size B= N/P : 0 p P-1 1 p B p B+B N HPF CYCLIC distribution on unknown P: p p+p p+k P Challenge: no symbolic products in Presburger Arithmetic Approach: analysis: use virtual processors instead of physical code generation: augment generated code with extra loops to perform virtual-to-physical mapping

Virtual Processor Model for Symbolics For BLOCK distributed dimension 0 p P-1 1 dimensional 2 dimensional Any block is a VP v owns [v, v+b-1], 1 v N (yields fictitious VPs) p v : v = p B+1 Computation: No extra code needed Communication: Avoid messages to fictitious processors

Multipartitioning Motivation Data and Computation partitioning strategies critical for parallel performance Standard HPF and OpenMP partitionings good performance for loosely synchronous computations Algorithms such as ADI are difficult to parallelize They require tighter synchronization A dimensional sweep along a partitioned dimension will induce serialization

Parallelizing Line Sweeps via Block Partitionings Local Sweeps along x and z Local Sweep along y Transpose Fully parallel computation Transpose back High communication volume: transpose ALL data Performance lags behind coarse-grained pipelining [SC98]

Coarse-Grain Pipelining Partial wavefront-type parallelism Processors are idle part of the time Processor 0 Processor 1 Processor 2 Processor 3

Multipartitioning Type of skewed-cyclic distribution Each processor owns a section in each of the distributed dimensions Processor 0 Processor 1 Processor 2 Processor 3

Evaluating Parallelizations } Compilergenerated coarse-grain pipelining } Hand-coded multipartitioning

Experimental Results Execution Trace for NAS SP Class A - 16 processors (September 2000) } dhpf-generated Multipartitioning } Hand-coded Multipartitioning

Speedups Measurements taken in September 2000 Speedups Speedup 20 18 16 14 12 10 8 6 4 2 0 1 4 9 16 Number of Processors Hand MPI dhpf MPI

Multipartitioning Summary dhpf: First compiler support for multipartitioned data distributions Automatic code generation for complex multipartitioned distributions Scalar performance is within 13% of sequential version Competitive parallel performance Message coalescing across multiple arrays will improve it Scalable computation partitioning: balanced parallelism Coexists with advanced optimizations in dhpf non-owner computes computation partitionings, partial computation replication,...

Compilation for Out-of-Core Execution dhpf compiler recognizes new HPF-like directives for distributing data over virtual out-of-core (OOC) processors: CSIO$ processors pio(100) CSIO$ template tio(10000,10000) CSIO$ align b(i,j) with tio(i,j) CSIO$ distribute tio(*,block) onto pio dhpf compiler generates code to iterate over virtual processors, copy tiles to and from disk, communicate between tiles.

Out-of-Core a special case of Overpartitioning Out-of-core partitioning similar to multipartitioning. Both must generate code to iterate over virtual processors. Both must manage communication between tiles. OOC implemented in dhpf using ad hoc code before general overpartitioning framework available. OOC and in-core distributions restricted to disjoint dimensions. OOC partitioning could be better implemented as a client of the general overpartitioning framework simpler, allow more flexible combinations of OOC and in-core distributions.

Specific Requirements of OOC Partitioning Traversal order of OOC virtual processors significantly affects performance. Communication must be kept within the loop iterating over virtual processors. Communication involving a tile can occur only when the tile is in memory. Communication between OOC tiles allocated to the same processor is very expensive. Requires extra I/O, retaining data in memory, or revisiting a tile.

Divide and Conquer by Compiler Recursive algorithms are valuable Locality in multi-level memory hierarchy Comparable or better performance than blocking Less tuning for cache configurations Parallelism in multi-processor system Dynamic load balancing Automatic recursion by compiler Transforming loops into recursion automatically Should working across different loop nests Should be practical for real-world applications Techniques Iteration space slicing Transitive dependence analysis

Recursive Partition of Computation K: do k = 1, N-1 do i = k+1, N s1: a(i,k) = a(i,k) / a(k,k) enddo J: do j = k + 1, N I: do i = k+1, N??? k: max(lbi,lbj)-1, min((ubj,ubi)-1 I: max(k+1,lbi), Ubi Recursive Loops (Lbj : Ubj, Lbi : Ubi) Key statement s2: a(i,j) = a(i,j) - a(i,k) * a(k,j) enddo enddo enddo (LU with no pivoting) K: 1, min(ubi,ubj)-1 J: max(k+1,lbj), Ubj I: max(k+1,lbi), Ubi

Iteration Space Slicing for Recursion 1 Previous(s2) N N Before(s1,s2)[Current(s2)] Previous(s1) Current(s2) Current(s1) N Previous(s1) =Before(s1,s2)[Previous(s2)] Current(s1) = Before(s1,s2)[Current(s2)] - Previous(s1)

LU_recur (Lbj, Ubj, Lbi, Ubi) Transformed non-pivoting LU if data fit in primary cache do k = 1, min(n-1, Ubi-1, Ubj-1) if (k >= max(lbj-1, Lbi-1) ) then do i = max(k+1, Lbi), min(ubi, N) s1: a(i,k) = a(i,k) / a(k,k) J: do j = max (k + 1, Lbj), min(ubj, N) I: do i = max (k+1, Lbi), min(ubi, N) s2: a(i,j) = a(i,j) - a(i,k) * a(k,j) else call LU-recur(Lbj, (Lbj+Ubj)/2, Lbi, (Lbi+Ubi)/2) (left-top) call LU-recur(Lbj, (Lbj+Ubj)/2, (Lbi+Ubi)/2+1, Ubi) (right-top) call LU-recur((Lbj+Ubj)/2 +1, Ubj, Lbi, (Lbi+Ubi)/2) (left-bottom) call LU-recur((Lbj+Ubj)/2 +1, Ubj, (Lbi+Ubi)/2 +1, Ubi) (right-bottom)

Benchmarks Experimental Evaluation MM, LU (non-pivoting), Cholesky, Erlebacher Experimental environment 195MHz SGI R10000 workstation L1 cache : 32KB 2-way associative L2 cache : unified 1MB 2-way associative Benchmark performance Significant improvement over unblocked original code Comparable or better performance than blocking

Performance of Matrix Multiply 2.9 Cycles in billions 22.8 513*513 1025*1025 original block1 block2 recursion L1 misses in millions L2 misses in millions 36.0 290.1 >3.6 >36.7 513*513 1025*1025 513*513 1025*1025

Performance of non-pivoting LU 11.8 Cycles in billions Size: 1024*1024 5.2 4.2 original block recursion L1 misses in millions 359.4 45.3 L2 misses in millions 93.1 103.4 1.7 1.3 original block recursion original block recursion

Summary dhpf Compiler Designed to Achieve Highest Possible Performance Integer set framework Overpartitioning Virtual processors Multipartitioning Strategy for load balance in series of dimensional sweeps Out-of-Core Distributions Can be generalized through use of overpartitioning Automatic Conversion to Recursive Blocking Powerful framework for transitive dependence and iteration space slicing Direct conversion to recursive blocking for loop nests