Taming High Performance Computing with Compiler Technology

Size: px

Start display at page:

Download "Taming High Performance Computing with Compiler Technology"

Hope Cobb
5 years ago
Views:

Computer Science Center for High Performance

1 Taming High Performance Computing with Compiler Technology John Mellor-Crummey Department of Computer Science Center for High Performance Software Research

2 High Performance Computing Applications Scientific inquiry ranging from elementary particles to cosmology Pollution modeling and remediation planning Storm forecasting and climate prediction Advanced vehicle design Computational chemistry and drug design Molecular nanotechnology Cryptology Nuclear weapons stewardship 2

3 High Performance Applications Algorithms Architectures Data Structures Effective parallelizations scalability Single-processor performance can differ by integer factors 3

4 Status of Highly-parallel Systems [Scalable, highly-parallel, microprocessorbased systems] remain in the research and experimental stage primarily because we lack adequate software technology, application-development tools, and, ultimately, well-developed applications. 4 Information Technology Research: Investing in our Future PITAC Report to the President, 1999

5 Challenges for Highly Parallel Computing Effective algorithms for complex problems Programming models and compilers Application development tools Operating systems for large-scale machines Design better high-performance architectures 5

6 Current Research Themes Compiler support for data parallel programming Implicitly and explicitly parallel global address space languages Technology for auto-tuning software Automatically tailor code to a microprocessor architecture Performance analysis tools Understanding application behavior on current systems Performance modeling How will applications perform at different scales and on future systems Compiler technology for scientific scripting languages R language for statistical programming 6

7 Outline Motivation Compiler technology for HPC Compiling data-parallel languages Semi-automatic synthesis of performance models Challenges for the future Other work 7

8 Compiling data-parallel languages Introduction Data parallelism Compiling HPF-like languages Rice dhpf compiler Data partitioning research Analysis and code generation Experimental results 8

9 Data Parallelism Apply the same operation to many data elements need not be synchronous need not be completely uniform Applicable to many problems in science and engineering 9

10 Data Parallel Programming Alternatives 10 Hand-coded parallelizations using library-based models complete applicability difficult to design and implement all responsibility for tuning falls to the developer Application frameworks easy to use limited applicability Single-threaded data-parallel languages much more flexible than application frameworks much simpler to use than hand-coded parallelizations compilers significantly determines performance offload details of tuning from the developer compilers are enormously complex out of luck if the compiler doesn t deliver performance

11 Data Parallel Compilation High Performance Fortran Partitioning of data drives partitioning of computation, communication, and synchronization Fortran program + data partitioning Partition computation Insert communication Manage storage Same answers as sequential program HPF Program Compilation Parallel Machine 11

12 Example HPF Program CHPF$ processors P(3,3) CHPF$ distribute A(block, block) onto P CHPF$ distribute B(block, block) onto P DO i = 2, n - 1 DO j = 2, n - 1 A(i,j) =.25 *(B(i-1,j) + B(i+1,j)+ B(i,j-1) + B(i,j+1)) P(0,0) P(2,2) 12 Processors Data for A, B (BLOCK,BLOCK)distribution

13 Compiling HPF-like Languages Partition data Select mapping of computation to processors Analyze communication requirements Partition computation by reducing loop bounds Insert communication 13

14 The Devil is in the Details Good data and computation partitionings are a must without good partitionings, parallelism suffers! Excess communication undermines scalability both frequency and volume must be right! Single processor efficiency is critical must use caches effectively node code must be amenable to optimization 14 Goal Compiler and runtime techniques that enable simple and natural programming, yet deliver the performance of hand-coded parallelizations

15 Rice dhpf Compiler Achievements parallelize sequential codes with minimal rewriting near hand-coded performance for tightly coupled codes 15 Innovations Sophisticated data partitionings Abstract set-based framework for communication analysis, code generation Sophisticated computation partitionings partial replication to reduce communication Comprehensive optimizations

16 Data Partitioning Good parallel performance requires suitable partitioning Tightly-coupled computations are problematic Line-sweep computations: e.g., ADI integration do j = 1, n do i = 2,n a(i,j) = a(i-1,j) 16 recurrences make parallelization difficult with BLOCK partitionings

17 Coarse-Grain Pipelining Compute along partitioned dimensions Partial serialization induces wavefront parallelism with block partitioning Processor 0 Processor 1 Processor 2 Processor 3 17

18 Coarse-Grain Pipelining Compute along partitioned dimensions Partial serialization induces wavefront parallelism with block partitioning Processor 0 Processor 1 Processor 2 Processor 3 18

19 Parallelizing Line Sweeps } Compilergenerated coarse-grain pipelining Hand-coded }multipartitioning 19

20 Diagonal Multipartitioning Each processor owns 1 tile between each pair of cuts along each distributed dimension Enables full parallelism for a sweep along any partitioned dimension Processor 0 Processor 1 Processor 2 Processor 3 20

21 Diagonal Multipartitioning Each processor owns 1 tile between each pair of cuts along each distributed dimension Enables full parallelism for a sweep along any partitioned dimension Processor 0 Processor 1 Processor 2 Processor 3 21

22 Generalized Multipartitioning Given an n-dimensional data domain and p processors, select which λ dimensions to partition, 2 λ n; how many cuts in each Partitioning constraints # tiles in each λ - 1 dimensional hyperplane is a multiple of p no more cuts than necessary Objective function: minimize communication volume pick the configuration of cuts to minimize total cross section 22 Mapping constraints load balance: in a hyperplane, each proc has same # tiles neighbor: in any particular direction, the neighbor of a given processor is the same IPDPS 2002 Best paper in Algorithms; JPDC 2003

23 Choosing the Best Partitioning Enumerate all elementary partitionings candidates depend on factorization of p Evaluate their communication cost Select the minimum cost partitioning worst case: p is a product of unique prime factors complexity: Number of choices for picking a pair of dimensions to partition with a number of cuts divisible by a particular prime factor d ( d 1) 2 ( 1+ o( 1 ) log p log log p Possible unique factors of p 23 very fast in practice.

24 Mapping Tiles with Modular Mappings Modular Shift Basic Tile Shape Integral # of shapes Integral # of shapes

25 Formal Compilation Framework 3 types of Sets Data Iterations Processors Layout: Reference: CompPart: 3 types of Mappings data iterations iterations processors data processors Representation 25 integer tuples with Presburger arithmetic for constraints Analysis: Use set equations to compute set(s) of interest iterations allocated to a processor communication sets Code generation: Synthesize loops from set(s), e.g. parallel (SPMD) loop nests message packing and unpacking [Adve & Mellor-Crummey, PLDI98]

26 Why Symbolic Sets? processors P(3,3) distribute A(block, block) onto P distribute B(block, block) onto P DO i = 2, n - 1 DO j = 2, n - 1 A(i, j) =.25 *( B(i-1, j) + B(i+1, j) + B(i, j-1) + B(i, j+1) ) Local section for P(x,y) (and iterations executed) { [i, j]: 20x + 2 i 20x + 19 & 30y + 2 j 30y + 29 } Non-local data accessed P(x,y) data / loop partitioning P(0,0) P(1,0) P(2,0) P(0,1) P(1,1) P(2,1) 30 P(0,2) P(1,2) P(2,2) Iterations that access non-local data

27 Integer-Set Framework: Example real A(100) distribute A(BLOCK) on P(4) do i = 1, N... = A(i-1) + A(i-2) +...! ON_HOME A(i-1) enddo symbolic N Layout := { [pid] -> [i] : 25 *pid + 1 i 25 *pid + 25 } Loop := { [i] : 1 i N } CPSubscript := { [i] [i-1] } RefSubscript := { [i] [i-2] } CompPart := (Layout o CPSubscript -1 ) Loop DataAccessed = CompPart o RefSubscript 27 NonLocal Data Accessed = DataAccessed - Layout

28 Optimizations using Integer Sets Partially replicate computation to reduce communication 66% lower message volume, 38% faster: NAS 64 procs Coalesce communication sets for multiple references 41% lower message volume, 35% faster: NAS 64 procs Split loops into local-only and off-processor loops 10% fewer Dcache misses, 9% faster: NAS procs Processor set constraints on communication sets 12% fewer Icache misses, 7% faster: NAS 64 procs 28 PACT 2002 Best student paper (with Daniel Chavarria-Miranda)

29 Experimental Evaluation NAS SP & BT benchmarks from NASA Ames use ADI to solve the Navier-Stokes equation in 3D forward & backward line sweeps on each dimension, each time step Compare four variants MPI hand-coded multipartitioning (NASA) dhpf: multipartitioned dhpf: 2D partitioning, coarse-grain pipelining PGI s pghpf: 1D partitioning with transpose Platform SGI Origin 2000: MHz procs. SGI compilers + SGI MPI 29

30 Efficiency for NAS SP (102 3 B size) similar comm. volume, more serialization > 2x multipartitioning comm. volume 30

31 Efficiency for NAS BT (102 3 B size) > 2x multipartitioning comm. volume 31 Platform: SGI Origin 2000

32 NAS BT Parallelizations Hand-coded 3D Multipartitioning Compiler-generated 3D Multipartitioning Execution Traces for NAS BT Class 'A' - 16 processors, SGI Origin

33 Observations High performance requires perfection parallelism and load-balance communication frequency communication volume scalar performance Data-parallel compiler technology can ease the programming burden yield near hand-coded performance 33

34 Data-parallel Related Work Linear equations/set-based compilation [Pugh et al; Ancourt et al; Amarasinghe & Lam] Commercial HPF compilers xlhpf, pghpf, xhpf HPF/JA 14 Teraflops on a code for the Earth Simulator Lots of research compiler efforts e.g. Polaris, CAPTOOLS 34 None support partially-replicated computation None support multipartitioning None achieve linear scaling on tightly-coupled codes

35 Outline Motivation Compiler technology for HPC Data-parallel programming systems Semi-automatic synthesis of performance models Challenges for the future Other work 35

36 Why Performance Modeling? Insight into applications barriers to scalability insight into optimizations Mapping applications to systems Grid resource selection & scheduling intelligent run-time adaptation Workload-based design of future systems 36

37 Modeling Challenges Performance depends on: architecture specific factors application characteristics input data parameters Difficult to model execution time directly Collecting data at scale is expensive 37

38 Approach Separate contribution of application characteristics Measure the application-specific factors static analysis dynamic analysis Construct scalable models Explore interactions with hardware Use binary analysis and instrumentation for language and programming model independence 38 [Marin & Mellor-Crummey SIGMETRICS 04]

39 Toolkit Design Overview Object Code Binary Instrumenter Instrumented Code Dynamic Analysis Binary Analyzer Execute Control flow graph Loop nesting structure BB instruction mix Static Analysis BB Counts Architecture neutral model Communication Volume & Frequency Post Processing Tool Scheduler Memory Reuse Distance Performance Prediction for Target Architecture Architecture Description Post Processing 39

40 Building Scalable Models Collect data from multiple runs n+1 runs to compute a model of degree n Approximation function: F(X) = c n *B n (X)+c n-1 *B n-1 (X)+ +c 0 *B 0 (X) A set of basis functions Include constraints Goal: determine coefficients Use quadratic programming 40

41 Execution Frequency Modeling Example X Count Execution Frequency Model Collected data Frequency Problem Size 41

42 Execution Frequency Modeling Example X Count Execution Frequency Model Collected data Model degree 0 Frequency Y=41416, Err=131% Problem Size 42

43 Execution Frequency Modeling Example X Count Execution Frequency Model Collected data Model degree 0 Frequency Model degree 1 Y=16776*X-42366, Err=60.4% Y=41416, Err=131% Problem Size 43

44 Execution Frequency Modeling Example X Count Execution Frequency Model Frequency Collected data Model degree 0 Model degree 1 Model degree 2 Y=482*X *X+964, Err=0% Y=16776*X-42366, Err=60.4% Y=41416, Err=131% Problem Size 44

45 Predict Schedule Latency for an Architecture Input: basic block and edge execution frequency Methodology: recover executed paths SPARC instructions generic RISC instantiate scheduler for architecture construct schedule for executed paths determine inefficiencies 45

46 Toolkit Design Overview Object Code Binary Instrumenter Instrumented Code Dynamic Analysis Binary Analyzer Execute Control flow graph Loop nesting structure BB instruction mix Static Analysis BB Counts Architecture neutral model Communication Volume & Frequency Post Processing Tool Scheduler Memory Reuse Distance Performance Prediction for Target Architecture Architecture Description Post Processing 46

47 Memory Reuse Distance MRD: # unique data blocks referenced since target block last accessed reference I 1 I 2 I 3 I 2 I 3 I 2 I 3 memory block A B A C A B B MRD I 1 : 1 cold miss I 2 : 2 cold misses, distance 2 47 I 3 : distance 0, distance 1

48 48 Memory reuse distance

49 Modeling Memory Reuse Distance More complex than execution frequency cold misses histogram of reuse distances number of bins not constant Average reuse distance is misleading 1 access with distance 10,000 3 accesses with distance 0 cache has 1024 blocks 2500 average 49

50 Modeling Memory Reuse Distance Normalized frequency 50% 30% 20% Reuse distance 50

51 51 Modeling Memory Reuse Distance

52 Predict Number of Cache Misses Instantiate model for problem size % 74% 52

53 Prediction: NAS BT 3.0 Mem Hier Utilization 300 NAS BT 3.0 Memory Hierarchy Utilization Miss count / Cell / Time step L1 measured L1 predicted L2 measured(x10) L2 predicted(x10) TLB measured(x10) TLB predicted(x10) Mesh size

54 Prediction: NAS BT 3.0 Time on SGI Origin Cycles / Cell / Time step 0 NAS BT 3.0 from SPARC to SGI Origin Measured time Scheduler latency L1 miss penalty L2 miss penalty TLB miss penalty Predicted time Measured time Scheduler Mesh latency size L1 miss penalty

55 Open Performance Modeling Issues Short term Better modeling of memory subsystem # outstanding loads to accurately predict memory latency Explore modeling of irregular applications Long term Model parallel applications Present modeling applies between synchronization points Combine with manually constructed parallel models Semi-automatically recover parallel trends Understand dynamic parallelism 55

56 Modeling Related Work Reuse distance Cache utilization [Beyls & D Hollander] Investigating optimizations [Ding et al.] Program instrumentation EEL, QPT [Ball, Larus, Schnarr] Scalable analytic models [Vernon et al; Hoisie et al.] Cross-architecture models at scale [Snavely et al.; Cascaval et al.] Simulation (trace-based and execution-driven) 56 None yield semi-automatically derived scalable models

57 HPC Compiler Challenges for the Future Programming systems for large-scale machines Abstraction and greater expressiveness are needed Potential parallelism must be readily accessible implicit parallelism or explicit element-wise parallelism Locality and latency tolerance are both critical for performance Dynamic self-scheduled parallelism will be necessary Failure will occur and must be expected and handled Support for self-tuning software for complex architectures Compiler-based tools Debugging and performance analysis of large-scale software on dynamic systems is a major open problem Insight into hardware design Understanding impact of proposed designs on whole programs 57

58 Past Work Multiprocessor synchronization 58 locks, synchronous barriers [ASPLOS89, TOCS91] reader-writer synchronization [PPOPP91] fuzzy barriers [IJPP94] Parallel debugging execution replay [JPDC90, TOC87] software instruction counter [ASPLOS89] detecting data races [WPDD93, SC91, SC90] Parallel programming environments Parascope [PIEEE 93], Dsystem [TPDT94] Parallel applications molecular dynamics [JCC92]

59 Ongoing Work Global address space parallel languages Co-array Fortran [LCPC03] Performance analysis [TJS02, LACSI01, ICS01, SIGMETRICS01] Improving node performance irregular mesh and particle codes [ICS99, IJPP00] sparse matrices [LACSI02, IJHPCA04] multigrid [ICS01] dense matrices [LACSI03] Grid computing [IJHPCA01] Library-based domain languages [JPDC01] 59

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf