Compilation for Heterogeneous Platforms

Similar documents
CellSs Making it easier to program the Cell Broadband Engine processor

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

Overpartioning with the Rice dhpf Compiler

Compilers and Compiler-based Tools for HPC

What does Heterogeneity bring?

Parallel Exact Inference on the Cell Broadband Engine Processor

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

OpenMP on the IBM Cell BE

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Parallel Computing: Parallel Architectures Jin, Hai

Massively Parallel Architectures

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Exploring Parallelism At Different Levels

Cell Processor and Playstation 3

How to Write Fast Code , spring th Lecture, Mar. 31 st

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Double-Precision Matrix Multiply on CUDA

A Transport Kernel on the Cell Broadband Engine

Bandwidth Avoiding Stencil Computations

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

high performance medical reconstruction using stream programming paradigms

The University of Texas at Austin

ECE 5730 Memory Systems

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Optimising for the p690 memory system

Evaluating the Portability of UPC to the Cell Broadband Engine

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Introduction to CELL B.E. and GPU Programming. Agenda

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Advanced optimizations of cache performance ( 2.2)

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

An Introduction to Parallel Programming

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Lecture 13: March 25

A Framework for Safe Automatic Data Reorganization

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Cell Programming Tips & Techniques

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Implicit and Explicit Optimizations for Stencil Computations

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Pipelining and Vector Processing

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

OpenMP on the IBM Cell BE

Memory Hierarchy Basics

WHY PARALLEL PROCESSING? (CE-401)

HPC VT Machine-dependent Optimization

Adapted from David Patterson s slides on graduate computer architecture

Advanced cache optimizations. ECE 154B Dmitri Strukov

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Portland State University ECE 588/688. Graphics Processors

All About the Cell Processor

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Lixia Liu, Zhiyuan Li Purdue University, USA. grants ST-HEC , CPA and CPA , and by a Google Fellowship

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Loop Transformations! Part II!

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Advanced Database Systems

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Lecture 2: CUDA Programming

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Addressing the Memory Wall

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Kismet: Parallel Speedup Estimates for Serial Programs

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

OpenMP for next generation heterogeneous clusters

Numerical Simulation on the GPU

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Memory Hierarchy. Slides contents from:

Cache-oblivious Programming

Comparing Memory Systems for Chip Multiprocessors

Lecture 2: Single processor architecture and memory

Computer Architecture

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Masterpraktikum Scientific Computing

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Future Applications and Architectures

Online Course Evaluation. What we will do in the last week?

Transcription:

Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf

Senior Researchers Ken Kennedy John Mellor-Crummey Zoran Budimlic Keith Cooper Tim Harvey Guohua Jin Charles Koelbel Plus: Rice recruiting to open HPC faculty slot

Background: Scheduling for Grids Virtual Grid Application Development Software (VGrADS) Project

Scheduling Grid Workflows Vision: Automate scheduling of DAG workflows onto Grid resources Strategy Application includes performance models for all workflow steps Performance models automatically constructed Off-line scheduler maps applications onto Grid resources in advance of execution Uses performance model as surrogates for execution time Models data movement costs Use heuristic schedulers (problem is NP-complete) Results Dramatic reduction in time to completion of workflow (makespan)

Workflow Scheduling Results Dramatic makespan reduction of offline scheduling over online scheduling Application: Montage Value of performance models and heuristics for offline scheduling Application: EMAN Online vs. Offline - Heterogeneous Platform (Compute Intensive Case) 45000000 40000000 35000000 30000000 Simulated Makespan 25000000 20000000 15000000 10000000 Offline Online 5000000 0 CF=1 CF=10 Compute Factors CF=100 Offline Online Scheduling Strategy Resource Allocation Strategies for Workflows in Grids CCGrid 05 Scheduling Strategies for Mapping Application Workflows onto the Grid HPDC 05

Performance Model Construction Vision: Automatic performance modeling From binary for a single resource and execution profiles Generate a distinct model for each target resource Research Uniprocessor modeling Can be extended to parallel MPI steps Memory hierarchy behavior Models for instruction mix Application-specific models Scheduling

Performance Prediction Overview Object Code Binary Instrumenter Instrumented Code Dynamic Analysis Binary Analyzer Execute Control flow graph Loop nesting structure BB instruction mix Static Analysis BB Counts Architecture neutral model Communication Volume & Frequency Post Processing Tool Scheduler Memory Reuse Distance Performance Prediction for Target Architecture Architecture Description Post Processing

Modeling Memory Reuse Distance

Execution Behavior: NAS LU 2D 3.0

Observations Performance models focus on uniprocessor memory hierarchy Reuse distance is the key to all levels Corrected for conflict misses Single-processor performance models for programmable scratch memories are easier No conflict misses, no complex replacement policies Compiler challenges are harder Performance models for parallel applications are complex, but possible Many successful hand constructions (e.g., Hoisie work at LANL) With uniprocessor performance models, good workflow scheduling is achievable Based on GrADS/VGrADS Experience Performance models make Grid utilization better

Application to Heterogeneous Platforms Assume a processor with different computational components CPU + GPU or FPGA Cell: PPE + SPEs Cray: Commodity micro (AMD) + vector/multithreaded unit Consider data movement costs (bandwidth and local memory size) Partition application loop nests into computations that have affinity to different units Example: vectorizable loops versus scalar computations Transformations to distribute loops into different types Transformations (tiling and fusion) to increase local reuse May need different partitionings due to data movement costs Schedule computations onto units, estimating run time including data movement Scheduling will favor reuse from local memories Best mapping may involve alternative partitionings

Multicore and Cell Compilers and Tools for Next-Generation Chip Architechtures

Bandwidth Management Multicore raises computational power rapidly Bandwidth onto chip unlikely to keep up Multicore systems will feature shared caches Replaces false sharing with enhanced probability of conflict misses Challenges for effective use of bandwidth Enhancing reuse when multiple processors are using cache Reorganizing data to increase density of cache block use Reorganizing computation to ensure reuse of data by multiple cores Inter-core pipelining Managing conflict misses With and without architectural help Without architectural help Data reorganization within pages and synchronization to minimize conflict misses May require special memory allocation run-time primitives

Conflict Misses Unfortunate fact: If a scientific calculation is sweeping across strips of > k arrays on a machine with k-way associativity and All k strips overlap in one associativity group, then Every access to the overlap group location is a miss 2 3 1 On each outer loop iteration, 1 evicts 2 which evicts 3 which evicts 1 In a 2-way associative cache, all are misses! This limits loop fusion, a profitable reuse strategy

Controlling Conflicts: An Example Cache and Page Parameters 256K Cache, 4-way set associative, 32-byte blocks 1024 associativity groups 64K Page 2048 cache blocks Each block in a page maps to a unique associativity group 2 different lines in a page map to the same associativity group In General Let A = number of associativity groups in cache Let P = number of cache blocks in a page If P A then each block in a page maps to a single associativity group No matter where the page is loaded If P < A then a block can map to A/P different associativity groups Depending on where the page is loaded

Questions Can we do data allocation precisely within a page so that conflict misses are minimized in a given computation? Extensive work on minimizing self-conflict misses Little work on inter-array conflict minimization No work, to my knowledge, on interprocessor conflict minimization Can we synchronize computations so that multiple cores do not interfere with one another? Even reuse of blocks across processors Might it be possible to convince vendors to provide additional features to help control cache, particularly conflict misses Allocation of part of cache as a scratchpad Dynamic modification of cache mapping

Compiling for Cell MI C SPE LS SPE LS SPE LS SPE LS PPE I/O (BE I) LS SPE EIB LS SPE LS SPE LS SPE L1,L 2 Cach e 1 PPE (AltiVec), 8 SPE (vector) each with 256KB LS Peak performance: At 3.2GHz, ~200GFlops for SP, ~20GFlops for DP Each SPE has its own address space and can only access data from LS, data transfer explicitly controlled through DMA Each SPE can transfer 8 bytes/cycle each direction Total memory bandwidth 25.6 GB/s = 8 bytes per cycle total On-chip bandwidth = 200 GB/s

Compiler Challenges Implications of Bandwidth versus Flops If a computation needs one input word per flop, SPEs must wait for data Eight SPEs can each get 16 bytes (4 words) every 16 cycles Peak under those conditions = 8 flops per 16 cycles = 1.6 Gflops Reuse in local storage is essential to get closer to peak Compiler Challenges Find sufficient parallelism for SPE coarse and fine granularity Effective data movement strategy DMA latency hiding Data reuse in local storage

Our Work on Short Vector Units AltiVec on PowerPC, SSE on x86 Short vector length (16 bytes), contiguous memory access for vectors, vector intrinsics only available in C/C++ Data alignment constraints 1 2 3 4 1 2 3 4 vector merge P roblem size 800 1600 3200 6400 load_ps 0.75 1.42 3.21 6.37 loadu_ps 1.12 2.21 4.69 9.38 2 3 4 1 Altivec SSE

Compilation Strategy for F90 Procedure outlining array statements and vectorizable loops rewriting computation in C/C++ to utilize the intrinsics Loop alignment to adjust relative data alignment between lhs and rhs in complement to software pipelined vector accesses Data padding for data alignment Vectorized scalar replacement for data reuse in vector registers

Compiling for Computational Accelerators Employ integrated techniques for vectorization, padding, alignment, scalar replacement to compile for short vector machines DO J = 2, N-1; A(2:N-1, J) = (A(2:N-1, J-1) + A(2:N-1, J+1) + A(1:N-2, J) + A(3:N, J))/4; ENDDO 2.19 1.23 2.39 1.85 2.34 1.88 1.72 Speedup of clalign+vpar to corig+v on Pentium IV (Intel Compiler) 1.81 Speedup of clalign+vpars to Orig+V on PowerPC G5 (VAST) Extending source-to-source vectorizer to support CELL Y. Zhao and K. Kennedy. Scalarization on Short Vector Machines." IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Austin, Texas, March 2005.

Related Work on Data Movement Related work: software cache Eichenberger et al, IBM compiler, cache lookup code before each memory reference, 64KB, 4-way, 128bytes cache line Related work: data management in generated code Eichenberger et al, loop blocking, prefetching Related work: array copying Copy regions of a big array into contiguous area of a small array Comparable to DMA transfer Lam et al, reduce cache conflict misses in blocked loop nests Temam et al, cost-benefit models for array copying Yi, general algorithm with heuristics for cost-benefits analysis Related work: software prefetching Prefetch data into cache before use Comparable to multi-buffering for overlapping computation and communication Callahan et al, Mowry et al, prefetch analysis, prefetch placement

Current Work Code generation: extending our compiling strategy for attached short vector units (SSE and AltiVec) onto the CELL processor Programming model (current) Input: a sequential version Fortran 90 program Output: PPE code + SPE code SPE code obtained by procedure outlining parallelizable loops and array statements Automatic parallelization and vectorization Dependence analysis Multiple SPEs: multiple threads with data/task partitioning Each SPE: vectorization using intrinsics, apply data alignment

DMA Size and Computation Per Data Code for sequential version code for(i = 0; i < N-2; i++) { B[i+1] = 0.0; for(k = 0; k < ComputationPerData; k++) B[i+1] += (A[i] + A[i+2] + C[i+2]) * (0.35/ComputationPerData); } Parallel version Range [0 : N-3] split across SPEs Multi-buffering Experiment setup 2.1GHz CELL blade at UTK Xlc compiler 1.0-2 N=22000111, double-buffering

DMA Size and Computation Per Data DMA transfer (floats): 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048 Computation Replication: 1, 2, 4, 8, 16, 32, 64, 128, 256 Speedup of 8 SPEs to 1 SPE

Future Work Compiling General Programs for Cell Data alignment at vector and DMA accesses Vector: loop peeling, software pipelined memory accesses, loop alignment, array padding DMA: loop peeling, array padding Data movement Buffer management for multi-buffering Data reuse Loop blocking, loop fusion Parallelization and enabling transformations Loop distribution, loop interchange, loop skewing, privatization Heterogeneous parallelism Partitioning programs between PPE and SPE Use of PPE cache as scratch memory

Automatic Tuning For many applications, picking the right optimization parameters is difficult Example: tiling in a cache hierarchy Pretuning of applications and libraries can correct for these deficiencies ATLAS and PHiPACK experience General automatic tuning strategy Generate search space of feasible optimization parameters Prune search space based on compiler models Use heuristic search to find good solutions in reasonable time Search space pruning Trade-off between tiling and loop fusion Need to determine effective cache size Qasem and Kennedy: ICS 06 Construct model for effective cache size based on tolerance Tune by varying tolerance

Summary Partitioning for heterogeneous systems requires: Compiler partitioning of computations for different components Performance estimation of partitioned code on each component Scheduling with data movement costs taken into account Transformations to make effective use of bandwidth and local memory on each component Loop fusion and tiling are critical Rice has a substantive track record in these areas Automatic vectorization/parallelization Restructuring for data reuse in registers and cache Prefetching, scalar replacement, cache blocking, loop fusion Construction of performance models for uniprocessor memory hierarchies Scheduling onto Grid heterogeneous components with data movement Generating code for short vector units and Cell