Portable Parallel Programming for Multicore Computing

Portable Parallel Programming for Multicore Computing? Vivek Sarkar Rice University vsarkar@rice.edu FPU ISU ISU FPU IDU FXU FXU IDU IFU BXU U U IFU BXU L2 L2 L2 L3 D

Acknowledgments Rice Habanero Multicore Software project http://habanero.rice.edu COMP 635 Seminar on Heterogeneous Processors http://www.cs.rice.edu/~vs3/comp635 X10 open source project http://x10.sf.net IBM Research study on Java on Cell 2

Future System Trends: a new Era of Mainstream & High End Parallel Processing Hardware building blocks for mainstream and high-performance systems are varied and proliferating Homogeneous Multi-core L2 Cache L2 Cache SPE Heterogeneous Accelerators PPE L2 32B/cycle PPU L1 EIB (up to 96B/cycle) PXU 64-bit Power Architecture with VMX MIC Dual XDR TM BIC FlexIO TM (2x) High Performance Clusters SMP Node Memory Interconnect SMP Node Memory Challenge: Develop new programming technologies to support portable parallel abstractions for future hardware 3

Outline 1. Habanero Multicore Software Project 2. Portable Parallel Programming for Heterogeneous Processors 3. Legacy Transformation for Automatic Parallelization 4

Habanero Project (habanero.rice.edu) Parallel Applications X10 F# 1) Habanero Programming Language Habanero Foreign Function Interface Sequential C, Fortran, Java, 1) will be based on an X10 subset 2), 3), 4), 5) will be developed first for 1), and then extended to support other languages 2) Habanero Static Compiler 3) Habanero Virtual Machine 4) Habanero Concurrency Library 5) Habanero Toolkit Vendor tools Vendor Platform Compilers & Libraries Multicore OS Multicore Hardware 5

Habanero Target Applications and Platforms Applications: Parallel Benchmarks SSCA s #1, #2, #3 from DARPA HPCS program NAS Parallel Benchmarks JGF, JUC, SciMark benchmarks Medical Imaging Back-end processing for Compressive Sensing (www.dsp.ece.rice.edu/cs) Contacts: Rich Baraniuk (Rice), Jason Cong (UCLA) Seismic Data Processing Rice Inversion project (www.trip.caam.rice.edu) Contact: Bill Symes (Rice) Computer Graphics and Visualization Mathematical modeling and smoothing of meshes Contact: Joe Warren (Rice) Computational Chemistry Fock Matrix Construction Contacts: David Bernholdt, Wael Elwasif, Robert Harrison, Annirudha Shet (ORNL) Habanero Compiler Implement Habanero compiler in Habanero so as to exploit multicore parallelism within the compiler Platforms: AMD Opteron Quad-Core Clearspeed Advance X620 DRC Coprocessor Module w/ Xilinx Virtex FPGA IBM Cyclops-64 (C-64) IBM Power5+, Power6 Intel Xeon Quad-Core NVIDIA Tesla S870 STI Cell Sun UltraSparc T1, T2 Additional suggestions welcome! 6

2) Habanero Static Parallelizing & Optimizing Compiler Habanero Language Interprocedural Analysis Front End AST Habanero Foreign Function Interface Sequential C, Fortran, Java, IRGen PIR Analysis & Optimization Parallel IR (PIR) Annotated Classfiles Portable Managed Runtime 7 C / Fortran (restricted code regions for targeting accelerators & high-end computing) Platform-specific static compiler Partitioned Code

Evaluating Java on Cell on a Streaming Microbenchmark (Rajesh Bordawekar, IBM Research, 1Q2007) Streaming integer vector add (b[j] = a[j] + c) for 32M vector size on 2.99 GHz P4 and 2.1 GHz Cell blade. Pentium version uses C code. Cell version uses Java on PPE and C on SPE. 8

Outline 1. Habanero Multicore Software Project 2. Portable Parallel Programming for Heterogeneous Processors 3. Legacy Transformation for Automatic Parallelization 9

Heterogeneous Processor Spectrum Dimension 1: Distance of accelerator from main processor (decreasing latency & bandwidth) Cell Dimension 2: Hardware customization in accelerator (decreasing energy per operation) 10

Portable Parallel Programming via X10 Places X10 language defines mapping from X10 objects & activities to X10 places X10 deployment defines mapping from virtual X10 places to physical processing elements X10 Data Structures X10 Places Physical PEs Homogeneous Multi-core Heterogeneous Accelerators Clusters PEs, L2 Cache PEs, SPE SM F SM F SM F SM F SM F EIB (up to 96B/cycle) SM F SM F SM F SMP Node SMP Node PEs, L2 Cache PEs, PPE PPU L2 L1 PXU 32B/cycle 64 -bit P ower Architecture with V MX MIC Dual XDR TM (2x) BIC FlexIO TM Memory Interconnect Memory 11

Places (contd.) Examples 1) finish { // Inter-place parallelism final int x =, y = ; async (a) a.foo(x); // Execute at a s place async (b[j]) b[j].bar(y); // Execute at b[j] s place } 2) // Implicit and explicit versions of remote fetch-and-op a) a.x = foo(a.x, b.y); b) async (b) { final double v = b.y; // Can be any value type async (a) atomic a.x = foo(a.x, v); } 12

X10 Deployment on a Multicore SMP (Open source: x10.sf.net) Place 0 Place 1 Place 2 Place 3 Basic Approach -- partition X10 heap into multiple place-local heaps Each X10 object is allocated in a designated place Each X10 activity is created (and pinned) at a designated place Allow an X10 activity to synchronously access data at remote places outside of atomic sections Thus, places serve as affinity hints for intra-smp locality 13

Extending X10 Places for Cell Deployments (Habanero) SPE Place 1 Place 2 Place 3 Place 4 Place 5 Place 6 Place 7 Place 8 EIB (up to 96B/cycle) PPE PPU L2 L1 32B/cycle PXU 64-bit Power Architecture with VMX (2x) MIC BIC Dual FlexIO TM Place 0 XDR TM Basic Approach: map 9 places on to PPE + eight SPEs Use finish & async s as highlevel representation of DMAs Challenges: Weak PPE SIMDization is critical Lack of hardware support for coherence Limited memory on SPE's Limited performance of code with frequent conditional or indirect branches Different ISA's for PPE and SPE. 14

Extending X10 Places for GPU Deployments (Habanero) Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Host (Place 0) Kernel 1 Device (hierarchy of places) Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Registers Processor 1 Shared Memory Registers Processor 2 Registers Processor M Instruction Unit Kernel 2 Grid 2 Constant Cache Block (1, 1) Texture Cache (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Device memory 15 (0, 1) (0, 2) (1, 1) (1, 2) (2, 1) (2, 2) (3, 1) (3, 2) (4, 1) (4, 2)

Outline 1. Habanero Multicore Software Project 2. Portable Parallel Programming for Heterogeneous Processors 3. Legacy Transformation for Automatic Parallelization 16

Automatic Parallelization revisited: let s target shiny decks instead of dusty decks! Legacy Code Sequential Java Language extensions Sequential Habanero + Parallel constructs Parallel X10 Automatic Parallelization X10 Runtime 17 Fine grained Synchronization (Phasers) Habanero Runtime Language Extensions in Support of Compiler Parallelization, J.Shirako, H.Kasahara, V.Sarkar, LCPC 2007

Language Extensions to aid Compiler Parallelization Already in X10 multidimensional arrays, points, regions, dependent types Proposed in Habanero project array views parameter intents retained (non-escaping) arrays and objects pure methods exception-free code regions gather/reduce computations All declarations are annotations are checked for safety e.g., Compiler inserts dynamic check for m!= 0 in j / m Programmer inserts dynamic check using a type cast operator int (:nonzero) m = (int(:nonzero)) n; // Cast to nonzero Compiler performs static checks of dependent types int (:nonzero) m = n; // Need to declare n as nonzero 18

Case Study: Java Grande Forum Benchmarks Annotations are checked for safety, and are consistent with best practices in software engineering 19

Experimental Results Target system p570 16-way Power6 4.7GHz SMP Main memory: 186GB Page size: 16GB L3 cache: 32MB/chip L2 cache: 4MB/core L1 cache: 128KB SMT-off, AIX5.3J IBM J9 JVM (Build 2.4, J2RE 1.6.0) used with following options in all runs -Xjit:count=0,optLevel=veryHot,ignoreIEEE -Xms1000M - Xmx1000M Benchmarks Java Grande Forum Benchmarks (Section 2 and Section 3) Java serial: v2.0 of the JGF benchmarks, sequential Java Habanero serial: Sequential Java with language extensions, same algorithm as JGF serial, annotations enable JVM optimization of null pointer and bounds checks Habanero parallel: Annotations enable parallelization of Habanero serial version (hand-simulated in this study) 20

Performance Results on 16-core Power6 SMP (8p x 2c) Habanero Serial is 1.2x faster than JGF Serial on average Habanero Parallel (hand-simulated) is 11.9x faster than Habanero serial and 14.3x faster than JGF serial on average 21

Conclusion? Homogeneous Multi-core Heterogeneous Accelerators High Performance Clusters L2 Cache L2 Cache SPE PPE L2 32B/cycle PPU L1 EIB (up to 96B/cycle) PXU 64-bit Power Architecture with VMX MIC Dual XDR TM Advances in parallel languages, compilers, and runtimes are necessary to address the programming challenges of multicore computing 22 BIC FlexIO TM (2x) SMP Node Memory SMP Node Interconnect Memory

Habanero Team (Nov 2007) Send email to Vivek Sarkar (vsarkar@rice.edu) if you are interested in the Habanero project, or in collaborating with us! 23