Portable Parallel Programming for Multicore Computing

Similar documents
IBM Cell Processor. Gilbert Hendry Mark Kretschmann

All About the Cell Processor

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

Technology Trends Presentation For Power Symposium

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CellSs Making it easier to program the Cell Broadband Engine processor

Computer Architecture

X10: a High-Productivity Approach to High Performance Programming.

COMP 635: Seminar on Heterogeneous Processors. Vivek Sarkar. Department of Computer Science Rice University

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

OpenMP on the IBM Cell BE

Addressing Heterogeneity in Manycore Applications

CUDA GPGPU Workshop 2012

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

High Performance Computing with Accelerators

Parallel Computing: Parallel Architectures Jin, Hai

Modern Processor Architectures. L25: Modern Compiler Design

WHY PARALLEL PROCESSING? (CE-401)

Evaluating the Portability of UPC to the Cell Broadband Engine

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

CnC-HC. a programming model for CPU-GPU hybrid parallelism. Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University

Introduction to CELL B.E. and GPU Programming. Agenda

How to Write Fast Code , spring th Lecture, Mar. 31 st

Compilation for Heterogeneous Platforms

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

OpenMP on the IBM Cell BE

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Trends and Challenges in Multicore Programming

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.

Parallel Exact Inference on the Cell Broadband Engine Processor

COMP 322: Fundamentals of Parallel Programming

Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Bandwidth Avoiding Stencil Computations

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

High Performance Computing on GPUs using NVIDIA CUDA

Re-architecting Virtualization in Heterogeneous Multicore Systems

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

The University of Texas at Austin

Intra Application Data Communication Characterization

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

high performance medical reconstruction using stream programming paradigms

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

The Case for Heterogeneous HTAP

CSE 392/CS 378: High-performance Computing - Principles and Practice

Experts in Application Acceleration Synective Labs AB

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Arquitecturas y Modelos de. Multicore

Porting Performance across GPUs and FPGAs

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

GPU Fundamentals Jeff Larkin November 14, 2016

INF5063: Programming heterogeneous multi-core processors Introduction

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

LLVM-based Communication Optimizations for PGAS Programs

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Turbostream: A CFD solver for manycore

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

High Performance Computing. University questions with solution

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Cell Programming Tips & Techniques

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Applications of Berkeley s Dwarfs on Nvidia GPUs

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Microarchitecture Overview. Performance

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

CPU-GPU Heterogeneous Computing

Massively Parallel Architectures

COMP 322: Principles of Parallel Programming. Vivek Sarkar Department of Computer Science Rice University

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

High performance Computing and O&G Challenges

Microarchitecture Overview. Performance

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Heterogeneous Processing Systems. Heterogeneous Multiset of Homogeneous Arrays (Multi-multi-core)

Introduction II. Overview

High Performance Computing (HPC) Introduction

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

High Performance Computing: Architecture, Applications, and SE Issues. Peter Strazdins

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Transcription:

Portable Parallel Programming for Multicore Computing? Vivek Sarkar Rice University vsarkar@rice.edu FPU ISU ISU FPU IDU FXU FXU IDU IFU BXU U U IFU BXU L2 L2 L2 L3 D

Acknowledgments Rice Habanero Multicore Software project http://habanero.rice.edu COMP 635 Seminar on Heterogeneous Processors http://www.cs.rice.edu/~vs3/comp635 X10 open source project http://x10.sf.net IBM Research study on Java on Cell 2

Future System Trends: a new Era of Mainstream & High End Parallel Processing Hardware building blocks for mainstream and high-performance systems are varied and proliferating Homogeneous Multi-core L2 Cache L2 Cache SPE Heterogeneous Accelerators PPE L2 32B/cycle PPU L1 EIB (up to 96B/cycle) PXU 64-bit Power Architecture with VMX MIC Dual XDR TM BIC FlexIO TM (2x) High Performance Clusters SMP Node Memory Interconnect SMP Node Memory Challenge: Develop new programming technologies to support portable parallel abstractions for future hardware 3

Outline 1. Habanero Multicore Software Project 2. Portable Parallel Programming for Heterogeneous Processors 3. Legacy Transformation for Automatic Parallelization 4

Habanero Project (habanero.rice.edu) Parallel Applications X10 F# 1) Habanero Programming Language Habanero Foreign Function Interface Sequential C, Fortran, Java, 1) will be based on an X10 subset 2), 3), 4), 5) will be developed first for 1), and then extended to support other languages 2) Habanero Static Compiler 3) Habanero Virtual Machine 4) Habanero Concurrency Library 5) Habanero Toolkit Vendor tools Vendor Platform Compilers & Libraries Multicore OS Multicore Hardware 5

Habanero Target Applications and Platforms Applications: Parallel Benchmarks SSCA s #1, #2, #3 from DARPA HPCS program NAS Parallel Benchmarks JGF, JUC, SciMark benchmarks Medical Imaging Back-end processing for Compressive Sensing (www.dsp.ece.rice.edu/cs) Contacts: Rich Baraniuk (Rice), Jason Cong (UCLA) Seismic Data Processing Rice Inversion project (www.trip.caam.rice.edu) Contact: Bill Symes (Rice) Computer Graphics and Visualization Mathematical modeling and smoothing of meshes Contact: Joe Warren (Rice) Computational Chemistry Fock Matrix Construction Contacts: David Bernholdt, Wael Elwasif, Robert Harrison, Annirudha Shet (ORNL) Habanero Compiler Implement Habanero compiler in Habanero so as to exploit multicore parallelism within the compiler Platforms: AMD Opteron Quad-Core Clearspeed Advance X620 DRC Coprocessor Module w/ Xilinx Virtex FPGA IBM Cyclops-64 (C-64) IBM Power5+, Power6 Intel Xeon Quad-Core NVIDIA Tesla S870 STI Cell Sun UltraSparc T1, T2 Additional suggestions welcome! 6

2) Habanero Static Parallelizing & Optimizing Compiler Habanero Language Interprocedural Analysis Front End AST Habanero Foreign Function Interface Sequential C, Fortran, Java, IRGen PIR Analysis & Optimization Parallel IR (PIR) Annotated Classfiles Portable Managed Runtime 7 C / Fortran (restricted code regions for targeting accelerators & high-end computing) Platform-specific static compiler Partitioned Code

Evaluating Java on Cell on a Streaming Microbenchmark (Rajesh Bordawekar, IBM Research, 1Q2007) Streaming integer vector add (b[j] = a[j] + c) for 32M vector size on 2.99 GHz P4 and 2.1 GHz Cell blade. Pentium version uses C code. Cell version uses Java on PPE and C on SPE. 8

Outline 1. Habanero Multicore Software Project 2. Portable Parallel Programming for Heterogeneous Processors 3. Legacy Transformation for Automatic Parallelization 9

Heterogeneous Processor Spectrum Dimension 1: Distance of accelerator from main processor (decreasing latency & bandwidth) Cell Dimension 2: Hardware customization in accelerator (decreasing energy per operation) 10

Portable Parallel Programming via X10 Places X10 language defines mapping from X10 objects & activities to X10 places X10 deployment defines mapping from virtual X10 places to physical processing elements X10 Data Structures X10 Places Physical PEs Homogeneous Multi-core Heterogeneous Accelerators Clusters PEs, L2 Cache PEs, SPE SM F SM F SM F SM F SM F EIB (up to 96B/cycle) SM F SM F SM F SMP Node SMP Node PEs, L2 Cache PEs, PPE PPU L2 L1 PXU 32B/cycle 64 -bit P ower Architecture with V MX MIC Dual XDR TM (2x) BIC FlexIO TM Memory Interconnect Memory 11

Places (contd.) Examples 1) finish { // Inter-place parallelism final int x =, y = ; async (a) a.foo(x); // Execute at a s place async (b[j]) b[j].bar(y); // Execute at b[j] s place } 2) // Implicit and explicit versions of remote fetch-and-op a) a.x = foo(a.x, b.y); b) async (b) { final double v = b.y; // Can be any value type async (a) atomic a.x = foo(a.x, v); } 12

X10 Deployment on a Multicore SMP (Open source: x10.sf.net) Place 0 Place 1 Place 2 Place 3 Basic Approach -- partition X10 heap into multiple place-local heaps Each X10 object is allocated in a designated place Each X10 activity is created (and pinned) at a designated place Allow an X10 activity to synchronously access data at remote places outside of atomic sections Thus, places serve as affinity hints for intra-smp locality 13

Extending X10 Places for Cell Deployments (Habanero) SPE Place 1 Place 2 Place 3 Place 4 Place 5 Place 6 Place 7 Place 8 EIB (up to 96B/cycle) PPE PPU L2 L1 32B/cycle PXU 64-bit Power Architecture with VMX (2x) MIC BIC Dual FlexIO TM Place 0 XDR TM Basic Approach: map 9 places on to PPE + eight SPEs Use finish & async s as highlevel representation of DMAs Challenges: Weak PPE SIMDization is critical Lack of hardware support for coherence Limited memory on SPE's Limited performance of code with frequent conditional or indirect branches Different ISA's for PPE and SPE. 14

Extending X10 Places for GPU Deployments (Habanero) Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Host (Place 0) Kernel 1 Device (hierarchy of places) Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Registers Processor 1 Shared Memory Registers Processor 2 Registers Processor M Instruction Unit Kernel 2 Grid 2 Constant Cache Block (1, 1) Texture Cache (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Device memory 15 (0, 1) (0, 2) (1, 1) (1, 2) (2, 1) (2, 2) (3, 1) (3, 2) (4, 1) (4, 2)

Outline 1. Habanero Multicore Software Project 2. Portable Parallel Programming for Heterogeneous Processors 3. Legacy Transformation for Automatic Parallelization 16

Automatic Parallelization revisited: let s target shiny decks instead of dusty decks! Legacy Code Sequential Java Language extensions Sequential Habanero + Parallel constructs Parallel X10 Automatic Parallelization X10 Runtime 17 Fine grained Synchronization (Phasers) Habanero Runtime Language Extensions in Support of Compiler Parallelization, J.Shirako, H.Kasahara, V.Sarkar, LCPC 2007

Language Extensions to aid Compiler Parallelization Already in X10 multidimensional arrays, points, regions, dependent types Proposed in Habanero project array views parameter intents retained (non-escaping) arrays and objects pure methods exception-free code regions gather/reduce computations All declarations are annotations are checked for safety e.g., Compiler inserts dynamic check for m!= 0 in j / m Programmer inserts dynamic check using a type cast operator int (:nonzero) m = (int(:nonzero)) n; // Cast to nonzero Compiler performs static checks of dependent types int (:nonzero) m = n; // Need to declare n as nonzero 18

Case Study: Java Grande Forum Benchmarks Annotations are checked for safety, and are consistent with best practices in software engineering 19

Experimental Results Target system p570 16-way Power6 4.7GHz SMP Main memory: 186GB Page size: 16GB L3 cache: 32MB/chip L2 cache: 4MB/core L1 cache: 128KB SMT-off, AIX5.3J IBM J9 JVM (Build 2.4, J2RE 1.6.0) used with following options in all runs -Xjit:count=0,optLevel=veryHot,ignoreIEEE -Xms1000M - Xmx1000M Benchmarks Java Grande Forum Benchmarks (Section 2 and Section 3) Java serial: v2.0 of the JGF benchmarks, sequential Java Habanero serial: Sequential Java with language extensions, same algorithm as JGF serial, annotations enable JVM optimization of null pointer and bounds checks Habanero parallel: Annotations enable parallelization of Habanero serial version (hand-simulated in this study) 20

Performance Results on 16-core Power6 SMP (8p x 2c) Habanero Serial is 1.2x faster than JGF Serial on average Habanero Parallel (hand-simulated) is 11.9x faster than Habanero serial and 14.3x faster than JGF serial on average 21

Conclusion? Homogeneous Multi-core Heterogeneous Accelerators High Performance Clusters L2 Cache L2 Cache SPE PPE L2 32B/cycle PPU L1 EIB (up to 96B/cycle) PXU 64-bit Power Architecture with VMX MIC Dual XDR TM Advances in parallel languages, compilers, and runtimes are necessary to address the programming challenges of multicore computing 22 BIC FlexIO TM (2x) SMP Node Memory SMP Node Interconnect Memory

Habanero Team (Nov 2007) Send email to Vivek Sarkar (vsarkar@rice.edu) if you are interested in the Habanero project, or in collaborating with us! 23