COMP 635: Seminar on Heterogeneous Processors. Vivek Sarkar. Department of Computer Science Rice University

Size: px

Start display at page:

Download "COMP 635: Seminar on Heterogeneous Processors. Vivek Sarkar. Department of Computer Science Rice University"

Elvin O’Brien’
6 years ago
Views:

1 COMP 635: Seminar on Heterogeneous Processors Vivek Sarkar Department of Computer Science Rice University August 27, 2007 Course Goals Gain familiarity with heterogeneous processor systems by studying a few sample design points in the spectrum Study and critique current software environments for these designs (programming models, compilers, tools, runtimes) Discuss research challenges in advancing the state of the art of software for heterogeneous processors Target audience: software, hardware, and application researchers interested in building or using heterogeneous processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas 2

2 Course Organization Class dates (12 lectures) 8/27, 9/10, 9/20 (Thurs), 9/24, 10/1, 10/8, 10/22, 10/29, 11/5, 11/19, 11/26, 12/3 No classes on 9/3 (Labor Day), 10/15 (Midterm Recess), 11/12 (Supercomputing 2007 conference week) No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week Time & Place Default: Mondays, 3:30pm - 4:30pm, DH 2014 Exception: time & place for 9/20 (Thurs) lecture TBD 30 minutes reserved after lecture for discussion (optional) Office Hours (DH 3131) 11am - 12noon, Fridays from 8/31/07 to 12/7/07 OWL-Space repository: COMP 635 F07 Grading Satisfactory/unsatisfactory grade for students taking seminar for credit Others should register officially as auditors, if possible For a satisfactory grade, you need to 1. Attend at least 50% of lectures 2. Submit a 4-page project/study report by 12/7/07 (report can be prepared in a group - just plan on 4 pages/person in that case) Optional in-class presentation of project/study report on 12/3/07 3 Course Content Introduction to Heterogeneous Processors and their Programming Models (1 lecture) Cell Processor and Cell SDK (2 lectures) Nvidia GPU and CUDA programming environment (2 lectures) DRC FPGA Coprocessor Module and Celoxica Programming Environment (1 lecture) Clearspeed Accelerator and SDK (1 lecture) Imagine Stream Processor (1 lecture) Microsoft Accelerator Library (1 lecture) Vector and SIMD processors -- a historical perspective (1 lecture) Programming Model and Runtime Desiderata for future Heterogeneous Processors (1 lecture) Student presentations (1 lecture) 4

3 COMP 635 Lecture 1: Introduction to Heterogeneous Processors and their Programming Models 5 Acknowledgments Georgia Tech ECE 6100, Module 14 Vince Mooney, Krishna Palem, Sudhakar Yalamanchili ex.html MIT IAP 2007, Lecture 2 Introduction to the Cell Processor, Michael Perrone UIUC ECE 497, Lecture 16 courses.ece.uiuc.edu/ece412/lectures/lecture16.ppt UIUC ECE 498 AL1, Programming Massively Parallel Processors David Kirk, Wen-mei Hwu 6

4 Memory transfer module schedules system-wide bulk data movement Heterogeneous Processors MAIN MEMORY GPP ACC LOCAL MEMORY MTM Accelerated activities and associated private data are localized for bandwidth, power, efficiency General-purpose processor orchestrates activity LOCAL MEMORY ACC ACC Accelerators can use scheduled, streaming communication or can operate on locally-buffered data pushed to them in advance Motivation: 1) Different parts of programs have different requirements Control-intensive portions need good branch predictors, speculation, big caches to achieve good performance Data-processing portions need lots of ALUs, have simpler control flows 2) Power consumption Features like branch prediction, out-oforder execution, tend to have very high power/performance ratios. Applications often have time-varying performance requirements 7 Sample Application Domains for Heterogeneous Processors Cell Processor Medical imaging, Drug discovery, Reservoir modeling, Seismic analysis, GPU (e.g., Nvidia) Computer-aided design (CAD), Digital content creation (DCC), emerging HPC applications, FPGA (e.g., Xilinx DRC) HPC, Petroleum, Financial, HPC accelerators (e.g., Clearspeed) HPC, Network processing, Graphics, Stream Processors (e.g., Imagine) Image processing, Signal processing, Video, Graphics, Others TCP/IP offload, Crypto, 8

5 Programming Models for Heterogeneous Processors Data Parallelism Single Program Multiple Data (SPMD) Pipelining Work Queue Fork Join Message Passing Storage Models: Shared vs. Local vs. Partitioned Memories Hybrid combinations of above Only a limited subset of these models are in production use today ==> programming model implementations for heterogeneous processors will have to grow to accommodate new application domains and new classes of programmers 9 Heterogeneous Processor Spectrum Dimension 1: Distance of accelerator from main processor Heterogeneous Multicore Dimension 2: Hardware customization in accelerator 10

6 Heterogeneous Processor Spectrum Dimension 1: Distance of accelerator from main processor Focus of this course Heterogeneous Multicore Dimension 2: Hardware customization in accelerator Focus of this course 11 Spectrum of Programmers for Heterogeneous Processors Application-level Users Plug & play experience by using ISV frameworks such as MATLAB and Mathematica, etc Library-level Programmers Portable library interface that works across homogeneous and heterogeneous processors Language-level Programmers Portable programming language that works across homogeneous and heterogeneous processors Conspicuous lack of new languages for heterogeneous processors, especially languages with managed runtimes! SDK-level Programmers C-based compilers and tools that are specific to a given heterogeneous processor 12

Spectrum of Programmers for Heterogeneous Processors Application-level Users Plug & play experience by using ISV frameworks such as MATLAB and Mathematica, etc Library-level Programmers Portable

7 Spectrum of Programmers for Heterogeneous Processors Application-level Users Plug & play experience by using ISV frameworks such as MATLAB and Mathematica, etc Library-level Programmers Portable library interface that works across homogeneous and heterogeneous processors Language-level Programmers Portable programming language that works across homogeneous and heterogeneous Focus processors of Conspicuous lack of new languages this coursefor heterogeneous processors, especially languages with managed runtimes! SDK-level Programmers C-based compilers and tools that are specific to a given heterogeneous processor 13 Cell Broadband Engine (BE) 14

8 Cell Performance 15 Cell Temperature Distribution Power and heat are key constraints 16

Code Partitioning for Cell Compile for PPE Key Flow Graph Node Call Graph Node Flow Graph Edge Call Graph Edge Compile for SPE Outlining Cloning Outlining: extract parallel loop into a separate

codes Reference: Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture, A. Eichenberger et al, IBM Systems Journal, Vol 45, No 1, 2006 17 Why GPUs?

9 Code Partitioning for Cell Compile for PPE Key Flow Graph Node Call Graph Node Flow Graph Edge Call Graph Edge Compile for SPE Outlining Cloning Outlining: extract parallel loop into a separate procedure Cloning: make separate copies for PPE and SPE, including clones of all procedures called from loop Coordination: insert operations on signal registers and mailbox queues in PPE and SPE codes Reference: Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture, A. Eichenberger et al, IBM Systems Journal, Vol 45, No 1, Why GPUs? A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics API GPU in every PC and workstation massive volume and potential impact 18

net RC5-72 challenge client code 1,979 218 >99% FEM Finite element modeling, simulation of 3D graded materials 1,874 146 99% RPES Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion

10 Sample GPU Applications Application Description Source Kernel % time H.264 SPEC 06 version, change in guess vector 34, % LBM SPEC 06 version, change to single precision and print fewer reports 1, >99% RC5-72 Distributed.net RC5-72 challenge client code 1, >99% FEM Finite element modeling, simulation of 3D graded materials 1, % RPES Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion 1, % PNS Petri Net simulation of a distributed system >99% SAXPY Single-precision implementation of saxpy, used in Linpack s Gaussian elim. routine >99% TRACF Two Point Angular Correlation Function % FDTD Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation 1, % MRI-Q Computing a matrix Q, a scanner s configuration in MRI reconstruction >99% 19 Performance of Sample Kernels and Applications GeForce 8800 GTX vs. 2.2GHz Opteron speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads 25 to 400 speedup if the function s data requirements and control flow suit the GPU and the application is optimized Keep in mind that the speedup also reflects how suitable the CPU is for executing the kernel Source: Slide 21, Lecture 1, UIUC ECE 498, David Kirk & Wen-mei Hwu, 20

Johnson, doug.johnson@celoxica.com, gladiator.ncsa.uiuc.

11 FPGAs: Basics of FPGA Offload Source: Compiling Software Code to FPGA-based Accelerator Processors for HPC Applications by Doug Johnson, gladiator.ncsa.uiuc.edu/pdfs/rssi06/presentations/14_doug_johnson.pdf 21 FPGA Acceleration Examples 22

ClearSpeed Multi-Threaded Array Processor (MTAP) Hardware multi- threading for latency tolerance Asynchronous, overlapped I/O Poly execution unit contains 96 Processor Elements (PE s) or cores.

12 ClearSpeed Multi-Threaded Array Processor (MTAP) Hardware multi- threading for latency tolerance Asynchronous, overlapped I/O Poly execution unit contains 96 Processor Elements (PE s) or cores. Array of PE s operates in a synchronous manner, i.e. each PE executes the same instruction on its data. Source: Accelerating HPC Applications with ClearSpeed by Daniel Kliger, daniel.kidger@clearspeed.com, Daresbury%20MEW% pdf 23 Clearspeed Linpack results Standard System Two 3.0 GHz Intel Xeon 5160 (Woodcrest) dual core processors, 16GB memory per node Single server: 34 GFLOPS Four node cluster: 136 GFLOPS Power consumption: 1,940 Watts Benchmark runtime: 48.4 minutes ClearSpeed Accelerated System Add two Advance accelerator boards per node (25W per board!) Single server: 90.1 GFLOPS Four node cluster: GFLOPS Power consumption: 2,140 Watts Benchmark runtime: 18.4 minutes 24

Linear Algebra Subprograms (BLAS) library.

13 ClearSpeed s CSXL acceleration library The CSXL acceleration library intercepts and accelerates calls to functions in the Basic Linear Algebra Subprograms (BLAS) library. These include Level 3 BLAS DGEMM calls and LAPACK DGETRF calls. 25 Imagine Stream Processor 26

Example of how Compilers can Help Opportunity for new languages to reduce compiler effort

14 Transforming Memory Accesses to Communication for Scalability Software challenge: deliver productivity of shared memory model, combined with scalability of communication model 27 Example of how Compilers can Help Opportunity for new languages to reduce compiler effort and broaden applicability Source: UIUC ECE 497, courses.ece.uiuc.edu/ece412/lectures/lecture16.ppt 28

15 Code Partitioning for Heterogeneous Processors Factors to consider when extracting a region of code for execution on an accelerator Matching operations in code region with primitives in accelerator (includes instruction selection and FPGA synthesis) Establishing coherence between main and local memories Obeying local memory size constraints Volume of data to be communicated Granularity of region relative to overhead of thread creation Structural constraints of task/thread being extracted Cloning of code that needs to be executed on multiple elements Coordination with rest of the program (coroutine vs. macrodataflow models) Reading List for Next Lecture (Sep 10th) 1. Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture, A. Eichenberger et al, IBM Systems Journal, Vol 45, No 1, 2006, 2. Dynamic Multigrain Parallelization on the Cell Broadband Engine, F. Blagojevic et al, PPoPP 2007 Best Paper, March 2007, &CFID= &CFTOKEN=

16 Announcement: Kickoff Meeting for Habanero Multicore Software Research Project Habanero is a new research project focused on Multicore Software. Its scope will span programming languages, compilers, virtual machines, and low-level runtime systems, and is synergistic with the expertise we have in various CS groups at Rice including the Parallel Compilers, Scalar Compilers, Programming Language Technologies, and Systems groups. A kickoff meeting for the Habanero project is scheduled for 1pm - 2:30pm on Wednesday, August 29th in DH Cookies will be served! 31 BACKUP SLIDES START HERE 32

Freescale MPC8572 PowerQUICC III Processor Dual Embedded e500 core 36-bit physical addressing Double-precision floating-point Integrated L1/L2 cache L1 cache 32 KB data and 32 KB Shared L2 cache 1 MB

17 Freescale MPC8572 PowerQUICC III Processor Dual Embedded e500 core 36-bit physical addressing Double-precision floating-point Integrated L1/L2 cache L1 cache 32 KB data and 32 KB Shared L2 cache 1 MB with ECC L2 configurable as SRAM, cache and I/O transactions can be stashed into L2 cache regions Integrated DDR memory controller with full ECC support Integrated security engine, Pattern Matching Engine, Packet Deflate Engine Four on-chip triple-speed Ethernet controllers 33 Freescale MPC8572 PowerQUICC III Processor Source: Freescale 34

AMD s use of HyperTransport (Torrenza) Torrenza technology Allows licensing of coherent HyperTransport to 3 rd party manufacturers to make socketcompatible accelerators/coprocessors Allows 3 rd party

18 AMD s use of HyperTransport (Torrenza) Torrenza technology Allows licensing of coherent HyperTransport to 3 rd party manufacturers to make socketcompatible accelerators/coprocessors Allows 3 rd party PPUs (Physics Processing Unit), GPUs, and coprocessors to access main system memory directly and coherently Could make accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory. 35

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX Wen-mei Hwu with David Kirk, Shane Ryoo, Christopher Rodrigues, John Stratton, Kuangwei Huang Overview