Heterogeneous Processing - PDF Free Download

Heterogeneous Processing Maya Gokhale maya@lanl.gov

Outline This talk: Architectures chip node system Programming Models and Compiler Overview Applications Systems Software

Clock rates and transistor counts FPGA Cell Intel dual core Opteron

How are the transistors being used? G4e Dual core Opteron FPGA http://arstechnica.com/articles/paedia/cpu/p4andg4e.ars/3 Clearspeed Cell Mathstar

Clearspeed CSX600 coprocessor layout Debug SRAM Memory Interface Bus Processor Core ISU System services Array of 96 Processor Elements 250 MHz IBM 0.13µ process, 8-layer metal (copper) 47% logic, 53% memory More logic than most processors! 15 mm x 15 mm die size 128 million transistors Approximately 10 Watts Chip-to-chip Bridge Ports Processor sparing allows high yields, currently ~40%

CSX600 processor core Multi-Threaded Array Processing Programmed in high-level languages Hardware multi-threading for latency tolerance Asynchronous, overlapped I/O Run-time extensible instruction set Bi-endian Array of 96 Processor Elements (PEs) Each is a VLIW core, not just an ALU 4-stage 32-bit or 64-bit multiply-add pipelines Divide/square root unit Built-in PE fault tolerance, resiliency Closely coupled 6K local SRAM per PE Independent address pointers per PE 128 bytes of registers per PE High performance, low power dissipation

Mathstar 400 Silicon Objects with operation up to 1 GHz 256 ALUs 64 MACs 80 RFs Two Bi-directional 500MHz DDR 16-bit LVDS ports (64 Gbps of bandwidth) 96 pins of LVCMOS GPIO, operating either synchronously or asynchronously up to 100 MHz. Twelve banks of 500MHz internal SRAM memory banks (57 GBytes/sec.) Two 266 MHz 36-bit DDR (72-bits per cycle) RLDRAM II controllers for external memory accesses (4.8 GBytes/sec.)4.8 GBytes/sec.) Field Programmable Object Array a coarse granularity reprogrammable device Silicon Objects: 16 bit configurable machines, such as an Arithmetic Logic Unit, Multiply-Accumulator or Register File. Silicon Object behavior and the interconnection among Silicon Objects are field-programmable. ALU, MAC, Register File On-chip RAM High-speed LVDS I/O and general purpose I/O Optimized for DSP applications

Cell source: wikipedia Cell Broadband Engine ASIC contains 64-bit PowerPC with two hardware threads 8 single precision FP processors with 4-way SIMD instruction set 256KB local memory Element Interconnect Bus 200BG/s peak 4GB XDR DRAM (based on Rambus, lower latency than DDR) 25.6GB/s Runs at 3.2GHz

GRAPE Gravity pipeline Series of ASICs specifically tailored to perform various sorts of force calculations Special purpose accelerator board on PCI bus of workstation GRAPE computes only the N^2 force calculations, and microprocessor does all the rest Communication cost is N (number of particles), computation is N^2 GRAPE function also ported to reconfigurable hardware (FPGA) PROGRAPE http://grape.astron.s.u-tokyo.ac.jp/~makino/papers/gbp2000-full

System on Chip (SoC) architectures Multiple, heterogeneous resources integrated onto the IC Memory SRAM for L1, L2 cache embedded DRAM (BG/L) configurable Block RAM on FPGA configure width, number of ports Computational resources multiple complete CPUs cache coherence is an issue: BG/L nodes are not, Opteron are RISC processor + multiple SIMD/Vector units: Cell PPU+SPUs vertex and fragment pipelines: GPUs multiple ALUs Clearspeed: 96 SIMD processing elements multiple function units 2 FP units on each BG/L processor arrays of hard multipliers, MAC units on FPGAs reconfigurable logic create specialized DSP pipelines, floating point units, crypto processors ASICs GRAPE gravity pipeline for n-body computations

Heterogeneous system characteristics specialized to an application class floating point multimedia signal/image processing cryptography provides extremely high performance on kernel operations Relative Power Requirements dual core Opteron 68W (rev. F chip - 90nm), 95W dual core Intel 135W Cell (PCI-E card) - 150W GPU - 28W idle, 50W 2D graphics, 120-130W 3D graphics Clearspeed - 25W FPGA - 8W Grape-4 chip 5-8W

Memory Hierarchy Register Set O(100) bytes processor clock rate - 2-3GHz On-chip Sram 8KB-16KB L1, 1MB L2 100-200 MHZ (can be DDR or QDR) multiple parallel banks, possibly dual ported Off-chip SRAM 16MB 100-200 MHZ multiple parallel banks Off-chip DRAM 4-8GB 16MHz Local memory hierarchy is implicit in processor-based architectures write code to be cache friendly minimize non-local memory accesses in NUMA parallel machines Memory hierarchy is exposed in accelerator architectures memory usage must be explicitly managed Block RAM on FPGAs local scratchpad and DRAM in Clearspeed local memory and DRAM in Cell

Putting heterogeneous processors into systems

Workstation accelerator: I/O card collection of interconnected SoC processors. each processor has dedicated (or shared) memory subsystem accelerator system attaches to host workstation via an I/O bus data acquisition channel for direct access to real-time data streams

Clusters with accelerators High performance, multi-level interconnect Infiniband, Myrinet, GigE High performance microprocessors 64-bit, multi-core, multi-socket co-processors Graphics boards, floating point arrays, FPGAs accelerator can be peer to microprocessor on network peer to microprocessor across sockets on hypertransport on I/O bus on memory bus

SRC Computer SRC Hi-Bar Switch SNAP Memory µp PCI-X Disk Gig Ethernet etc. Storage Area Network SNAP Memory µp PCI-X Local Area Network MAP MAP Chaining GPIO Wide Area Network Common Memory Common Memory FPGA board augments microprocessor MAP on DIMM interface 2.8 GB/s 2 large FPGAs multiple banks of on board SRAM on board DRAM provides for 20 simultaneous memory accesses @ 150MHZ FPGAs can be interconnected independently of microprocessor

Cray XD1 special ASIC to talk hypertransport Only one small FPGA co-processor 16MB QDR SRAM - 12 GB/s BW

FPGA co-processors Opteron motherboards Hypertransport connection between Opteron and FPGA Use DIMM slots on MB for FPGA memory Include additional off-chip SRAM Two companies DRC, XtremeData DRC XtremeData

Clusters augmented with Floating Point Arrays Clearspeed Board recently partnered with IBM to build cluster of FPA-accelerated nodes Board contains two CS processors Each processor has 96 double precision SIMD PEs; RISC control processor; I/O controller advertise 50GF sustained DGEMM using 25W GPGPU programmable graphics processors Cell Blade Mercury and IBM have partnered on Cell Blade architecture two CBEs per blade PCI-E or IB to connect blad to microprocessor

BG/L Each ASIC contains 2 700 MHz PPC 440 (32-bit), each with a double pipelined of double precision FP units on chip DRAM controller on chip interconnect network interface 3 comm networks: 3D torus for peer-topeer, tree for collective communication, fast interrupt network for barriers From http://www.llnl.gov/asci/platforms/bluegenel/images/bgl_slide2.gif

Software environments Back when floating point co-processors were introduced, targeting co-processor from compiler was easy differentiate by data type: operations on integer data types translated to integer opcodes and mapped to integer unit; operations on FP data types translated to floating point opcodes and mapped to floating point unit No longer straightforward when MMX/SSE were introduced harder to determine whether an integer operation should use multimedia unit or integer unit needed compiler to re-factor loops and vectorize alternatives use libraries written in assembly code; programmer has to call library add a new data type eg. poly to refer to vector data Problem is even harder with heterogeneous accelerators

Programming models and compilers With heterogeneous accelerators there are multiple function units multiple execution units multiple threads of control exposed memory hierarchy asymmetrical parallelism control RISC processor array of programmable datapath-oriented processors When accelerator is on interconnection network or I/O bus, data reorganization and communication costs must be factored into the benefit afforded by accelerator Current state of practice is to leave it to the application programmer partition program between control and data path re-organize, align data to match accelerator requirements communicate and synchronize between multiple parallel processes

Programming models for heterogeneous computing many opportunities for parallelism requires multiple (possibly hierarchical) programming models process level between multiple nodes or sockets thread level among multiple, possibly heterogeneous, cores vector/simd/systolic/pipeline within a core

Applications accelerator architectures driven by commercial forces network routing: FPGA signal and image processing: FPGA, FPOA multimedia: GPU, Cell HPC community re-engineer applications to fit accelerators cryptography SAR hyperspectral imagery, video financial codes seismic scientific simulations

GPUS Pat McCormick pat@lanl.gov

Why Graphics Processors? They re everywhere Most sit idle in desktop systems Performance and cost ~240 GFLOPS peak vs. 12 for Pentium 4 40+ GB/sec memory bandwidth vs. 6 for Pentium 4 Designed for parallelism Lots of math units, local 4-way SIMD, dual/co issue Transistors for compute vs. out-of-order, prediction, etc.

GPU Performance Trends

The Graphics Pipeline

Architecture MIMD Engines 48+ cores in the latest GPUs SIMD Engines Based on GeForce 6800 architecture, courtesy of NVIDIA Corp.

Programming Models Drive the GPU with a graphics-centric API OpenGL or DirectX

Programming Models Low-level options ATI s CTM (Close To the Metal) SIGGRAPH 2006 Presentation Removes graphics state issues, but code at the assembly level High-level options PeakStream - Matt s talk this afternoon Scout - Jeff s talk this afternoon Brook - Stanford (http://graphics.stanford.edu/proejcts/brookgpu) Sh - C++-based metalanguage (http://libsh.org)

Agenda Introduction to heterogeous computing (Maya Gokhale and Pat McCormick, LANL) Applications The Chances and Challenges of Parallelism: Comparison of Hardwired (GPU) and Reconfigurable (FPGA) Devices (Robert Strzodka, Stanford University) On the Acceleration of Graph Problems on FPGA (Zack Baker, LANL) Speech Recognition on Cell Broadband Engine (Yang Liu, LLNL) Transport Kernel on Cell Broadband Engine (Paul Henning, LANL) Systems Software and Tools Peakstream Development Environment (Matt Papakipos, Peakstream) The Scout GPU compiler (Jeff Inman, LANL) Array allocation in non-cached memory systems (Justin Tripp, LANL) Compiler Support for Heterogeneous Computing in a CELL Processor (Yuan Zhao, Rice) Program analysis tools for Heterogeneous Computing (Matt Sottile, LANL) Operating Systems issues in Heterogeneous Computing (Ron Minnich, LANL)