The University of Texas at Austin

Size: px

Start display at page:

Download "The University of Texas at Austin"

Aubrie McKinney
6 years ago
Views:

1 EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

2 Outline Stream Wrapup SRF design Motivation Cell architecture GPP Controller (PPE) Compute PEs (SPEs) Interconnect (EIB) Memory and I/O Comparisons Stream Processors Software (probably next time) All Cell related images and figures Sony and IBM Cell Broadband Engine Sony Corp. Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 2

3 Heat-map (Area per FPU) 64 bit Area overhead of intra-cluster switches Number of FPUs per cluster (ILP) Area overhead of an instruction sequencer Number of clusters (DLP) Area overhead of an inter-cluster switch Many reasonable hardware options for 64-bit Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture

4 Application Performance (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) Relative runtime CONV2D DGEMM FFT3D FEM MD CDP all_seq_busy some_seq_busy_mem_busy no_seq_busy_mem_busy some_seq_busy_mem_idle Small performance differences for good streaming applications Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 4

5 SRF Decouples Execution from Memory Unpredictable I/O Latencies Static latencies cluster switch cluster switch SRF lane SRF lane Inter-cluster and memory switches cache bank DRAM bank 3,840 GB/s 512 GB/s 64 GB/s I/O pins <64 GB/s cache bank DRAM bank Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 5

6 SRF Sequential Access Single ported memory Efficient wide access of 4 contiguous words Implemented using sub arrays Reduced access time Stream buffers Reduced power Stream-buffers match bandwidth to compute needs Time multiplex the SRF port SRF bank 0 256b Compute cluster 0 64b Local WL drivers Sub array 0 Sub array 1 Sub array 2 Sub array 3 Wide single port time multiplexed by stream buffers Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 6

7 In-lane Indexing Almost Free Single ported memory Efficient wide access of 4 contiguous words Implemented using sub arrays Reduced access time Reduced power Stream-buffers match Stream buffers bandwidth to compute needs Time multiplex the SRF port Indexed SRF at low extra cost 8:1 MUX in sub-arrays Row decoder per sub-array SRF bank 0 Compute cluster 0 Addr. FIFOs Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. 512 Sub array 0 Sub array 1 Sub array 2 Sub array :1 mux Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 7

8 Stream Processors and GPUs Stream Processors GPUs Bulk kernel computation Kernel uses scalar ISA VLIW + SIMD Bulk memory operations Software latency hiding Stream mem. System HW optimized local mem. Locality opportunities Minimize off-chip transfers With capable mem system So far mostly load-time parameters Bulk kernel computation Kernel uses scalar ISA SIMD Scalar mem. Operations Threads to hide latency Threads to fill mem. Pipe Small shared memory Limited locality Rely on off-chip BW Needed for graphics Dynamic work-loads Mostly read-only Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 8

9 Conclusions Stream Processors offer extreme performance and efficiency rely on software for more efficient hardware Empower software through new interfaces Exposed locality hierarchy Exposed communication Bulk operations and hierarchical control Decouple execution pipeline from unpredictable I/O Help software when it makes sense Aggressive memory system with SIMD alignment Multiple parallelism mechanisms (can skip short-vectors ) Hardware assist for bulk operation dispatch Stream Processors offer programmable and efficiency high performance Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 9

10 Cell Broadband Engine Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 10

11 Outline Motivation Cell architecture GPP Controller (PPE) Compute PEs (SPEs) Interconnect (EIB) Memory and I/O Comparisons Stream Processors Software (probably next time) All Cell related images and figures Sony and IBM Cell Broadband Engine Sony Corp. Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 11

12 Cell Motivation Part I Performance demanding applications have different characterisitics Parallelism Locality Realtime Games, graphics, multimedia Requires redesign of HW and SW to provide efficient high performance Power, memory, frequency walls Cell designed specifically for these applications Requirements set by Sony and Toshiba Main design and architecture at IBM Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 12

13 Move to IBM Slides Rest of motivation and architecture slides taken directly from talks by Peter Hofstee, IBM Separate PDF file combined from: Slides/01_Hofstee_Cell.pdf Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 13

14 Outline Motivation Cell architecture GPP Controller (PPE) Compute PEs (SPEs) Interconnect (EIB) Memory and I/O Comparisons Stream Processors Software (probably next time) All Cell related images and figures Sony and IBM Cell Broadband Engine Sony Corp. Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 14

15 Hardware Efficiency Greater Software Responsibility Hardware matches VLSI strengths Throughput-oriented design Parallelism, locality, and partitioning Hierarchical control to simplify instruction sequencing Minimalistic HW scheduling and allocation Software given more explicit control Explicit hierarchical scheduling and latency hiding (schedule) Explicit parallelism (parallelize) Explicit locality management (localize) Must reduce HW waste but no free lunch Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 15

16 Locality

64-bit MADDs (16 clusters 2 GB 512 KB 1 MB ~61 KB from: Unleashing the power: A programming example of large FFTs on Cell

17 Storage/Bandwidth Hierarchy is Key to Efficient High Performance DRAM bank DRAM bank <64 GB/s I/O pins cache bank cache bank 64 GB/s Inter-cluster and memory switches SRF lane SRF lane 512 GB/s cluster switch cluster switch 3,840 GB/s bit MADDs (16 clusters 2 GB 512 KB 1 MB ~61 KB from: Unleashing the power: A programming example of large FFTs on Cell given by Alex Chow at power.org on 6/9/2005 Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 17

18 SRF/LS Comparison Serve as staging area for memory Capture locality as part of the storage hierarchy Single time multiplexed wide port kernel access DMA access instruction access SPs uses word granularity vs. Cell s 4-word SP s SRF has efficient auto-increment access mode Cell uses one memory for both code and data Why? Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 18

19 Parallelism

20 Three Types of Parallelism in Applications Instruction level parallelism (ILP) multiple instructions from the same instruction basic-block (loop body) that can execute together true ILP is usually quite limited (~5 - ~20 instructions) Task level Parallelism (TLP) separate high-level tasks (different code) that can be run at the same time True TLP very limited (only a few concurrent tasks) Data level parallelism (DLP) multiple iterations of a loop that can execute concurrently DLP is plentiful in scientific applications Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 20

21 Taking Advantage of ILP Multiple FUs (VLIW or superscalar) Cell has limited superscalar (not for FP) Merrimac has 4-wide VLIW FP ops Latency tolerance (pipeline parallelism) Cell has 7 FP instructions in flight Merrimac expected to have ~24 FP Merrimac uses VLIW to avoid interlocks and bypass networks Cell also emphasizes static scheduling not clear to what extent dynamic variations are allowed Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 21

22 Taking Advantage of TLP Multiple FUs (MIMD) Cell can run a different task (thread) on each SPE + asynchronous DMA on each SPE DMA must be controlled by the SPE kernel Merrimac can run a kernel and DMA concurrently DMAs fully independent of the kernels Latency tolerance concurrent execution of different kernels and their associated stream memory operations Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 22

23 Taking Advantage of DLP Multiple FUs SIMD very (most?) efficient way of utilizing parallelism Cell has 4-wide SIMD Merrimac 16-wide MIMD convert DLP to TLP and use MIMD for different tasks VLIW convert DLP to ILP and use VLIW (unrolling, SWP) Latency tolerance Overlap memory operations and kernel execution (SWP and unrolling) Take advantage of pipeline parallelism in memory Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 23

24 Memory System

25 High Bandwidth Asynchronous DMA Very high bandwidth memory system need to keep FUs busy even with storage hierarchy Cell has ~2 words/cycle (25.6GB/s ) Merrimac designed for 4 words/cycle Sophisticated DMA stride (with records) gather/scatter (with records) Differences in granularity of DMA control Merrimac treats DMA as stream level operations Cell treats DMA as kernel level operations Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 25

26 Design Choices Discussion

27 Is VLIW a Good Idea? Want to take advantage of all available ILP Merrimac has a shallower pipeline than Cell Efficiency allows exposed communication at the kernel level no need for interlocks (or at least cheaper) no bypass network Manage a larger register space distributed/clustered register file allows relaxing SIMD restrictions with communication VLIW code expansion can be a problem especially for MIMD requires recompilation interlocks allow correct (possibly inefficient) execution Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 27

28 Is MIMD a Good Idea? Probably yes for limited parallelism true for PS3, probably not for scientific applications May help irregular applications work well on Merrimac with relaxed-simd but there are overheads padding, extra node copies, SIMD conditionals overhead Multiple sequencers and extra instruction storage instructions storage might be saved by pipelining kernels not always applicable, might increase bandwidth Cell sequencer seems to be ~12% of SPE Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 28

29 Is Kernel-Level DMA Control a Good Idea? MIMD (almost) dictates kernel level control Consumes issue slots Works well for fixed-rate streams Requires multi-threading for irregular stream access stream-level DMA lets hardware handle synchronization more efficiently May allow pointer-chasing? only real advantage is making the way-too-long memory latency loop a little shorter by localizing control Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 29

30 Outline Motivation Cell architecture GPP Controller (PPE) Compute PEs (SPEs) Interconnect (EIB) Memory and I/O Comparisons Merrimac Software All Cell related images and figures Sony and IBM Cell Broadband Engine Sony Corp. Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 30

31 Parallelism in Cell PPE has SIMD VMX instructions and is 2-way SMT 8 SPEs Each SPE operates exclusively on 4-vectors Odd/even instruction pipes Odd for branching and memory / Even for compute Pipelining for taking advantage of ILP Asynchronous DMAs SWP of communication and computation Memory-level parallelism Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 31

32 Locality in Cell PPE has L1 and L2 caches L2 cache is 512KB SPE have lots of registers and local store 128 registers per SPU 256KB LS per SPU Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 32

33 Communication and Synchronization in Cell Element Interconnect Bus (EIB) Carries all data between memory system and PPE/SPEs Carries all data between SPE SPE, SPE PPE, and PPE SPE PPE L2 cache is coherent with memory SPEs are not coherent (LS is not a cache) SPE DMA coherent with L2 SPEs can DMA to/from one another Memory map LS space into global namespace Each SPE has local virtual memory translation to do this No automatic coherency explicit synchronization Synchronization through memory or mailboxes Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 33

34 Cell Software Challenges Separate code for PPE and SPEs Explicit synchronization SPEs can only access memory through DMAs DMA is asynchronous, but prep instructions are part of SPE code SW responsible for consistency and coherency SPEs must be programmed with SIMD Lots of pipeline challenges left up to programmer / compiler Deep pipeline with no branch predictor 2-wide scalar pipeline needs static scheduling LS shared by DMA, instruction fetch, and SIMD LD/ST No memory protection on LS (Stack can eat data or code) Most common programming system is just low-level API and intrinsics Cell SDK Luckily, other options exist Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 34

35 Separate Code for PPE and SPEs PPE should be doing top-level control and I/O only Trying to compute on the PPE is problematic In order 2-way SMT with deep pipeline Busy enough just trying to coordinate between the SPEs SPEs are the parallel compute engines PowerPC architecture, but really different ISA and requirements Vector only, 2-way superscalar, Two processor types really have two different software systems Synchronization can be tricky to get efficient PPE SPE should use mailbox SPE PPE should use L2 cache location Why? Spawning threads on SPEs painfully slow Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 35

36 Memory System Challenges No arbitrary global LD/ST from SPE Everything must go through DMA Strict DMA alignment and granularity restrictions All DMAs must be aligned to 16-bytes and request at least 16 bytes All DMAs should be aligned to 128-bytes and request at least 128 bytes of contiguous memory Failing to maintain alignment will cause a bus error! Banked DRAM structure Consecutive DMA requests should span 2KB (8 banks) DMA commands issued from within SPE Strided data access Chain of stride commands to have arbitrary gather Chained list of commands prepared by SPE and stored in LS Need to explicitly synchronize with DMA completion Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 36

37 SPE Programming Challenges Vector only Scalars must be placed in LSW of vector and some instructions will take this into account All casts and operations are aligned to 16 bytes (4 words) Poor integer float casting LS is only SRAM in SPE Careful with growing stack Sometimes need to force instruction prefetch to avoid stalls Priority given to DMA first, then LD/ST, and only then fetch No branch prediction (predict not-taken?) Can have branch hint instructions inserted explicitly Some help from XLC compiler (and an occasional bug) Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 37

38 Overall Lots to deal with Explicit communication and synchronization Explicit locality Explicit parallelism Explicit pipeline scheduling Need some help from programming tools Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 38

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors