Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Size: px

Start display at page:

Download "Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008"

Michael Malone
6 years ago
Views:

1 Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008

2 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule) Cray's two oxen Cray's 1024 chicken 2

3 Three Performance-Limiting Walls Power Wall Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase. Frequency Wall Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account. Memory Wall On multi-gigahertz symmetric multiprocessors even those with integrated memory controllers latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor. 3

4 Conventional Processors Don't Cut It... shallower pipelines with in-order execution have proven to be the most area and energy efficient. [...] we believe the efficient building blocks of future architectures are likely to be simple, modestly pipelined (5-9 stages) processors, floating point units, vector, and SIMD processing elements. Note that these constraints fly in the face of the conventional wisdom of simplifying parallel programming by using the largest processors available. [...] David Patterson, [...] Kathy Yelick The Landscape of Parallel Computing Research: A View from Berkeley future architectures shallow pipelines in order execution SIMD units 4

5 Cache Memories Don't Cut It When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. [...] Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available. H. Peter Hofstee Cell Broadband Engine Architecture from 20,000 feet conventional processor bucket brigade a hundred people a few buckets 5

6 Cache Memories Don't Cut It Their (multicore) low cost does not guarantee their effective use in HPC. This relates back to the data-intensive nature of most HPC applications and the sharing of already limited bandwidth to memory. The stream benchmark performance of Intel's new Woodcrest dual core processor illustrates this point. [...] Much effort was put into improving Woodcrest's memory subsystem, which offers a total of over 21 GBs/sec on nodes with two sockets and four cores. Yet, four-threaded runs of the memory intensive Stream benchmark on such nodes that I have seen extract no more than 35 percent of the available bandwidth from the Woodcrest's memory subsystem. Richard B. Walsh New Processor Options for HPC conventional processor latency-limited bandwidth 6

7 Conventional Memories Don't Cut It More flexible or even reconfigurable data coherency schemes will be needed to leverage the improved bandwidth and reduced latency. An example might be large, on-chip, caches that can flexibly adapt between private or shared configurations. In addition, real-time embedded applications prefer more direct control over the memory hierarchy, and so could benefit from on-chip storage configured as software-managed scratchpad memory. [...] David Patterson, [...] Kathy Yelick The Landscape of Parallel Computing Research: A View from Berkeley future architectures reconfigurable coherency software-managed scrachpad memory 7

CELL multi-core in-order execution shallow pipeline SIMD scratchpad memory 204.8 GFLOPS single precision 204.8 GB/s internal bandwidth 25.6 GB/s memory bandwidth 3.

8 CELL multi-core in-order execution shallow pipeline SIMD scratchpad memory GFLOPS single precision GB/s internal bandwidth 25.6 GB/s memory bandwidth 3.2 GHz 90 nm OSI 234 million transistors 165 million Xbox million Itanium 2 (2002) 1,700 million Dual-Core Itanium 2 (2006) 12.8 GFLOPS (single or double) 8

9 CELL PPE Power Processing Element SPE Synergistic Processing Element SPU Synergistic Processing Unit LS Local Store MFC Memory Flow Controller EIB Element Interconnection Bus MIC Memory Interface Controller 9

10 Power Processing Element Power Processing Element (PPE) Power 970 architecture compliant 2-way Symmetric Multithreading (SMT) 32KB Level 1 instruction cache 32KB level 1 data cache 512KB level 2 cache VMX (AltiVec) with bit vector registers standard FPU fully pipelined DP with FMA 6.4 Gflop/s DP at 3.2 GHz AltiVec no DP 4-way fully pipelined SP with FMA 25.6 Gflop/s SP at 3.2 GHz 10

11 Synergistic Processing Elements Synergistic Processing Elements (SPEs) 128-bit SIMD 128 vector registers 256KB instruction and data local memory Memory Flow Controller (MFC) 16-way SIMD (8-bit integer) 8-way SIMD (16-bit integer) 4-way SIMD (32-bit integer, single prec. FP) 2-way SIMD (64-bit double prec. FP) 25.6 Gflop/s SP at 3.2 Ghz (fully pipelined) 1.8 Gflop/s DP at 3.2 Ghz (7 cycle latency) 11

SPE SIMD architecture two in-order (dual issue) pipelines large register file (128 128-bit registers)

12 SPE SIMD architecture two in-order (dual issue) pipelines large register file ( bit registers) 256 KB of scratchpad memory (Local Store) Memory Flow Controller to DMA code and data from system memory 12

13 Element Interconnection Bus Element Interconnection Bus (EIB) 4 16B-wide unidirectional channels half the system clock (1.6GHz) GB/s bandwidth (arbitration) 13

14 EIB 16 byte channels 4 unidirectional rings token based arbitration half system clock (1.6 GHz) 14

15 Main Memory System Memory Interface Controller (MIC) external dual XDR, 3.2 Ghz max effective frequency, (max 400 MHz, Octal Data Rate), each: 8 banks max 256 MB, total: 16 banks max 512 MB, 25.6 GB/s. 15

16 CELL Performance Double Precision In double precision every seven cycles each SPE can: process a two element vector, perform two operations on each element. in one cycle the FPU on the PPE can: process one element, perform two operations on the element. 8 x 2 x 2 x 3.2 GHz / 7 = Gflop/s 2 x 3.2 GHz = 6.4 Gflop/s Gflop/s 16

17 CELL Performance Single Precision In single precision in one cycle each SPE can: process a four element vector, perform two operations on each element. in one cycle the VMX on the PPE can: process a four element vector, perform two operations on each element. 8 x 4 x 2 x 3.2 GHz = Gflop/s 4 x 2 x 3.2 GHz = 25.6 Gflop/s Gflop/s 17

18 CELL Performance Bandwidth Bandwidth: 3.2 GHz clock: each SPU 25.6 GB/s, Main memory 25.6 GB/s, EIB GB/s. (compare to 25.6 Gflop/s per SPU) (compare to Gflop/s 8 SPUs) 18

19 CELL Performance Historical Perspective Connection Machine CM-5 (512 CPUs) 512 x 128 = 65 Gflop/s DP Playstation3 (4 units) 4 x 17 = 68 Gflop/s DP 19

20 Performance Comparison Double Precision 1.6 GHz Dual-Core Itanium x 4 x 2 = 12.8 Gflop/s 3.2 GHz CELL BE (SPEs only) 3.2 x 8 x 8 = 14.6 Gflop/s 20

Performance Comparison Single Precision 1.6 GHz Dual-Core Itanium 2 1.6 x 4 x 2 = 12.8 Gflop/s 3.2 GHz SPE 3.2 x 8 = 25.

21 Performance Comparison Single Precision 1.6 GHz Dual-Core Itanium x 4 x 2 = 12.8 Gflop/s 3.2 GHz SPE 3.2 x 8 = 25.6 Gflop/s One SPE = 2 Dual-Core Itaniums GHz CELL BE (SPEs only) 3.2 x 8 x 8 = Gflop/s One CBE = 16 Dual-Core Itaniums 2 21

22 SDK - Platforms Linux x86 x86-64 PPC64 CELL BE RPM-based distribution Fedora Core recommended CELL plugins for Eclipse available 22

23 SDK - Compilers PPU and SPU are different ISAs different sets of compilers PPU GCC SPU GCC G++ G++ GFORTRAN XLC XLC XLC++ XLC++ XLF 32-bit 32-bit 64-bit OpenMP assembler and linker are common to GNU and XL compilers XL compilers requite GNU tool chain for cross-assembling and cross linking for both the PPE and the SPE 23

24 SDK - Samples /opt/ibm/cell-sdk/prototype /lib FFT matrix /samples overlays /workloads FFT using ALF MatMul... game math simple DMA audio resample tutorial samples curves and surfaces... software managed cache... 24

25 Compilation and Linking SPU objects are embedded in PPU objects SPU code and PPU code are linked into one executable SDK provides standard makefile structure /opt/ibm/cell-sdk/prototype make.env compilation options make.footer build rules do not modify make.header definitions do not modify README_build_env.txt makefile howto understand the build process run the default makefile see what it does 25

26 Compilation and Linking 26

27 Hello World - libspe2 spe_context_run() is a blocking call create one POSIX thread for each SPE thread PPE #include <libspe2.h> #include <pthread.h> int main() { spe_context_create() pthread_create() pthread_join() spe_context_destroy() } SPE int main() { //... } void* spe_thread() { spe_image_open() spe_program_load() spe_context_run() pthread_exit() } 27

28 SPU Context Switching quiesce the SPE harvest (reset) an SPE save privileged and problem state to CSA save low 16 K of LS to CSA load and start SPE context-save sequence save GPRs and channel state to CSA save LSCSA to CSA save 240 KB of LS to CSA load and start SPE context-restore sequence copy LSCSA from CSA to LS restore 240 KB of LS from CSA restore GPRs and channel state from LSCSA restore privileged state from CSA restore remaining problem state from CSA restore 16 KB of LS from CSA 28

29 CELL Programming Models / Environments Gedae commercial product 29

30 CELL Resources developerworks CELL Resource Center Barcelona Supercomputer Center Computer Sciences Linux on CELL Power.org CELL Developer Corner The CELL Project at IBM Research CELL BE at IBM alphaworks GA Tech CELL Workshop ICL CELL Summit CellPerformance CELL Broadband Engine Programming Handbook CELL Broadband Engine Programming Tutorial SPE Runtime Management Library C/C++ Language Extensions for CELL Broadband Engine Architecture SPU Assembly Language Specification Synergistic Processor Unit Instruction Set Architecture 30

Amir Khorsandi Spring 2012

Introduction to Amir Khorsandi Spring 2012 History Motivation Architecture Software Environment Power of Parallel lprocessing Conclusion 5/7/2012 9:48 PM ٢ out of 37 5/7/2012 9:48 PM ٣ out of 37 IBM, SCEI/Sony,