All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas
Acknowledgements Cell is the result of a deep partnership between SCEI/Sony, Toshiba, and IBM Cell represents the work of more than 400 people starting in 2001 More detailed papers on the Cell implementation and the SPE micro-architecture can be found in the ISSCC 2005 proceedings 2
Agenda Trends in s and Systems Cell Overview Aspects of the Implementation Power Efficient Architecture 3
Trends in Microprocessors and Systems 4
Performance over Time (Game processors take the lead on media performance) Flops (SP) 1000000 100000 10000 1000 Game s PC s 100 10 1993 1995 1996 1998 2000 2002 2004 5
System Trends toward Integration Memory Accel Northbridge Memory Next-Gen IO Southbridge IO Implied loss of system configuration flexibility Must compensate with generality of acceleration function to maintain market size. 6
Next Generation s address Programming Complexity and Trend Towards Programmable Offload Engines with a Simpler System Alternative GPU Streaming Graphics 64b Power Mem. Contr. CPU NIC Security Media Hardwired Function CPU Network Security Media Programmable ASIC Synergistic... Synergistic Cell Config. IO 7
Performance Limiters in Conventional Microprocessors Memory Wall Latency induced bandwidth limitations Power Wall Must improve efficiency and performance equally Frequency Wall Diminishing returns from deeper pipelines (can be negative if power is taken into account) 8
Cell Overview 9
Cell Goals Outstanding performance, especially on game/multimedia applications. Real time responsiveness to the user and the network. Applicable to a wide range of platforms. Support introduction in systems in 2005. 10
Cell Concept Compatibility with 64b Power Architecture Builds on and leverages IBM investment and community Increased efficiency and performance Non Homogenous Coherent Chip Multiprocessor High design frequency, low operating voltage Streaming DMA architecture attacks Memory Wall Highly optimized implementation Interface between user and networked world Image rich information, virtual reality Flexibility and security Multi-OS support, including RTOS/non-RTOS Combine real-time and non-real time worlds 11
Cell Chip Block Diagram SPU SPE SXU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS SMF SMF SMF SMF SMF SMF SMF SMF Coherent Bus (up to 96 Bytes/cycle) L2 PPE MIC BIC Dual XDR TM FlexIO TM L1 PXU 12
What is a Synergistic? (and why is it efficient?) Local Store is large 2 nd level register file / private instruction store instead of cache Asynchronous transfer (DMA) to shared memory Frontal attack on the Memory Wall Media Unit turned into a Unified (large) Register File 128 entry x 128 bit Media & Compute optimized One context SIMD architecture SFP FXU EVN FWD FXU ODD GPR DMA SBI CONTROL DP CHANNEL LS LS LS LS SMM ATO BEB RTB SPU SM 13
Coherent Offload Model DMA into and out of Local Store equivalent to Power core loads & stores Governed by Power Architecture page and segment tables for translation and protection Shared memory model Power architecture compatible addressing MMIO capabilities for SPEs Local Store is mapped (alias) allowing LS to LS DMA transfers DMA equivalents of locking loads & stores OS management/virtualization of SPEs Pre-emptive context switch is supported (but not efficient) 14
One approach to programming Cell Single Source Compiler Auto parallelization ( treat target Cell as an SMP ) Auto SIMD-ization ( SIMD-vectorization ) Compiler management of Local Store as 2 nd level register file / SW managed cache (I&D) Most Cell unique piece Optimization OpenMP pragmas Vector.org SIMD intrinsics Data/Code partitioning Streaming / pre-specifying code/data use Prototype Single Source Compiler Developed in IBM Research 15
Cell Implementation Aspects 16
SPE BLOCK DIAGRAM Floating-Point Unit Fixed-Point Unit Permute Unit Load-Store Unit Branch Unit Channel Unit Local Store (256kB) Single Port SRAM Result Forwarding and Staging Register File Instruction Issue Unit / Instruction Line Buffer 128B Read 128B Write On-Chip Coherent Bus DMA Unit 8 Byte/Cycle 16 Byte/Cycle 64 Byte/Cycle 128 Byte/Cycle 17
SPE PIPELINE FRONT END IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2 SPE PIPELINE BACK END Branch Instruction RF1 RF2 Permute Instruction EX1 EX2 EX3 EX4 Load/Store Instruction EX1 EX2 EX3 EX4 Fixed Point Instruction EX5 EX6 WB WB IF Instruction Fetch IB Instruction Buffer ID Instruction Decode IS Instruction Issue RF Register File Access EX Execution WB Write Back EX1 EX2 WB Floating Point Instruction EX1 EX2 EX3 EX4 EX5 EX6 WB 18
PPE BLOCK DIAGRAM 8 Pre-Decode L2 Interface Fetch Control Branch Scan 4 L1 Instruction Cache Thread A Thread B 4 SMT Dispatch (Queue) 2 1 Threads alternate fetch and dispatch cycles Microcode L1 Data Cache 1 Load/Store Unit 1 Fixed-Point Unit Completion/Flush 1 Branch Execution Unit VMX Load/Store/ Permute Decode Dependency Issue 2 VMX/FPU Issue (Queue) 2 1 1 1 1 VMX Arith./Logic Unit FPU Arith/Logic Unit VMX Completion Thread A Thread B Thread A FPU Load/Store FPU Completion 19
PPE PIPELINE FRONT END Microcode MC1 MC2 MC3 MC4... MC9 MC10 MC11 IC1 IC2 IC3 IC4 IB1 IB2 ID1 ID2 ID3 IS1 IS2 Instruction Cache and Buffer Instruction Decode and Issue IS3 BP1 BP2 BP3 BP4 Branch Prediction PPE PIPELINE BACK END Branch Instruction DLY DLY DLY RF2 EX1 EX2 Fixed Point Unit Instruction DLY DLY DLY RF1 RF2 EX1 EX2 EX3 EX4 Load/Store Instruction RF1 RF1 RF2 EX1 EX2 EX3 EX4 EX5 EX3 EX6 EX4 IBZ IC0 EX7 EX5 EX8 WB WB IC Instruction Cache IB Instruction Buffer BP Branch Prediction MC Microcode ID Instruction Decode IS Instruction Issue DLY Delay Stage RF Register File Access EX Execution WB Write Back 20
Cell ~250M transistors ~235mm2 Top frequency >4GHz 9 cores, 10 threads > 256 GFlops (SP) @4GHz > 26 GFlops (DP) @4GHz Up to 25.6GB/s memory B/W Up to 75 GB/s I/O B/W ~400M$(US) design investment 21
First pass hardware measurement in the Lab - Nominal Voltage = 1V Hardware Performance Measurement (85 C) 4.5 Fmax Frequency [GHz] 4 3.5 3 0.9 1 1.1 1.2 Supply Voltage 22
Cell Configurable I/O Direct Attach XDR XDR tm XDR tm XDR tm XDR tm Two I/O interfaces Configurable number of Bytes Coherent or I/O Protection IOIF CELL BIF CELL IOIF XDR tm XDR tm XDR tm XDR tm IOIF0 XDR tm XDR tm CELL IOIF1 IOIF IOIF CELL BIF CELL XDR tm XDR tm SW CELL BIF CELL XDR tm XDR tm IOIF IOIF 23
Power Efficient Architecture 24
Power Efficient Architecture and the BE Non-Homogeneous Coherent Multi- Data-plane/Control-plane specialization More efficient than homogeneous SMP 3-level model of Memory Bandwidth without (inefficient) speculation High-bandwidth.. Low power 25
Power Efficient Architecture and the SPE Power Efficient ISA allows Simple Control Single mode architecture No cache Branch hint Large unified register file Channel Interface Efficient Microarchitecture Single port local store Extensive clock gating Efficient implementation See Cool Chips paper by O. Takahashi et al. and T. Asano et al. 26
Cell BE Example Application Areas Cell excels at processing of rich media content in the context of broad connectivity Digital content creation (games and movies) Game playing and game serving Distribution of dynamic, media rich content Imaging and image processing Image analysis (e.g. video surveillance) Next-generation physics-based visualization Video conferencing Streaming applications (codecs etc.) Physical simulation & science Cell is an excellent match for any applications that require: Parallel processing Real time processing Graphics content creation or rendering Pattern matching High-performance SIMD capabilities 27
Cell Summary Cell ushers in a new era of leading edge processors optimized for digital media and entertainment Responsiveness to the human user and the network are key drivers for Cell New levels of performance and power efficiency beyond what is achieved by PC processors Cell will enable entirely new classes of applications, even beyond those we contemplate today 28