All About the Cell Processor

Similar documents
Cell Broadband Engine Processor: Motivation, Architecture,Programming

Spring 2011 Prof. Hyesoon Kim

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Introduction to the Cell multiprocessor

Cell today and tomorrow

Hardware and Software Architectures for the CELL BROADBAND ENGINE processor

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Technology Trends Presentation For Power Symposium

Parallel Computing: Parallel Architectures Jin, Hai

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

The University of Texas at Austin

Amir Khorsandi Spring 2012

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Computer Architecture

OpenMP on the IBM Cell BE

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Power 7. Dan Christiani Kyle Wieschowski

1. PowerPC 970MP Overview

CellSs Making it easier to program the Cell Broadband Engine processor

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Cell Broadband Engine Overview

Xbox 360 Architecture. Lennard Streat Samuel Echefu

XT Node Architecture

INF5063: Programming heterogeneous multi-core processors Introduction

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Revisiting Parallelism

IBM's POWER5 Micro Processor Design and Methodology

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Intel Architecture for Software Developers

Introduction to Computing and Systems Architecture

Introduction to CELL B.E. and GPU Programming. Agenda

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Next Generation Technology from Intel Intel Pentium 4 Processor

Portable Parallel Programming for Multicore Computing

Xbox 360 high-level architecture

Open Innovation with Power8

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Cell BE enabling density computing for data rich environments

Power Technology For a Smarter Future

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

CS 152 Computer Architecture and Engineering

What SMT can do for You. John Hague, IBM Consultant Oct 06

Cell Processor and Playstation 3

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CS 152 Computer Architecture and Engineering

OpenMP on the IBM Cell BE

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Advanced Processor Architecture

POWER3: Next Generation 64-bit PowerPC Processor Design

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Portland State University ECE 588/688. Cray-1 and Cray T3E

Software Development Kit for Multicore Acceleration Version 3.0

How to Write Fast Code , spring th Lecture, Mar. 31 st

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

The Processor: Instruction-Level Parallelism

GPU Fundamentals Jeff Larkin November 14, 2016

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University

45-year CPU Evolution: 1 Law -2 Equations

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Mercury Computer Systems & The Cell Broadband Engine

Massively Parallel Architectures

Intel released new technology call P6P

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

GPUs and GPGPUs. Greg Blanton John T. Lubia

EC 513 Computer Architecture

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

A Brief View of the Cell Broadband Engine

WHY PARALLEL PROCESSING? (CE-401)

Processor (IV) - advanced ILP. Hwansoo Han

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Vector Engine Processor of SX-Aurora TSUBASA

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

CONSOLE ARCHITECTURE

SPARC64 X: Fujitsu s New Generation 16 core Processor for UNIX Server

Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

COMPUTER ORGANIZATION AND DESI

Computer Architecture!

Toward a Memory-centric Architecture

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

Convergence of Parallel Architecture

Advanced Instruction-Level Parallelism

Advanced Parallel Programming I

Memory Systems IRAM. Principle of IRAM

Cell Broadband Engine CMOS SOI 65 nm Hardware Initialization Guide

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Transcription:

All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas

Acknowledgements Cell is the result of a deep partnership between SCEI/Sony, Toshiba, and IBM Cell represents the work of more than 400 people starting in 2001 More detailed papers on the Cell implementation and the SPE micro-architecture can be found in the ISSCC 2005 proceedings 2

Agenda Trends in s and Systems Cell Overview Aspects of the Implementation Power Efficient Architecture 3

Trends in Microprocessors and Systems 4

Performance over Time (Game processors take the lead on media performance) Flops (SP) 1000000 100000 10000 1000 Game s PC s 100 10 1993 1995 1996 1998 2000 2002 2004 5

System Trends toward Integration Memory Accel Northbridge Memory Next-Gen IO Southbridge IO Implied loss of system configuration flexibility Must compensate with generality of acceleration function to maintain market size. 6

Next Generation s address Programming Complexity and Trend Towards Programmable Offload Engines with a Simpler System Alternative GPU Streaming Graphics 64b Power Mem. Contr. CPU NIC Security Media Hardwired Function CPU Network Security Media Programmable ASIC Synergistic... Synergistic Cell Config. IO 7

Performance Limiters in Conventional Microprocessors Memory Wall Latency induced bandwidth limitations Power Wall Must improve efficiency and performance equally Frequency Wall Diminishing returns from deeper pipelines (can be negative if power is taken into account) 8

Cell Overview 9

Cell Goals Outstanding performance, especially on game/multimedia applications. Real time responsiveness to the user and the network. Applicable to a wide range of platforms. Support introduction in systems in 2005. 10

Cell Concept Compatibility with 64b Power Architecture Builds on and leverages IBM investment and community Increased efficiency and performance Non Homogenous Coherent Chip Multiprocessor High design frequency, low operating voltage Streaming DMA architecture attacks Memory Wall Highly optimized implementation Interface between user and networked world Image rich information, virtual reality Flexibility and security Multi-OS support, including RTOS/non-RTOS Combine real-time and non-real time worlds 11

Cell Chip Block Diagram SPU SPE SXU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS SMF SMF SMF SMF SMF SMF SMF SMF Coherent Bus (up to 96 Bytes/cycle) L2 PPE MIC BIC Dual XDR TM FlexIO TM L1 PXU 12

What is a Synergistic? (and why is it efficient?) Local Store is large 2 nd level register file / private instruction store instead of cache Asynchronous transfer (DMA) to shared memory Frontal attack on the Memory Wall Media Unit turned into a Unified (large) Register File 128 entry x 128 bit Media & Compute optimized One context SIMD architecture SFP FXU EVN FWD FXU ODD GPR DMA SBI CONTROL DP CHANNEL LS LS LS LS SMM ATO BEB RTB SPU SM 13

Coherent Offload Model DMA into and out of Local Store equivalent to Power core loads & stores Governed by Power Architecture page and segment tables for translation and protection Shared memory model Power architecture compatible addressing MMIO capabilities for SPEs Local Store is mapped (alias) allowing LS to LS DMA transfers DMA equivalents of locking loads & stores OS management/virtualization of SPEs Pre-emptive context switch is supported (but not efficient) 14

One approach to programming Cell Single Source Compiler Auto parallelization ( treat target Cell as an SMP ) Auto SIMD-ization ( SIMD-vectorization ) Compiler management of Local Store as 2 nd level register file / SW managed cache (I&D) Most Cell unique piece Optimization OpenMP pragmas Vector.org SIMD intrinsics Data/Code partitioning Streaming / pre-specifying code/data use Prototype Single Source Compiler Developed in IBM Research 15

Cell Implementation Aspects 16

SPE BLOCK DIAGRAM Floating-Point Unit Fixed-Point Unit Permute Unit Load-Store Unit Branch Unit Channel Unit Local Store (256kB) Single Port SRAM Result Forwarding and Staging Register File Instruction Issue Unit / Instruction Line Buffer 128B Read 128B Write On-Chip Coherent Bus DMA Unit 8 Byte/Cycle 16 Byte/Cycle 64 Byte/Cycle 128 Byte/Cycle 17

SPE PIPELINE FRONT END IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2 SPE PIPELINE BACK END Branch Instruction RF1 RF2 Permute Instruction EX1 EX2 EX3 EX4 Load/Store Instruction EX1 EX2 EX3 EX4 Fixed Point Instruction EX5 EX6 WB WB IF Instruction Fetch IB Instruction Buffer ID Instruction Decode IS Instruction Issue RF Register File Access EX Execution WB Write Back EX1 EX2 WB Floating Point Instruction EX1 EX2 EX3 EX4 EX5 EX6 WB 18

PPE BLOCK DIAGRAM 8 Pre-Decode L2 Interface Fetch Control Branch Scan 4 L1 Instruction Cache Thread A Thread B 4 SMT Dispatch (Queue) 2 1 Threads alternate fetch and dispatch cycles Microcode L1 Data Cache 1 Load/Store Unit 1 Fixed-Point Unit Completion/Flush 1 Branch Execution Unit VMX Load/Store/ Permute Decode Dependency Issue 2 VMX/FPU Issue (Queue) 2 1 1 1 1 VMX Arith./Logic Unit FPU Arith/Logic Unit VMX Completion Thread A Thread B Thread A FPU Load/Store FPU Completion 19

PPE PIPELINE FRONT END Microcode MC1 MC2 MC3 MC4... MC9 MC10 MC11 IC1 IC2 IC3 IC4 IB1 IB2 ID1 ID2 ID3 IS1 IS2 Instruction Cache and Buffer Instruction Decode and Issue IS3 BP1 BP2 BP3 BP4 Branch Prediction PPE PIPELINE BACK END Branch Instruction DLY DLY DLY RF2 EX1 EX2 Fixed Point Unit Instruction DLY DLY DLY RF1 RF2 EX1 EX2 EX3 EX4 Load/Store Instruction RF1 RF1 RF2 EX1 EX2 EX3 EX4 EX5 EX3 EX6 EX4 IBZ IC0 EX7 EX5 EX8 WB WB IC Instruction Cache IB Instruction Buffer BP Branch Prediction MC Microcode ID Instruction Decode IS Instruction Issue DLY Delay Stage RF Register File Access EX Execution WB Write Back 20

Cell ~250M transistors ~235mm2 Top frequency >4GHz 9 cores, 10 threads > 256 GFlops (SP) @4GHz > 26 GFlops (DP) @4GHz Up to 25.6GB/s memory B/W Up to 75 GB/s I/O B/W ~400M$(US) design investment 21

First pass hardware measurement in the Lab - Nominal Voltage = 1V Hardware Performance Measurement (85 C) 4.5 Fmax Frequency [GHz] 4 3.5 3 0.9 1 1.1 1.2 Supply Voltage 22

Cell Configurable I/O Direct Attach XDR XDR tm XDR tm XDR tm XDR tm Two I/O interfaces Configurable number of Bytes Coherent or I/O Protection IOIF CELL BIF CELL IOIF XDR tm XDR tm XDR tm XDR tm IOIF0 XDR tm XDR tm CELL IOIF1 IOIF IOIF CELL BIF CELL XDR tm XDR tm SW CELL BIF CELL XDR tm XDR tm IOIF IOIF 23

Power Efficient Architecture 24

Power Efficient Architecture and the BE Non-Homogeneous Coherent Multi- Data-plane/Control-plane specialization More efficient than homogeneous SMP 3-level model of Memory Bandwidth without (inefficient) speculation High-bandwidth.. Low power 25

Power Efficient Architecture and the SPE Power Efficient ISA allows Simple Control Single mode architecture No cache Branch hint Large unified register file Channel Interface Efficient Microarchitecture Single port local store Extensive clock gating Efficient implementation See Cool Chips paper by O. Takahashi et al. and T. Asano et al. 26

Cell BE Example Application Areas Cell excels at processing of rich media content in the context of broad connectivity Digital content creation (games and movies) Game playing and game serving Distribution of dynamic, media rich content Imaging and image processing Image analysis (e.g. video surveillance) Next-generation physics-based visualization Video conferencing Streaming applications (codecs etc.) Physical simulation & science Cell is an excellent match for any applications that require: Parallel processing Real time processing Graphics content creation or rendering Pattern matching High-performance SIMD capabilities 27

Cell Summary Cell ushers in a new era of leading edge processors optimized for digital media and entertainment Responsiveness to the human user and the network are key drivers for Cell New levels of performance and power efficiency beyond what is achieved by PC processors Cell will enable entirely new classes of applications, even beyond those we contemplate today 28