IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Similar documents
CellSs Making it easier to program the Cell Broadband Engine processor

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

All About the Cell Processor

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Spring 2011 Prof. Hyesoon Kim

Parallel Computing: Parallel Architectures Jin, Hai

INF5063: Programming heterogeneous multi-core processors Introduction

The University of Texas at Austin

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

High Performance Computing. University questions with solution

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

Introduction to Computing and Systems Architecture

Computer Architecture

How to Write Fast Code , spring th Lecture, Mar. 31 st

The SARC Architecture

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

Parallel Exact Inference on the Cell Broadband Engine Processor

Massively Parallel Architectures

Technology Trends Presentation For Power Symposium

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Introduction to the Cell multiprocessor

Saman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology

OpenMP on the IBM Cell BE

Evaluating the Portability of UPC to the Cell Broadband Engine

WHY PARALLEL PROCESSING? (CE-401)

Software Development Kit for Multicore Acceleration Version 3.0

high performance medical reconstruction using stream programming paradigms

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Portable Parallel Programming for Multicore Computing

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Cell Programming Tips & Techniques

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM

Xbox 360 high-level architecture

Portland State University ECE 588/688. Graphics Processors

Parallel Computing. Hwansoo Han (SKKU)

Introduction to CELL B.E. and GPU Programming. Agenda

Portland State University ECE 588/688. Cray-1 and Cray T3E

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE

Compilation for Heterogeneous Platforms

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CSE 392/CS 378: High-performance Computing - Principles and Practice

OpenMP on the IBM Cell BE

Amir Khorsandi Spring 2012

A Brief View of the Cell Broadband Engine

High Performance Computing on GPUs using NVIDIA CUDA

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

What does Heterogeneity bring?

Cell Broadband Engine Overview

Cell Processor and Playstation 3

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

Ray Tracing on the Cell Processor

Why Multiprocessors?

Introduction to Parallel Computing

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Re-architecting Virtualization in Heterogeneous Multicore Systems

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Memory Systems IRAM. Principle of IRAM

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

A Cache Hierarchy in a Computer System

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.

Mercury Computer Systems & The Cell Broadband Engine

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Accelerated Library Framework for Hybrid-x86

PowerPC 740 and 750

Acceleration of Correlation Matrix on Heterogeneous Multi-Core CELL-BE Platform

Intel Enterprise Processors Technology

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Cell Broadband Engine Processor: Motivation, Architecture,Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Simultaneous Multithreading on Pentium 4

A Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

On the Portability and Performance of Message-Passing Programs on Embedded Multicore Platforms

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

A Streaming Computation Framework for the Cell Processor. Xin David Zhang

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

Processor Architecture and Interconnect

A Buffered-Mode MPI Implementation for the Cell BE Processor

Memory-Intensive Computations. Optimized On-Chip Pipelining of. on the CELL BE. Christoph Kessler Jörg Keller

Neil Costigan School of Computing, Dublin City University PhD student / 2 nd year of research.

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

CUDA Programming Model

Tesla Architecture, CUDA and Optimization Strategies

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Versal: AI Engine & Programming Environment

Higher Level Programming Abstractions for FPGAs using OpenCL

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Multi-core Architectures. Dr. Yingwu Zhu

Transcription:

IBM Cell Processor Gilbert Hendry Mark Kretschmann

Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline

Cell Architecture: System on a Chip Heterogeneous chip multiprocessor 64-bit Power Architecture with cooperative offload processors, with direct memory access and communication synchronization methods 64-bit Power Processing Element Control Core (PPE) 8 Synergistic Processing Elements (SPE) Single instruction, multiple-data architecture, supported by both the vector media extensions on the PPE and the instruction set of the SPE's (i.e. gaming/media and scientific applications) High-bandwidth on-chip element interconnection bus (EIB), PPE Main control unit for the entire Cell 32KB L1 instruction and data caches 512KB L2 unified cache Dual threaded, static dual issue Composed of three main units Instruction Unit (IU) Fetch, Decode, Branch, Issue, and Completion Fixed-Point Execution Unit Fixed-Point instructions and load/store instructions Vector Scalar Unit Vector and floating point instructions

Cell Architecture: Synergistic Processing Element Synergistic Processing Element (SPE) SIMD instruction set architecture, optimized for power and performance of computational-intensive applications Local store memory for instructions and data Additional level of memory hierarchy Largest component of SPE Single-port SRAM, capable of reading and writing through both narrow 128-bit and wide 128-byte ports Data and instructions transferred between local store and system memory by asynchronous DMA commands, controlled by memory flow controlled on SPE Support for 16 simultaneous DMA commands Programmable DMA options SPE instructions insert DMA commands into queues Pool DMA commands into a single DMA list command Insert commands in DMA queue from other SPE processors 128 entry unified register file for improved memory bandwidth, and optimum power efficiency and performance Dynamically configurable to provide support for content protection and privacy

Cell Architecture Cell Synergistic Memory Flow Controller Provides integration between communicating SPE's within the system architecture Utilizes the Power protection model and address translation, incorporating the SPE's into the system protection and address translation requirements Allows data transfer and synchronization capabilities to interface with the system Element Interconnect Bus (EIB) Supports the communication interface between the PPE and SPE elements, serving as a high performance data transfer engine between local store memories and Cell system memory Parallel data transfer and processing are are capable due to the offloading of data transfers from compute elements onto dedicated data transfer engines Asynchronous high performance data transfer along with parallel data processing on the PPE and SPE's eliminate the need to explicitly interleave data processing and transfer at program level.

Cell Architecture: Element Interconnect Bus (EIB) Element Interconnect Bus (EIB) High-bandwidth on-chip element interconnection bus (EIB), allowing on-chip interactions between processor elements. Coherant bus allowing single address space to be shared by the PPE and SPE's for efficient communication and ease of programming Peak bandwidth of 205.8GB/s 4 x 16B wide channels

Cell Architecture: Layout

Cell Architecture

Memory & I/O XDR RAMBUS is used as bus between EIB and main memory 2 x 32-bit channels, with 3.2Gbit/s data transfer per pin 25.6GB/s peak data transfer potentially scalable to 6.4Gbit/s, offering a peak rate of 51.2GB/s FlexPhase Technology signals can arrive at different times, reducing need for exact wire length Configurable to attach to variable amounts of main memory Memory Interface Controller (MIC) controls the flow of data between the EIB and the XDR RAMBUS RAMBUS RRAC FlexIO used for I/O 7 transmit, 5 receive 8-bit ports 35GB/s transmit peak 25GB/s receive peak Used as a high speed interconnect between SPE's of different Cells when a multiple Cell architecture is in place 4 Cells can be seamlessly connected given the current architecture, an additional 4 with an added switch

Security SPE and it's local store (LS) provide an important role in the Cell security architecture Code and data are always preloaded into the LS before executing an application DMA engine transfers data between LS and all other system resources (e.g. main memory, I/O, other SPE's LS) Secure Processing Vault Application is isolated within the SPE from all other execution threads SPE is run in special mode where it is disengaged from the EIB, and thus the entire system Majority of LS loaded for explicit SPE use no external read/writes allowed while in isolation state only external action available is a cancel application command all application data erased before reintegration with system Portion of LS left free solely for communication between other SPE's application currently running on SPE is responsible for the safe data transfer between free LS space and the other SPE's

Security Runtime Secure Boot Secure Boot: application code goes through cryptographic authentication to ensure it has not been modified does not ensure modification after initial startup of application Allows applications to secure boot an arbitrary number of times during runtime renew trustworthiness of application arbitrary number of times Everytime an SPE enters isolation mode the runtime secure boot is first performed to provide another level of application security Hardware Root of Secrecy Root key used to decrypt all application authentication keys is embedded in system hardware Root key is also integrated with the isolation mode of SPE's hardware decryption facility fetches encrypted data into isolated SPE data decrypted using hardware root key only authenticated applications can access the root key root key decryption can only happen within isolated SPE

Compiler Octopiler single source parallelizing, simdizing compiler generates multiple binaries targeting both the PPE and SPE elements from one single source file allows programmers to develop applications for parallel architectures with the illusion of a single shared memory image Compiler-controlled software-cache, memory hierarchy optimizations, and code partitioning techniques assume all data resides in a shared system memory enables automatic transfer of data and code preserves coherence across all local SPE memories and system memory

Compiler: Memory Abstraction Softare Caching SPE can only accessit's own LS, with DMA transfer DMA commands are needed to transfer data whenever read/write to the shared system memory or other SPE's Re-use the data transferred by a single DMA commands across multiple accesses where spacial or temporal locality exists runtime libraries exists to manage software caching of system memory data within an SPE's LS system memory accesses are identified by compiler, which replaces memory accesses with calls to the software cache library Static Buffering Reduce performance impact of sequential DMA transfers between system memory and LS of SPE's Pre-fetch large portions of data arrays (e.g. loops, array computations) into static buffer using single DMA command Code Partitioning Existing facility partitions code into multiple overlays when a code section is too large to fit into an SPE's LS overlays are automatically transferred/swapped into SPE LS when required

Compiler: Performance Performance evaluation for parallelizing compiler technology Figure 1 shows the speedup when code is automatically simdized for the SPE's average speedup of 9.9 Figure 2 shows the speedup obtained when compiler automatically distributes work among the PPE and SPE processors

Cell Programming Models In order to obtain optimum performance within the Cell both the SPE's LS and Cell's SIMD dataflow must be taken into account by the programmer, and ultimately the programming model Function Offload Model SPE's used as accelerators for critical functions designated by the PPE original code optimized and recompiled for SPE instruction set designated program functions offloaded to SPE's for execution when PPE calls designated program function it is automatically invoked on the SPE Programmer statically identifies which program functions will execute on PPE and which will be offloaded to the SPE separate source files and compilation for PPE and SPE Prototype single-source approach using compiler directives as special offload hints has been developed for the Cell challenge to have compiler automatically determine which functions execute on PPE and SPE allows applications to remain compatible with the Power Architecture generating both PPE and SPE binaries Cell systems load SPE-accelerated binaries Non-Cell systems load PPE binaries

Cell Programming Models Device Extension Model Special type of function offload model SPE's act as interfaces between PPE and external devices Use memory-mapped SPE-accessible registers as command/response FIFO between PPE and SPE's Device memory can be mapped by DMA, supporting transfer size granularity as small as a single byte Computational Acceleration Model SPE centralized model, offering greater integration of SPE's in application execution and programming than function offload model Computational intensive code performed on SPE's rather than PPE PPE acts as a control center and service engine for the SPE software Work is partitioned among multiple SPE's operating in parallel manually done by programmers or automatically by compiler must include efficient scheduling of DMA operations for code and data transfers between PPE and SPE utilize shared-memory programming model or a supported message passing model Generally provides extensive acceleration of intense math functions without requiring significant re-write or re-compilation

Cell Programming Models Streaming Model Designed to take advantage of support for message passing between PPE and SPE's Construct serial or parallel pipelines using the SPE's Each SPE applies a particular application kernel to the received data stream PPE acts as a stream controller, with SPE's acting as data stream processors Extremely efficient when each SPE performs an equivalent amount of work Shared Memory Multiprocessor Model Utilizes the two different instruction sets of the PPE and SPE to execute applications provides great benefits to performance when utilizing just a single instruction set for a program would be very inefficient Since DMA operations are cache coherent, combining DMA commands with SPE load/store commands provides the illusion of a single shared address space conventional shared-memory loads are replaced by a DMA from system memory to an SPE's LS, and then a load from the LS to the SPE's register file conventional shared-memory stores are replaced by a store from an SPE's register file to it's LS, then a DMA from the SPE's LS to system memory Affective addresses in common with both PPE and SPE are used for all pseudo shared-memory load/store commands Possibilities exist for a compiler or interpretor to manage all SPE LS's as local caches for instruction and data

Cell Programming Models Asymmetric Thread Runtime Model Using this model the scheduling of threads on both the PPE and SPE are possible Interaction among threads as in a conventional SMP Similar to thread or lightweight task models of conventional operating systems extensions exist to include processing units with different instruction sets as exists within the PPE and SPE's Tasks are scheduled on both the PPE and SPE in order to optimize performance and utilization of resources abilities of the SPE to run only a single thread is thus hidden from programmer

Applications Games (PS3) Audio Video physics

Applications Imaging Medical, rendering, feature extraction Mercury Blade, Turismo Televisions (Toshiba) MPEG decode

Applications Handheld devices Less SPEs? Scientific, Engineering computing Matrices! Chemistry, physics, aerospace, defense

Performance Single precision parallelized matrix multiplication

Performance Results of SPEC OpenMP 2001

Performance Cell vs. other processors for FFTs

Power IBM attributes power efficiency to heterogeneous architecture Voltage (V) 0.9 0.9 1.0 1.1 1.2 1.3 Relationship of speed to heat Frequency (GHz) 2.0 3.0 3.8 4.0 4.4 5.0 Power (W) 1 2 3 4 7 11 Die Temp (C) 25 27 31 38 47 63 As tested by IBM under a heavy workload

Cost? Playstation 3 - $500+? BladeCenter - ~$2,500 (HP itanium 2 ~$7,500)

Conclusions New heterogeneous architecture seems well suited for media applications Areas with large amounts of data also can benefit