Cell Broadband Engine Processor: Motivation, Architecture,Programming

Size: px
Start display at page:

Download "Cell Broadband Engine Processor: Motivation, Architecture,Programming"

Transcription

1 Cell Broadband Engine Processor: Motivation, Architecture,Programming H. Peter Hofstee, Ph. D. Cell Chief Scientist and Chief Architect, Cell Synergistic Processor IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas

2 Acknowledgements Cell Broadband Engine ( Cell ) is the result of a deep partnership between SCEI/Sony, Toshiba, and IBM Cell represents the work of more than 400 people starting in 2001and a design investment of about $400M 2

3 Agenda Motivation Architecture Implementation Programming Applications 3

4 Motivation 4

5 Motivation: Cell Goals Outstanding performance, especially on game/multimedia applications. Challenges: Power Wall, Frequency Wall, Memory Wall Real time responsiveness to the user and the network. Challenges: Real-time in an SMP environment, Security Applicable to a wide range of platforms. Challenge: Maintain programmability while increasing performance Support an introduction in 2005/6. Challenge: Structure innovation such that 5yr. schedule can be met 5

6 Power Wall: Module Heat Flux Trend 14 IBM ES9000 CMOS Prescott 12 Bipolar Jayhawk(dual) Module Heat Flux(watts/cm 2 ) Start of Water Cooling Vacuum IBM 360 Fujitsu VP2000 IBM 3090S CDC Cyber 205 IBM 4381 IBM 3081 Fujitsu M380 IBM 370 IBM NTT Fujitsu M-780 IBM 3090 Year of Announcement IBM RY5 IBM RY6 IBM RY4 IBM RY7 Apache Pulsar T-Rex Mckinley IBM GP Merced Pentium 4 Squadrons Pentium II(DSIP) R. Schmidt, IBM 6

7 CMOS Devices hitting a scaling wall Power components: 1000 Net: Active power Passive power Gate leakage Sub-threshold leakage (sourcedrain leakage) Power Density (W/cm 2 ) Further improvements require structure/materials changes (next slide) Active Power Air Cooling limit 0.1 Gate Length (microns) Passive Power

8 Better Performance Without Scaling New materials & structures are critical for continuing CMOS technology density and performance path, while mitigating power dissipation 8

9 Computing Paradigm Shift Today: Single thread performance hitting limits Architecture and process technology saturated Small percentage gains expected to remain But: Signs of paradigm shift to application specific system customization Large multiple gains for specific applications Cell 50x on TRE, 120x on FFT Datapower XML acceleration Many examples in embedded markets Future: Greater performance demands Immersive Interaction 3D, real-time, gaming inspired applications Rich media, data-intensive content Sensory Computing New network tier Autonomous agents performing intelligent analysis on streaming data >A&D: battlefield coordination Single Thread Performance SPECint Single thread performance growth rate slows dramatically Historical Trend 45% CGR 9

10 Solutions Memory wall: More slower threads Asynchronous loads Efficiency wall: More slower threads Specialized function Power wall: Reduce transistor power operating voltage limit oxide thickness scaling limit channel length Reduce switching per function INCREASE CONCURRENCY INCREASE SPECIALIZATION 10

11 Architecture & Implementation 11

12 Cell Concept Compatibility with 64b Power Architecture Builds on and leverages IBM investment and community Increased efficiency and performance Non Homogenous Coherent Chip Multiprocessor Allows an attack on the Frequency Wall Streaming DMA architecture attacks Memory Wall High design frequency, low operating voltage attacks Power Wall Highly optimized implementation Interface between user and networked world Flexibility and security Multi-OS support, including RTOS/non-RTOS Architectural extensions for real-time management 12

13 Cell Architecture is 64b Power Architecture Power ISA Power ISA MMU/BIU MMU/BIU Memory COHERENT BUS Incl. coherence/memory IO transl. compatible with 32/64b Power Arch. Applications and OS s 13

14 Cell Architecture is 64b Power Architecture Plus Power Power Memory Flow Control (MFC) ISA +RMT ISA +RMT MMU/BIU MMU/BIU +RMT +RMT Memory COHERENT BUS (+RAG) IO transl. MMU/DMA MMU/DMA LS Alias LS Alias +RMT Local Store Memory +RMT Local Store Memory 14

15 Cell Architecture is 64b Power Architecture + MFC Plus Power Power Synergistic Processors ISA +RMT ISA +RMT MMU/BIU MMU/BIU +RMT +RMT Memory COHERENT BUS (+RAG) IO transl. LS Alias LS Alias Syn. Proc. ISA MMU/DMA +RMT Local Store Memory Syn. Proc. ISA MMU/DMA +RMT Local Store Memory 15

16 Coherent Offload Model DMA into and out of Local Store equivalent to Power core loads & stores Governed by Power Architecture page and segment tables for translation and protection Shared memory model Power architecture compatible addressing MMIO capabilities for SPEs Local Store is mapped (alias) allowing LS to LS DMA transfers DMA equivalents of locking loads & stores OS management/virtualization of SPEs Pre-emptive context switch is supported (but not efficient) 16

17 SPE Highlights SFP FXU EVN FWD FXU ODD GPR DMA SBI CONTROL DP CHANNEL SMM LS LS LS LS ATO RTB 14.5mm 2 (90nm SOI) BEB RISC like organization 32 bit fixed instructions Clean design unified Register file User-mode architecture No translation/protection within SPU DMA is full Power Arch protect/x-late VMX-like SIMD dataflow Broad set of operations (8 / 16 / 32 Byte) Graphics SP-Float IEEE DP-Float Unified register file 128 entry x 128 bit 256KB Local Store Combined I & D 17

18 SPE BLOCK DIAGRAM Floating-Point Unit Fixed-Point Unit Permute Unit Load-Store Unit Branch Unit Channel Unit Local Store (256kB) Single Port SRAM Result Forwarding and Staging Register File Instruction Issue Unit / Instruction Line Buffer 128B Read 128B Write On-Chip Coherent Bus DMA Unit 8 Byte/Cycle 16 Byte/Cycle 64 Byte/Cycle 128 Byte/Cycle 18

19 SPE PIPELINE FRONT END IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2 SPE PIPELINE BACK END Branch Instruction RF1 RF2 Permute Instruction EX1 EX2 EX3 EX4 Load/Store Instruction EX1 EX2 EX3 EX4 Fixed Point Instruction EX5 EX6 WB WB IF Instruction Fetch IB Instruction Buffer ID Instruction Decode IS Instruction Issue RF Register File Access EX Execution WB Write Back EX1 EX2 WB Floating Point Instruction EX1 EX2 EX3 EX4 EX5 EX6 WB 19

20 Systems and Technology Group 20

21 CELL PROCESSOR STATISTICS 250M transistors 235mm2 Top frequency >4GHz Lab conditions Most efficient at ~1V > 200 GFlops > 20 GFlops Up to 25.6 GB/s memory B/W Up to 70+ GB/s I/O B/W Practical ~ 50GB/s Frequency [GHz] Hardware Performance Measurement (85 C) 4.5 Fmax Supply Voltage 100+ simultaneous bus transactions First pass hardware measurement in the Lab - Nominal Voltage = 1V 16+8 entry DMA queue per SPE 21

22 Programming 22

23 Cell Prototype Software Environment Programmer Experience Code Dev Tools Samples Workloads Demos End-User Experience Development Environment Debug Tools SPE Management Lib Application Libs Execution Environment Development Tools Stack Performance Tools Linux PPC64 with Cell Extensions Verification Hypervisor Miscellaneous Tools Hardware or System Level Simulator Standards: Language extensions ABI 23

24 Operating System Runtime Strategy Heterogeneous Multi-Threading Model PPE Threads, SPE Threads SPE DMA EA = PPE Process EA Space Or SPE Private EA space OS supports Create/Destroy SPE tasks Atomic Update Primitives used for Mutex SPE Context Fully Managed Context Save/Restore for Debug Virtualization Mode (indirect access) Direct Access Mode (realtime) OS assignment of SPE threads to SPEs Programmer directed using affinity mask SPE Compilers use OS runtime services PPE object files Application Source & Libraries SPE object files Cell AwareOS ( Linux) SPE Virtualization / Scheduling Layer (m->n SPE threads) Existing PPE tasks/threads PPE MT1 MT2 Physical PPE New SPE tasks/threads SPE SPE SPE SPE SPE SPE SPE SPE Physical SPEs 24

25 MFC Local Stor e SPU AUC N MFC Local Stor e SPU AUC N MFC Local Stor e SPU AUC N MFC Local Stor e SPU AUC N MFC Local Stor e SPU AUC N MFC Local Stor e SPU AUC N MFC Local Stor e SPU AUC N MFC Local Stor e SPU AUC N Systems and Technology Group Programming Models 1) Application Specific Accelerators Acceleration provided by O/S services Application independent of accelerators platform fixed PPE PowerPC Application BE Aware OS (Linux) mpeg_encode() O/S Service System Memory Parameter Area OpenGL Encrypt Decrypt Encoding Decoding Graphics Offload Data Encryption Data Decryption Realtime MPEG Encoding Compression/ Decompression Application Specific Acceleration Model SPC Accelerated Subsystems 25

26 Programming Models 2) Function Offload Power Processor (PPE) System Memory SPE function provided by libraries Predetermined functions Application calls standard Libraries Single source compilation SPE working set fits in Local Store O/S handles SPE allocation SPU Local Store MFC Multi-stage Pipeline N SPU Local Store MFC Power Processor (PPE) System Memory N SPU Local Store MFC N Parallel-stages SPU N Local Store MFC SPU N Local Store MFC 26

27 Programming Models 3) Computational Acceleration User created RPC libraries User acceleration routines User compiles SPE code Local Data Data and Parameters passed in call Global Data Data and Parameters passed in call SPE Code manages global data PPE Puts Text Static Data Parameters PPE Puts Initial Text Static Data Parameters Power Processor (PPE) SPU Local Store MFC Power Processor (PPE) System Memory N SPE executes PPE Gets Results SPE Puts Results SPU N Local Store 27 MFC SPE Independently Stages Text & Intermediate Data Transfers while executing

28 Single source approach to programming Cell Single Source Compiler Auto parallelization ( treat target Cell as an SMP ) Auto SIMD-ization ( SIMD-vectorization ) Compiler management of Local Store as 2 nd level register file / SW managed cache (I&D) Most Cell unique piece Optimization OpenMP pragmas Vector.org SIMD intrinsics Data/Code partitioning Streaming / pre-specifying code/data use Prototype Single Source Compiler Developed in IBM Research 28

29 Applications 29

30 Cell BE Performance Characteristics VERY GOOD Computationally intensive code ( loops can be unrolled ) Order of magnitude more flops Scatter-gather type problems ( e.g.fft, Raycasting, sparse matrices(?) ) Almost two orders of magnitude more performance than a typical PC processor NOT OPTIMIZED FOR Rapid context switch LS is context, switch is about 30uSec Run to completion is preferred Cooperative switching works well Pre-emptive switch is possible Load-compare-add-branch (TPCC, gcc type codes) 6 cycle load hurts Still 8 (10) threads on a single chip 30

31 Terrain Rendering Engine 31

32 Planned Usage of Cell Sony Playstation million units / year Current Sony installed base 190 million units (PS1 & PS2) Cell Development Center reporting directly to CEO Toshiba Cell on PC-type form factor card (October 2005) IBM White box form factor Reference system for Cell Engineering and Technology Services Custom designs and application utilizing cell Mercury Computer Cell based systems 32

33 User Interaction Drives Innovation in Computing Immersive Interaction Online Gaming Level of Interaction Main Frame Batch Punch Cards Main Frame Multitasking Green Screen/ Teletype Mini-Computer WYSIWYG Word Processing Time Stand Alone PC Windows Spreadsheet Client/Server Internet WWW Gaming Source: J.A. Kahle 33

34 Characteristics of the Latest Transition in User Interaction Windows Click and wait Client-centric User data accessible from client only Device-centric Connected Wired, sporadic /newsgroups 34 Immersive, 3D interactivity Real-time Distributed User data accessible everywhere Device-agnostic Collaborative Wireless, always-on Text messaging/blogs

35 Summary & Conclusions 35

36 Summary Cell ushers in a new era of leading edge processors optimized for digital media and entertainment Desire for realism is driving a convergence between supercomputing and entertainment New levels of performance and power efficiency beyond what is achieved by PC processors Responsiveness to the human user and the network are key drivers for Cell Cell will enable entirely new classes of applications, even beyond those we contemplate today TIME TO GET IN THE GAME! 36

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

Cell today and tomorrow

Cell today and tomorrow Cell today and tomorrow H. Peter Hofstee, Ph. D. Cell Chief Scientist and Chief Architect, Cell Synergistic Processor IBM Systems and Technology Group SCEI/Sony Toshiba IBM (STI) Design Center Austin,

More information

Cell Broadband Engine Overview

Cell Broadband Engine Overview Cell Broadband Engine Overview Course Code: L1T1H1-02 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn An overview of Cell history Cell microprocessor highlights Hardware architecture

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Hardware and Software Architectures for the CELL BROADBAND ENGINE processor

Hardware and Software Architectures for the CELL BROADBAND ENGINE processor Tutorial Hardware and Software Architectures for the CELL BROADBAD EGIE processor Michael Day, Peter Hofstee IBM Systems & Technology Group, Austin, Texas CODES+ISSS Conference, September 2005 Agenda Trends

More information

Technology Trends Presentation For Power Symposium

Technology Trends Presentation For Power Symposium Technology Trends Presentation For Power Symposium 2006 8-23-06 Darryl Solie, Distinguished Engineer, Chief System Architect IBM Systems & Technology Group From Ingenuity to Impact Copyright IBM Corporation

More information

Amir Khorsandi Spring 2012

Amir Khorsandi Spring 2012 Introduction to Amir Khorsandi Spring 2012 History Motivation Architecture Software Environment Power of Parallel lprocessing Conclusion 5/7/2012 9:48 PM ٢ out of 37 5/7/2012 9:48 PM ٣ out of 37 IBM, SCEI/Sony,

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim PowerPC-base Core @3.2GHz 1 VMX vector unit per core 512KB L2 cache 7 x SPE @3.2GHz 7 x 128b 128 SIMD GPRs 7 x 256KB SRAM for SPE 1 of 8 SPEs reserved for redundancy total

More information

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008 Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule)

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Cell Broadband Engine. Spencer Dennis Nicholas Barlow Cell Broadband Engine Spencer Dennis Nicholas Barlow The Cell Processor Objective: [to bring] supercomputer power to everyday life Bridge the gap between conventional CPU s and high performance GPU s History

More information

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Michael Gschwind IBM T.J. Watson Research Center Cell Design Goals Provide the platform for the future of computing 10

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

Introduction to the Cell multiprocessor

Introduction to the Cell multiprocessor Introduction to the Cell multiprocessor This paper provides an introductory overview of the Cell multiprocessor. Cell represents a revolutionary extension of conventional microprocessor architecture and

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

Software Development Kit for Multicore Acceleration Version 3.0

Software Development Kit for Multicore Acceleration Version 3.0 Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial SC33-8410-00 Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial SC33-8410-00 Note

More information

Cell Processor and Playstation 3

Cell Processor and Playstation 3 Cell Processor and Playstation 3 Guillem Borrell i Nogueras February 24, 2009 Cell systems Bad news More bad news Good news Q&A IBM Blades QS21 Cell BE based. 8 SPE 460 Gflops Float 20 GFLops Double QS22

More information

INF5063: Programming heterogeneous multi-core processors Introduction

INF5063: Programming heterogeneous multi-core processors Introduction INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 19 th, 2012 INF5063 Overview Course topic and scope Background for the use and parallel processing using

More information

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner

More information

Gigascale Integration Design Challenges & Opportunities. Shekhar Borkar Circuit Research, Intel Labs October 24, 2004

Gigascale Integration Design Challenges & Opportunities. Shekhar Borkar Circuit Research, Intel Labs October 24, 2004 Gigascale Integration Design Challenges & Opportunities Shekhar Borkar Circuit Research, Intel Labs October 24, 2004 Outline CMOS technology challenges Technology, circuit and μarchitecture solutions Integration

More information

Revisiting Parallelism

Revisiting Parallelism Revisiting Parallelism Sudhakar Yalamanchili, Georgia Institute of Technology Where Are We Headed? MIPS 1000000 Multi-Threaded, Multi-Core 100000 Multi Threaded 10000 Era of Speculative, OOO 1000 Thread

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

1. PowerPC 970MP Overview

1. PowerPC 970MP Overview 1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor

More information

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

Neil Costigan School of Computing, Dublin City University PhD student / 2 nd year of research.

Neil Costigan School of Computing, Dublin City University PhD student / 2 nd year of research. Crypto On the Cell Neil Costigan School of Computing, Dublin City University. neil.costigan@computing.dcu.ie +353.1.700.6916 PhD student / 2 nd year of research. Supervisor : - Dr Michael Scott. IRCSET

More information

Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University

Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University Programming for Performance on the Cell BE processor & Experiences at SSSU Sri Sathya Sai University THE STI CELL PROCESSOR The Inevitable Shift to the era of Multi-Core Computing The 9-core Cell Microprocessor

More information

Crypto On the Playstation 3

Crypto On the Playstation 3 Crypto On the Playstation 3 Neil Costigan School of Computing, DCU. neil.costigan@computing.dcu.ie +353.1.700.6916 PhD student / 2 nd year of research. Supervisor : - Dr Michael Scott. IRCSET funded. Playstation

More information

Supercomputing and Mass Market Desktops

Supercomputing and Mass Market Desktops Supercomputing and Mass Market Desktops John Manferdelli Microsoft Corporation This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

More information

IBM's POWER5 Micro Processor Design and Methodology

IBM's POWER5 Micro Processor Design and Methodology IBM's POWER5 Micro Processor Design and Methodology Ron Kalla IBM Systems Group Outline POWER5 Overview Design Process Power POWER Server Roadmap 2001 POWER4 2002-3 POWER4+ 2004* POWER5 2005* POWER5+ 2006*

More information

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors Peter Sandon Senior PowerPC Processor Architect IBM Microelectronics All information in these materials is subject to

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

A Brief View of the Cell Broadband Engine

A Brief View of the Cell Broadband Engine A Brief View of the Cell Broadband Engine Cris Capdevila Adam Disney Yawei Hui Alexander Saites 02 Dec 2013 1 Introduction The cell microprocessor, also known as the Cell Broadband Engine (CBE), is a Power

More information

Lecture 8: RISC & Parallel Computers. Parallel computers

Lecture 8: RISC & Parallel Computers. Parallel computers Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

Vector Engine Processor of SX-Aurora TSUBASA

Vector Engine Processor of SX-Aurora TSUBASA Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Fundamentals of Computer Design

Fundamentals of Computer Design CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Xbox 360 high-level architecture

Xbox 360 high-level architecture 11/2/11 Xbox 360 s Xenon vs. Playstation 3 s Cell Both chips clocked at a 3.2 GHz Architectural Comparison: Xbox 360 vs. Playstation 3 Prof. Aaron Lanterman School of Electrical and Computer Engineering

More information

Cell SDK and Best Practices

Cell SDK and Best Practices Cell SDK and Best Practices Stefan Lutz Florian Braune Hardware-Software-Co-Design Universität Erlangen-Nürnberg siflbrau@mb.stud.uni-erlangen.de Stefan.b.lutz@mb.stud.uni-erlangen.de 1 Overview - Introduction

More information

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy Competitors using generic parts Performance benefits to be had for custom design Original PlayStation: no vector processing or floating point support Geometry issues Photorealism at the core of design

More information

Power Technology For a Smarter Future

Power Technology For a Smarter Future 2011 IBM Power Systems Technical University October 10-14 Fontainebleau Miami Beach Miami, FL IBM Power Technology For a Smarter Future Jeffrey Stuecheli Power Processor Development Copyright IBM Corporation

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CONSOLE ARCHITECTURE

CONSOLE ARCHITECTURE CONSOLE ARCHITECTURE Introduction Part 1 What is a console? Console components Differences between consoles and PCs Benefits of console development The development environment Console game design What

More information

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P. Concurrent Programming with the Cell Processor Dietmar Kühl Bloomberg L.P. dietmar.kuehl@gmail.com Copyright Notice 2009 Bloomberg L.P. Permission is granted to copy, distribute, and display this material,

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

POWER7+ TM IBM IBM Corporation

POWER7+ TM IBM IBM Corporation POWER7+ TM 2012 Corporation Outline POWER Processor History Design Overview Performance Benchmarks Key Features Scale-up / Scale-out The new accelerators Advanced energy management Summary * Statements

More information

Parallel and Distributed Computing

Parallel and Distributed Computing Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering

More information

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems ( ZIH ) Center for Information Services and High Performance Computing Event Tracing and Visualization for Cell Broadband Engine Systems ( daniel.hackenberg@zih.tu-dresden.de ) Daniel Hackenberg Cell Broadband

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last

More information

The Processor That Don't Cost a Thing

The Processor That Don't Cost a Thing The Processor That Don't Cost a Thing Peter Hsu, Ph.D. Peter Hsu Consulting, Inc. http://cs.wisc.edu/~peterhsu DRAM+Processor Commercial demand Heat stiffling industry's growth Heat density limits small

More information

A Transport Kernel on the Cell Broadband Engine

A Transport Kernel on the Cell Broadband Engine A Transport Kernel on the Cell Broadband Engine Paul Henning Los Alamos National Laboratory LA-UR 06-7280 Cell Chip Overview Cell Broadband Engine * (Cell BE) Developed under Sony-Toshiba-IBM efforts Current

More information

POWER7: IBM's Next Generation Server Processor

POWER7: IBM's Next Generation Server Processor Hot Chips 21 POWER7: IBM's Next Generation Server Processor Ronald Kalla Balaram Sinharoy POWER7 Chief Engineer POWER7 Chief Core Architect Acknowledgment: This material is based upon work supported by

More information

Cell Broadband Engine Architecture. Version 1.0

Cell Broadband Engine Architecture. Version 1.0 Copyright and Disclaimer Copyright International Business Machines Corporation, Sony Computer Entertainment Incorporated, Toshiba Corporation 2005 All Rights Reserved Printed in the United States of America

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

History. PowerPC based micro-architectures. PowerPC ISA. Introduction PowerPC based micro-architectures Godfrey van der Linden Presentation for COMP9244 Software view of Processor Architectures 2006-05-25 History 1985 IBM started on AMERICA 1986 Development of RS/6000 1990

More information

Giorgio Buttazzo. Scuola Superiore Sant Anna, Pisa. The transition

Giorgio Buttazzo. Scuola Superiore Sant Anna, Pisa. The transition Giorgio Buttazzo Scuola Superiore Sant Anna, Pisa The transition On May 7 th, 2004, Intel, the world s largest chip maker, canceled the development of the Tejas processor, the successor of the Pentium4-style

More information

Computer Architecture

Computer Architecture Informatics 3 Computer Architecture Dr. Boris Grot and Dr. Vijay Nagarajan Institute for Computing Systems Architecture, School of Informatics University of Edinburgh General Information Instructors: Boris

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

POWER7: IBM's Next Generation Server Processor

POWER7: IBM's Next Generation Server Processor POWER7: IBM's Next Generation Server Processor Acknowledgment: This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002 Outline

More information

Bruno Pereira Evangelista

Bruno Pereira Evangelista Bruno Pereira Evangelista Introduction The multi-core era Playstation3 Architecture Cell Broadband Engine Processor Cell Architecture How games are using SPUs Cell SDK RSX Graphics Processor PSGL Cg COLLADA

More information

Introduction to the MMAGIX Multithreading Supercomputer

Introduction to the MMAGIX Multithreading Supercomputer Introduction to the MMAGIX Multithreading Supercomputer A supercomputer is defined as a computer that can run at over a billion instructions per second (BIPS) sustained while executing over a billion floating

More information

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content

More information

Open Innovation with Power8

Open Innovation with Power8 2011 IBM Power Systems Technical University October 10-14 Fontainebleau Miami Beach Miami, FL IBM Open Innovation with Power8 Jeffrey Stuecheli Power Processor Development Copyright IBM Corporation 2013

More information

Concurrent High Performance Processor design: From Logic to PD in Parallel

Concurrent High Performance Processor design: From Logic to PD in Parallel IBM Systems Group Concurrent High Performance design: From Logic to PD in Parallel Leon Stok, VP EDA, IBM Systems Group Mainframes process 30 billion business transactions per day The mainframe is everywhere,

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE The Pennsylvania State University The Graduate School College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE A Thesis in Electrical Engineering by Srijith Rajamohan 2009

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Power 7. Dan Christiani Kyle Wieschowski

Power 7. Dan Christiani Kyle Wieschowski Power 7 Dan Christiani Kyle Wieschowski History 1980-2000 1980 RISC Prototype 1990 POWER1 (Performance Optimization With Enhanced RISC) (1 um) 1993 IBM launches 66MHz POWER2 (.35 um) 1997 POWER2 Super

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

Industry Collaboration and Innovation

Industry Collaboration and Innovation Industry Collaboration and Innovation OpenCAPI Topics Industry Background Technology Overview Design Enablement OpenCAPI Consortium Industry Landscape Key changes occurring in our industry Historical microprocessor

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

POWER3: Next Generation 64-bit PowerPC Processor Design

POWER3: Next Generation 64-bit PowerPC Processor Design POWER3: Next Generation 64-bit PowerPC Processor Design Authors Mark Papermaster, Robert Dinkjian, Michael Mayfield, Peter Lenk, Bill Ciarfella, Frank O Connell, Raymond DuPont High End Processor Design,

More information

Experts in Application Acceleration Synective Labs AB

Experts in Application Acceleration Synective Labs AB Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Influence of Technology Directions on System Architecture. Dr. Randy Isaac VP of Science and Technology IBM Research Division September 10, 2001

Influence of Technology Directions on System Architecture. Dr. Randy Isaac VP of Science and Technology IBM Research Division September 10, 2001 Influence of Technology Directions on System Architecture Dr. Randy Isaac VP of Science and Technology IBM Research Division September 10, 2001 Moore's Law continues beyond conventional scaling Power becomes

More information

Trends in the Infrastructure of Computing

Trends in the Infrastructure of Computing Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much

More information

Multimedia in Mobile Phones. Architectures and Trends Lund

Multimedia in Mobile Phones. Architectures and Trends Lund Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved

More information

Spring 2010 Prof. Hyesoon Kim. Xbox 360 System Architecture, Anderews, Baker

Spring 2010 Prof. Hyesoon Kim. Xbox 360 System Architecture, Anderews, Baker Spring 2010 Prof. Hyesoon Kim Xbox 360 System Architecture, Anderews, Baker 3 CPU cores 4-way SIMD vector units 8-way 1MB L2 cache (3.2 GHz) 2 way SMT 48 unified shaders 3D graphics units 512-Mbyte DRAM

More information

Cell Programming Tips & Techniques

Cell Programming Tips & Techniques Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and

More information

PS3 programming basics. Week 1. SIMD programming on PPE Materials are adapted from the textbook

PS3 programming basics. Week 1. SIMD programming on PPE Materials are adapted from the textbook PS3 programming basics Week 1. SIMD programming on PPE Materials are adapted from the textbook Overview of the Cell Architecture XIO: Rambus Extreme Data Rate (XDR) I/O (XIO) memory channels The PowerPC

More information

Toward a Memory-centric Architecture

Toward a Memory-centric Architecture Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains

More information

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format: MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Michael Perrone, 6.189 Multicore Programming Primer, January (IAP)

More information

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura Multicore Challenge in Vector Pascal P Cockshott, Y Gdura N-body Problem Part 1 (Performance on Intel Nehalem ) Introduction Data Structures (1D and 2D layouts) Performance of single thread code Performance

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

Cell BE enabling density computing for data rich environments

Cell BE enabling density computing for data rich environments Cell BE enabling density computing for data rich environments Michael Gschwind Bruce D Amora Alexandre Eichenberger Cell Broadband Engine - enabling density computing for data-rich environments Cell History

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors:!

More information