Cell Processor and Playstation 3

Similar documents
CellSs Making it easier to program the Cell Broadband Engine processor

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Evaluating the Portability of UPC to the Cell Broadband Engine

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Introduction to Computing and Systems Architecture

Crypto On the Playstation 3

QDP++/Chroma on IBM PowerXCell 8i Processor

Introduction to CELL B.E. and GPU Programming. Agenda

high performance medical reconstruction using stream programming paradigms

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Parallel Computing: Parallel Architectures Jin, Hai

Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.

How to Write Fast Code , spring th Lecture, Mar. 31 st

Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany

Spring 2011 Prof. Hyesoon Kim

Cell SDK and Best Practices

Experts in Application Acceleration Synective Labs AB

OpenMP on the IBM Cell BE

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

Multi - FFT Vectorization for the Cell Multicore Processor

Multicore Programming Case Studies: Cell BE and NVIDIA Tesla Meeting on Parallel Routine Optimization and Applications

Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications

Interconnection of Clusters of Various Architectures in Grid Systems

Technology Trends Presentation For Power Symposium

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

QDP++ on Cell BE WEI WANG. June 8, 2009

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM

Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lars Schor, and Lothar Thiele

Interval arithmetic on the Cell processor

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine. David A. Bader, Virat Agarwal

Software Development Kit for Multicore Acceleration Version 3.0

Parallel Exact Inference on the Cell Broadband Engine Processor

Acceleration of Correlation Matrix on Heterogeneous Multi-Core CELL-BE Platform

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Amir Khorsandi Spring 2012

Compilation for Heterogeneous Platforms

All About the Cell Processor

Massively Parallel Architectures

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

PARALLEL VIDEO PROCESSING PERFORMANCE EVALUATION ON THE IBM CELL BROADBAND ENGINE PROCESSOR

Optimization of FEM solver for heterogeneous multicore processor Cell. Noriyuki Kushida 1

Neil Costigan School of Computing, Dublin City University PhD student / 2 nd year of research.

PS3 programming basics. Week 1. SIMD programming on PPE Materials are adapted from the textbook

High Performance Computing. University questions with solution

A Transport Kernel on the Cell Broadband Engine

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE

OpenMP on the IBM Cell BE

Cell Broadband Engine Overview

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.

Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU

The University of Texas at Austin

Applications on emerging paradigms in parallel computing

CELL CULTURE. Sony Computer Entertainment, Application development for the Cell processor. Programming. Developing for the Cell. Programming the Cell

A Brief View of the Cell Broadband Engine

Bruno Pereira Evangelista

Intel Performance Libraries

A parallel patch based algorithm for CT image denoising on the Cell Broadband Engine

High-Performance Modular Multiplication on the Cell Broadband Engine

Cell Broadband Engine Architecture. Version 1.0

Cell Broadband Engine Processor: Motivation, Architecture,Programming

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Cellular Planets: Optimizing Planetary Simulations for the Cell Processor

Analysis of a Computational Biology Simulation Technique on Emerging Processing Architectures

Xbox 360 high-level architecture

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Revisiting Parallelism

IBM. Software Development Kit for Multicore Acceleration, Version 3.0. SPU Timer Library Programmer s Guide and API Reference

LU Decomposition On Cell Broadband Engine: An Empirical Study to Exploit Heterogeneous Chip Multiprocessors

IBM Research Report. Optimizing the Use of Static Buffers for DMA on a CELL Chip

Trends in HPC (hardware complexity and software challenges)

Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine

Porting an MPEG-2 Decoder to the Cell Architecture

IBM Research Report. SPU Based Network Module for Software Radio System on Cell Multicore Platform

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

MODEL-BASED SOFTWARE DESIGN TOOLS FOR THE CELL PROCESSOR. Nicholas Stephen Lowell. Thesis. Submitted to the Faculty of the

Optimizing Large Scale Chemical Transport Models for Multicore Platforms

Xbox 360 Architecture. Lennard Streat Samuel Echefu

CSC573: TSHA Introduction to Accelerators

CONSOLE ARCHITECTURE

Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

IBM Software Development Kit for Multicore Acceleration v3.0 Basic Linear Algebra Subprograms Programmer s Guide and API Reference Version 1.

Distributed Operation Layer Integrated SW Design Flow for Mapping Streaming Applications to MPSoC

INF5063: Programming heterogeneous multi-core processors Introduction

Implementation of a backprojection algorithm on CELL

Modeling Multigrain Parallelism on Heterogeneous Multi-core Processors

NUMERICAL PARALLEL COMPUTING

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Cell Programming Tips & Techniques

Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands MSc THESIS

IBM Software Development Kit for Multicore Acceleration v3.0 Basic Linear Algebra Subprograms Programmer s Guide and API Reference

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

The PlayStation 3 for High Performance Scientific Computing. Kurzak, Jakub and Buttari, Alfredo and Luszczek, Piotr and Dongarra, Jack

Scheduling on Asymmetric Parallel Architectures

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

Transcription:

Cell Processor and Playstation 3 Guillem Borrell i Nogueras February 24, 2009

Cell systems Bad news More bad news Good news Q&A

IBM Blades QS21 Cell BE based. 8 SPE 460 Gflops Float 20 GFLops Double QS22 PowerXCell 8i based 8 SPE* 460 GFlops Float 200 GFlops Double Some of them already installed in BSC

IBM Blades QS21 Cell BE based. 8 SPE 460 Gflops Float 20 GFLops Double QS22 PowerXCell 8i based 8 SPE* 460 GFlops Float 200 GFlops Double Some of them already installed in BSC No SPU Double precision improvements expected from IBM

Playstation 3 Cell BE based. 6 SPE 460 Gflops Float 20 GFLops Double

Playstation 3 Cell BE based. 6 SPE 460 Gflops Float 20 GFLops Double 256 MB RAM

IBM Power 7 8 PowerPC cores 4 threads per core (32 Threads!)? SPE

IBM Power 7 8 PowerPC cores 4 threads per core (32 Threads!)? SPE 1 TFlop on a chip

Cell Broadband Engine

PPE PowerPC Processor Element PPU PowerPC Processor Unit EIB Element Interconnect Bus SPE Synergistic Processor Element SPU Synergistic Processor Unit MFC Memory Flow Controller DMA Direct Memory Access SIMD Single Instruction Multiple Data

How does it compute? PPU starts a program PPU loads an SPU context on a thread SPU acquires the thread SPU executes context SPU ends the task and returns control to PPU

How does it compute? PPU starts a program PPU loads an SPU context on a thread SPU acquires the thread SPU executes context SPU ends the task and returns control to PPU I have said nothing about the data

Overview

Dumb vector unit General purpose vector unit Designed to run compiled code. A context is a program. Altivec unit on a box IBM called that VMX

Dumb vector unit General purpose vector unit Designed to run compiled code. A context is a program. Altivec unit on a box IBM called that VMX LS is regiser based No type distinction Data should be aligned by hand

Dumb vector unit General purpose vector unit Designed to run compiled code. A context is a program. Altivec unit on a box IBM called that VMX LS is regiser based No type distinction Data should be aligned by hand MFC is a DMA controller Data moved with DMA primitives. No data scheduling No data implicit copying.

SIMD Programming SPUs are programmed using SIMD primitives Like a vector unit

SIMD Programming SPUs are programmed using SIMD primitives Like a vector unit Almost coding in assembly Access assembly instructions via libspu2 Add to that DMA instructions

SIMD Programming SPUs are programmed using SIMD primitives Like a vector unit Almost coding in assembly Access assembly instructions via libspu2 Add to that DMA instructions That can take us ages

PPU and SPU code PPE have L1 and L2 cache. SPE have LS (register based)

PPU and SPU code PPE have L1 and L2 cache. SPE have LS (register based) Their ssembly has nothing to do They are compiled separately. PPU code cannot be reused.

And now the good news.

PPU code libraries BLAS (Basic Linear Algebra Subroutines) LAPACK (Linear Algebra Package) FFTW (The Fastest Fourier Transform in the West) C and Fortran interfaces Fortran interface is not complete

PPU code libraries BLAS (Basic Linear Algebra Subroutines) LAPACK (Linear Algebra Package) FFTW (The Fastest Fourier Transform in the West) C and Fortran interfaces Fortran interface is not complete Almost all we need is in Cell SDK!

Thin ice PPU code means no SPU control. Data must be aligned too using posix_memalign. If SPU control is needed PPU code cannot be used at all Tells us what we can or cannot do BSC has been using those for about 2 years.

Optimizing compilers Cell Superscalar Alpha state OpenMP-like pragmas for SIMD BSC Free Software XL compilers for Multicore Acceleration Alpha state OpenMP support for SIMD MASS (Mathematical Acceleration Subsystem) Worth the money

PPU code is possible Conclusions

Conclusions PPU code is possible SPU code is not possible for us 2 options: Cell SDK for Multicore Acceleration Optimizing Compiler (WAIT)

Conclusions PPU code is possible SPU code is not possible for us 2 options: Cell SDK for Multicore Acceleration Optimizing Compiler (WAIT) Get a good C book.

Q&A