Cell Programming Tips & Techniques

Similar documents
Developing Code for Cell - Mailboxes

Exercise Euler Particle System Simulation

After a Cell Broadband Engine program executes without errors on the PPE and the SPEs, optimization through parameter-tuning can begin.

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

Hands-on - DMA Transfer Using Control Block

Hello World! Course Code: L2T2H1-10 Cell Ecosystem Solutions Enablement. Systems and Technology Group

Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

Cell Broadband Engine Architecture. Version 1.02

A Transport Kernel on the Cell Broadband Engine

The network interface configuration property screens can be accessed by double clicking the network icon in the Windows Control Panel.

Introduction to CELL B.E. and GPU Programming. Agenda

Hardware-based Speculation

Crypto On the Playstation 3

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Spring 2011 Prof. Hyesoon Kim

Cell Broadband Engine Architecture. Version 1.0

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

Parallel Computing: Parallel Architectures Jin, Hai

The University of Texas at Austin

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.

Introduction to PCI Express Positioning Information

Cell SDK and Best Practices

Chapter 13 Reduced Instruction Set Computers

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Cell Programming Tutorial JHD

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Hands-on - DMA Transfer Using get and put Buffer

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Pipelining and Vector Processing

OpenMP on the IBM Cell BE

Cell Broadband Engine Overview

Hands-on - DMA Transfer Using get Buffer

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Lecture 7 - Memory Hierarchy-II

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Synergistic Processor Unit Instruction Set Architecture. Title Page. Version 1.0

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Static, multiple-issue (superscaler) pipelines

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Chapter 11. Instruction Sets: Addressing Modes and Formats. Yonsei University

Synergistic Processor Unit Instruction Set Architecture. Title Page. Version 1.2

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

GM8126 I2C. User Guide Rev.: 1.0 Issue Date: December 2010

Multiple Instruction Issue. Superscalars

IBM Research Report. Optimizing the Use of Static Buffers for DMA on a CELL Chip

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer System Architecture Final Examination Spring 2002

High Performance Computing: Tools and Applications

High Performance Computing. University questions with solution

CSC 631: High-Performance Computer Architecture

Floating Point/Multicycle Pipelining in DLX

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

04 - DSP Architecture and Microarchitecture

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

DK2. Handel-C code optimization

Evaluating the Portability of UPC to the Cell Broadband Engine

SUPERSCALAR AND VLIW PROCESSORS

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CellSs Making it easier to program the Cell Broadband Engine processor

Dynamic Control Hazard Avoidance

Computer Architecture and Organization. Instruction Sets: Addressing Modes and Formats

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

SIMD Exploitation in (JIT) Compilers

1. PowerPC 970MP Overview

All About the Cell Processor

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

EE 4683/5683: COMPUTER ARCHITECTURE

EECC551 Exam Review 4 questions out of 6 questions

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Introduction to Computing and Systems Architecture

PS3 programming basics. Week 1. SIMD programming on PPE Materials are adapted from the textbook

Compilation for Heterogeneous Platforms

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key

This section covers the MIPS instruction set.

Hardware-Based Speculation

Advanced optimizations of cache performance ( 2.2)

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Software Development Kit for Multicore Acceleration Version 3.0

OpenMP on the IBM Cell BE

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

HY425 Lecture 09: Software to exploit ILP

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 11 Instruction Sets: Addressing Modes and Formats

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

HY425 Lecture 09: Software to exploit ILP

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Transcription:

Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1

Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and language features for SPU SIMD 2

Class Agenda Review relevant SPE Features SPU Programming Tips Level of Programming (Assembler, Intrinsics, Auto-Vectorization) Overlap DMA with computation (double, multiple buffering) Dual Issue rate (Instruction Scheduling) Design for limited local store Branch hints or elimination Loop unrolling and pipelining Integer multiplies (avoid 32-bit integer multiplies) Shuffle byte instructions for table look-ups Avoid scalar code Choose the right SIMD strategy Load / Store only by quadword SIMD Programming Tips 3

Review Cell Architecture 4

Cell Processor 5

Course Agenda Cell Blade Products Cell Blade Family of Servers Cell Blade Architecture Cell Blade Overview Critical signals, link speed and bandwidth Power consumption Hardware components Blade and blade center assembly Example of a cell blade with maximum interconnection capability Options - Infiniband References: Dan Brokenshire, BE Programming Tips Trademarks: Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. 6

Key SPE Features 7

SPE Single-Ported Local Memory 8

SPU Programming Tips 9

SPU Programming Tips Level of Programming (Assembler, Intrinsics, Auto-Vectorization) Overlap DMA with computation (double, multiple buffering) Dual Issue rate (Instruction Scheduling) Design for limited local store Branch hints or elimination Loop unrolling and pipelining Integer multiplies (avoid 32-bit integer multiplies) Shuffle byte instructions for table look-ups Avoid scalar code Choose the right SIMD strategy Load / Store only by quadword 10

Programming Levels on Cell BE Trade-Off Performance vs. Effort Expert level Assembler, high performance, high efforts More ease of programming C compiler, vector data types, intrinsics, compiler schedules instructions + allocates registers Auto-SIMDization for scalar loops, user should support by alignment directives, compiler provides feedback about SIMDization Highest degree of ease of use user-guided parallelization necessary, Cell BE looks like a single processor Requirements for Compiler increasing with each level 11

Overlap DMA with computation Double or multi-buffer code or (typically) data Example for double bufferign n+1 data blcoks: Use multiple buffers in local store Use unique DMA tag ID for each buffer Use fence commands to order DMAs within a tag group Use barrier commands to ordr DMAs within a queue 12

Start DMAs from SPU Use SPE-initiated DMA transfers rather than PPEinitiated DMA transfers, because there are more SPEs than the one PPE the PPE can enqueue only eight DMA requests whereas each SPE can enqueue 16 13

Instruction Scheduling 14

Instruction Starvation Situation Dual-Issue Dual-Issue Instruction Instruction Logic Logic There are 2 instruction buffers up to 64 ops along the fall-through path First buffer is half-empty can initiate refill initiate refill after half empty instruction buffers When port is continuously used starvation occurs (no ops left in buffers) 15

Instruction Starvation Prevention initiate refill after half empty Dual-Issue Dual-Issue Instruction Instruction Logic Logic instruction buffer before it is too late to hide latency SPE has an explicit IFETCH op which initiates an instruction fetch Scheduler monitors starvation situation when port is continuously used insert IFETCH op within the (red) window refill IFETCH latency Compiler design scheduler must keep track of code layout 16

Design for Limited Local Store The Local Store holds up to 256 KB for the program, stack, local data structures, and DMA buffers. Most performance optimizations put pressure on local store (e.g. multiple DMA buffers) Use plug-ins (runtime download program kernels) to build complex function servers in the LS. 17

Branch Optimizations SPE Heavily pipelined high penalty for branch misses (18 cycles) Hardware policy: assume all branches are not taken Advantage Reduced hardware complexity Faster clock cycles Increased predictability Solution approaches If-conversions: compare and select operations Predications/code re-org: compiler analysis, user directives Branch hint instruction (hbr, 11 cycles before branch) 18

Branches 19

20

Hinting Branches & Instruction Starvation Prevention Dual-Issue Dual-Issue Instruction Instruction Logic Logic instruction buffers HINT buffer SPE provides a HINT operation fetches the branch target into HINT buffer no penalty for correctly predicted branches IFETCH window refill latency BRANCH if true HINT br, target fetches ops from target; needs a min of 15 cycles and 8 intervening ops target compiler inserts hints when beneficial Impact on instruction starvation after a correctly hinted branch, IFETCH window is smaller 21

Loop Unrolling Unroll loops to reduce dependencies increase dual-issue rates This exploits the large SPU register file. Compiler auto-unrolling is not perfect, but pretty good. 22

Loop Unrolling - Examples j=n; For(i=1, i<n, i++) { a[i] = (b[i] + b[j]) / 2; j = i; } a[1] = (b[1] + b[n]) / 2; For(i=2, i<n, i++) { a[i] = (b[i] + b[i-1]) / 2; } For(i=1, i<100, i++) { } a[i] = b[i+2] * c[i-1]; For(i=1, i<99, i+=2) { a[i] = b[i+2] * c[i-1]; a[i+1] = b[i+3] * c[i]; } 23

SPU 24

SPU Software Pipeline 25

Integer Multiplies Avoid integer multiplies on operands greater than 16 bits SPU supports only a 16-bit x16-bit multiply 32-bit multiply requires five instructions (three 16-bit multiplies and two adds) Keep array elements sized to a power-of-2 to avoid multiplies when indexing. Cast operands to unsigned short prior to multiplying. Constants are of type int and also require casting. Use a macro to explicitly perform 16-bit multiplies. This can avoid inadvertent introduction of signed extends and masks due to casting. #define MULTIPLY(a, b)\ (spu_extract(spu_mulo((vector unsigned short)spu_promote(a,0),\ (vector unsigned short)spu_promote(b, 0)),0)) 26

Avoid Scalar Code 27

Choose an SIMD strategy appropriate for your algorithm Evaluate array-of-structure (AOS) organization For graphics vertices, this organization (also called or vector-across) can have more-efficient code size and simpler DMA needs, but less-efficient computation unless the code is unrolled. Evaluate structure-of-arrays (SOA) organization. For graphics vertices, this organization (also called parallel-array) can be easier to SIMDize, but the data must be maintained in separate arrays or the SPU must shuffle AOS data into an SOA form. 28

Choose SIMD strategy appropriate for algorithm vec-across More efficient code size Typically less efficient code/computation unless code is unrolled Typically simpler DMA needs parallel-array Easy to SIMD program as if scalar, operating on 4 independent objects at a time Data must be maintained in separate arrays or SPU must shuffle vecacross data into a parallel array form Consider unrolling affects when picking SIMD strategy 29

SIMD Example 30

Load / Store by Quadword Scalar loads and stores are slow, with long latency. SPUs only support quadword loads and stores. Consider making scalars into quadword integer vectors. Load or store scalar arrays as quadwords, and perform your own extraction and insertion to eliminate load and store instructions. 31

SIMD Programming Tips 32

33

34

35

36

37

38

39

40

Use Offset Pointer Use the PPE s load/store with update instructions. These allow sequential indexing through an array without the need of additional instructions to increment the array pointer. For the SPEs (which do not support load/store with update instructions), use the d-form instructions to specify an immediate offset from a base array pointer For example, consider the following PPE code that exploits the PowerPC store with update instruction: #define FILL_VEC_FLOAT(_q, _data) *(vector float)(_q++) = _data; FILL_VEC_FLOAT(q, x); FILL_VEC_FLOAT(q, y); FILL_VEC_FLOAT(q, z); FILL_VEC_FLOAT(q, w); The same code can be modified for SPU execution as follows: #define FILL_VEC_FLOAT(_q, _offset, _data) *(vector float)(_q+(_offset)) = _data; FILL_VEC_FLOAT(q, 0, x); FILL_VEC_FLOAT(q, 1, y); FILL_VEC_FLOAT(q, 2, z); FILL_VEC_FLOAT(q, 3, w); q += 4; 41

Shuffle byte instructions for table look-ups 42

(c) Copyright International Business Machines Corporation 2005. All Rights Reserved. Printed in the United Sates September 2005. The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture Other company, product and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. IBM Microelectronics Division The IBM home page is http://www.ibm.com 1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page is Hopewell Junction, NY 12533-6351 http://www.chips.ibm.com 43