OpenMP on the IBM Cell BE

Similar documents
OpenMP on the IBM Cell BE

Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture

Evaluating the Portability of UPC to the Cell Broadband Engine

Hybrid Access-Specific Software Cache Techniques for the Cell BE Architecture

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

Hybrid Access-Specific Software Cache Techniques for the Cell BE Architecture

CellSs Making it easier to program the Cell Broadband Engine processor

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

All About the Cell Processor

Cell Processor and Playstation 3

Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

IBM Research Report. Optimizing the Use of Static Buffers for DMA on a CELL Chip

Compilation for Heterogeneous Platforms

Cell Programming Tips & Techniques

Technology Trends Presentation For Power Symposium

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

The University of Texas at Austin

Portable Parallel Programming for Multicore Computing

Portland State University ECE 588/688. Cray-1 and Cray T3E

Parallel Exact Inference on the Cell Broadband Engine Processor

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Parallel Numerical Algorithms

Introduction to CELL B.E. and GPU Programming. Agenda

Parallel Computing: Parallel Architectures Jin, Hai

Spring 2011 Prof. Hyesoon Kim

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

A brief introduction to OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

Parallel Computing. Hwansoo Han (SKKU)

Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures

Optimising for the p690 memory system

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

A Transport Kernel on the Cell Broadband Engine

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

A Comparison of Programming Models for Multiprocessors with Explicitly Managed Memory Hierarchies

Scientific Programming in C XIV. Parallel programming

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Compiling for HSA accelerators with GCC

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation

Porting an MPEG-2 Decoder to the Cell Architecture

High-Performance Modular Multiplication on the Cell Broadband Engine

Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lars Schor, and Lothar Thiele

OpenMP: Open Multiprocessing

Introduction to Computer Systems /18-243, fall th Lecture, Dec 1

QDP++/Chroma on IBM PowerXCell 8i Processor

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Portland State University ECE 588/688. Cray-1 and Cray T3E

Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

Runtime Support for Scalable Task-parallel Programs

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

high performance medical reconstruction using stream programming paradigms

Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

The SARC Architecture

Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany

Runtime Address Space Computation for SDSM Systems

How to Write Fast Code , spring th Lecture, Mar. 31 st

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

COMIC: A Coherent Shared Memory Interface for Cell BE

Session 4: Parallel Programming with OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Parallel Computer Architecture and Programming Written Assignment 3

Crypto On the Playstation 3

COMP 322: Principles of Parallel Programming. Lecture 18: Understanding Parallel Computers (Chapter 2, contd) Fall 2009

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Cell Broadband Engine Architecture. Version 1.0

Compiling Effectively for Cell B.E. with GCC

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Simplified and Effective Serial and Parallel Performance Optimization

OpenMP at Sun. EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Code optimization in a 3D diffusion model

Systems Programming and Computer Architecture ( ) Timothy Roscoe

41st Cray User Group Conference Minneapolis, Minnesota

!OMP #pragma opm _OPENMP

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP: Open Multiprocessing

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Exploring Parallelism At Different Levels

Amir Khorsandi Spring 2012

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

Lecture 12: Instruction Execution and Pipelining. William Gropp

A Streaming Computation Framework for the Cell Processor. Xin David Zhang

Transcription:

OpenMP on the IBM Cell BE 15th meeting of ScicomP Barcelona Supercomputing Center (BSC) May 18-22 2009 Marc Gonzalez Tallada

Index OpenMP programming and code transformations Tiling and Software cache transformations Sources of overheads Performance Loop level parallelism Double buffer Combine OpenMP and SIMD parallelism 2

Introduction The Cell BE Architecture multi core design that mixes two architectures One core based on Power PC architecture (PPE) Synergistic Processor Elements SPU SXU 16 B/cycle (each) LS 256KB SPU SXU 16 B/cycle (each) LS 256KB SPU SXU 16 B/cycle (each) LS 256KB MFC MFC MFC Eight cores based on the Synergistic Processor Element (SPE). 16 B/cycle (each) 16 B/cycle (each) EIB (up to 96 Bytes/cycle) 16 B/cycle (each) SPEs are provided with local stores 16 B/cycle Power Processor Element 16 B/cycle 16 B/cycle(2x) Load and Store instruction in SPE can address only Local Store L2 PPU 32 B/cycle L1 PXU 16 B/cycle MIC BIC Data transfer to/from main memory is explicitly performed under software control. Dual XDR FlexIO 3

Cell programmability Transform original code Allocate buffers in the local store Introduce DMA operations within the code Synchronization statements Translate from original address space to local address space Manual solution PERFORMANCE but not PROGRAMMABILITY Very optimized codes but at cost of programmability Manual SIMD coding Overlap of communication with computation Automatic solution Tiling, Double Buffer Good solution for regular applications Needs of considerable information at compile time Software Cache PROGRAMMABILITY but not PERFORMANCE Usually performance is limited to the available information at compile time Very difficult to generate code that overlaps computation with communication 4

Can the Cell BE be programmed as a cache-based multi-core? OpenMP programming model Parallel region Variable scoping PRIVATE, SHARED, THREADPRIVATE Worksharing constructs DO, SECTIONS, SINGLE Synchronization constructs CRITICAL, BARRIER, ATOMIC Memory consistency FLUSH #pragma omp parallel private(c,i) shared(a, b, d) { for (i=0; i<n; i++) c[i]= ; #pragma omp for scheduling(static) reduction(s) for (i=0; i<n; i++) { a[i] = c[b[i]] + d[i]; s = s + a[i]; #pragma omp barrier #pragma omp critical { s = s + c[0]; Hardware does not impose any restriction to the model! IBM Cell BE can be programmed as a cache based multi-core 5

Main problem to solve Transform original code Allocate buffers in the local store Introduce DMA operations within the code Synchronization statements Translate from original address space to local address space Compile-time predictable access a[i], d[i], b[i], s Unpredictable access c[b[i]] #pragma omp parallel private(c,i) shared(a, b, d) { for (i=0; i<n; i++) c[i]= ; #pragma omp for scheduling(static) reduction(+:s) for (i=0; i<n; i++) { a[i] = c[b[i]] + d[i]; s = s + a[i]; #pragma omp barrier #pragma omp critical { s = s + c[0]; Software Cache + Tiling techniques 6

Introduction Code transformation: poor information at compile-time On ly software cache #pragma omp for scheduling(static) reduction(+:s) for (i=0; i<n; i++) { a[i] = c[b[i]] + d[i]; s = s + a[i]; Me mor y handl er (h?): cont ai ns poi nt er t o buff er i n l ocal st ore HI T: execut es cache l ookup, updat es me mor y handl er REF: perf or ms address transl ati on and act ual me mor y access tmp_s = 0.0; for (i=start;i<end;i++){ if (!HIT(h1, &d[i])) MAP(h4, &d[i]); if (!HIT(h2, &b[i])) MAP(h2, &b[i]); tmp01 = REF(h1, &d[i]); tmp02 = REF(h2, &b[i]); if (!HIT(h4, &c[tmp02])) MAP(h4, &c[tmp02]); tmp03 = REF(h4, &c[tmp02]); if (!HIT(h3, &a[i])) MAP(h3, &a[i]); REF( h3, &a[i])=tmp0 3 + tmp01; tmp_s = tmp_s + REF(h3, &a[i]); atomic_add(s, tmp_s, ); omp_barrier(); 7

For strided memory references Enable compiler optimizations for memory references that expose a strided access pattern Execute control code at buffer level, not at every memory instance Maximize the overlap between computation and communication Try to compute the number of iterations that can be executed before needing to change buffer &a[i] One buff er 8

Hybrid code transformation Organize the LS in two storages: Predictale access Software cache for unpredictable access #pragma omp for scheduling(static) reduction(+:s) for (i=0; i<n; i++) { a[i] = c[b[i]] + d[i]; s = s + a[i]; for (i=start;i<n;i++){ tmp01 = REF(h1, &d[i]); tmp02 = REF(h2, &b[i]); if (!HIT(h4, &c[tmp02])) MAP(h4, &c[tmp02]); tmp03 = REF(h4, &c[tmp02]); REF(h3, &a[i])=tmp03 + tmp01; tmp_s = tmp_s + REF(h3, &a[i]); tmp_s = 0.0; i=start; while (i< end){ n = end; if (!AVAIL(h1, &d[i])) MMAP(h1, &d[i]); n = min(n, i+avail(h1, &d[i]); if (!AVAIL(h2, &b[i])) MMAP(h2, &b[i]); n = min(n, i+avail(h2, &b[i]); if (!AVAIL(h3, &a[i])) MMAP(h3, &a[i]); n = min(n, i+avail(h3, &a[i]); HCONSISTENCY(n, h3); HSYNC(h1, h2, h3); start = i; for (i=start;i<n;i++){ atomic_add(s, tmp_s, ); omp_barrier(); 9

Execution model Loops execute in three different phases Control code Allocate buffers Program DMA transfers Consistency Synchronize with DMA Execute a burst of computation Might include some control code, DMA programming and synchronization tmp_s = 0.0; i=0; while (i< upper_bound){ n = N; if (!AVAIL(h1, &d[i])) MMAP(h1, &d[i]); n = min(n, i+avail(h1, &d[i]); if (!AVAIL(h2, &b[i])) MMAP(h2, &b[i]); n = min(n, i+avail(h2, &b[i]); if (!AVAIL(h3, &a[i])) MMAP(h3, &a[i]); HCONSISTENCY(n, h3); HSYNC(h1, h2, h3); start = i; for (i=start;i<n;i++){ atomic_add(s, tmp_s, ); omp_barrier(); Synch. Control Code Comnput. 10

Compiler limitations: Memory alias Compiler limitations What if a,b,c or d are memory alias? How to allocate buffers consistently? What if some element in a buffer is also referenced through the software cache? Memory aliasing Avoid pointer usage Avoid function calls: use inline annotations #pragma omp parallel private(c,i) shared(a, b, d) { for (i=0; i<n; i++) c[i]= ; #pragma omp for scheduling(sta tic) reduction(+:s) for (i=0; i<n; i++) { a[i] = c[b[i]] + d[i] + ; s = s + a[i]; #pragma omp barrier #pragma omp critical sec tion { s = s + c[0]; 11

Memory Consistency Maintain a relaxed consistency model according to the OpenMP memory model Based on Atomicity and Dirty bits When data in a buffer has to be evicted, the write-back process is composed by three steps: 1. Atomic Read 2. Merge 3. Atomic Write 12

Evaluation Comparison to a traditional software cache 4-way, 128-byte cache line, 64KB of capacity Write-back implemented through Dirty-Bits and atomic (synchronous) data transfers Cache Overhead Comparison Execution Time (sec) 120 100 80 60 40 20 0 HYBRID HYBRID synch 103,44 TRADITIONAL 78,61 62,13 47,41 25,93 21,63 9,33 12,29 10,9 13,11 3,68 3,76 IS CG FT MG Application 13

Evaluation: Comparing performance with Power 5 POWER5-based blade with two processors running at 1.5 GHz 16 GB of memory (8 GB each processor) Each processor 2 core with 2 SMT threads each Shared 1,8 MB L2 16 Execution Time (sec) 14 12 10 8 6 4 POWER5 Cell BE HYBRID Cell BE TRADITIONAL 2 0 IS CG FT MG IS loop 1 IS loop 2 FT loop 1 FT loop 2 FT loop 3 FT loop 4 FT loop 5 MG loop1 MG loop2 MG loop3 MG loop4 MG loop5 MG loop6 MG loop7 POWER5 8,25 10,76 5,61 3,12 8,00 0,25 1,52 1,17 1,14 1,19 0,59 0,22 0,03 0,67 0,37 1,55 0,21 0,07 Cell BE HYBRID 9,33 12,29 10,9 3,68 6,65 2,68 1,76 3,79 2,27 2,23 0,81 0,22 0,06 0,81 0,35 1,69 0,49 0,07 Cell BE TRADITIONAL 47,41 103,44 78,61 13,11 Application / Loop 14

Evaluation: Scalability Cell BE versus Power5 Scalabilty on Cell BE Scalability on Power 5 80 30 Execution Time (sec) 70 60 50 40 30 20 10 MG-A FT-A CG-B IS-B Execution Time (sec) 25 20 15 10 5 MG-A FT-A CG-B IS-B 0 1 SPE 2 SPEs 4 SPEs 8 SPEs 0 1 2 4 MG-A 23.99 12.28 6.42 3.5 FT-A 72.48 37.88 20.46 10.96 CG-B 73.74 37.75 20.17 12.25 IS-B 45.59 24.21 14.11 10.24 Number of threads MG-A 6.86 3.79 3.12 FT-A 11.64 6.94 5.61 CG-B 24.86 13.20 10.76 IS-B 10.25 9.83 8.25 Number of threads 15

Runtime activity Number of iterations per runtime intervention Buffer size: 4KB MG 2 SPE 4 SPEs 8 SPEs kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS 1 213310272 2788302 17025381 6,11 76,50 106655136 1393961 8511458 6,11 76,51 53327648 696943 4255404 6,11 76,52 2 95494660 1401200 8842580 6,31 68,15 47747270 700650 4421740 6,31 68,15 23873710 350385 2211260 6,31 68,14 3 33554432 196096 196096 1,00 171,11 16777216 98048 98048 1,00 171,11 8388608 49024 49024 1,00 171,11 4 786412 8098 32392 4,00 97,11 393216 4032 16128 4,00 97,52 196648 2026 8104 4,00 97,06 5 795076 7741 30964 4,00 102,71 401860 3886 15544 4,00 103,41 205232 2005 8020 4,00 102,36 CG 2 SPE 4 SPEs 8 SPEs kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS 1 225000 444 2664 6,00 506,76 112500 222 1332 6,00 506,76 56244 114 684 6,00 493,37 2 5624700 11100 11100 1,00 506,73 2812200 5550 5550 1,00 506,70 1406100 2850 2850 1,00 493,37 3 5624700 11100 22200 2,00 506,73 2812200 5550 11100 2,00 506,70 1406100 2850 5700 2,00 493,37 4 224988 444 888 2,00 506,73 112488 222 444 2,00 506,70 56244 114 228 2,00 493,37 5 224988 444 444 1,00 506,73 112488 222 222 1,00 506,70 56244 114 114 1,00 493,37 6 5624700 11100 44400 4,00 506,73 2812200 5550 22200 4,00 506,70 1406100 2850 11400 4,00 493,37 7 187490 370 740 2,00 506,73 93740 185 370 2,00 506,70 46870 95 190 2,00 493,37 8 187490 370 740 2,00 506,73 93740 185 370 2,00 506,70 46870 95 190 2,00 493,37 FT 2 SPE 4 SPEs 8 SPEs kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS 1 134217728 524288 4194304 8,00 256,00 67108864 262144 2097152 8,00 256,00 33554432 131072 1048576 8,00 256,00 2 134217728 524288 4194304 8,00 256,00 67108864 262144 2097152 8,00 256,00 33554432 131072 1048576 8,00 256,00 3 117440512 458752 3670016 8,00 256,00 58720256 229376 1835008 8,00 256,00 29360128 114688 917504 8,00 256,00 IS 2 SPE 4 SPEs 8 SPEs kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS 1 11534336 11264 11264 1,00 1024,00 5767168 5632 5632 1,00 1024,00 2883584 2816 2816 1,00 1024,00 2 23068672 22528 22528 1,00 1024,00 23068672 22528 22528 1,00 1024,00 23068672 22528 22528 1,00 1024,00 16

Evaluation: Overhead Distribution MG A - LOOP 5 - CACHE OVERHEAD DISTRIBUTION WRITE-BACK 9,22 WORK 51,53 UPDATE D-B 19,05 DEC 2,19 DMA-REG 3,81 BARRIER 4,11 MMAP 7,29 WORK: time spent in actual computation. WRITE-BACK: time spent in the write-back process. UPDATE D-B: time spent updating the dirty-bits information. DMA-IREG: time spent synchronizing with the DMA data transfers in the TC. DMA-REG: time spent synchronizing with the DMA data transfers in the HLC. DEC: time spent in the pinning mechanism for cache lines. TRANSAC: time spent executing control code of the TC. BARRIER: time spent in the barrier synchronization at end of parallel computation. MMAP: time spent in executing look-up, placement/replacement actions and DMA programming. 17

Evaluation: Overhead Distribution IS B - LOOP 1 - CACHE OVERHEAD DISTRIBUTION WRITE-BACK 2,20 UPDATE D-B: 12,75 WORK 43,54 BARRIER 1,18 TRANSAC 32,95 DEC 0,19 DMA-IREG 5,38 MMAP 0,74 DMA-REG 0,39 WORK: time spent in actual computation. WRITE-BACK: time spent in the write-back process. UPDATE D-B: time spent updating the dirty-bits information. DMA-IREG: time spent synchronizing with the DMA data transfers in the TC. DMA-REG: time spent synchronizing with the DMA data transfers in the HLC. DEC: time spent in the pinning mechanism for cache lines. TRANSAC: time spent executing control code of the TC. BARRIER: time spent in the barrier synchronization at end of parallel computation. MMAP: time spent in executing look-up, placement/replacement actions and DMA programming. 18

Memory Consistency Maintain a relaxed consistency model, following the OpenMP memory model Important sources of overhead Dirty Bits: every store operation is monitored Atomicity at write-back process Optimizations to smooth the impact of this overhead Several observations for scientific parallel codes: Most of cache lines are modified by one execution flow Buffers usually are totally modified, not requiring atomicity at the moment of write-back Aliasing between data in a buffer and data in the software cache, rarely occur 19

Evaluation: Memory Consistency MG CLASS A IS CLASS B Reduction of execution time (%) 1,2 1 0,8 0,6 0,4 0,2 0 1 2 3 4 5 6 7 CLR HL MR PERFECT Reduction of execution time (%) 1,2 1,0 0,8 0,6 0,4 0,2 0,0 1 2 3 4 5 6 CLR HL MR PERFECT LOOP LOOP CG CLASS B FT CLASS A Reduction of execution time (%) 1,2 1 0,8 0,6 0,4 0,2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 LOOP CLR HL MR PERFECT Reduction of execution time (%) 1,2 1 0,8 0,6 0,4 0,2 0 1 2 3 4 5 6 7 LOOP CLR HL MR PERFECT CL R: dat a evi cti on based on 128-byt e har dwar e cache li ne reservati on HL: dat a evi cti on i s done at buff er l evel. No ali as bet ween dat ai n buff er and dat a i n t he softwar e cache. MR: dat a evi cti on i s done at buff er l evel. No ali as bet ween dat ai n buff er and dat a i n t he softwar e cache, and si ngl e writer. PERFEC T: dat a evicti on i s freel y execut ed, wit hout at omi city nor dirty-bits 20

Double buffer techniques Double buffer does not come for free Implies executing more control code Requires to adapt the computational bursts to data transfer times Depends on the available bandwidth, which depends it self on the number of executing threads 21

Evaluation: pre-fecth of data Speedups and Execution Times 1,20 speedup Modulo Scheduled loops 1,20 1,12 1,02 1,03 0,99 1,01 1,04 1,00 0,95 Only pre-fetching for regular memory references 1,06 1,07 1,09 1,03 1,10 1,08 1,03 1,02 1,15 1,43 1,02 1,05 0,99 0,98 1,03 0,99 1,27 1,13 1,13 1,16 0,80 0,60 0,40 0,20 0,00 CG loop 1 CG loop 2 CG loop 3 CG loop 4 CG loop 5 CG loop 6 CG loop 7 CG loop 8 CG loop 9 CG loop 10 CG loop 11 CG loop 12 CG loop 13 CG loop 14 Applications/Loop IS loop 1 IS loop 2 IS loop 3 IS loop 4 FT loop 1 FT loop 2 FT loop 3 FT loop 4 FT loop 5 STREAM Copy STREAM Scale STREAM Add STREAM Triad Execution Time (sec) 14,00 12,00 10,00 8,00 6,00 12,72 11,76 7,47 6,21 10,03 12,07 Cell BE Pre-fetching Cell BE no Pre-fetching Speedup 1,40 1,20 1,00 0,80 0,60 1,082 1,203 0.996 4,00 0,40 2,00 0,20 0,00 CG IS FT 0,00 CG IS FT22

Combining OpenMP with SIMD execution Actual effect Limited by the execution model Only affects the computational bursts Very dependant on runtime parameters Number of threads Number of iterations per runtime intervention 4,00 3,50 Speedup 3,00 2,50 2,00 1,50 1,00 0,50 0,00 L-0 L-3 L-4 L-7 L-8 L-11 L-12 L-13 L-0 L-1 L-0 L-1 L-2 L-3 L-1 L-2 L-3 L-4 L-5 CG IS FT MG 1 SPE 2 SPEs 4 SPEs 8 SPEs 23

Combining OpenMP with SIMD execution Actual effect Limited by the execution model Only affects the computational bursts Very dependant on runtime parameters Number of threads Number of iterations per runtime intervention 1,80 1,60 1,40 Speedup 1,20 1,00 0,80 0,60 1 SPE 2 SPEs 4 SPEs 8 SPEs 0,40 0,20 0,00 copy scale add triad STREAM FT MG CG 24

Conclusions OpenMP transformations Remember, three phases Very conditioned to memory aliasing Try to avoid pointers, introduce inline annotations We can reach similar performance as what we would obtain from a cache based multi-core Double-buffer effectiveness Depending on the number of threads, access patterns, bandwidth Ranging between 10%-20% of speedup SIMD effectiveness Only affects the computational phase Limited by alignment constraints 25

Questions 26