OpenMP on the IBM Cell BE 15th meeting of ScicomP Barcelona Supercomputing Center (BSC) May 18-22 2009 Marc Gonzalez Tallada
Index OpenMP programming and code transformations Tiling and Software cache transformations Sources of overheads Performance Loop level parallelism Double buffer Combine OpenMP and SIMD parallelism 2
Introduction The Cell BE Architecture multi core design that mixes two architectures One core based on Power PC architecture (PPE) Synergistic Processor Elements SPU SXU 16 B/cycle (each) LS 256KB SPU SXU 16 B/cycle (each) LS 256KB SPU SXU 16 B/cycle (each) LS 256KB MFC MFC MFC Eight cores based on the Synergistic Processor Element (SPE). 16 B/cycle (each) 16 B/cycle (each) EIB (up to 96 Bytes/cycle) 16 B/cycle (each) SPEs are provided with local stores 16 B/cycle Power Processor Element 16 B/cycle 16 B/cycle(2x) Load and Store instruction in SPE can address only Local Store L2 PPU 32 B/cycle L1 PXU 16 B/cycle MIC BIC Data transfer to/from main memory is explicitly performed under software control. Dual XDR FlexIO 3
Cell programmability Transform original code Allocate buffers in the local store Introduce DMA operations within the code Synchronization statements Translate from original address space to local address space Manual solution PERFORMANCE but not PROGRAMMABILITY Very optimized codes but at cost of programmability Manual SIMD coding Overlap of communication with computation Automatic solution Tiling, Double Buffer Good solution for regular applications Needs of considerable information at compile time Software Cache PROGRAMMABILITY but not PERFORMANCE Usually performance is limited to the available information at compile time Very difficult to generate code that overlaps computation with communication 4
Can the Cell BE be programmed as a cache-based multi-core? OpenMP programming model Parallel region Variable scoping PRIVATE, SHARED, THREADPRIVATE Worksharing constructs DO, SECTIONS, SINGLE Synchronization constructs CRITICAL, BARRIER, ATOMIC Memory consistency FLUSH #pragma omp parallel private(c,i) shared(a, b, d) { for (i=0; i<n; i++) c[i]= ; #pragma omp for scheduling(static) reduction(s) for (i=0; i<n; i++) { a[i] = c[b[i]] + d[i]; s = s + a[i]; #pragma omp barrier #pragma omp critical { s = s + c[0]; Hardware does not impose any restriction to the model! IBM Cell BE can be programmed as a cache based multi-core 5
Main problem to solve Transform original code Allocate buffers in the local store Introduce DMA operations within the code Synchronization statements Translate from original address space to local address space Compile-time predictable access a[i], d[i], b[i], s Unpredictable access c[b[i]] #pragma omp parallel private(c,i) shared(a, b, d) { for (i=0; i<n; i++) c[i]= ; #pragma omp for scheduling(static) reduction(+:s) for (i=0; i<n; i++) { a[i] = c[b[i]] + d[i]; s = s + a[i]; #pragma omp barrier #pragma omp critical { s = s + c[0]; Software Cache + Tiling techniques 6
Introduction Code transformation: poor information at compile-time On ly software cache #pragma omp for scheduling(static) reduction(+:s) for (i=0; i<n; i++) { a[i] = c[b[i]] + d[i]; s = s + a[i]; Me mor y handl er (h?): cont ai ns poi nt er t o buff er i n l ocal st ore HI T: execut es cache l ookup, updat es me mor y handl er REF: perf or ms address transl ati on and act ual me mor y access tmp_s = 0.0; for (i=start;i<end;i++){ if (!HIT(h1, &d[i])) MAP(h4, &d[i]); if (!HIT(h2, &b[i])) MAP(h2, &b[i]); tmp01 = REF(h1, &d[i]); tmp02 = REF(h2, &b[i]); if (!HIT(h4, &c[tmp02])) MAP(h4, &c[tmp02]); tmp03 = REF(h4, &c[tmp02]); if (!HIT(h3, &a[i])) MAP(h3, &a[i]); REF( h3, &a[i])=tmp0 3 + tmp01; tmp_s = tmp_s + REF(h3, &a[i]); atomic_add(s, tmp_s, ); omp_barrier(); 7
For strided memory references Enable compiler optimizations for memory references that expose a strided access pattern Execute control code at buffer level, not at every memory instance Maximize the overlap between computation and communication Try to compute the number of iterations that can be executed before needing to change buffer &a[i] One buff er 8
Hybrid code transformation Organize the LS in two storages: Predictale access Software cache for unpredictable access #pragma omp for scheduling(static) reduction(+:s) for (i=0; i<n; i++) { a[i] = c[b[i]] + d[i]; s = s + a[i]; for (i=start;i<n;i++){ tmp01 = REF(h1, &d[i]); tmp02 = REF(h2, &b[i]); if (!HIT(h4, &c[tmp02])) MAP(h4, &c[tmp02]); tmp03 = REF(h4, &c[tmp02]); REF(h3, &a[i])=tmp03 + tmp01; tmp_s = tmp_s + REF(h3, &a[i]); tmp_s = 0.0; i=start; while (i< end){ n = end; if (!AVAIL(h1, &d[i])) MMAP(h1, &d[i]); n = min(n, i+avail(h1, &d[i]); if (!AVAIL(h2, &b[i])) MMAP(h2, &b[i]); n = min(n, i+avail(h2, &b[i]); if (!AVAIL(h3, &a[i])) MMAP(h3, &a[i]); n = min(n, i+avail(h3, &a[i]); HCONSISTENCY(n, h3); HSYNC(h1, h2, h3); start = i; for (i=start;i<n;i++){ atomic_add(s, tmp_s, ); omp_barrier(); 9
Execution model Loops execute in three different phases Control code Allocate buffers Program DMA transfers Consistency Synchronize with DMA Execute a burst of computation Might include some control code, DMA programming and synchronization tmp_s = 0.0; i=0; while (i< upper_bound){ n = N; if (!AVAIL(h1, &d[i])) MMAP(h1, &d[i]); n = min(n, i+avail(h1, &d[i]); if (!AVAIL(h2, &b[i])) MMAP(h2, &b[i]); n = min(n, i+avail(h2, &b[i]); if (!AVAIL(h3, &a[i])) MMAP(h3, &a[i]); HCONSISTENCY(n, h3); HSYNC(h1, h2, h3); start = i; for (i=start;i<n;i++){ atomic_add(s, tmp_s, ); omp_barrier(); Synch. Control Code Comnput. 10
Compiler limitations: Memory alias Compiler limitations What if a,b,c or d are memory alias? How to allocate buffers consistently? What if some element in a buffer is also referenced through the software cache? Memory aliasing Avoid pointer usage Avoid function calls: use inline annotations #pragma omp parallel private(c,i) shared(a, b, d) { for (i=0; i<n; i++) c[i]= ; #pragma omp for scheduling(sta tic) reduction(+:s) for (i=0; i<n; i++) { a[i] = c[b[i]] + d[i] + ; s = s + a[i]; #pragma omp barrier #pragma omp critical sec tion { s = s + c[0]; 11
Memory Consistency Maintain a relaxed consistency model according to the OpenMP memory model Based on Atomicity and Dirty bits When data in a buffer has to be evicted, the write-back process is composed by three steps: 1. Atomic Read 2. Merge 3. Atomic Write 12
Evaluation Comparison to a traditional software cache 4-way, 128-byte cache line, 64KB of capacity Write-back implemented through Dirty-Bits and atomic (synchronous) data transfers Cache Overhead Comparison Execution Time (sec) 120 100 80 60 40 20 0 HYBRID HYBRID synch 103,44 TRADITIONAL 78,61 62,13 47,41 25,93 21,63 9,33 12,29 10,9 13,11 3,68 3,76 IS CG FT MG Application 13
Evaluation: Comparing performance with Power 5 POWER5-based blade with two processors running at 1.5 GHz 16 GB of memory (8 GB each processor) Each processor 2 core with 2 SMT threads each Shared 1,8 MB L2 16 Execution Time (sec) 14 12 10 8 6 4 POWER5 Cell BE HYBRID Cell BE TRADITIONAL 2 0 IS CG FT MG IS loop 1 IS loop 2 FT loop 1 FT loop 2 FT loop 3 FT loop 4 FT loop 5 MG loop1 MG loop2 MG loop3 MG loop4 MG loop5 MG loop6 MG loop7 POWER5 8,25 10,76 5,61 3,12 8,00 0,25 1,52 1,17 1,14 1,19 0,59 0,22 0,03 0,67 0,37 1,55 0,21 0,07 Cell BE HYBRID 9,33 12,29 10,9 3,68 6,65 2,68 1,76 3,79 2,27 2,23 0,81 0,22 0,06 0,81 0,35 1,69 0,49 0,07 Cell BE TRADITIONAL 47,41 103,44 78,61 13,11 Application / Loop 14
Evaluation: Scalability Cell BE versus Power5 Scalabilty on Cell BE Scalability on Power 5 80 30 Execution Time (sec) 70 60 50 40 30 20 10 MG-A FT-A CG-B IS-B Execution Time (sec) 25 20 15 10 5 MG-A FT-A CG-B IS-B 0 1 SPE 2 SPEs 4 SPEs 8 SPEs 0 1 2 4 MG-A 23.99 12.28 6.42 3.5 FT-A 72.48 37.88 20.46 10.96 CG-B 73.74 37.75 20.17 12.25 IS-B 45.59 24.21 14.11 10.24 Number of threads MG-A 6.86 3.79 3.12 FT-A 11.64 6.94 5.61 CG-B 24.86 13.20 10.76 IS-B 10.25 9.83 8.25 Number of threads 15
Runtime activity Number of iterations per runtime intervention Buffer size: 4KB MG 2 SPE 4 SPEs 8 SPEs kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS 1 213310272 2788302 17025381 6,11 76,50 106655136 1393961 8511458 6,11 76,51 53327648 696943 4255404 6,11 76,52 2 95494660 1401200 8842580 6,31 68,15 47747270 700650 4421740 6,31 68,15 23873710 350385 2211260 6,31 68,14 3 33554432 196096 196096 1,00 171,11 16777216 98048 98048 1,00 171,11 8388608 49024 49024 1,00 171,11 4 786412 8098 32392 4,00 97,11 393216 4032 16128 4,00 97,52 196648 2026 8104 4,00 97,06 5 795076 7741 30964 4,00 102,71 401860 3886 15544 4,00 103,41 205232 2005 8020 4,00 102,36 CG 2 SPE 4 SPEs 8 SPEs kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS 1 225000 444 2664 6,00 506,76 112500 222 1332 6,00 506,76 56244 114 684 6,00 493,37 2 5624700 11100 11100 1,00 506,73 2812200 5550 5550 1,00 506,70 1406100 2850 2850 1,00 493,37 3 5624700 11100 22200 2,00 506,73 2812200 5550 11100 2,00 506,70 1406100 2850 5700 2,00 493,37 4 224988 444 888 2,00 506,73 112488 222 444 2,00 506,70 56244 114 228 2,00 493,37 5 224988 444 444 1,00 506,73 112488 222 222 1,00 506,70 56244 114 114 1,00 493,37 6 5624700 11100 44400 4,00 506,73 2812200 5550 22200 4,00 506,70 1406100 2850 11400 4,00 493,37 7 187490 370 740 2,00 506,73 93740 185 370 2,00 506,70 46870 95 190 2,00 493,37 8 187490 370 740 2,00 506,73 93740 185 370 2,00 506,70 46870 95 190 2,00 493,37 FT 2 SPE 4 SPEs 8 SPEs kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS 1 134217728 524288 4194304 8,00 256,00 67108864 262144 2097152 8,00 256,00 33554432 131072 1048576 8,00 256,00 2 134217728 524288 4194304 8,00 256,00 67108864 262144 2097152 8,00 256,00 33554432 131072 1048576 8,00 256,00 3 117440512 458752 3670016 8,00 256,00 58720256 229376 1835008 8,00 256,00 29360128 114688 917504 8,00 256,00 IS 2 SPE 4 SPEs 8 SPEs kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS 1 11534336 11264 11264 1,00 1024,00 5767168 5632 5632 1,00 1024,00 2883584 2816 2816 1,00 1024,00 2 23068672 22528 22528 1,00 1024,00 23068672 22528 22528 1,00 1024,00 23068672 22528 22528 1,00 1024,00 16
Evaluation: Overhead Distribution MG A - LOOP 5 - CACHE OVERHEAD DISTRIBUTION WRITE-BACK 9,22 WORK 51,53 UPDATE D-B 19,05 DEC 2,19 DMA-REG 3,81 BARRIER 4,11 MMAP 7,29 WORK: time spent in actual computation. WRITE-BACK: time spent in the write-back process. UPDATE D-B: time spent updating the dirty-bits information. DMA-IREG: time spent synchronizing with the DMA data transfers in the TC. DMA-REG: time spent synchronizing with the DMA data transfers in the HLC. DEC: time spent in the pinning mechanism for cache lines. TRANSAC: time spent executing control code of the TC. BARRIER: time spent in the barrier synchronization at end of parallel computation. MMAP: time spent in executing look-up, placement/replacement actions and DMA programming. 17
Evaluation: Overhead Distribution IS B - LOOP 1 - CACHE OVERHEAD DISTRIBUTION WRITE-BACK 2,20 UPDATE D-B: 12,75 WORK 43,54 BARRIER 1,18 TRANSAC 32,95 DEC 0,19 DMA-IREG 5,38 MMAP 0,74 DMA-REG 0,39 WORK: time spent in actual computation. WRITE-BACK: time spent in the write-back process. UPDATE D-B: time spent updating the dirty-bits information. DMA-IREG: time spent synchronizing with the DMA data transfers in the TC. DMA-REG: time spent synchronizing with the DMA data transfers in the HLC. DEC: time spent in the pinning mechanism for cache lines. TRANSAC: time spent executing control code of the TC. BARRIER: time spent in the barrier synchronization at end of parallel computation. MMAP: time spent in executing look-up, placement/replacement actions and DMA programming. 18
Memory Consistency Maintain a relaxed consistency model, following the OpenMP memory model Important sources of overhead Dirty Bits: every store operation is monitored Atomicity at write-back process Optimizations to smooth the impact of this overhead Several observations for scientific parallel codes: Most of cache lines are modified by one execution flow Buffers usually are totally modified, not requiring atomicity at the moment of write-back Aliasing between data in a buffer and data in the software cache, rarely occur 19
Evaluation: Memory Consistency MG CLASS A IS CLASS B Reduction of execution time (%) 1,2 1 0,8 0,6 0,4 0,2 0 1 2 3 4 5 6 7 CLR HL MR PERFECT Reduction of execution time (%) 1,2 1,0 0,8 0,6 0,4 0,2 0,0 1 2 3 4 5 6 CLR HL MR PERFECT LOOP LOOP CG CLASS B FT CLASS A Reduction of execution time (%) 1,2 1 0,8 0,6 0,4 0,2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 LOOP CLR HL MR PERFECT Reduction of execution time (%) 1,2 1 0,8 0,6 0,4 0,2 0 1 2 3 4 5 6 7 LOOP CLR HL MR PERFECT CL R: dat a evi cti on based on 128-byt e har dwar e cache li ne reservati on HL: dat a evi cti on i s done at buff er l evel. No ali as bet ween dat ai n buff er and dat a i n t he softwar e cache. MR: dat a evi cti on i s done at buff er l evel. No ali as bet ween dat ai n buff er and dat a i n t he softwar e cache, and si ngl e writer. PERFEC T: dat a evicti on i s freel y execut ed, wit hout at omi city nor dirty-bits 20
Double buffer techniques Double buffer does not come for free Implies executing more control code Requires to adapt the computational bursts to data transfer times Depends on the available bandwidth, which depends it self on the number of executing threads 21
Evaluation: pre-fecth of data Speedups and Execution Times 1,20 speedup Modulo Scheduled loops 1,20 1,12 1,02 1,03 0,99 1,01 1,04 1,00 0,95 Only pre-fetching for regular memory references 1,06 1,07 1,09 1,03 1,10 1,08 1,03 1,02 1,15 1,43 1,02 1,05 0,99 0,98 1,03 0,99 1,27 1,13 1,13 1,16 0,80 0,60 0,40 0,20 0,00 CG loop 1 CG loop 2 CG loop 3 CG loop 4 CG loop 5 CG loop 6 CG loop 7 CG loop 8 CG loop 9 CG loop 10 CG loop 11 CG loop 12 CG loop 13 CG loop 14 Applications/Loop IS loop 1 IS loop 2 IS loop 3 IS loop 4 FT loop 1 FT loop 2 FT loop 3 FT loop 4 FT loop 5 STREAM Copy STREAM Scale STREAM Add STREAM Triad Execution Time (sec) 14,00 12,00 10,00 8,00 6,00 12,72 11,76 7,47 6,21 10,03 12,07 Cell BE Pre-fetching Cell BE no Pre-fetching Speedup 1,40 1,20 1,00 0,80 0,60 1,082 1,203 0.996 4,00 0,40 2,00 0,20 0,00 CG IS FT 0,00 CG IS FT22
Combining OpenMP with SIMD execution Actual effect Limited by the execution model Only affects the computational bursts Very dependant on runtime parameters Number of threads Number of iterations per runtime intervention 4,00 3,50 Speedup 3,00 2,50 2,00 1,50 1,00 0,50 0,00 L-0 L-3 L-4 L-7 L-8 L-11 L-12 L-13 L-0 L-1 L-0 L-1 L-2 L-3 L-1 L-2 L-3 L-4 L-5 CG IS FT MG 1 SPE 2 SPEs 4 SPEs 8 SPEs 23
Combining OpenMP with SIMD execution Actual effect Limited by the execution model Only affects the computational bursts Very dependant on runtime parameters Number of threads Number of iterations per runtime intervention 1,80 1,60 1,40 Speedup 1,20 1,00 0,80 0,60 1 SPE 2 SPEs 4 SPEs 8 SPEs 0,40 0,20 0,00 copy scale add triad STREAM FT MG CG 24
Conclusions OpenMP transformations Remember, three phases Very conditioned to memory aliasing Try to avoid pointers, introduce inline annotations We can reach similar performance as what we would obtain from a cache based multi-core Double-buffer effectiveness Depending on the number of threads, access patterns, bandwidth Ranging between 10%-20% of speedup SIMD effectiveness Only affects the computational phase Limited by alignment constraints 25
Questions 26