ARE WE OPTIMIZING HARDWARE FOR

Size: px

Start display at page:

Download "ARE WE OPTIMIZING HARDWARE FOR"

Janis Snow
5 years ago
Views:

1 ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science juanmc@idi.ntnu.no Oct. 3, 2013 Juan M. Cebrian PP4EE Oct. 3, / 38

2 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

3 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

4 WHY ENERGY EFFICIENT? GREEN COMPUTING Good for economy (maybe not for everyone) Good for the environment Juan M. Cebrian PP4EE Oct. 3, / 38

5 WHY ENERGY EFFICIENT? GREEN COMPUTING Good for economy (maybe not for everyone) Good for the environment However, real world is slightly different Source: Unknown Juan M. Cebrian PP4EE Oct. 3, / 38

6 WHY ENERGY EFFICIENT? GREEN COMPUTING Good for economy (maybe not for everyone) Good for the environment However, real world is slightly different REALITY Companies want to make money People don t buy because of energy efficiency Energy savings -> new markets or features Design given a power budget (e.g. 5W for smartphones) Source: Unknown Juan M. Cebrian PP4EE Oct. 3, / 38

7 WHY ENERGY EFFICIENT? IN HPC Not so different TOP 500 Race to see which country has the largest...computer Juan M. Cebrian PP4EE Oct. 3, / 38

8 WHY ENERGY EFFICIENT? IN HPC Not so different TOP 500 Race to see which country has the largest...computer Juan M. Cebrian PP4EE Oct. 3, / 38

9 WHY ENERGY EFFICIENT? IN HPC Not so different TOP 500 Race to see which country has the largest...computer THE GOAL Exascale under a reasonable power budget (Horizon 2020) Juan M. Cebrian PP4EE Oct. 3, / 38

10 WHY ENERGY EFFICIENT? IN HPC THE GOAL Exascale under a reasonable power budget (Horizon 2020) Juan M. Cebrian PP4EE Oct. 3, / 38

11 WHY ENERGY EFFICIENT? IN HPC THE GOAL Exascale under a reasonable power budget (Horizon 2020) WAIT...THAT MEANS 33PF, 17MW -> 515pJ/Op 1EF, 20MW -> 20pJ/Op 30x improvements in all system components Juan M. Cebrian PP4EE Oct. 3, / 38

12 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution Juan M. Cebrian PP4EE Oct. 3, / 38

13 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution VALIDATION Relies on benchmarking A set of relevant applications Juan M. Cebrian PP4EE Oct. 3, / 38

14 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution VALIDATION Relies on benchmarking A set of relevant applications BENCHMARK SUITES SPEC-CPU ( ) SPLASH-2 (1995, Cited by 2912) PARSEC (2008, Cited by 1039) Rodinia (2009, Cited by 304) RELEVANT APPLICATIONS (DWARFS) Dense Linear Algebra, Sparse Linear Algebra Spectral Methods, N-Body Methods Structured Grids, Unstructured Grids MapReduce, Combinational Logic Graph Traversal, Dynamic Programming Backtrack and Branch-and-Bound Graphical Models, Finite State Machines Juan M. Cebrian PP4EE Oct. 3, / 38

15 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution VALIDATION Relies on benchmarking A set of relevant applications However... People needs change! Hardware evolves! What about the implementation? Assume its correct/optimal BENCHMARK SUITES SPEC-CPU ( ) SPLASH-2 (1995, Cited by 2912) PARSEC (2008, Cited by 1039) Rodinia (2009, Cited by 304) RELEVANT APPLICATIONS (DWARFS) Dense Linear Algebra, Sparse Linear Algebra Spectral Methods, N-Body Methods Structured Grids, Unstructured Grids MapReduce, Combinational Logic Graph Traversal, Dynamic Programming Backtrack and Branch-and-Bound Graphical Models, Finite State Machines Juan M. Cebrian PP4EE Oct. 3, / 38

16 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

17 Motivation Background Methodology Results Conclusions I MPROVING E NERGY E FFICIENCY Intel R Haswell NVIDIA Tegra 4 R Power Saving Mech. AMD R Kabini Parallelization Specialization Vectorization Heterogeneity Juan M. Cebrian PP4EE Oct. 3, / 38

18 Motivation Background Methodology Results Conclusions I MPROVING E NERGY E FFICIENCY Intel R Haswell NVIDIA Tegra 4 R Power Saving Mech. AMD R Kabini Parallelization Specialization Vectorization Heterogeneity Juan M. Cebrian PP4EE Oct. 3, / 38

19 WHAT IS VECTORIZATION / SIMD Source: Intel Juan M. Cebrian PP4EE Oct. 3, / 38

WHAT IS VECTORIZATION / SIMD Source: Intel SINGLE INSTRUCTION, MULTIPLE DATA (SIMD) Nothing new (vector supercomputers of the early 1970s) First widely-deployed SIMD: Intel s MMX TM extensions

20 WHAT IS VECTORIZATION / SIMD Source: Intel SINGLE INSTRUCTION, MULTIPLE DATA (SIMD) Nothing new (vector supercomputers of the early 1970s) First widely-deployed SIMD: Intel s MMX TM extensions (64-Bit, 1996) Streaming SIMD Extensions (SSE, 128-bit, 1999) Advanced Vector Extensions (AVX, 256-bit, 2011) NEON (2009) Cortex TM A8/A9 fake 128-bit, Cortex-A15 real 128-bit Juan M. Cebrian PP4EE Oct. 3, / 38

21 SIMD ADVANTAGES AND DISADVANTAGES ADVANTAGES Potential speedup based on register size Reduced cache and pipeline pressure Increased energy efficiency! DISADVANTAGES Increased bandwidth requirements Usually requires low level programming (intrinsics) Large register files increase energy consumption and chip area Not all applications can be vectorized without major code changes Juan M. Cebrian PP4EE Oct. 3, / 38

22 WHY BOTHER WITH SIMD SIMD WIDTH Xeon E5-2692, Opteron 6274 and BlueGene/Q: 256-bits (8 floats) Xeon Phi 31S1P: 512-bits (16 floats) K20x: 2048-bits (64 floats) Juan M. Cebrian PP4EE Oct. 3, / 38

WHY BOTHER WITH SIMD SIMD WIDTH Xeon E5-2692, Opteron 6274 and BlueGene/Q: 256-bits (8 floats) Xeon Phi 31S1P: 512-bits (16 floats) K20x: 2048-bits (64 floats)

23 WHY BOTHER WITH SIMD SIMD WIDTH Xeon E5-2692, Opteron 6274 and BlueGene/Q: 256-bits (8 floats) Xeon Phi 31S1P: 512-bits (16 floats) K20x: 2048-bits (64 floats) ACADEMIA TO THE RESCUE SPEC-CPU ( ) SPLASH-2 (1995, Cited by 2912) PARSEC (2008, Cited by 1039) Rodinia (2009, Cited by 304) Juan M. Cebrian PP4EE Oct. 3, / 38

24 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

25 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) Juan M. Cebrian PP4EE Oct. 3, / 38

METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12

26 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) Juan M. Cebrian PP4EE Oct. 3, / 38

27 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) ISOLATED CPU ENERGY Core i5 and i7 include energy MSRs (PP0, PP1, PKG) Juan M. Cebrian PP4EE Oct. 3, / 38

28 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) ISOLATED CPU ENERGY Core i5 and i7 include energy MSRs (PP0, PP1, PKG) Juan M. Cebrian PP4EE Oct. 3, / 38

29 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) ISOLATED CPU ENERGY Core i5 and i7 include energy MSRs (PP0, PP1, PKG) Juan M. Cebrian PP4EE Oct. 3, / 38

30 NEW PLATFORMS ODROID-XU-E Source: Hardkernel Juan M. Cebrian PP4EE Oct. 3, / 38

31 Wrapper/Math Library Profile Vectorize Test Vect. Program Domain SSE AVX NEON Hotspots Changes Group blackscholes Financial x x x yes DT S canneal Engineering x x x yes MCC RL fluidanimate Animation x x x yes MCC CI raytrace Rendering x - - yes - - streamcluster Data Mining x x x yes DT RL swaptions Financial x x x no MCC CI vips Media Proc. x x x no MCC RL/CI x264 Media Proc. x x x no - RL/CI DT: Direct Translation MCC: Major Code Changes S: Scalable RL: Resource Limited CI: Code/Input limited Juan M. Cebrian PP4EE Oct. 3, / 38

32 Wrapper/Math Library Profile Vectorize Test WRAPPER LIBRARY Share code between implementations Keep control of the vectorization process 1 #define _MM_ALIGNMENT 16 2 #define SIMD_WIDTH 4 3 #define _MM_ABS _mm_abs_ps 4 #define _MM_CMPLT _mm_cmplt_ps 5 #define _MM_TYPE m attribute ((aligned (16))) static const int absmask[] = {0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff}; 8 #define _mm_abs_ps(x) _mm_and_ps((x), *(const m128*)absmask) Juan M. Cebrian PP4EE Oct. 3, / 38

33 EXAMPLE: STREAMCLUSTER KERNEL PROFILING Juan M. Cebrian PP4EE Oct. 3, / 38

34 EXAMPLE: STREAMCLUSTER KERNEL WRAPPER LIBRARY 1 / compute Euclidean distance squared between two p o i n t s / 2 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 3 i n t i ; 4 f l o a t r e s u l t =0.0; 5 for ( i =0; i <dim ; i ++) 6 r e s u l t += ( p1. coord [ i ] p2. coord [ i ] ) (p1. coord [ i ] p2. coord [ i ] ) ; 7 return ( r e s u l t ) ; 8 } 1 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 2 i n t i ; 3 _MM_TYPE result, _aux, _ diff, _coord1, _coord2 ; 4 r e s u l t = _MM_SETZERO( ) ; 5 6 for ( i =0; i <dim ; i = i +SIMD_WIDTH) { 7 _coord1 = _MM_LOADU(&( p1. coord [ i ] ) ) ; 8 _coord2 = _MM_LOADU(&( p2. coord [ i ] ) ) ; 9 10 _ d i f f = _MM_SUB( _coord1, _coord2 ) ; 11 _aux = _MM_MUL( _ d i f f, _ d i f f ) ; 12 result = _MM_ADD( result, _aux ) ; 13 } 14 / / Add a l l items of the vector 15 return (_MM_CVT_F(_MM_FULL_HADD( result, r e s u l t ) ) ) ; Juan M. Cebrian PP4EE Oct. 3, / 38

35 EXAMPLE: STREAMCLUSTER KERNEL WRAPPER LIBRARY 1 / compute Euclidean distance squared between two p o i n t s / 2 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 3 i n t i ; 4 f l o a t r e s u l t =0.0; 5 for ( i =0; i <dim ; i ++) 6 r e s u l t += ( p1. coord [ i ] p2. coord [ i ] ) (p1. coord [ i ] p2. coord [ i ] ) ; 7 return ( r e s u l t ) ; 8 } 1 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 2 i n t i ; 3 _MM_TYPE result, _aux, _ diff, _coord1, _coord2 ; > SSE: m128 4 NEON: f l o a t 3 2 x 4 _ t 5 r e s u l t = _MM_SETZERO( ) ; for ( i =0; i <dim ; i = i +SIMD_WIDTH) { 8 _coord1 = _MM_LOADU(&( p1. coord [ i ] ) ) ; > SSE: _mm_loadu_ps (&( p1. coord [ i ] ) ) ; 9 NEON: vld1q_f32 (&( p1. coord [ i ] ) ) ; 10 _coord2 = _MM_LOADU(&( p2. coord [ i ] ) ) ; _ d i f f = _MM_SUB( _coord1, _coord2 ) ; > AVX: _mm256_sub_ps ( _coord1, _coord2 ) ; 13 NEON: _ d i f f = vsubq_f32 ( _coord1, _coord2 ) ; 14 _aux = _MM_MUL( _ d i f f, _ d i f f ) ; Juan M. Cebrian PP4EE Oct. 3, / 38

36 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

37 RESULTS - RUNTIME IVY BRIDGE Norm. Time (%) Total Speedu SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Blackscholes ROI SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Canneal Total Speedup SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Fluidanimate 79,4 50,1 31,6 20,0 12,6 7,9 5,0 3,2 2,0 1,3 Speedup (N Times Faster) - Log Scale Scalable benchmarks 50x speedup on 8 threads (Hyper-threading) Resource limited around 2x per thread Code/Input limited around 10-15% speedup per thread Juan M. Cebrian PP4EE Oct. 3, / 38

38 RESULTS - ENERGY IVY BRIDGE Normalized Energy (%) PKG SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Blackscholes PP1 PP0 SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Canneal Avg. PP0 Power SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Fluidanimate Avg. Power PP0 (W) Average PP0 barely changes when using SSE/AVX Their threading equivalent increases power by aprox. 5-8W per thread This performance for free translates into huge energy savings Juan M. Cebrian PP4EE Oct. 3, / 38

39 RESULTS - PERFORMANCE COUNTERS EXECUTION CYCLE BREAKDOWN We require instructions and data Instruction Dispatch Critical Points L1 Data Juan M. Cebrian PP4EE Oct. 3, / 38

40 RESULTS - PERFORMANCE COUNTERS EXECUTION CYCLE BREAKDOWN We require instructions and data Instruction Dispatch Critical Points L1 Data INSTRUCTION STALLS - Reorder Buffer (ROB) - Renaming Logic (RS) Juan M. Cebrian PP4EE Oct. 3, / 38

41 RESULTS - PERFORMANCE COUNTERS CACHE MR Miss-Rate (%) 50 Scalar SSE 35 AVX Th. 2 Th. 4 Th. 8 Th. 1 Th. 2 Th. 4 Th. 8 Th. L1D LLC Miss-R Miss-R Blackscholes Miss-Rate (%) 35 Scalar 30 SSE 25 AVX Th. 2 Th. 4 Th. 8 Th. 1 Th. 2 Th. 4 Th. 8 Th. L1D LLC Miss-R Miss-R Canneal Miss-Rate (%) 45 Scalar 40 SSE 35 AVX Th. 2 Th. 4 Th. 8 Th. 1 Th. 2 Th. 4 Th. 8 Th. L1D LLC Miss-R Miss-R Fluidanimate L1D AND LLC Total number of accesses is reduced for all levels, but miss rate changes Miss rate of L1D increases linearly with SIMD register width under heavy usage Fluidanimate barely generates AVX instructions due to input Juan M. Cebrian PP4EE Oct. 3, / 38

42 STALL CYCLE BREAKDOWN IVY BRIDGE - SCALABLE - BLACKSCHOLES Execution Cycle Breakdown (%) ROB-Stalls RS-Stalls Other L1D-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Renaming logic pressure increases L1D stalls increase slightly Behavior is consistent across thread counts Juan M. Cebrian PP4EE Oct. 3, / 38

43 STALL CYCLE BREAKDOWN IVY BRIDGE - RESOURCE LIMITED - CANNEAL Execution Cycle Breakdown (%) ROB-Stalls RS-Stalls Other L1D-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Renaming logic and ROB pressure increases L1D stalls force dispatch stalls Behavior is consistent across thread counts Juan M. Cebrian PP4EE Oct. 3, / 38

44 STALL CYCLE BREAKDOWN IVY BRIDGE - CODE/INPUT LIMITED - FLUIDANIMATE Execution Cycle Breakdown (%) ROB-Stalls RS-Stalls Other L1D-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. ROB pressure increases, RS barely changes (not many SIMD instructions) L1D stalls force dispatch stalls Behavior is consistent across thread counts Juan M. Cebrian PP4EE Oct. 3, / 38

45 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

46 CONCLUSIONS CONCLUSIONS Great energy savings from vectorization (prioritize over parallelization) SIMD implementations change the architectural trade-offs of the processor SIMD is widely available in many market segments, and can no longer be ignored We aim to distribute our code to reinforce the validation process of new proposals ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS?: YES Benchmarks should cover most common architectural features or architects may end up under/over estimating the impact of their contributions. Juan M. Cebrian PP4EE Oct. 3, / 38

47 Thank you ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science juanmc@idi.ntnu.no Oct. 3, 2013 Juan M. Cebrian PP4EE Oct. 3, / 38

48 Thank you STALL CYCLE BREAKDOWN IVY BRIDGE - STREAMCLUSTER Norm. Cycle Count (%) ROB-Stalls RS-Stalls Other LD1-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Juan M. Cebrian PP4EE Oct. 3, / 38

49 Thank you SIMD LIMITING FACTORS DATA STRUCTURES OO programming encourages Array of Structures (AoS) over Structures of Arrays (SoA) Source: spuify.co.uk Juan M. Cebrian PP4EE Oct. 3, / 38

50 Thank you SIMD LIMITING FACTORS SOLUTIONS Software: hide SoA internal representation from user (e.g., Intel Array Building Blocks, or Apple s EVE) Hardware: NEON stride loads Source: ARM Juan M. Cebrian PP4EE Oct. 3, / 38

51 Thank you SIMD LIMITING FACTORS DIVERGENT BRANCHES Conditional branches pose a thread to SIMD performance AoS vs SoA if (input < 1) if (input < 1) Scalar SIMD Juan M. Cebrian PP4EE Oct. 3, / 38

52 Thank you SIMD LIMITING FACTORS HORIZONTAL OPERATIONS AND ROUNDING Horizontal operations usually slower and may cause rounding errors AoS vs SoA a3 a2 a1 a0 + b3 b2 b1 b0 c3 c2 c1 c0 a3 a2 a1 a0 + a3+a2 a1+a0 if (input < 1) if (input < 1) Div. Branches Vertical Horizontal In floating point, a0 + a1 + a2 + a3!= (a0 + a1) + (a2 + a3) Juan M. Cebrian PP4EE Oct. 3, / 38

53 Thank you SIMD LIMITING FACTORS INPUT SIZE Input size may not be divisible by SIMD width AoS vs SoA a5 a4 a3 a2 a1 a0 a3 a2 a1 a0 Scalar SIMD + b3 b2 b1 b0 if (input < 1) if (input < 1) c3 c2 c1 c0 Div. Branches Partially Scalar Partially SIMD a3 a2 a1 a0 + b3 b2 b1 b0 a3 a2 a1 a0 + c3 c2 c1 c0 a3+a2 a1+a0 Vertical Horizontal HOps & Rounding Juan M. Cebrian PP4EE Oct. 3, / 38

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information