ARE WE OPTIMIZING HARDWARE FOR

Size: px
Start display at page:

Download "ARE WE OPTIMIZING HARDWARE FOR"

Transcription

1 ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science juanmc@idi.ntnu.no Oct. 3, 2013 Juan M. Cebrian PP4EE Oct. 3, / 38

2 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

3 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

4 WHY ENERGY EFFICIENT? GREEN COMPUTING Good for economy (maybe not for everyone) Good for the environment Juan M. Cebrian PP4EE Oct. 3, / 38

5 WHY ENERGY EFFICIENT? GREEN COMPUTING Good for economy (maybe not for everyone) Good for the environment However, real world is slightly different Source: Unknown Juan M. Cebrian PP4EE Oct. 3, / 38

6 WHY ENERGY EFFICIENT? GREEN COMPUTING Good for economy (maybe not for everyone) Good for the environment However, real world is slightly different REALITY Companies want to make money People don t buy because of energy efficiency Energy savings -> new markets or features Design given a power budget (e.g. 5W for smartphones) Source: Unknown Juan M. Cebrian PP4EE Oct. 3, / 38

7 WHY ENERGY EFFICIENT? IN HPC Not so different TOP 500 Race to see which country has the largest...computer Juan M. Cebrian PP4EE Oct. 3, / 38

8 WHY ENERGY EFFICIENT? IN HPC Not so different TOP 500 Race to see which country has the largest...computer Juan M. Cebrian PP4EE Oct. 3, / 38

9 WHY ENERGY EFFICIENT? IN HPC Not so different TOP 500 Race to see which country has the largest...computer THE GOAL Exascale under a reasonable power budget (Horizon 2020) Juan M. Cebrian PP4EE Oct. 3, / 38

10 WHY ENERGY EFFICIENT? IN HPC THE GOAL Exascale under a reasonable power budget (Horizon 2020) Juan M. Cebrian PP4EE Oct. 3, / 38

11 WHY ENERGY EFFICIENT? IN HPC THE GOAL Exascale under a reasonable power budget (Horizon 2020) WAIT...THAT MEANS 33PF, 17MW -> 515pJ/Op 1EF, 20MW -> 20pJ/Op 30x improvements in all system components Juan M. Cebrian PP4EE Oct. 3, / 38

12 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution Juan M. Cebrian PP4EE Oct. 3, / 38

13 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution VALIDATION Relies on benchmarking A set of relevant applications Juan M. Cebrian PP4EE Oct. 3, / 38

14 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution VALIDATION Relies on benchmarking A set of relevant applications BENCHMARK SUITES SPEC-CPU ( ) SPLASH-2 (1995, Cited by 2912) PARSEC (2008, Cited by 1039) Rodinia (2009, Cited by 304) RELEVANT APPLICATIONS (DWARFS) Dense Linear Algebra, Sparse Linear Algebra Spectral Methods, N-Body Methods Structured Grids, Unstructured Grids MapReduce, Combinational Logic Graph Traversal, Dynamic Programming Backtrack and Branch-and-Bound Graphical Models, Finite State Machines Juan M. Cebrian PP4EE Oct. 3, / 38

15 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution VALIDATION Relies on benchmarking A set of relevant applications However... People needs change! Hardware evolves! What about the implementation? Assume its correct/optimal BENCHMARK SUITES SPEC-CPU ( ) SPLASH-2 (1995, Cited by 2912) PARSEC (2008, Cited by 1039) Rodinia (2009, Cited by 304) RELEVANT APPLICATIONS (DWARFS) Dense Linear Algebra, Sparse Linear Algebra Spectral Methods, N-Body Methods Structured Grids, Unstructured Grids MapReduce, Combinational Logic Graph Traversal, Dynamic Programming Backtrack and Branch-and-Bound Graphical Models, Finite State Machines Juan M. Cebrian PP4EE Oct. 3, / 38

16 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

17 Motivation Background Methodology Results Conclusions I MPROVING E NERGY E FFICIENCY Intel R Haswell NVIDIA Tegra 4 R Power Saving Mech. AMD R Kabini Parallelization Specialization Vectorization Heterogeneity Juan M. Cebrian PP4EE Oct. 3, / 38

18 Motivation Background Methodology Results Conclusions I MPROVING E NERGY E FFICIENCY Intel R Haswell NVIDIA Tegra 4 R Power Saving Mech. AMD R Kabini Parallelization Specialization Vectorization Heterogeneity Juan M. Cebrian PP4EE Oct. 3, / 38

19 WHAT IS VECTORIZATION / SIMD Source: Intel Juan M. Cebrian PP4EE Oct. 3, / 38

20 WHAT IS VECTORIZATION / SIMD Source: Intel SINGLE INSTRUCTION, MULTIPLE DATA (SIMD) Nothing new (vector supercomputers of the early 1970s) First widely-deployed SIMD: Intel s MMX TM extensions (64-Bit, 1996) Streaming SIMD Extensions (SSE, 128-bit, 1999) Advanced Vector Extensions (AVX, 256-bit, 2011) NEON (2009) Cortex TM A8/A9 fake 128-bit, Cortex-A15 real 128-bit Juan M. Cebrian PP4EE Oct. 3, / 38

21 SIMD ADVANTAGES AND DISADVANTAGES ADVANTAGES Potential speedup based on register size Reduced cache and pipeline pressure Increased energy efficiency! DISADVANTAGES Increased bandwidth requirements Usually requires low level programming (intrinsics) Large register files increase energy consumption and chip area Not all applications can be vectorized without major code changes Juan M. Cebrian PP4EE Oct. 3, / 38

22 WHY BOTHER WITH SIMD SIMD WIDTH Xeon E5-2692, Opteron 6274 and BlueGene/Q: 256-bits (8 floats) Xeon Phi 31S1P: 512-bits (16 floats) K20x: 2048-bits (64 floats) Juan M. Cebrian PP4EE Oct. 3, / 38

23 WHY BOTHER WITH SIMD SIMD WIDTH Xeon E5-2692, Opteron 6274 and BlueGene/Q: 256-bits (8 floats) Xeon Phi 31S1P: 512-bits (16 floats) K20x: 2048-bits (64 floats) ACADEMIA TO THE RESCUE SPEC-CPU ( ) SPLASH-2 (1995, Cited by 2912) PARSEC (2008, Cited by 1039) Rodinia (2009, Cited by 304) Juan M. Cebrian PP4EE Oct. 3, / 38

24 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

25 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) Juan M. Cebrian PP4EE Oct. 3, / 38

26 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) Juan M. Cebrian PP4EE Oct. 3, / 38

27 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) ISOLATED CPU ENERGY Core i5 and i7 include energy MSRs (PP0, PP1, PKG) Juan M. Cebrian PP4EE Oct. 3, / 38

28 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) ISOLATED CPU ENERGY Core i5 and i7 include energy MSRs (PP0, PP1, PKG) Juan M. Cebrian PP4EE Oct. 3, / 38

29 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) ISOLATED CPU ENERGY Core i5 and i7 include energy MSRs (PP0, PP1, PKG) Juan M. Cebrian PP4EE Oct. 3, / 38

30 NEW PLATFORMS ODROID-XU-E Source: Hardkernel Juan M. Cebrian PP4EE Oct. 3, / 38

31 Wrapper/Math Library Profile Vectorize Test Vect. Program Domain SSE AVX NEON Hotspots Changes Group blackscholes Financial x x x yes DT S canneal Engineering x x x yes MCC RL fluidanimate Animation x x x yes MCC CI raytrace Rendering x - - yes - - streamcluster Data Mining x x x yes DT RL swaptions Financial x x x no MCC CI vips Media Proc. x x x no MCC RL/CI x264 Media Proc. x x x no - RL/CI DT: Direct Translation MCC: Major Code Changes S: Scalable RL: Resource Limited CI: Code/Input limited Juan M. Cebrian PP4EE Oct. 3, / 38

32 Wrapper/Math Library Profile Vectorize Test WRAPPER LIBRARY Share code between implementations Keep control of the vectorization process 1 #define _MM_ALIGNMENT 16 2 #define SIMD_WIDTH 4 3 #define _MM_ABS _mm_abs_ps 4 #define _MM_CMPLT _mm_cmplt_ps 5 #define _MM_TYPE m attribute ((aligned (16))) static const int absmask[] = {0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff}; 8 #define _mm_abs_ps(x) _mm_and_ps((x), *(const m128*)absmask) Juan M. Cebrian PP4EE Oct. 3, / 38

33 EXAMPLE: STREAMCLUSTER KERNEL PROFILING Juan M. Cebrian PP4EE Oct. 3, / 38

34 EXAMPLE: STREAMCLUSTER KERNEL WRAPPER LIBRARY 1 / compute Euclidean distance squared between two p o i n t s / 2 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 3 i n t i ; 4 f l o a t r e s u l t =0.0; 5 for ( i =0; i <dim ; i ++) 6 r e s u l t += ( p1. coord [ i ] p2. coord [ i ] ) (p1. coord [ i ] p2. coord [ i ] ) ; 7 return ( r e s u l t ) ; 8 } 1 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 2 i n t i ; 3 _MM_TYPE result, _aux, _ diff, _coord1, _coord2 ; 4 r e s u l t = _MM_SETZERO( ) ; 5 6 for ( i =0; i <dim ; i = i +SIMD_WIDTH) { 7 _coord1 = _MM_LOADU(&( p1. coord [ i ] ) ) ; 8 _coord2 = _MM_LOADU(&( p2. coord [ i ] ) ) ; 9 10 _ d i f f = _MM_SUB( _coord1, _coord2 ) ; 11 _aux = _MM_MUL( _ d i f f, _ d i f f ) ; 12 result = _MM_ADD( result, _aux ) ; 13 } 14 / / Add a l l items of the vector 15 return (_MM_CVT_F(_MM_FULL_HADD( result, r e s u l t ) ) ) ; Juan M. Cebrian PP4EE Oct. 3, / 38

35 EXAMPLE: STREAMCLUSTER KERNEL WRAPPER LIBRARY 1 / compute Euclidean distance squared between two p o i n t s / 2 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 3 i n t i ; 4 f l o a t r e s u l t =0.0; 5 for ( i =0; i <dim ; i ++) 6 r e s u l t += ( p1. coord [ i ] p2. coord [ i ] ) (p1. coord [ i ] p2. coord [ i ] ) ; 7 return ( r e s u l t ) ; 8 } 1 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 2 i n t i ; 3 _MM_TYPE result, _aux, _ diff, _coord1, _coord2 ; > SSE: m128 4 NEON: f l o a t 3 2 x 4 _ t 5 r e s u l t = _MM_SETZERO( ) ; for ( i =0; i <dim ; i = i +SIMD_WIDTH) { 8 _coord1 = _MM_LOADU(&( p1. coord [ i ] ) ) ; > SSE: _mm_loadu_ps (&( p1. coord [ i ] ) ) ; 9 NEON: vld1q_f32 (&( p1. coord [ i ] ) ) ; 10 _coord2 = _MM_LOADU(&( p2. coord [ i ] ) ) ; _ d i f f = _MM_SUB( _coord1, _coord2 ) ; > AVX: _mm256_sub_ps ( _coord1, _coord2 ) ; 13 NEON: _ d i f f = vsubq_f32 ( _coord1, _coord2 ) ; 14 _aux = _MM_MUL( _ d i f f, _ d i f f ) ; Juan M. Cebrian PP4EE Oct. 3, / 38

36 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

37 RESULTS - RUNTIME IVY BRIDGE Norm. Time (%) Total Speedu SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Blackscholes ROI SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Canneal Total Speedup SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Fluidanimate 79,4 50,1 31,6 20,0 12,6 7,9 5,0 3,2 2,0 1,3 Speedup (N Times Faster) - Log Scale Scalable benchmarks 50x speedup on 8 threads (Hyper-threading) Resource limited around 2x per thread Code/Input limited around 10-15% speedup per thread Juan M. Cebrian PP4EE Oct. 3, / 38

38 RESULTS - ENERGY IVY BRIDGE Normalized Energy (%) PKG SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Blackscholes PP1 PP0 SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Canneal Avg. PP0 Power SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Fluidanimate Avg. Power PP0 (W) Average PP0 barely changes when using SSE/AVX Their threading equivalent increases power by aprox. 5-8W per thread This performance for free translates into huge energy savings Juan M. Cebrian PP4EE Oct. 3, / 38

39 RESULTS - PERFORMANCE COUNTERS EXECUTION CYCLE BREAKDOWN We require instructions and data Instruction Dispatch Critical Points L1 Data Juan M. Cebrian PP4EE Oct. 3, / 38

40 RESULTS - PERFORMANCE COUNTERS EXECUTION CYCLE BREAKDOWN We require instructions and data Instruction Dispatch Critical Points L1 Data INSTRUCTION STALLS - Reorder Buffer (ROB) - Renaming Logic (RS) Juan M. Cebrian PP4EE Oct. 3, / 38

41 RESULTS - PERFORMANCE COUNTERS CACHE MR Miss-Rate (%) 50 Scalar SSE 35 AVX Th. 2 Th. 4 Th. 8 Th. 1 Th. 2 Th. 4 Th. 8 Th. L1D LLC Miss-R Miss-R Blackscholes Miss-Rate (%) 35 Scalar 30 SSE 25 AVX Th. 2 Th. 4 Th. 8 Th. 1 Th. 2 Th. 4 Th. 8 Th. L1D LLC Miss-R Miss-R Canneal Miss-Rate (%) 45 Scalar 40 SSE 35 AVX Th. 2 Th. 4 Th. 8 Th. 1 Th. 2 Th. 4 Th. 8 Th. L1D LLC Miss-R Miss-R Fluidanimate L1D AND LLC Total number of accesses is reduced for all levels, but miss rate changes Miss rate of L1D increases linearly with SIMD register width under heavy usage Fluidanimate barely generates AVX instructions due to input Juan M. Cebrian PP4EE Oct. 3, / 38

42 STALL CYCLE BREAKDOWN IVY BRIDGE - SCALABLE - BLACKSCHOLES Execution Cycle Breakdown (%) ROB-Stalls RS-Stalls Other L1D-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Renaming logic pressure increases L1D stalls increase slightly Behavior is consistent across thread counts Juan M. Cebrian PP4EE Oct. 3, / 38

43 STALL CYCLE BREAKDOWN IVY BRIDGE - RESOURCE LIMITED - CANNEAL Execution Cycle Breakdown (%) ROB-Stalls RS-Stalls Other L1D-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Renaming logic and ROB pressure increases L1D stalls force dispatch stalls Behavior is consistent across thread counts Juan M. Cebrian PP4EE Oct. 3, / 38

44 STALL CYCLE BREAKDOWN IVY BRIDGE - CODE/INPUT LIMITED - FLUIDANIMATE Execution Cycle Breakdown (%) ROB-Stalls RS-Stalls Other L1D-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. ROB pressure increases, RS barely changes (not many SIMD instructions) L1D stalls force dispatch stalls Behavior is consistent across thread counts Juan M. Cebrian PP4EE Oct. 3, / 38

45 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38

46 CONCLUSIONS CONCLUSIONS Great energy savings from vectorization (prioritize over parallelization) SIMD implementations change the architectural trade-offs of the processor SIMD is widely available in many market segments, and can no longer be ignored We aim to distribute our code to reinforce the validation process of new proposals ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS?: YES Benchmarks should cover most common architectural features or architects may end up under/over estimating the impact of their contributions. Juan M. Cebrian PP4EE Oct. 3, / 38

47 Thank you ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science juanmc@idi.ntnu.no Oct. 3, 2013 Juan M. Cebrian PP4EE Oct. 3, / 38

48 Thank you STALL CYCLE BREAKDOWN IVY BRIDGE - STREAMCLUSTER Norm. Cycle Count (%) ROB-Stalls RS-Stalls Other LD1-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Juan M. Cebrian PP4EE Oct. 3, / 38

49 Thank you SIMD LIMITING FACTORS DATA STRUCTURES OO programming encourages Array of Structures (AoS) over Structures of Arrays (SoA) Source: spuify.co.uk Juan M. Cebrian PP4EE Oct. 3, / 38

50 Thank you SIMD LIMITING FACTORS SOLUTIONS Software: hide SoA internal representation from user (e.g., Intel Array Building Blocks, or Apple s EVE) Hardware: NEON stride loads Source: ARM Juan M. Cebrian PP4EE Oct. 3, / 38

51 Thank you SIMD LIMITING FACTORS DIVERGENT BRANCHES Conditional branches pose a thread to SIMD performance AoS vs SoA if (input < 1) if (input < 1) Scalar SIMD Juan M. Cebrian PP4EE Oct. 3, / 38

52 Thank you SIMD LIMITING FACTORS HORIZONTAL OPERATIONS AND ROUNDING Horizontal operations usually slower and may cause rounding errors AoS vs SoA a3 a2 a1 a0 + b3 b2 b1 b0 c3 c2 c1 c0 a3 a2 a1 a0 + a3+a2 a1+a0 if (input < 1) if (input < 1) Div. Branches Vertical Horizontal In floating point, a0 + a1 + a2 + a3!= (a0 + a1) + (a2 + a3) Juan M. Cebrian PP4EE Oct. 3, / 38

53 Thank you SIMD LIMITING FACTORS INPUT SIZE Input size may not be divisible by SIMD width AoS vs SoA a5 a4 a3 a2 a1 a0 a3 a2 a1 a0 Scalar SIMD + b3 b2 b1 b0 if (input < 1) if (input < 1) c3 c2 c1 c0 Div. Branches Partially Scalar Partially SIMD a3 a2 a1 a0 + b3 b2 b1 b0 a3 a2 a1 a0 + c3 c2 c1 c0 a3+a2 a1+a0 Vertical Horizontal HOps & Rounding Juan M. Cebrian PP4EE Oct. 3, / 38

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

Optimized Hardware for Suboptimal Software: The Case for SIMD-aware Benchmarks

Optimized Hardware for Suboptimal Software: The Case for SIMD-aware Benchmarks Optimized Hardware for Suboptimal Software: The Case for SIMD-aware Benchmarks Juan M. Cebrián, Magnus Jahre and Lasse Natvig Dept. of Computer and Information Science (IDI) NTNU Trondheim, NO-791, Norway.

More information

PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites

PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites Christian Bienia (Princeton University), Sanjeev Kumar (Intel), Kai Li (Princeton University) Outline Overview What

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

The Mont-Blanc approach towards Exascale

The Mont-Blanc approach towards Exascale http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are

More information

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore By Dan Stafford Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1

NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 Alberto Ros Universidad de Murcia October 17th, 2017 1 A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras, "Non-Speculative Load-Load Reordering in TSO". ISCA,

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

Contour Detection on Mobile Platforms

Contour Detection on Mobile Platforms Contour Detection on Mobile Platforms Bor-Yiing Su, subrian@eecs.berkeley.edu Prof. Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/26 Diagnosing Power/Performance

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Dan Stafford, Justine Bonnot

Dan Stafford, Justine Bonnot Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing

More information

Programmer's View of Execution Teminology Summary

Programmer's View of Execution Teminology Summary CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms

Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms Gaurav Mitra 1 Beau Johnston 1 Alistair P. Rendell 1 Eric McCreath 1 Jun Zhou 2 1 Research

More information

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

TDT 4260 lecture 2 spring semester 2015

TDT 4260 lecture 2 spring semester 2015 1 TDT 4260 lecture 2 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Chapter 1: Fundamentals of Quantitative Design and Analysis, continued

More information

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Y. Kodama, T. Odajima, M. Matsuda, M. Tsuji, J. Lee and M. Sato RIKEN AICS (Advanced Institute for Computational

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Visualization of OpenCL Application Execution on CPU-GPU Systems

Visualization of OpenCL Application Execution on CPU-GPU Systems Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Northeastern University Computer Architecture Research

More information

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Piecewise Holistic Autotuning of Compiler and Runtime Parameters Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R

More information

Trends in the Infrastructure of Computing

Trends in the Infrastructure of Computing Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Spectre and Meltdown. Clifford Wolf q/talk

Spectre and Meltdown. Clifford Wolf q/talk Spectre and Meltdown Clifford Wolf q/talk 2018-01-30 Spectre and Meltdown Spectre (CVE-2017-5753 and CVE-2017-5715) Is an architectural security bug that effects most modern processors with speculative

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Memory access patterns. 5KK73 Cedric Nugteren

Memory access patterns. 5KK73 Cedric Nugteren Memory access patterns 5KK73 Cedric Nugteren Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance

More information

Detecting Memory-Boundedness with Hardware Performance Counters

Detecting Memory-Boundedness with Hardware Performance Counters Center for Information Services and High Performance Computing (ZIH) Detecting ory-boundedness with Hardware Performance Counters ICPE, Apr 24th 2017 (daniel.molka@tu-dresden.de) Robert Schöne (robert.schoene@tu-dresden.de)

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 2

ECE 571 Advanced Microprocessor-Based Design Lecture 2 ECE 571 Advanced Microprocessor-Based Design Lecture 2 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 January 2016 Announcements HW#1 will be posted tomorrow I am handing out

More information

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Anastasia Ailamaki. Performance and energy analysis using transactional workloads Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Chapter 1: Fundamentals of Quantitative Design and Analysis

Chapter 1: Fundamentals of Quantitative Design and Analysis 1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Hybrid Architectures Why Should I Bother?

Hybrid Architectures Why Should I Bother? Hybrid Architectures Why Should I Bother? CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering,

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Intro Michael Bader Winter 2015/2016 Intro, Winter 2015/2016 1 Part I Scientific Computing and Numerical Simulation Intro, Winter 2015/2016 2 The Simulation Pipeline phenomenon,

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Performance and Energy Usage of Workloads on KNL and Haswell Architectures Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research

More information

Progress Report on QDP-JIT

Progress Report on QDP-JIT Progress Report on QDP-JIT F. T. Winter Thomas Jefferson National Accelerator Facility USQCD Software Meeting 14 April 16-17, 14 at Jefferson Lab F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 /

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Multicore Scaling: The ECM Model

Multicore Scaling: The ECM Model Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,

More information

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Growth in Cores - A well rehearsed story

Growth in Cores - A well rehearsed story Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Low-power Architecture. By: Jonathan Herbst Scott Duntley Low-power Architecture By: Jonathan Herbst Scott Duntley Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media

More information

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

Native Offload of Haskell Repa Programs to Integrated GPUs

Native Offload of Haskell Repa Programs to Integrated GPUs Native Offload of Haskell Repa Programs to Integrated GPUs Hai (Paul) Liu with Laurence Day, Neal Glew, Todd Anderson, Rajkishore Barik Intel Labs. September 28, 2016 General purpose computing on integrated

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MIC workshop Guillaume Colin de Verdière MARCH 17 TH, 2015 MIC Workshop PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France March 17th, 2015 Overview Context

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD. George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA

EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD. George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA 1. INTRODUCTION HiPERiSM Consulting, LLC, has a mission to develop (or enhance) software and

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Getting Started with

Getting Started with /************************************************************************* * LaCASA Laboratory * * Authors: Aleksandar Milenkovic with help of Mounika Ponugoti * * Email: milenkovic@computer.org * * Date:

More information