ARE WE OPTIMIZING HARDWARE FOR
|
|
- Janis Snow
- 5 years ago
- Views:
Transcription
1 ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science juanmc@idi.ntnu.no Oct. 3, 2013 Juan M. Cebrian PP4EE Oct. 3, / 38
2 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38
3 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38
4 WHY ENERGY EFFICIENT? GREEN COMPUTING Good for economy (maybe not for everyone) Good for the environment Juan M. Cebrian PP4EE Oct. 3, / 38
5 WHY ENERGY EFFICIENT? GREEN COMPUTING Good for economy (maybe not for everyone) Good for the environment However, real world is slightly different Source: Unknown Juan M. Cebrian PP4EE Oct. 3, / 38
6 WHY ENERGY EFFICIENT? GREEN COMPUTING Good for economy (maybe not for everyone) Good for the environment However, real world is slightly different REALITY Companies want to make money People don t buy because of energy efficiency Energy savings -> new markets or features Design given a power budget (e.g. 5W for smartphones) Source: Unknown Juan M. Cebrian PP4EE Oct. 3, / 38
7 WHY ENERGY EFFICIENT? IN HPC Not so different TOP 500 Race to see which country has the largest...computer Juan M. Cebrian PP4EE Oct. 3, / 38
8 WHY ENERGY EFFICIENT? IN HPC Not so different TOP 500 Race to see which country has the largest...computer Juan M. Cebrian PP4EE Oct. 3, / 38
9 WHY ENERGY EFFICIENT? IN HPC Not so different TOP 500 Race to see which country has the largest...computer THE GOAL Exascale under a reasonable power budget (Horizon 2020) Juan M. Cebrian PP4EE Oct. 3, / 38
10 WHY ENERGY EFFICIENT? IN HPC THE GOAL Exascale under a reasonable power budget (Horizon 2020) Juan M. Cebrian PP4EE Oct. 3, / 38
11 WHY ENERGY EFFICIENT? IN HPC THE GOAL Exascale under a reasonable power budget (Horizon 2020) WAIT...THAT MEANS 33PF, 17MW -> 515pJ/Op 1EF, 20MW -> 20pJ/Op 30x improvements in all system components Juan M. Cebrian PP4EE Oct. 3, / 38
12 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution Juan M. Cebrian PP4EE Oct. 3, / 38
13 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution VALIDATION Relies on benchmarking A set of relevant applications Juan M. Cebrian PP4EE Oct. 3, / 38
14 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution VALIDATION Relies on benchmarking A set of relevant applications BENCHMARK SUITES SPEC-CPU ( ) SPLASH-2 (1995, Cited by 2912) PARSEC (2008, Cited by 1039) Rodinia (2009, Cited by 304) RELEVANT APPLICATIONS (DWARFS) Dense Linear Algebra, Sparse Linear Algebra Spectral Methods, N-Body Methods Structured Grids, Unstructured Grids MapReduce, Combinational Logic Graph Traversal, Dynamic Programming Backtrack and Branch-and-Bound Graphical Models, Finite State Machines Juan M. Cebrian PP4EE Oct. 3, / 38
15 WHAT S THE PLAN HOW COMPUTER ARCHITECTS WORK THE SCIENTIFIC METHOD Find/analyze problem Propose solution Validate solution VALIDATION Relies on benchmarking A set of relevant applications However... People needs change! Hardware evolves! What about the implementation? Assume its correct/optimal BENCHMARK SUITES SPEC-CPU ( ) SPLASH-2 (1995, Cited by 2912) PARSEC (2008, Cited by 1039) Rodinia (2009, Cited by 304) RELEVANT APPLICATIONS (DWARFS) Dense Linear Algebra, Sparse Linear Algebra Spectral Methods, N-Body Methods Structured Grids, Unstructured Grids MapReduce, Combinational Logic Graph Traversal, Dynamic Programming Backtrack and Branch-and-Bound Graphical Models, Finite State Machines Juan M. Cebrian PP4EE Oct. 3, / 38
16 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38
17 Motivation Background Methodology Results Conclusions I MPROVING E NERGY E FFICIENCY Intel R Haswell NVIDIA Tegra 4 R Power Saving Mech. AMD R Kabini Parallelization Specialization Vectorization Heterogeneity Juan M. Cebrian PP4EE Oct. 3, / 38
18 Motivation Background Methodology Results Conclusions I MPROVING E NERGY E FFICIENCY Intel R Haswell NVIDIA Tegra 4 R Power Saving Mech. AMD R Kabini Parallelization Specialization Vectorization Heterogeneity Juan M. Cebrian PP4EE Oct. 3, / 38
19 WHAT IS VECTORIZATION / SIMD Source: Intel Juan M. Cebrian PP4EE Oct. 3, / 38
20 WHAT IS VECTORIZATION / SIMD Source: Intel SINGLE INSTRUCTION, MULTIPLE DATA (SIMD) Nothing new (vector supercomputers of the early 1970s) First widely-deployed SIMD: Intel s MMX TM extensions (64-Bit, 1996) Streaming SIMD Extensions (SSE, 128-bit, 1999) Advanced Vector Extensions (AVX, 256-bit, 2011) NEON (2009) Cortex TM A8/A9 fake 128-bit, Cortex-A15 real 128-bit Juan M. Cebrian PP4EE Oct. 3, / 38
21 SIMD ADVANTAGES AND DISADVANTAGES ADVANTAGES Potential speedup based on register size Reduced cache and pipeline pressure Increased energy efficiency! DISADVANTAGES Increased bandwidth requirements Usually requires low level programming (intrinsics) Large register files increase energy consumption and chip area Not all applications can be vectorized without major code changes Juan M. Cebrian PP4EE Oct. 3, / 38
22 WHY BOTHER WITH SIMD SIMD WIDTH Xeon E5-2692, Opteron 6274 and BlueGene/Q: 256-bits (8 floats) Xeon Phi 31S1P: 512-bits (16 floats) K20x: 2048-bits (64 floats) Juan M. Cebrian PP4EE Oct. 3, / 38
23 WHY BOTHER WITH SIMD SIMD WIDTH Xeon E5-2692, Opteron 6274 and BlueGene/Q: 256-bits (8 floats) Xeon Phi 31S1P: 512-bits (16 floats) K20x: 2048-bits (64 floats) ACADEMIA TO THE RESCUE SPEC-CPU ( ) SPLASH-2 (1995, Cited by 2912) PARSEC (2008, Cited by 1039) Rodinia (2009, Cited by 304) Juan M. Cebrian PP4EE Oct. 3, / 38
24 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38
25 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) Juan M. Cebrian PP4EE Oct. 3, / 38
26 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) Juan M. Cebrian PP4EE Oct. 3, / 38
27 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) ISOLATED CPU ENERGY Core i5 and i7 include energy MSRs (PP0, PP1, PKG) Juan M. Cebrian PP4EE Oct. 3, / 38
28 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) ISOLATED CPU ENERGY Core i5 and i7 include energy MSRs (PP0, PP1, PKG) Juan M. Cebrian PP4EE Oct. 3, / 38
29 METHODOLOGY Source: Intel Cache Size Sharing Ways of Line size Latency associativity (cycles) Level 1 Instruction 32KB Private 8 64B 4 Level 1 Data 32KB Private 8 64B 4 Level 2 256KB Private 8 64B 12 Level 3 8MB (20MB) Shared 16 64B Ivy Bridge 3770K At the Wall (SYSTEM) POWER Yokogawa WT210 power meter ENVIRONMENT PARSEC 3.0b GCC 4.7 and -O2 flag Ubuntu , Kernel Discart Cold runs System running in console mode (level 3) ISOLATED CPU ENERGY Core i5 and i7 include energy MSRs (PP0, PP1, PKG) Juan M. Cebrian PP4EE Oct. 3, / 38
30 NEW PLATFORMS ODROID-XU-E Source: Hardkernel Juan M. Cebrian PP4EE Oct. 3, / 38
31 Wrapper/Math Library Profile Vectorize Test Vect. Program Domain SSE AVX NEON Hotspots Changes Group blackscholes Financial x x x yes DT S canneal Engineering x x x yes MCC RL fluidanimate Animation x x x yes MCC CI raytrace Rendering x - - yes - - streamcluster Data Mining x x x yes DT RL swaptions Financial x x x no MCC CI vips Media Proc. x x x no MCC RL/CI x264 Media Proc. x x x no - RL/CI DT: Direct Translation MCC: Major Code Changes S: Scalable RL: Resource Limited CI: Code/Input limited Juan M. Cebrian PP4EE Oct. 3, / 38
32 Wrapper/Math Library Profile Vectorize Test WRAPPER LIBRARY Share code between implementations Keep control of the vectorization process 1 #define _MM_ALIGNMENT 16 2 #define SIMD_WIDTH 4 3 #define _MM_ABS _mm_abs_ps 4 #define _MM_CMPLT _mm_cmplt_ps 5 #define _MM_TYPE m attribute ((aligned (16))) static const int absmask[] = {0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff}; 8 #define _mm_abs_ps(x) _mm_and_ps((x), *(const m128*)absmask) Juan M. Cebrian PP4EE Oct. 3, / 38
33 EXAMPLE: STREAMCLUSTER KERNEL PROFILING Juan M. Cebrian PP4EE Oct. 3, / 38
34 EXAMPLE: STREAMCLUSTER KERNEL WRAPPER LIBRARY 1 / compute Euclidean distance squared between two p o i n t s / 2 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 3 i n t i ; 4 f l o a t r e s u l t =0.0; 5 for ( i =0; i <dim ; i ++) 6 r e s u l t += ( p1. coord [ i ] p2. coord [ i ] ) (p1. coord [ i ] p2. coord [ i ] ) ; 7 return ( r e s u l t ) ; 8 } 1 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 2 i n t i ; 3 _MM_TYPE result, _aux, _ diff, _coord1, _coord2 ; 4 r e s u l t = _MM_SETZERO( ) ; 5 6 for ( i =0; i <dim ; i = i +SIMD_WIDTH) { 7 _coord1 = _MM_LOADU(&( p1. coord [ i ] ) ) ; 8 _coord2 = _MM_LOADU(&( p2. coord [ i ] ) ) ; 9 10 _ d i f f = _MM_SUB( _coord1, _coord2 ) ; 11 _aux = _MM_MUL( _ d i f f, _ d i f f ) ; 12 result = _MM_ADD( result, _aux ) ; 13 } 14 / / Add a l l items of the vector 15 return (_MM_CVT_F(_MM_FULL_HADD( result, r e s u l t ) ) ) ; Juan M. Cebrian PP4EE Oct. 3, / 38
35 EXAMPLE: STREAMCLUSTER KERNEL WRAPPER LIBRARY 1 / compute Euclidean distance squared between two p o i n t s / 2 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 3 i n t i ; 4 f l o a t r e s u l t =0.0; 5 for ( i =0; i <dim ; i ++) 6 r e s u l t += ( p1. coord [ i ] p2. coord [ i ] ) (p1. coord [ i ] p2. coord [ i ] ) ; 7 return ( r e s u l t ) ; 8 } 1 f l o a t d i s t ( Point p1, Point p2, i n t dim ) { 2 i n t i ; 3 _MM_TYPE result, _aux, _ diff, _coord1, _coord2 ; > SSE: m128 4 NEON: f l o a t 3 2 x 4 _ t 5 r e s u l t = _MM_SETZERO( ) ; for ( i =0; i <dim ; i = i +SIMD_WIDTH) { 8 _coord1 = _MM_LOADU(&( p1. coord [ i ] ) ) ; > SSE: _mm_loadu_ps (&( p1. coord [ i ] ) ) ; 9 NEON: vld1q_f32 (&( p1. coord [ i ] ) ) ; 10 _coord2 = _MM_LOADU(&( p2. coord [ i ] ) ) ; _ d i f f = _MM_SUB( _coord1, _coord2 ) ; > AVX: _mm256_sub_ps ( _coord1, _coord2 ) ; 13 NEON: _ d i f f = vsubq_f32 ( _coord1, _coord2 ) ; 14 _aux = _MM_MUL( _ d i f f, _ d i f f ) ; Juan M. Cebrian PP4EE Oct. 3, / 38
36 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38
37 RESULTS - RUNTIME IVY BRIDGE Norm. Time (%) Total Speedu SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Blackscholes ROI SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Canneal Total Speedup SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Fluidanimate 79,4 50,1 31,6 20,0 12,6 7,9 5,0 3,2 2,0 1,3 Speedup (N Times Faster) - Log Scale Scalable benchmarks 50x speedup on 8 threads (Hyper-threading) Resource limited around 2x per thread Code/Input limited around 10-15% speedup per thread Juan M. Cebrian PP4EE Oct. 3, / 38
38 RESULTS - ENERGY IVY BRIDGE Normalized Energy (%) PKG SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Blackscholes PP1 PP0 SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Canneal Avg. PP0 Power SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Fluidanimate Avg. Power PP0 (W) Average PP0 barely changes when using SSE/AVX Their threading equivalent increases power by aprox. 5-8W per thread This performance for free translates into huge energy savings Juan M. Cebrian PP4EE Oct. 3, / 38
39 RESULTS - PERFORMANCE COUNTERS EXECUTION CYCLE BREAKDOWN We require instructions and data Instruction Dispatch Critical Points L1 Data Juan M. Cebrian PP4EE Oct. 3, / 38
40 RESULTS - PERFORMANCE COUNTERS EXECUTION CYCLE BREAKDOWN We require instructions and data Instruction Dispatch Critical Points L1 Data INSTRUCTION STALLS - Reorder Buffer (ROB) - Renaming Logic (RS) Juan M. Cebrian PP4EE Oct. 3, / 38
41 RESULTS - PERFORMANCE COUNTERS CACHE MR Miss-Rate (%) 50 Scalar SSE 35 AVX Th. 2 Th. 4 Th. 8 Th. 1 Th. 2 Th. 4 Th. 8 Th. L1D LLC Miss-R Miss-R Blackscholes Miss-Rate (%) 35 Scalar 30 SSE 25 AVX Th. 2 Th. 4 Th. 8 Th. 1 Th. 2 Th. 4 Th. 8 Th. L1D LLC Miss-R Miss-R Canneal Miss-Rate (%) 45 Scalar 40 SSE 35 AVX Th. 2 Th. 4 Th. 8 Th. 1 Th. 2 Th. 4 Th. 8 Th. L1D LLC Miss-R Miss-R Fluidanimate L1D AND LLC Total number of accesses is reduced for all levels, but miss rate changes Miss rate of L1D increases linearly with SIMD register width under heavy usage Fluidanimate barely generates AVX instructions due to input Juan M. Cebrian PP4EE Oct. 3, / 38
42 STALL CYCLE BREAKDOWN IVY BRIDGE - SCALABLE - BLACKSCHOLES Execution Cycle Breakdown (%) ROB-Stalls RS-Stalls Other L1D-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Renaming logic pressure increases L1D stalls increase slightly Behavior is consistent across thread counts Juan M. Cebrian PP4EE Oct. 3, / 38
43 STALL CYCLE BREAKDOWN IVY BRIDGE - RESOURCE LIMITED - CANNEAL Execution Cycle Breakdown (%) ROB-Stalls RS-Stalls Other L1D-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Renaming logic and ROB pressure increases L1D stalls force dispatch stalls Behavior is consistent across thread counts Juan M. Cebrian PP4EE Oct. 3, / 38
44 STALL CYCLE BREAKDOWN IVY BRIDGE - CODE/INPUT LIMITED - FLUIDANIMATE Execution Cycle Breakdown (%) ROB-Stalls RS-Stalls Other L1D-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. ROB pressure increases, RS barely changes (not many SIMD instructions) L1D stalls force dispatch stalls Behavior is consistent across thread counts Juan M. Cebrian PP4EE Oct. 3, / 38
45 OUTLINE 1 MOTIVATION 2 BACKGROUND 3 METHODOLOGY 4 RESULTS 5 CONCLUSIONS Juan M. Cebrian PP4EE Oct. 3, / 38
46 CONCLUSIONS CONCLUSIONS Great energy savings from vectorization (prioritize over parallelization) SIMD implementations change the architectural trade-offs of the processor SIMD is widely available in many market segments, and can no longer be ignored We aim to distribute our code to reinforce the validation process of new proposals ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS?: YES Benchmarks should cover most common architectural features or architects may end up under/over estimating the impact of their contributions. Juan M. Cebrian PP4EE Oct. 3, / 38
47 Thank you ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science juanmc@idi.ntnu.no Oct. 3, 2013 Juan M. Cebrian PP4EE Oct. 3, / 38
48 Thank you STALL CYCLE BREAKDOWN IVY BRIDGE - STREAMCLUSTER Norm. Cycle Count (%) ROB-Stalls RS-Stalls Other LD1-Stalls Dispatch-Stalls Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX Scalar SSE AVX 1 Th. 2 Th. 4 Th. 8 Th. Juan M. Cebrian PP4EE Oct. 3, / 38
49 Thank you SIMD LIMITING FACTORS DATA STRUCTURES OO programming encourages Array of Structures (AoS) over Structures of Arrays (SoA) Source: spuify.co.uk Juan M. Cebrian PP4EE Oct. 3, / 38
50 Thank you SIMD LIMITING FACTORS SOLUTIONS Software: hide SoA internal representation from user (e.g., Intel Array Building Blocks, or Apple s EVE) Hardware: NEON stride loads Source: ARM Juan M. Cebrian PP4EE Oct. 3, / 38
51 Thank you SIMD LIMITING FACTORS DIVERGENT BRANCHES Conditional branches pose a thread to SIMD performance AoS vs SoA if (input < 1) if (input < 1) Scalar SIMD Juan M. Cebrian PP4EE Oct. 3, / 38
52 Thank you SIMD LIMITING FACTORS HORIZONTAL OPERATIONS AND ROUNDING Horizontal operations usually slower and may cause rounding errors AoS vs SoA a3 a2 a1 a0 + b3 b2 b1 b0 c3 c2 c1 c0 a3 a2 a1 a0 + a3+a2 a1+a0 if (input < 1) if (input < 1) Div. Branches Vertical Horizontal In floating point, a0 + a1 + a2 + a3!= (a0 + a1) + (a2 + a3) Juan M. Cebrian PP4EE Oct. 3, / 38
53 Thank you SIMD LIMITING FACTORS INPUT SIZE Input size may not be divisible by SIMD width AoS vs SoA a5 a4 a3 a2 a1 a0 a3 a2 a1 a0 Scalar SIMD + b3 b2 b1 b0 if (input < 1) if (input < 1) c3 c2 c1 c0 Div. Branches Partially Scalar Partially SIMD a3 a2 a1 a0 + b3 b2 b1 b0 a3 a2 a1 a0 + c3 c2 c1 c0 a3+a2 a1+a0 Vertical Horizontal HOps & Rounding Juan M. Cebrian PP4EE Oct. 3, / 38
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationOptimized Hardware for Suboptimal Software: The Case for SIMD-aware Benchmarks
Optimized Hardware for Suboptimal Software: The Case for SIMD-aware Benchmarks Juan M. Cebrián, Magnus Jahre and Lasse Natvig Dept. of Computer and Information Science (IDI) NTNU Trondheim, NO-791, Norway.
More informationPARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites
PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites Christian Bienia (Princeton University), Sanjeev Kumar (Intel), Kai Li (Princeton University) Outline Overview What
More informationEnergy Models for DVFS Processors
Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationECE 571 Advanced Microprocessor-Based Design Lecture 4
ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationThe Mont-Blanc approach towards Exascale
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are
More informationBackground Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore
By Dan Stafford Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationNON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1
NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 Alberto Ros Universidad de Murcia October 17th, 2017 1 A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras, "Non-Speculative Load-Load Reordering in TSO". ISCA,
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationContour Detection on Mobile Platforms
Contour Detection on Mobile Platforms Bor-Yiing Su, subrian@eecs.berkeley.edu Prof. Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/26 Diagnosing Power/Performance
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationDan Stafford, Justine Bonnot
Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing
More informationProgrammer's View of Execution Teminology Summary
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs
More informationEfficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip
ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,
More informationUse of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms
Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms Gaurav Mitra 1 Beau Johnston 1 Alistair P. Rendell 1 Eric McCreath 1 Jun Zhou 2 1 Research
More informationOpenCL Vectorising Features. Andreas Beckmann
Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationTDT 4260 lecture 2 spring semester 2015
1 TDT 4260 lecture 2 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Chapter 1: Fundamentals of Quantitative Design and Analysis, continued
More informationData Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger
Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationPerformance and Power Impact of Issuewidth in Chip-Multiprocessor Cores
Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions
More informationPreliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths
Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Y. Kodama, T. Odajima, M. Matsuda, M. Tsuji, J. Lee and M. Sato RIKEN AICS (Advanced Institute for Computational
More informationibench: Quantifying Interference in Datacenter Applications
ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationVisualization of OpenCL Application Execution on CPU-GPU Systems
Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Northeastern University Computer Architecture Research
More informationPiecewise Holistic Autotuning of Compiler and Runtime Parameters
Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R
More informationTrends in the Infrastructure of Computing
Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationSpectre and Meltdown. Clifford Wolf q/talk
Spectre and Meltdown Clifford Wolf q/talk 2018-01-30 Spectre and Meltdown Spectre (CVE-2017-5753 and CVE-2017-5715) Is an architectural security bug that effects most modern processors with speculative
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationMemory access patterns. 5KK73 Cedric Nugteren
Memory access patterns 5KK73 Cedric Nugteren Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More information6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models
Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance
More informationDetecting Memory-Boundedness with Hardware Performance Counters
Center for Information Services and High Performance Computing (ZIH) Detecting ory-boundedness with Hardware Performance Counters ICPE, Apr 24th 2017 (daniel.molka@tu-dresden.de) Robert Schöne (robert.schoene@tu-dresden.de)
More informationECE 571 Advanced Microprocessor-Based Design Lecture 2
ECE 571 Advanced Microprocessor-Based Design Lecture 2 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 January 2016 Announcements HW#1 will be posted tomorrow I am handing out
More informationAnastasia Ailamaki. Performance and energy analysis using transactional workloads
Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationChapter 1: Fundamentals of Quantitative Design and Analysis
1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationHybrid Architectures Why Should I Bother?
Hybrid Architectures Why Should I Bother? CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering,
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform
More informationHPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,
HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Intro Michael Bader Winter 2015/2016 Intro, Winter 2015/2016 1 Part I Scientific Computing and Numerical Simulation Intro, Winter 2015/2016 2 The Simulation Pipeline phenomenon,
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationPerformance and Energy Usage of Workloads on KNL and Haswell Architectures
Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research
More informationProgress Report on QDP-JIT
Progress Report on QDP-JIT F. T. Winter Thomas Jefferson National Accelerator Facility USQCD Software Meeting 14 April 16-17, 14 at Jefferson Lab F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 /
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationJackson Marusarz Intel Corporation
Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationMulticore Scaling: The ECM Model
Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationGrowth in Cores - A well rehearsed story
Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationLow-power Architecture. By: Jonathan Herbst Scott Duntley
Low-power Architecture By: Jonathan Herbst Scott Duntley Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media
More informationSupercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?
Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA
More information( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture
( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline
More informationNative Offload of Haskell Repa Programs to Integrated GPUs
Native Offload of Haskell Repa Programs to Integrated GPUs Hai (Paul) Liu with Laurence Day, Neal Glew, Todd Anderson, Rajkishore Barik Intel Labs. September 28, 2016 General purpose computing on integrated
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationEXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière
EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MIC workshop Guillaume Colin de Verdière MARCH 17 TH, 2015 MIC Workshop PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France March 17th, 2015 Overview Context
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More informationDouble-precision General Matrix Multiply (DGEMM)
Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationEXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD. George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA
EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA 1. INTRODUCTION HiPERiSM Consulting, LLC, has a mission to develop (or enhance) software and
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationGetting Started with
/************************************************************************* * LaCASA Laboratory * * Authors: Aleksandar Milenkovic with help of Mounika Ponugoti * * Email: milenkovic@computer.org * * Date:
More information