Static and Dynamic Frequency Scaling on Multicore CPUs

Size: px
Start display at page:

Download "Static and Dynamic Frequency Scaling on Multicore CPUs"

Transcription

1 Static and Dynamic Frequency Scaling on Multicore CPUs Wenlei Bao 1 Changwan Hong 1 Sudheer Chunduri 2 Sriram Krishnamoorthy 3 Louis-Noël Pouchet 4 Fabrice Rastello 5 P. Sadayappan 1 1 The Ohio State University 2 Argonne National Laboratory 3 Pacific Northwest National Laboratory 4 Colorado State University 5 NRA : High Performance and Embedded Architecture and Compilation

2 : Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 2

3 Motivation & Overview: Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 3

4 Motivation & Overview: Energy Efficiency Energy efficiency is of great importance in cases ranging from battery-operated devices to data centers. Dynamic Voltage and Frequency Scaling (DVFS) Well studied and widely adopted in both personal computing devices and large scale clusters. Fundamental control mechanism enables trade-off between performance and energy to improve energy efficiency. Energy savings achieved through voltage and frequency scaling. 4

5 Motivation & Overview: Typical DVFS approaches Limitations observed among these DVFS approaches: Û Frequency that optimizes CPU energy varies significantly across processors. Different processors have different optimal frequencies even for the same compute-bound application. Û Optimizing energy for parallel application highly dependent on its parallel scaling. n turn depends on the operating frequency. Mostly ignored in previous related work. Û Dynamic schemes remain constrained to runtime inspection at specific time intervals Short-running program phases will not have benefits compared using best frequency from the start. e.g. Linux On-demand governor. 5

6 Motivation & Overview: Our Proposed Approach A compile-time approach to select best frequency for affine program regions. 1 Static analysis to approximate the operational intensity and parallel scaling. 2 Categorize the program region into different categories such as compute-bound and bandwidth-bound. 3 Select frequency / core pair for each program category from one-time E/EDP profiling through microbenchmarking. A lightweight runtime approach throttles frequency based on application energy efficiency. Use as baseline to compare and validate against our compile-time approach. 6

7 Motivation & Overview: Overview of the flow nput program Source code Analysis Extract Features Sequential Frequency /Core Decision Tree Poor scalability Compute-bound Categorization Bandwidth-bound 7

8 Hybrid Static/Dynamic Frequency Scaling: Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 8

9 Hybrid Static/Dynamic Frequency Scaling: llustrative Example CPU only Energy on 4 ntel CPUs for DGEMM/MKL Energy (J) Energy (J) Energy (J) Energy (J) Frequency (GHz) Frequency (GHz) Frequency (GHz) Frequency (GHz) core 2 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 4 cores 8 cores 1 4 cores Figure : Energy consumption of DGEMM/MKL on 4 CPUs 4 ntel x86 CPUs: SandyBridge, vybridge, Haswell. Representative Benchmarks: DGEMM/MKL: Compute-bound out-of-core computation. Jacobi-2D: Bandwidth-bound computation. Table : Processor characteristics ntel CPU Microarch. Node Cores L1 L2 L3 Freq. Range Voltage Range Core i7-26k SandyBridge 32nm 4 32K 256K 8192K GHz V Xeon E vybridge 22nm 4 32K 256K 8192K GHz V Xeon E5-265 vybridge 22nm 8 32K 256K 248K GHz V Core i7-477k Haswell 22nm 4 32K 256K 8192K GHz V 9

10 Hybrid Static/Dynamic Frequency Scaling: llustrative Example CPU only Energy on 4 ntel CPUs for Jacobi-2D Energy (J) Energy (J) Energy (J) Frequency (GHz) Frequency (GHz) Frequency (GHz) Frequency (GHz) core 2 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 4 cores 8 cores 1 4 cores Figure : Energy consumption of Jacobi-2d on 4 CPUs Energy (J) Effect of Execution Time The optimal frequency shift towards lower frequencies compared to compute-bound cases. Execution time decreases slower than frequency increases: expected acceleration not achieved. 2 Effect of Weak Scaling Weak Scaling: adding cores not decreasing execution time linearly (bandwidth saturation). Lead to higher energy cost for 4/8 cores than single core when increasing frequency. 1

11 Hybrid Static/Dynamic Frequency Scaling: llustrative Example Summary of Motivations Optimal CPU energy not at min nor max frequency in most cases. Optimal CPU energy achieved at different frequencies depending on number of cores used. Effect of Execution Time lower the best frequency compared to compute-bound. Effect of Weak Scaling higher energy cost for multi-cores than single core. Objectives Goal #1: Characterize programs into different categories. Goal #2: Analysis fast enough for large programs.e.g. 1s for 1 lines. Trade off result accuracy for analysis time. Goal #3: Select best frequency/core pair to optimize CPU energy for affine kernels. 11

12 Hybrid Static/Dynamic Frequency Scaling: Overview of the flow nput program Source code Analysis Extract Features Sequential Frequency /Core Decision Tree Poor scalability Compute-bound Categorization Bandwidth-bound 12

13 Hybrid Static/Dynamic Frequency Scaling: Extracting Features Overview of the process: 1 Extract polyhedral representation of the program (including OpenMP doall) 2 nline parameter values (many of our analysis are not parametric) 3 Count the number of operations in each loop / region of interest 4 Compute the data space (in cache lines) accessed by each loop / region 5 Run various algorithms to estimate cache misses, thread workload, etc. Core features currently extracted: EXACT Number of FLOPs EXACT Data footprint (read/written) APPROX Data cache misses (at each level, inc. shared/private) APPROX OpenMP thread workload 13

14 Hybrid Static/Dynamic Frequency Scaling: Approximating Cache Misses Metric used to estimate Operational ntensity (O) O: Ratio of operations executed per bytes of data transferred to/from memory. O is sufficient to categorize program region as compute-bound or bandwidth-bound. Counting cache misses 1 Slice the region of interest (e.g. one OpenMP thread), inline parameters 2 Specify cache properties: shared/private, size in kb, cache line size in B 3 Compute data space accessed by each loop, in terms of distinct cache lines(dl) 4 Ad-hoc algorithm to estimate misses at each loop level, based on DL Some observations: Massive problem simplification: do not consider associativity, only look at capacity/compulsory misses Data prefetching (next-line), inter-iteration data reuse, and SMD all taken into account 14

15 Hybrid Static/Dynamic Frequency Scaling: Approximating Cache Misses Algorithm 1 EstimateCacheMisses nput: Number of array elements per cache line: ls cache size, in lines: C Polyhedral program: P Output: Estimate of misses 1: for all loops l do 2: Misses[l] = DLBase[l] = DLnext[l] = Processed[l] = 3: end for 4: for all Arrays A do 5: for all loops l do 6: DLbase[l] += #(DL A ) PSl, 7: DLnext[l] += #(DL A \ PSl, DLA ) PSl,1 8: end for 9: end for 1: for all loops l in postfix AST order and Processed[l] == do 11: if DLbase[l] < C then 12: Misses[l] = DLbase[l] 13: else 14: if l is inner-most loop then 15: Misses[l] = DLbase[l] 16: else 17: Miss = 18: for all loops ll immediately surrounded by l do 19: Miss += Misses[ll] 2: Processed[ll] = 1 21: end for 22: Misses[l] = Miss * TripCount(l) 23: if DLnext[l] < C then 24: Misses[l] -= DLnext[l] 25: end if 26: end if 27: end if 28: Processed[l] = 1 29: end for 3: return Misses[fk]; <= Compute DL for each loop <= Recursively traverse the program, bottom-up <= Handle inter-iteration reuse (experimental) 15

16 Hybrid Static/Dynamic Frequency Scaling: Approximating FLOPs and Compute O Counting FLOPS Collecting the number of operations not part of array subscript and multiply with the number of points in iteration domain of the statement. Ensure: Estimate of O Require: Cache line size in element No.: ls cache size, in lines: C Polyhedral program: P Parameter values: n 1: P = attachcontextnformation( n, P) 2: Misses = EstimateCacheMisses(ls, C, P ) 3: Flops = countflops(p ) 4: return Flops / (Misses * ls) Parallelism Features Sequential/Parallel: check if any OpenMP doall pragma exist. Poor scaling: performance improvement not scale with number of cores. Achieved by counting the number of active/inactive threads. 16

17 Hybrid Static/Dynamic Frequency Scaling: Decision Tree to Select frequency / core pair The frequency / core pair assigned to each category obtained through program profiling. Decision Tree 1 f the code is sequential, choose config_sequential; 2 else if the code is bandwidth-bound (O < threshold), choose config_bwbound. 3 else if the code has poor scaling, choose config_poorscale. 4 else choose it config_computebound. 17

18 Experimental Evaluation: Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 18

19 Experimental Evaluation: Experimental Evaluation Experimental Protocol Evaluation of compile-time approach PolyBench/C Benchmarks optimized by PoCC(6 in total) PolyBench-Parallel optimized with auto-parallelization PolyBench-Poly optimized with data-locality(tiling, coarse/fine-grain parallelism). Compile-time approach implemented as PolyFeat in PoCC. Collecting energy/time PCM for ntel CPUs, BM AMESTER for POWER8. Table : Number of benchmarks in each category Benchmarks seq/par bw-bound poor scale comp-bound polybench-parrallel 12/ polybench-poly 5/

20 Experimental Evaluation: Experimental Results Summary of CPU only Energy and EDP improvements Saving in % On-Demand vs. Powersave CPUmiser vs. Powersave OurStatic vs. Powersave Best vs. Powersave 6% 4% 2% % -2% -4% -6% SandyBridge-E SandyBridge-EDP vybridge4-e vybridge4-edp vybridge8-e vybridge8-edp Haswell-E Haswell-EDP POWER8-E POWER8-EDP Results On-demand: energy efficiency improved for vybridge and P8, where max frequency leads to min energy for compute-bound. CPUMiser, runtime DVFS approach based on CP of the workload. optimize energy with performance lost under certain threshold. Our static approach consistently outperforming other schemes and close to Best achievable. 2

21 Experimental Evaluation: Experimental Results Comparison with Runtime DVFS Saving in % SandyBridge-E vybridge4-e vybridge8-e Haswell-E 15% 5% -5% -15% -25% -35% -45% -55% Our Static Our Dynamic CPUmiser1% CPUmiser5% CPUmiser1% Ondemand95% Ondemand85% Ondemand75% Ondemand65% Our Dynamic Approach: nspect changes in energy efficiency (energy normalized by instructions). Decision to adapting frequency when: Energy efficiency decreased compared to previous.(processor-specific) O changed indicate computation performed varied.(application-specific) For interval with similar O, apply energy convexity rule.(gradient descent) 21

22 Experimental Evaluation: Phase Analysis Power traces for single-phase programs 5 jacobi-2d-imper (par) P F 5 mvt (poly) P F Power (Watt) Frequency (GHz) Power (Watt) Frequency (GHz) Power consumption is stable, frequency oscillate between two points. ndicating the optimal frequency is in between these points. 46 out of 6 benchmarks have single phase. 22

23 Experimental Evaluation: Phase Analysis Power traces for multiple-phase programs 5 covariance (par) P F 5 covariance (poly) P F Power (Watt) Frequency (GHz) Power (Watt) Frequency (GHz) Visualized as stable change for part of execution of the frequency. Phases are not identical, because of data locality optimization. 14 out of 6 benchmarks show 2 or more (up to 5) phases. 23

24 Experimental Evaluation: Conclusion Take home message: Compile-time frequency and core count selection for affine program New static analysis for program region categorization Fully implemented in PolyFeat / PoCC One-time machine profiling with representative benchmarks Significant energy savings over powersave linux governor Key aspect: analysis speed versus accuracy Trade-off precision when approximating operational intensity Analysis completes in < 1 second for regions of 1+ lines Future work: automatic region detection regions can be automatically detected w/ our algo. allows more precise frequency selection granularity driven by cost of frequency change 24

25 end: References Hong, C., Bao, W., Cohen, A., Krishnamoorthy, S., Pouchet, L.N., Rastello, F., Ramanujam, J. and Sadayappan, P. (216) Effective padding of multidimensional arrays to avoid cache conflict misses. n Proceedings of the 37th ACM SGPLAN Conference on Programming Language Design and mplementation. Bao, W., Krishnamoorthy, S., Pouchet, L. N., Rastello, F., & Sadayappan, P. (216) Polycheck: Dynamic verification of iteration space transformations on affine programs. n Proceedings of the 43rd Annual ACM SGPLAN-SGACT Symposium on Principles of Programming Languages. Bao, W., Tavarageri, S., Ozguner, F., & Sadayappan, P. (214) PWCET: Power-Aware Worst Case Execution Time Analysis rd nternational Conference on Parallel Processing Workshops. Bao, W. (214) Power Aware WCET Analysis. Diss. The Ohio State University. 32

26 Thank you

Static and Dynamic Frequency Scaling on Multicore CPUs

Static and Dynamic Frequency Scaling on Multicore CPUs Static and Dynamic Frequency Scaling on Multicore CPUs WENLEI BAO and CHANGWAN HONG, The Ohio State University SUDHEER CHUNDURI, IBM Research India SRIRAM KRISHNAMOORTHY, Pacific Northwest National Laboratory

More information

Predic've Modeling in a Polyhedral Op'miza'on Space

Predic've Modeling in a Polyhedral Op'miza'on Space Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung EJ Park 1, Louis- Noël Pouchet 2, John Cavazos 1, Albert Cohen 3, and P. Sadayappan 2 1 University of Delaware 2 The Ohio State University 3

More information

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

Parametric Multi-Level Tiling of Imperfectly Nested Loops* Parametric Multi-Level Tiling of Imperfectly Nested Loops* Albert Hartono 1, Cedric Bastoul 2,3 Sriram Krishnamoorthy 4 J. Ramanujam 6 Muthu Baskaran 1 Albert Cohen 2 Boyana Norris 5 P. Sadayappan 1 1

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications Li Tan 1, Zizhong Chen 1, Ziliang Zong 2, Rong Ge 3, and Dong Li 4 1 University of California, Riverside 2 Texas

More information

A polyhedral loop transformation framework for parallelization and tuning

A polyhedral loop transformation framework for parallelization and tuning A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory

More information

Energy-centric DVFS Controlling Method for Multi-core Platforms

Energy-centric DVFS Controlling Method for Multi-core Platforms Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To

More information

Potentials and Limitations for Energy Efficiency Auto-Tuning

Potentials and Limitations for Energy Efficiency Auto-Tuning Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

Multilevel Acyclic Partitioning of Directed Acyclic Graphs for Enhancing Data Locality

Multilevel Acyclic Partitioning of Directed Acyclic Graphs for Enhancing Data Locality Multilevel Acyclic Partitioning of Directed Acyclic Graphs for Enhancing Data Locality Julien Herrmann 1, Bora Uçar 2, Kamer Kaya 3, Aravind Sukumaran Rajam 4, Fabrice Rastello 5, P. Sadayappan 4, Ümit

More information

A Simple Model for Estimating Power Consumption of a Multicore Server System

A Simple Model for Estimating Power Consumption of a Multicore Server System , pp.153-160 http://dx.doi.org/10.14257/ijmue.2014.9.2.15 A Simple Model for Estimating Power Consumption of a Multicore Server System Minjoong Kim, Yoondeok Ju, Jinseok Chae and Moonju Park School of

More information

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet PolyOpt/C A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C 0.2.1 March 12th 2012 Louis-Noël Pouchet This manual is dedicated to PolyOpt/C version 0.2.1, a framework for Polyhedral

More information

AUTOMATIC SMT THREADING

AUTOMATIC SMT THREADING AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY

More information

More Data Locality for Static Control Programs on NUMA Architectures

More Data Locality for Static Control Programs on NUMA Architectures More Data Locality for Static Control Programs on NUMA Architectures Adilla Susungi 1, Albert Cohen 2, Claude Tadonki 1 1 MINES ParisTech, PSL Research University 2 Inria and DI, Ecole Normale Supérieure

More information

Polyhedral-Based Data Reuse Optimization for Configurable Computing

Polyhedral-Based Data Reuse Optimization for Configurable Computing Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February

More information

Managing Hardware Power Saving Modes for High Performance Computing

Managing Hardware Power Saving Modes for High Performance Computing Managing Hardware Power Saving Modes for High Performance Computing Second International Green Computing Conference 2011, Orlando Timo Minartz, Michael Knobloch, Thomas Ludwig, Bernd Mohr timo.minartz@informatik.uni-hamburg.de

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

A Framework for Automatic OpenMP Code Generation

A Framework for Automatic OpenMP Code Generation 1/31 A Framework for Automatic OpenMP Code Generation Raghesh A (CS09M032) Guide: Dr. Shankar Balachandran May 2nd, 2011 Outline 2/31 The Framework An Example Necessary Background Polyhedral Model SCoP

More information

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National

More information

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

The Polyhedral Model Is More Widely Applicable Than You Think

The Polyhedral Model Is More Widely Applicable Than You Think The Polyhedral Model Is More Widely Applicable Than You Think Mohamed-Walid Benabderrahmane 1 Louis-Noël Pouchet 1,2 Albert Cohen 1 Cédric Bastoul 1 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Neural Network Assisted Tile Size Selection

Neural Network Assisted Tile Size Selection Neural Network Assisted Tile Size Selection Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan Dept. of Computer Science and Engineering Ohio State University June 22, 2010 iwapt 2010 Workshop Berkeley,

More information

Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems

Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems Florin Balasa American University in Cairo Noha Abuaesh American University in Cairo Ilie I. Luican Microsoft Inc., USA Cristian

More information

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March

More information

COL862 Programming Assignment-1

COL862 Programming Assignment-1 Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

Solving the Live-out Iterator Problem, Part I

Solving the Live-out Iterator Problem, Part I Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University September 2010 888.11 Outline: Reminder: step-by-step methodology 1 Problem definition:

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory

More information

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core 1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan & Matt Larsen (University of Oregon), Hank Childs (Lawrence Berkeley National Laboratory) 26

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Performance Modeling and Analysis of Flash based Storage Devices

Performance Modeling and Analysis of Flash based Storage Devices Performance Modeling and Analysis of Flash based Storage Devices H. Howie Huang, Shan Li George Washington University Alex Szalay, Andreas Terzis Johns Hopkins University MSST 11 May 26, 2011 NAND Flash

More information

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance

More information

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun

More information

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

OASIS: Self-tuning Storage for Applications

OASIS: Self-tuning Storage for Applications OASIS: Self-tuning Storage for Applications Kostas Magoutis, Prasenjit Sarkar, Gauri Shah 14 th NASA Goddard- 23 rd IEEE Mass Storage Systems Technologies, College Park, MD, May 17, 2006 Outline Motivation

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment Xin-Wei Shih, Tzu-Hsuan Hsu, Hsu-Chieh Lee, Yao-Wen Chang, Kai-Yuan Chao 2013.01.24 1 Outline 2 Clock Network Synthesis Clock network

More information

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University

More information

SHARDS & Talus: Online MRC estimation and optimization for very large caches

SHARDS & Talus: Online MRC estimation and optimization for very large caches SHARDS & Talus: Online MRC estimation and optimization for very large caches Nohhyun Park CloudPhysics, Inc. Introduction Efficient MRC Construction with SHARDS FAST 15 Waldspurger at al. Talus: A simple

More information

Load-Sto-Meter: Generating Workloads for Persistent Memory Damini Chopra, Doug Voigt Hewlett Packard (Enterprise)

Load-Sto-Meter: Generating Workloads for Persistent Memory Damini Chopra, Doug Voigt Hewlett Packard (Enterprise) Load-Sto-Meter: Generating Workloads for Persistent Memory Damini Chopra, Doug Voigt Hewlett Packard (Enterprise) Application vs. Pure Workloads Benchmarks that reproduce application workloads Assist in

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

Polyhedral Optimizations of Explicitly Parallel Programs

Polyhedral Optimizations of Explicitly Parallel Programs Habanero Extreme Scale Software Research Group Department of Computer Science Rice University The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT) October 19, 2015

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Understanding PolyBench/C 3.2 Kernels

Understanding PolyBench/C 3.2 Kernels Understanding PolyBench/C 3.2 Kernels Tomofumi Yuki INRIA Rennes, FRANCE tomofumi.yuki@inria.fr ABSTRACT In this position paper, we argue the need for more rigorous specification of kernels in the PolyBench/C

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

POWER-AWARE SOFTWARE ON ARM. Paul Fox

POWER-AWARE SOFTWARE ON ARM. Paul Fox POWER-AWARE SOFTWARE ON ARM Paul Fox OUTLINE MOTIVATION LINUX POWER MANAGEMENT INTERFACES A UNIFIED POWER MANAGEMENT SYSTEM EXPERIMENTAL RESULTS AND FUTURE WORK 2 MOTIVATION MOTIVATION» ARM SoCs designed

More information

Optimising Multicore JVMs. Khaled Alnowaiser

Optimising Multicore JVMs. Khaled Alnowaiser Optimising Multicore JVMs Khaled Alnowaiser Outline JVM structure and overhead analysis Multithreaded JVM services JVM on multicore An observational study Potential JVM optimisations Basic JVM Services

More information

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 22, 2010 888.11, Class #5 Introduction: Polyhedral

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance Wei Zhang and Yiqiang Ding Department of Electrical and Computer Engineering Virginia Commonwealth University {wzhang4,ding4}@vcu.edu

More information

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako Pipeline Parallelism and the OpenMP Doacross Construct COMP515 - guest lecture October 27th, 2015 Jun Shirako Doall Parallelization (Recap) No loop-carried dependence among iterations of doall loop Parallel

More information

Runtime Support for Scalable Task-parallel Programs

Runtime Support for Scalable Task-parallel Programs Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information

Motivation Goal Idea Proposition for users Study

Motivation Goal Idea Proposition for users Study Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan Computer and Information Science University of Oregon 23 November 2015 Overview Motivation:

More information

A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs

A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs Muthu Manikandan Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J. Ramanujam 2 Atanas Rountev 1

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

Program design and analysis

Program design and analysis Program design and analysis Optimizing for execution time. Optimizing for energy/power. Optimizing for program size. Motivation Embedded systems must often meet deadlines. Faster may not be fast enough.

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

POWER MANAGEMENT AND ENERGY EFFICIENCY

POWER MANAGEMENT AND ENERGY EFFICIENCY POWER MANAGEMENT AND ENERGY EFFICIENCY * Adopted Power Management for Embedded Systems, Minsoo Ryu 2017 Operating Systems Design Euiseong Seo (euiseong@skku.edu) Need for Power Management Power consumption

More information

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from

More information

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging

More information

Filesystem Performance on FreeBSD

Filesystem Performance on FreeBSD Filesystem Performance on FreeBSD Kris Kennaway kris@freebsd.org BSDCan 2006, Ottawa, May 12 Introduction Filesystem performance has many aspects No single metric for quantifying it I will focus on aspects

More information

LRC: Dependency-Aware Cache Management for Data Analytics Clusters. Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017

LRC: Dependency-Aware Cache Management for Data Analytics Clusters. Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017 LRC: Dependency-Aware Cache Management for Data Analytics Clusters Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017 Outline Cache Management for Data Analytics Clusters Inefficiency

More information

Abhishek Pandey Aman Chadha Aditya Prakash

Abhishek Pandey Aman Chadha Aditya Prakash Abhishek Pandey Aman Chadha Aditya Prakash System: Building Blocks Motivation: Problem: Determining when to scale down the frequency at runtime is an intricate task. Proposed Solution: Use Machine learning

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems

S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems Shuibing He, Xian-He Sun, Bo Feng Department of Computer Science Illinois Institute of Technology Speed Gap Between CPU and Hard Drive http://www.velobit.com/storage-performance-blog/bid/114532/living-with-the-2012-hdd-shortage

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

DVFS Space Exploration in Power-Constrained Processing-in-Memory Systems

DVFS Space Exploration in Power-Constrained Processing-in-Memory Systems DVFS Space Exploration in Power-Constrained Processing-in-Memory Systems Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of

More information

No Tradeoff Low Latency + High Efficiency

No Tradeoff Low Latency + High Efficiency No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS),

More information

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors Kaixi Hou, Hao Wang, Wu-chun Feng {kaixihou,hwang121,wfeng}@vt.edu Pairwise Sequence Alignment Algorithms

More information

Computing and energy performance

Computing and energy performance Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization i i of a multi algorithms li l i PDE solver on CPU and GPU clusters Stéphane Vialle, Sylvain Contassot Vivier, Thomas

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Performance and Energy Usage of Workloads on KNL and Haswell Architectures Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research

More information

Dense Matrix Multiplication

Dense Matrix Multiplication Dense Matrix Multiplication Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur October 7, 2015 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, 2015 1 / 56 Overview 1 The Problem 2

More information