Static and Dynamic Frequency Scaling on Multicore CPUs
|
|
- Lee Kennedy
- 6 years ago
- Views:
Transcription
1 Static and Dynamic Frequency Scaling on Multicore CPUs Wenlei Bao 1 Changwan Hong 1 Sudheer Chunduri 2 Sriram Krishnamoorthy 3 Louis-Noël Pouchet 4 Fabrice Rastello 5 P. Sadayappan 1 1 The Ohio State University 2 Argonne National Laboratory 3 Pacific Northwest National Laboratory 4 Colorado State University 5 NRA : High Performance and Embedded Architecture and Compilation
2 : Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 2
3 Motivation & Overview: Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 3
4 Motivation & Overview: Energy Efficiency Energy efficiency is of great importance in cases ranging from battery-operated devices to data centers. Dynamic Voltage and Frequency Scaling (DVFS) Well studied and widely adopted in both personal computing devices and large scale clusters. Fundamental control mechanism enables trade-off between performance and energy to improve energy efficiency. Energy savings achieved through voltage and frequency scaling. 4
5 Motivation & Overview: Typical DVFS approaches Limitations observed among these DVFS approaches: Û Frequency that optimizes CPU energy varies significantly across processors. Different processors have different optimal frequencies even for the same compute-bound application. Û Optimizing energy for parallel application highly dependent on its parallel scaling. n turn depends on the operating frequency. Mostly ignored in previous related work. Û Dynamic schemes remain constrained to runtime inspection at specific time intervals Short-running program phases will not have benefits compared using best frequency from the start. e.g. Linux On-demand governor. 5
6 Motivation & Overview: Our Proposed Approach A compile-time approach to select best frequency for affine program regions. 1 Static analysis to approximate the operational intensity and parallel scaling. 2 Categorize the program region into different categories such as compute-bound and bandwidth-bound. 3 Select frequency / core pair for each program category from one-time E/EDP profiling through microbenchmarking. A lightweight runtime approach throttles frequency based on application energy efficiency. Use as baseline to compare and validate against our compile-time approach. 6
7 Motivation & Overview: Overview of the flow nput program Source code Analysis Extract Features Sequential Frequency /Core Decision Tree Poor scalability Compute-bound Categorization Bandwidth-bound 7
8 Hybrid Static/Dynamic Frequency Scaling: Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 8
9 Hybrid Static/Dynamic Frequency Scaling: llustrative Example CPU only Energy on 4 ntel CPUs for DGEMM/MKL Energy (J) Energy (J) Energy (J) Energy (J) Frequency (GHz) Frequency (GHz) Frequency (GHz) Frequency (GHz) core 2 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 4 cores 8 cores 1 4 cores Figure : Energy consumption of DGEMM/MKL on 4 CPUs 4 ntel x86 CPUs: SandyBridge, vybridge, Haswell. Representative Benchmarks: DGEMM/MKL: Compute-bound out-of-core computation. Jacobi-2D: Bandwidth-bound computation. Table : Processor characteristics ntel CPU Microarch. Node Cores L1 L2 L3 Freq. Range Voltage Range Core i7-26k SandyBridge 32nm 4 32K 256K 8192K GHz V Xeon E vybridge 22nm 4 32K 256K 8192K GHz V Xeon E5-265 vybridge 22nm 8 32K 256K 248K GHz V Core i7-477k Haswell 22nm 4 32K 256K 8192K GHz V 9
10 Hybrid Static/Dynamic Frequency Scaling: llustrative Example CPU only Energy on 4 ntel CPUs for Jacobi-2D Energy (J) Energy (J) Energy (J) Frequency (GHz) Frequency (GHz) Frequency (GHz) Frequency (GHz) core 2 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 4 cores 8 cores 1 4 cores Figure : Energy consumption of Jacobi-2d on 4 CPUs Energy (J) Effect of Execution Time The optimal frequency shift towards lower frequencies compared to compute-bound cases. Execution time decreases slower than frequency increases: expected acceleration not achieved. 2 Effect of Weak Scaling Weak Scaling: adding cores not decreasing execution time linearly (bandwidth saturation). Lead to higher energy cost for 4/8 cores than single core when increasing frequency. 1
11 Hybrid Static/Dynamic Frequency Scaling: llustrative Example Summary of Motivations Optimal CPU energy not at min nor max frequency in most cases. Optimal CPU energy achieved at different frequencies depending on number of cores used. Effect of Execution Time lower the best frequency compared to compute-bound. Effect of Weak Scaling higher energy cost for multi-cores than single core. Objectives Goal #1: Characterize programs into different categories. Goal #2: Analysis fast enough for large programs.e.g. 1s for 1 lines. Trade off result accuracy for analysis time. Goal #3: Select best frequency/core pair to optimize CPU energy for affine kernels. 11
12 Hybrid Static/Dynamic Frequency Scaling: Overview of the flow nput program Source code Analysis Extract Features Sequential Frequency /Core Decision Tree Poor scalability Compute-bound Categorization Bandwidth-bound 12
13 Hybrid Static/Dynamic Frequency Scaling: Extracting Features Overview of the process: 1 Extract polyhedral representation of the program (including OpenMP doall) 2 nline parameter values (many of our analysis are not parametric) 3 Count the number of operations in each loop / region of interest 4 Compute the data space (in cache lines) accessed by each loop / region 5 Run various algorithms to estimate cache misses, thread workload, etc. Core features currently extracted: EXACT Number of FLOPs EXACT Data footprint (read/written) APPROX Data cache misses (at each level, inc. shared/private) APPROX OpenMP thread workload 13
14 Hybrid Static/Dynamic Frequency Scaling: Approximating Cache Misses Metric used to estimate Operational ntensity (O) O: Ratio of operations executed per bytes of data transferred to/from memory. O is sufficient to categorize program region as compute-bound or bandwidth-bound. Counting cache misses 1 Slice the region of interest (e.g. one OpenMP thread), inline parameters 2 Specify cache properties: shared/private, size in kb, cache line size in B 3 Compute data space accessed by each loop, in terms of distinct cache lines(dl) 4 Ad-hoc algorithm to estimate misses at each loop level, based on DL Some observations: Massive problem simplification: do not consider associativity, only look at capacity/compulsory misses Data prefetching (next-line), inter-iteration data reuse, and SMD all taken into account 14
15 Hybrid Static/Dynamic Frequency Scaling: Approximating Cache Misses Algorithm 1 EstimateCacheMisses nput: Number of array elements per cache line: ls cache size, in lines: C Polyhedral program: P Output: Estimate of misses 1: for all loops l do 2: Misses[l] = DLBase[l] = DLnext[l] = Processed[l] = 3: end for 4: for all Arrays A do 5: for all loops l do 6: DLbase[l] += #(DL A ) PSl, 7: DLnext[l] += #(DL A \ PSl, DLA ) PSl,1 8: end for 9: end for 1: for all loops l in postfix AST order and Processed[l] == do 11: if DLbase[l] < C then 12: Misses[l] = DLbase[l] 13: else 14: if l is inner-most loop then 15: Misses[l] = DLbase[l] 16: else 17: Miss = 18: for all loops ll immediately surrounded by l do 19: Miss += Misses[ll] 2: Processed[ll] = 1 21: end for 22: Misses[l] = Miss * TripCount(l) 23: if DLnext[l] < C then 24: Misses[l] -= DLnext[l] 25: end if 26: end if 27: end if 28: Processed[l] = 1 29: end for 3: return Misses[fk]; <= Compute DL for each loop <= Recursively traverse the program, bottom-up <= Handle inter-iteration reuse (experimental) 15
16 Hybrid Static/Dynamic Frequency Scaling: Approximating FLOPs and Compute O Counting FLOPS Collecting the number of operations not part of array subscript and multiply with the number of points in iteration domain of the statement. Ensure: Estimate of O Require: Cache line size in element No.: ls cache size, in lines: C Polyhedral program: P Parameter values: n 1: P = attachcontextnformation( n, P) 2: Misses = EstimateCacheMisses(ls, C, P ) 3: Flops = countflops(p ) 4: return Flops / (Misses * ls) Parallelism Features Sequential/Parallel: check if any OpenMP doall pragma exist. Poor scaling: performance improvement not scale with number of cores. Achieved by counting the number of active/inactive threads. 16
17 Hybrid Static/Dynamic Frequency Scaling: Decision Tree to Select frequency / core pair The frequency / core pair assigned to each category obtained through program profiling. Decision Tree 1 f the code is sequential, choose config_sequential; 2 else if the code is bandwidth-bound (O < threshold), choose config_bwbound. 3 else if the code has poor scaling, choose config_poorscale. 4 else choose it config_computebound. 17
18 Experimental Evaluation: Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 18
19 Experimental Evaluation: Experimental Evaluation Experimental Protocol Evaluation of compile-time approach PolyBench/C Benchmarks optimized by PoCC(6 in total) PolyBench-Parallel optimized with auto-parallelization PolyBench-Poly optimized with data-locality(tiling, coarse/fine-grain parallelism). Compile-time approach implemented as PolyFeat in PoCC. Collecting energy/time PCM for ntel CPUs, BM AMESTER for POWER8. Table : Number of benchmarks in each category Benchmarks seq/par bw-bound poor scale comp-bound polybench-parrallel 12/ polybench-poly 5/
20 Experimental Evaluation: Experimental Results Summary of CPU only Energy and EDP improvements Saving in % On-Demand vs. Powersave CPUmiser vs. Powersave OurStatic vs. Powersave Best vs. Powersave 6% 4% 2% % -2% -4% -6% SandyBridge-E SandyBridge-EDP vybridge4-e vybridge4-edp vybridge8-e vybridge8-edp Haswell-E Haswell-EDP POWER8-E POWER8-EDP Results On-demand: energy efficiency improved for vybridge and P8, where max frequency leads to min energy for compute-bound. CPUMiser, runtime DVFS approach based on CP of the workload. optimize energy with performance lost under certain threshold. Our static approach consistently outperforming other schemes and close to Best achievable. 2
21 Experimental Evaluation: Experimental Results Comparison with Runtime DVFS Saving in % SandyBridge-E vybridge4-e vybridge8-e Haswell-E 15% 5% -5% -15% -25% -35% -45% -55% Our Static Our Dynamic CPUmiser1% CPUmiser5% CPUmiser1% Ondemand95% Ondemand85% Ondemand75% Ondemand65% Our Dynamic Approach: nspect changes in energy efficiency (energy normalized by instructions). Decision to adapting frequency when: Energy efficiency decreased compared to previous.(processor-specific) O changed indicate computation performed varied.(application-specific) For interval with similar O, apply energy convexity rule.(gradient descent) 21
22 Experimental Evaluation: Phase Analysis Power traces for single-phase programs 5 jacobi-2d-imper (par) P F 5 mvt (poly) P F Power (Watt) Frequency (GHz) Power (Watt) Frequency (GHz) Power consumption is stable, frequency oscillate between two points. ndicating the optimal frequency is in between these points. 46 out of 6 benchmarks have single phase. 22
23 Experimental Evaluation: Phase Analysis Power traces for multiple-phase programs 5 covariance (par) P F 5 covariance (poly) P F Power (Watt) Frequency (GHz) Power (Watt) Frequency (GHz) Visualized as stable change for part of execution of the frequency. Phases are not identical, because of data locality optimization. 14 out of 6 benchmarks show 2 or more (up to 5) phases. 23
24 Experimental Evaluation: Conclusion Take home message: Compile-time frequency and core count selection for affine program New static analysis for program region categorization Fully implemented in PolyFeat / PoCC One-time machine profiling with representative benchmarks Significant energy savings over powersave linux governor Key aspect: analysis speed versus accuracy Trade-off precision when approximating operational intensity Analysis completes in < 1 second for regions of 1+ lines Future work: automatic region detection regions can be automatically detected w/ our algo. allows more precise frequency selection granularity driven by cost of frequency change 24
25 end: References Hong, C., Bao, W., Cohen, A., Krishnamoorthy, S., Pouchet, L.N., Rastello, F., Ramanujam, J. and Sadayappan, P. (216) Effective padding of multidimensional arrays to avoid cache conflict misses. n Proceedings of the 37th ACM SGPLAN Conference on Programming Language Design and mplementation. Bao, W., Krishnamoorthy, S., Pouchet, L. N., Rastello, F., & Sadayappan, P. (216) Polycheck: Dynamic verification of iteration space transformations on affine programs. n Proceedings of the 43rd Annual ACM SGPLAN-SGACT Symposium on Principles of Programming Languages. Bao, W., Tavarageri, S., Ozguner, F., & Sadayappan, P. (214) PWCET: Power-Aware Worst Case Execution Time Analysis rd nternational Conference on Parallel Processing Workshops. Bao, W. (214) Power Aware WCET Analysis. Diss. The Ohio State University. 32
26 Thank you
Static and Dynamic Frequency Scaling on Multicore CPUs
Static and Dynamic Frequency Scaling on Multicore CPUs WENLEI BAO and CHANGWAN HONG, The Ohio State University SUDHEER CHUNDURI, IBM Research India SRIRAM KRISHNAMOORTHY, Pacific Northwest National Laboratory
More informationPredic've Modeling in a Polyhedral Op'miza'on Space
Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung EJ Park 1, Louis- Noël Pouchet 2, John Cavazos 1, Albert Cohen 3, and P. Sadayappan 2 1 University of Delaware 2 The Ohio State University 3
More informationParametric Multi-Level Tiling of Imperfectly Nested Loops*
Parametric Multi-Level Tiling of Imperfectly Nested Loops* Albert Hartono 1, Cedric Bastoul 2,3 Sriram Krishnamoorthy 4 J. Ramanujam 6 Muthu Baskaran 1 Albert Cohen 2 Boyana Norris 5 P. Sadayappan 1 1
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationAffine Loop Optimization using Modulo Unrolling in CHAPEL
Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers
More informationA2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications
A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications Li Tan 1, Zizhong Chen 1, Ziliang Zong 2, Rong Ge 3, and Dong Li 4 1 University of California, Riverside 2 Texas
More informationA polyhedral loop transformation framework for parallelization and tuning
A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory
More informationEnergy-centric DVFS Controlling Method for Multi-core Platforms
Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To
More informationPotentials and Limitations for Energy Efficiency Auto-Tuning
Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationA task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b
5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of
More informationMultilevel Acyclic Partitioning of Directed Acyclic Graphs for Enhancing Data Locality
Multilevel Acyclic Partitioning of Directed Acyclic Graphs for Enhancing Data Locality Julien Herrmann 1, Bora Uçar 2, Kamer Kaya 3, Aravind Sukumaran Rajam 4, Fabrice Rastello 5, P. Sadayappan 4, Ümit
More informationA Simple Model for Estimating Power Consumption of a Multicore Server System
, pp.153-160 http://dx.doi.org/10.14257/ijmue.2014.9.2.15 A Simple Model for Estimating Power Consumption of a Multicore Server System Minjoong Kim, Yoondeok Ju, Jinseok Chae and Moonju Park School of
More informationPolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet
PolyOpt/C A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C 0.2.1 March 12th 2012 Louis-Noël Pouchet This manual is dedicated to PolyOpt/C version 0.2.1, a framework for Polyhedral
More informationAUTOMATIC SMT THREADING
AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY
More informationMore Data Locality for Static Control Programs on NUMA Architectures
More Data Locality for Static Control Programs on NUMA Architectures Adilla Susungi 1, Albert Cohen 2, Claude Tadonki 1 1 MINES ParisTech, PSL Research University 2 Inria and DI, Ecole Normale Supérieure
More informationPolyhedral-Based Data Reuse Optimization for Configurable Computing
Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February
More informationManaging Hardware Power Saving Modes for High Performance Computing
Managing Hardware Power Saving Modes for High Performance Computing Second International Green Computing Conference 2011, Orlando Timo Minartz, Michael Knobloch, Thomas Ludwig, Bernd Mohr timo.minartz@informatik.uni-hamburg.de
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationDesigning Power-Aware Collective Communication Algorithms for InfiniBand Clusters
Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,
More informationA Framework for Automatic OpenMP Code Generation
1/31 A Framework for Automatic OpenMP Code Generation Raghesh A (CS09M032) Guide: Dr. Shankar Balachandran May 2nd, 2011 Outline 2/31 The Framework An Example Necessary Background Polyhedral Model SCoP
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationIdentifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning
Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationThe Polyhedral Model Is More Widely Applicable Than You Think
The Polyhedral Model Is More Widely Applicable Than You Think Mohamed-Walid Benabderrahmane 1 Louis-Noël Pouchet 1,2 Albert Cohen 1 Cédric Bastoul 1 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationNeural Network Assisted Tile Size Selection
Neural Network Assisted Tile Size Selection Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan Dept. of Computer Science and Engineering Ohio State University June 22, 2010 iwapt 2010 Workshop Berkeley,
More informationCompiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems
Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems Florin Balasa American University in Cairo Noha Abuaesh American University in Cairo Ilie I. Luican Microsoft Inc., USA Cristian
More informationIterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March
More informationCOL862 Programming Assignment-1
Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,
More informationTanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia
How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems
More informationSolving the Live-out Iterator Problem, Part I
Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University September 2010 888.11 Outline: Reminder: step-by-step methodology 1 Problem definition:
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationIMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign
SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory
More information1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core
1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload
More informationExploration of Cache Coherent CPU- FPGA Heterogeneous System
Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationOverview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization
Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan & Matt Larsen (University of Oregon), Hank Childs (Lawrence Berkeley National Laboratory) 26
More informationAccelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin
Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most
More informationPerformance Modeling and Analysis of Flash based Storage Devices
Performance Modeling and Analysis of Flash based Storage Devices H. Howie Huang, Shan Li George Washington University Alex Szalay, Andreas Terzis Johns Hopkins University MSST 11 May 26, 2011 NAND Flash
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationEfficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems
Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun
More informationMulticore Performance and Tools. Part 1: Topology, affinity, clock speed
Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationOASIS: Self-tuning Storage for Applications
OASIS: Self-tuning Storage for Applications Kostas Magoutis, Prasenjit Sarkar, Gauri Shah 14 th NASA Goddard- 23 rd IEEE Mass Storage Systems Technologies, College Park, MD, May 17, 2006 Outline Motivation
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationTurbo Boost Up, AVX Clock Down: Complications for Scaling Tests
Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationSymmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment
Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment Xin-Wei Shih, Tzu-Hsuan Hsu, Hsu-Chieh Lee, Yao-Wen Chang, Kai-Yuan Chao 2013.01.24 1 Outline 2 Clock Network Synthesis Clock network
More informationOn Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University
More informationSHARDS & Talus: Online MRC estimation and optimization for very large caches
SHARDS & Talus: Online MRC estimation and optimization for very large caches Nohhyun Park CloudPhysics, Inc. Introduction Efficient MRC Construction with SHARDS FAST 15 Waldspurger at al. Talus: A simple
More informationLoad-Sto-Meter: Generating Workloads for Persistent Memory Damini Chopra, Doug Voigt Hewlett Packard (Enterprise)
Load-Sto-Meter: Generating Workloads for Persistent Memory Damini Chopra, Doug Voigt Hewlett Packard (Enterprise) Application vs. Pure Workloads Benchmarks that reproduce application workloads Assist in
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationPolyhedral Optimizations of Explicitly Parallel Programs
Habanero Extreme Scale Software Research Group Department of Computer Science Rice University The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT) October 19, 2015
More informationGPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC
GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationUnderstanding PolyBench/C 3.2 Kernels
Understanding PolyBench/C 3.2 Kernels Tomofumi Yuki INRIA Rennes, FRANCE tomofumi.yuki@inria.fr ABSTRACT In this position paper, we argue the need for more rigorous specification of kernels in the PolyBench/C
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationibench: Quantifying Interference in Datacenter Applications
ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization
More informationJackson Marusarz Intel Corporation
Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits
More informationPOWER-AWARE SOFTWARE ON ARM. Paul Fox
POWER-AWARE SOFTWARE ON ARM Paul Fox OUTLINE MOTIVATION LINUX POWER MANAGEMENT INTERFACES A UNIFIED POWER MANAGEMENT SYSTEM EXPERIMENTAL RESULTS AND FUTURE WORK 2 MOTIVATION MOTIVATION» ARM SoCs designed
More informationOptimising Multicore JVMs. Khaled Alnowaiser
Optimising Multicore JVMs Khaled Alnowaiser Outline JVM structure and overhead analysis Multithreaded JVM services JVM on multicore An observational study Potential JVM optimisations Basic JVM Services
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationPolyhedral Compilation Foundations
Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 22, 2010 888.11, Class #5 Introduction: Polyhedral
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationHybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance
Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance Wei Zhang and Yiqiang Ding Department of Electrical and Computer Engineering Virginia Commonwealth University {wzhang4,ding4}@vcu.edu
More informationPipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako
Pipeline Parallelism and the OpenMP Doacross Construct COMP515 - guest lecture October 27th, 2015 Jun Shirako Doall Parallelization (Recap) No loop-carried dependence among iterations of doall loop Parallel
More informationRuntime Support for Scalable Task-parallel Programs
Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism
More informationI. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS
Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com
More informationMotivation Goal Idea Proposition for users Study
Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan Computer and Information Science University of Oregon 23 November 2015 Overview Motivation:
More informationA Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs
A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs Muthu Manikandan Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J. Ramanujam 2 Atanas Rountev 1
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationProgram design and analysis
Program design and analysis Optimizing for execution time. Optimizing for energy/power. Optimizing for program size. Motivation Embedded systems must often meet deadlines. Faster may not be fast enough.
More informationEnergy Models for DVFS Processors
Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationPOWER MANAGEMENT AND ENERGY EFFICIENCY
POWER MANAGEMENT AND ENERGY EFFICIENCY * Adopted Power Management for Embedded Systems, Minsoo Ryu 2017 Operating Systems Design Euiseong Seo (euiseong@skku.edu) Need for Power Management Power consumption
More informationLarge Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System
Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from
More informationOptimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and
Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging
More informationFilesystem Performance on FreeBSD
Filesystem Performance on FreeBSD Kris Kennaway kris@freebsd.org BSDCan 2006, Ottawa, May 12 Introduction Filesystem performance has many aspects No single metric for quantifying it I will focus on aspects
More informationLRC: Dependency-Aware Cache Management for Data Analytics Clusters. Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017
LRC: Dependency-Aware Cache Management for Data Analytics Clusters Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017 Outline Cache Management for Data Analytics Clusters Inefficiency
More informationAbhishek Pandey Aman Chadha Aditya Prakash
Abhishek Pandey Aman Chadha Aditya Prakash System: Building Blocks Motivation: Problem: Determining when to scale down the frequency at runtime is an intricate task. Proposed Solution: Use Machine learning
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationS4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems
S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems Shuibing He, Xian-He Sun, Bo Feng Department of Computer Science Illinois Institute of Technology Speed Gap Between CPU and Hard Drive http://www.velobit.com/storage-performance-blog/bid/114532/living-with-the-2012-hdd-shortage
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationDVFS Space Exploration in Power-Constrained Processing-in-Memory Systems
DVFS Space Exploration in Power-Constrained Processing-in-Memory Systems Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of
More informationNo Tradeoff Low Latency + High Efficiency
No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS),
More informationAAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors
AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors Kaixi Hou, Hao Wang, Wu-chun Feng {kaixihou,hwang121,wfeng}@vt.edu Pairwise Sequence Alignment Algorithms
More informationComputing and energy performance
Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization i i of a multi algorithms li l i PDE solver on CPU and GPU clusters Stéphane Vialle, Sylvain Contassot Vivier, Thomas
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationPerformance and Energy Usage of Workloads on KNL and Haswell Architectures
Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research
More informationDense Matrix Multiplication
Dense Matrix Multiplication Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur October 7, 2015 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, 2015 1 / 56 Overview 1 The Problem 2
More information