System Software Solutions for Exploiting Power Limited HPC Systems
|
|
- Peregrine Griffin
- 5 years ago
- Views:
Transcription
1 System Software Solutions for Exploiting Power Limited HPC Systems 45th Martin Schulz, LLNL/CASC SPEEDUP Workshop on High-Performance Computing September 2016, Basel, Switzerland LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA Lawrence Livermore National Security, LLC
2 Power Limits Will be a Reality Facility issues Procuring large amounts of power is difficult Large investment costs Operating costs Agencies are setting limits (e.g., DOE: 20MW for exascale) Expensive energy prices in some regions Heating and cooling issues Need to get power out of the system again Problem for very dense systems Consequences Facility, system, or rack level power bounds Limits the hardware that can be purchased Need to use power more efficiently!
3 Power Consumption on Vulcan, a BG/Q System Provisioned Power
4 Goal: 20MW Provisioned Power Wasted Power
5 Goal: 20MW Provisioned Power
6 Goal: 20MW Provisioned Power Need to Enforce Power Caps Overprovisioned systems More hardware that can be powered at once Need mechanisms to control and cap power Amount of capping can depend on multiple issues Physical limits of a facility Budget/End of year constraints Renewable energy sources Power as a constraint, not an optimization goal
7 Power-Capping is Not New Chip vendors are starting to build this into their CPUs Most prominent: Intel s RAPL for CPUs and memory Vendors looking into other system components as well State of the art in GPUs Limited by power draw over PCI bus GPUs can clock themselves down Typically not exposed Open questions: How to apply it on large scale parallel applications? What is the impact on performance and how can we mitigate? Problem definition Given a job-level power constraint, how do we optimize application performance?
8 Naïve Scheme: Uniform, Static Allocation Equally distribute Enforce power constraint over all nodes of a job Can use Running Average Power Limit (RAPL) Statically select a configuration under the power constraint Configuration: number of cores core posi/oning Commonly used: Packed configuration Maximum cores possible under the power constraint Ø Leads to several severe problems
9 1 st Issue: Node/Core/Cap Configurations Matter Experiment on 32 nodes on LLNL s TLCC system with a global power bound Naïve configuration in red Running all cores This limits number of nodes Best configuration in blue Moderate per node power bound Reduced number of cores Difference: > 2x Consequences: Nodes (8 32) Cores (4 16) sp mz, 4500W power bound Max power per processor Best configuration Determining the right configuration is critical Intuition is insufficient -> need ways to models Seconds Processor Power Bound (Watts) (51, 65, 80, 95, 115) Avg. Watts
10 Application and Bound Dependent
11 2nd Issue: Inhomogeneus Power Usage Dual socket 8-core mg.c.8 64 processor 34 power bounds Power lead to unintended noise Power variations turn load imbalance Question: How can we compensate for this?
12 Large Scale Power Capping: 95W CPU Clock Frequency Normalized Slowdown Census across 2386 Sandybridge processors mg (multigrid) Runs at 105W ep (embarrassingly parallel) Runs at 90W Chart showing one point for each processor in the system Performance normalized to fastest unbounded run X-Axis: Slowdown Y-Axis: CPU clock Slowest processors circled
13 Large Scale Power Capping: 80/65W CPU Clock Frequency Normalized Slowdown
14 Large Scale Power Capping: < 51W CPU Clock Frequency Normalized Slowdown
15 Large Scale Power Capping: Summary CPU Clock Frequency Normalized Slowdown Power capping makes systems heterogeneous Need more flexible task scheduling and ability to absorb slack Needs to be taken into account during load balancing Slowdown under power caps application specific Can t use a single knob processor/silicon efficiency Need to compensate per application/per CPU
16 Variation Aware Power-Budgeting Strategy Compensate power budget on each processor 1. Pre-characterizing power variation of each processor 2. Profiling power-performance trend of target application by single node 3. Creating app. dependent power-performance model taking variation into account 4. Calculating power-budget for each node which maximizes performance PMMD: Power Measurement & Management Direc/ves HPC Applica/on Source Code Analysis to Insert PMMDs HPC Applica/on with PMMDs Test Runs on a Single Module App. Input Data Module- level Power Alloca/ons Varia/on- Aware Power Budge/ng Algorithm App. Dependent Power Model Table (all modules) App. Power Profile Final App. Runs Module Alloca/on (Scheduler) App. level Power Constraint App. Independent System Power Varia/on Table Inputs Output
17 Variation-Aware Power Budgeting: Results
18 3 rd Issue: Application Imbalance Can lead to Inefficient power usage ParaDiS: Average power usage on nodes 70 Significant difference in power usage across nodes Wastes unused power on nodes Difference in power usage up to 36%!
19 The Need for a Power-Aware Runtime Requirements to select performance-optimizing configuration Cannot violate the power constraint Constrain overhead of reallocating power Do not slow down the critical path Low-overhead, on-line profiling of the configuration space Need for runtime configuration and optimization Application and input dependent Hard to predict and model a priori System variability depends on actual hardware resources Conductor: A run-time system for efficient power allocation
20 Conductor Step I: Configuration Selection Configuration = thread levels / power settings Profile the configuration space on-line Run each computation operation on individual nodes at distinct configurations Record the power/performance profile characteristics of each configuration Ø Intended for iterative applications with predictable computation behavior
21 Conductor Step II: Power Reallocation How can we allocate power to the critical operations in an application and improve performance? Barrier Before power reallocation P 1 P 0 C 0 C 1 Barrier time After power reallocation P 1 P 0 C 1 C 0 High power Medium power Low power time Slack
22 Start Explore configura/ons Step 1: Explore configura/on MPI processes Configura/ons k n 1, k 2,..., k n k 1 k 2 k 3 k n Allreduce {Power, Execu/on Time} Perform this step for each OpenMP region in a time step
23 Start Explore configura/ons Step 2: Select configura/on Construct Pareto fron/er Select Configura/on k OPT Example: 40W node-level power constraint
24 Start Explore configura/ons Step 3: Save power in non- cri/cal tasks Construct Pareto fron/er Select Configura/on k OPT P constraint Power Slack Slack Yes Slow down computa/on Is computa3on non- cri3cal? P constraint Power C C C Slack Slack Power headroom P average C C C Time Time
25 Start Explore configura/ons Step 4: Monitor power usage Construct Pareto fron/er Yes Slow down computa/on Select Configura/on k OPT Is computa3on non- cri3cal? No End monitor Start monitor
26 Start Explore configura/ons Construct Pareto fron/er Calculate new power headroom Distribute new power alloca/on Step 5: Reallocate power ParaDiS: Before power realloca/on Yes Slow down computa/on Select Configura/on k OPT Is computa3on non- cri3cal? No End monitor Start monitor ParaDiS: A_er power realloca/on
27 Experimental Setup Test system: cab cluster at LLNL: Intel Sandy Bridge cluster with 1296 nodes, yields close to ½ PFLOP/s 8-core dual Intel Xeon E GHz Interconnect: InfiniBand QDR Socket-level power management interface (e.g., RAPL) Benchmarks (MPI+OpenMP) Benchmark Input Iterations BT-MZ and SP-MZ Class D 500 LULESH 32 x 32 x ParaDiS Copper 500 CoMD 40 x 40 x Up to 512 cores: 64 MPI processes, 8 OpenMP threads/process
28 Evaluation: Configuration Selection Up to 30% speedup over Static scheme Benefits configuration-sensitive applications
29 Evaluation: Power Reallocation Up to 13% speedup over Static scheme Benefits from process-level imbalance of power usage
30 System-Wide Power Optimization Power is a global resource The system power cap must be divided among jobs Static division results in power fragmentation Dynamic management can utilize open resources Power-aware Resource Management Power as a controlled resource that is allocated User input Combined with Runtime Adaptation Detection and reallocation of unused power Part of a global operating system Transparent to application Need to maintain fairness System Enclave Boards Nodes libmsr
31 Application Agnostic Tuning in a Global OS HPC System Model (Picture courtesy of Daniel Ellsworth)
32 POWsched: The Basic Idea Monitor power usage of all applications in the system OS level power monitors Scalable aggregation In a hierarchy represen/ng applica/ons In a hierarchy represen/ng the system Detect wasted power Applications that don t use their allocation Detect needed power Applications that run against their power limit Shift wasted power to applications that need it
33 POWsched Results: 50W Static and Dynamic
34 POWsched Results: Static vs Dynamic
35 Conclusions We need dynamic solutions to deal with the power limits It s a dynamic property On-line adaption and tuning must be part of the solution Discussed projects Sources and effects of variability Critical path power scaling Global power re-distribution Common thread: steer power to where it is needed To slower processor, critical path, limited process Monitor, determine headroom, redistribute headroom, and repeat Where to go next? Finer grained hardware control (beyond sockets) More application configuration options Different code paths, algorithms, depending on capping levels Increases search space Integration of/cooperation between job and system levels
36 PIPER project The Scalability Team Postdoc Staff Abhinav Bhatele Kathryn Mohror q q q q Todd Gamblin Ignacio Laguna David Beckingsale David Boehme Murali Emani Tanzima Islam Barry Rountree Martin Schulz Aniruddha Marathe Tapasya Patki Kento Sato Jae-Seung Yeom Performance analysis tools and optimization Correctness and debugging (incl. STAT, AutomaDeD, MUST) Power-aware and power-limited computing (incl. Adagio, Conductor) Resilience and Checkpoint/Restart (incl. SCR)
37 PIPER project The Scalability Team Postdoc Staff Abhinav Bhatele Kathryn Mohror q q q q Todd Gamblin Ignacio Laguna David Beckingsale David Boehme Murali Emani Tanzima Islam Barry Rountree Martin Schulz Aniruddha Marathe Tapasya Patki Kento Sato Jae-Seung Yeom Performance analysis tools and optimization Correctness and debugging (incl. STAT, AutomaDeD, MUST) Power-aware and power-limited computing (incl. Adagio, Conductor) Resilience and Checkpoint/Restart (incl. SCR)
38 Conclusions We need dynamic solutions to deal with the power limits It s a dynamic property On-line adaption and tuning must be part of the solution Discussed projects Sources and effects of variability Critical path power scaling Global power re-distribution Common thread: steer power to where it is needed To slower processor, critical path, limited process Monitor, determine headroom, redistribute headroom, and repeat Where to go next? Finer grained hardware control (beyond sockets) More application configuration options Different code paths, algorithms, depending on capping levels Increases search space Integration of/cooperation between job and system levels
Power Constrained HPC
http://scalability.llnl.gov/ Power Constrained HPC Martin Schulz Center or Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory With many collaborators and Co-PIs, incl.: LLNL: Barry
More informationExploring Hardware Overprovisioning in Power-Constrained, High Performance Computing
Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Tapasya Patki 1 David Lowenthal 1 Barry Rountree 2 Martin Schulz 2 Bronis de Supinski 2 1 The University of Arizona
More informationPower Bounds and Large Scale Computing
1 Power Bounds and Large Scale Computing Friday, March 1, 2013 Bronis R. de Supinski 1 Tapasya Patki 2, David K. Lowenthal 2, Barry L. Rountree 1 and Martin Schulz 1 2 University of Arizona This work has
More informationEconomic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng
Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng Energy Efficient Supercompu1ng, SC 16 November 14, 2016 This work was performed under the auspices of the U.S.
More informationImpact of Manufacturing Variability in Power Constrained Supercomputing. Koji Inoue. Kyushu University
Impact of Manufacturing Variability in Power Constrained Supercomputing Koji Inoue Kyushu University Trends of Supercomputing 1 Exa Flops (10 18 Floating-point Operations Per Second) World-Wide Next Target
More informationNoise Injection Techniques to Expose Subtle and Unintended Message Races
Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau
More informationFinding the Limits of Power-Constrained Application Performance
Finding the Limits of Power-Constrained Application Performance Peter E. Bailey Aniruddha Marathe David K. Lowenthal Dept. of Computer Science The University of Arizona Barry Rountree Martin Schulz Lawrence
More informationGetting Insider Information via the New MPI Tools Information Interface
Getting Insider Information via the New MPI Tools Information Interface EuroMPI 2016 September 26, 2016 Kathryn Mohror This work was performed under the auspices of the U.S. Department of Energy by Lawrence
More informationWhitelisting MSRs with msr-safe
Whitelisting MSRs with msr-safe Kathleen Shoga, Barry Rountree, Martin Schulz, Jeff Shafer Email: shoga1@llnl.gov LLNL-PRES-663879 This work was performed under the auspices of the U.S. Department of Energy
More informationPShifter: Feedback-based Dynamic Power Shifting within HPC Jobs for Performance
: Feedback-based Dynamic Power Shifting within HPC Jobs for Performance Neha Gholkar 1, Frank Mueller 1, Barry Rountree 2, Aniruddha Marathe 2 1 North Carolina State University, USA, ngholka@ncsu.edu,
More informationEReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications
EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications Sourav Chakraborty 1, Ignacio Laguna 2, Murali Emani 2, Kathryn Mohror 2, Dhabaleswar K (DK) Panda 1, Martin Schulz
More informationOverview. Energy-Efficient and Power-Constrained Techniques for ExascaleComputing. Motivation: Power is becoming a leading design constraint in HPC
Energy-Efficient and Power-Constrained Techniques for ExascaleComputing Stephanie Labasan Computer and Information Science University of Oregon 17 October 2016 Overview Motivation: Power is becoming a
More informationA Comparative Study of High Performance Computing on the Cloud. Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez
A Comparative Study of High Performance Computing on the Cloud Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez What is The Cloud? The cloud is just a bunch of computers connected over
More informationOvercoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model
Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model Lai Wei, Ignacio Laguna, Dong H. Ahn Matthew P. LeGendre, Gregory L. Lee This work was performed under the auspices of the
More informationMapping MPI+X Applications to Multi-GPU Architectures
Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under
More informationData Management and Analysis for Energy Efficient HPC Centers
Data Management and Analysis for Energy Efficient HPC Centers Presented by Ghaleb Abdulla, Anna Maria Bailey and John Weaver; Copyright 2015 OSIsoft, LLC Data management and analysis for energy efficient
More informationAdaptive Power Profiling for Many-Core HPC Architectures
Adaptive Power Profiling for Many-Core HPC Architectures Jaimie Kelley, Christopher Stewart The Ohio State University Devesh Tiwari, Saurabh Gupta Oak Ridge National Laboratory State-of-the-Art Schedulers
More informationBuilding Multi-Petaflop Systems with MVAPICH2 and Global Arrays
Building Multi-Petaflop Systems with MVAPICH2 and Global Arrays ABHINAV VISHNU*, JEFFREY DAILY, BRUCE PALMER, HUBERTUS VAN DAM, KAROL KOWALSKI, DARREN KERBYSON, AND ADOLFY HOISIE PACIFIC NORTHWEST NATIONAL
More informationPerformance and Energy Usage of Workloads on KNL and Haswell Architectures
Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research
More informationMVAPICH User s Group Meeting August 20, 2015
MVAPICH User s Group Meeting August 20, 2015 LLNL-PRES-676271 This work was performed under the auspices of the U.S. Department of Energy by under contract DE-AC52-07NA27344. Lawrence Livermore National
More informationOp#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD
Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Riyaz Haque and David F. Richards This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore
More informationExploring Hardware Overprovisioning in Power-Constrained, High Performance Computing
Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Tapasya Patki Dept. of Computer Science The University of Arizona tpatki@cs.arizona.edu Martin Schulz Lawrence Livermore
More informationMPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA
MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to
More informationPhase-Based Application-Driven Power Management on the Single-chip Cloud Computer
Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig Introduction
More informationExploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems
Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI
More informationGot Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work
Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work The Salishan Conference on High-Speed Computing April 26, 2016 Adam Moody
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationA Global Operating System for HPC Clusters
A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ
More informationAnalyzing and Mitigating the Impact of Manufacturing Variability in Power-Constrained Supercomputing
Analyzing and Mitigating the Impact of Manufacturing Variability in Power-Constrained Supercomputing Yuichi Inadomi 1, Tapasya Patki 2, Koji Inoue 1, Mutsumi Aoyagi 1, Barry Rountree 3, Martin Schulz 3,
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationChallenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang
Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Multigrid Solvers Method of solving linear equation
More informationPerformance Optimization Under Thermal and Power Constraints For High Performance Computing Data Centers
Performance Optimization Under Thermal and Power Constraints For High Performance Computing Data Centers Osman Sarood PhD Final Defense Department of Computer Science December 3rd, 2013!1 PhD Thesis Committee
More informationIBM Information Technology Guide For ANSYS Fluent Customers
IBM ISV & Developer Relations Manufacturing IBM Information Technology Guide For ANSYS Fluent Customers A collaborative effort between ANSYS and IBM 2 IBM Information Technology Guide For ANSYS Fluent
More informationLarge Scale Debugging of Parallel Tasks with AutomaDeD!
International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Seattle, Nov, 0 Large Scale Debugging of Parallel Tasks with AutomaDeD Ignacio Laguna, Saurabh Bagchi Todd
More informationPOWER- AWARE RESOURCE MANAGER Maximizing Data Center Performance Under Strict Power Budget
POWER- AWARE RESOURCE MANAGER Maximizing Data Center Performance Under Strict Power Budget Osman Sarood, Akhil Langer*, Abhishek Gupta, Laxmikant Kale Parallel Programming Laboratory Department of Computer
More informationAcuSolve Performance Benchmark and Profiling. October 2011
AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, Altair Compute
More informationABySS Performance Benchmark and Profiling. May 2010
ABySS Performance Benchmark and Profiling May 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC
More informationSlurm Configuration Impact on Benchmarking
Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16
More informationGROMACS Performance Benchmark and Profiling. August 2011
GROMACS Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource
More informationPreparing GPU-Accelerated Applications for the Summit Supercomputer
Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership
More informationNERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber
NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori
More informationHPC Technology Trends
HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations
More informationA Comparative Study of High-Performance Computing on the Cloud
A Comparative Study of High-Performance Computing on the Cloud Aniruddha Marathe Rachel Harris David K. Lowenthal Dept. of Computer Science The University of Arizona Bronis R. de Supinski Barry Rountree
More informationInfiniBand-based HPC Clusters
Boosting Scalability of InfiniBand-based HPC Clusters Asaf Wachtel, Senior Product Manager 2010 Voltaire Inc. InfiniBand-based HPC Clusters Scalability Challenges Cluster TCO Scalability Hardware costs
More informationCarlo Cavazzoni, HPC department, CINECA
Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have
More informationMotivation Goal Idea Proposition for users Study
Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan Computer and Information Science University of Oregon 23 November 2015 Overview Motivation:
More informationAUTOMATIC SMT THREADING
AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY
More informationOverview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization
Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan & Matt Larsen (University of Oregon), Hank Childs (Lawrence Berkeley National Laboratory) 26
More informationARCS: Adaptive Runtime Configuration Selection for Power-Constrained OpenMP Applications
ARCS: Adaptive Runtime Configuration Selection for Power-Constrained OpenMP Applications Md Abdullah Shahneous Bari, Nicholas Chaimov, Abid M. Malik, Kevin A. Huck, Barbara Chapman, Allen D. Malony and
More informationThe State and Needs of IO Performance Tools
The State and Needs of IO Performance Tools Scalable Tools Workshop Lake Tahoe, CA August 6 12, 2017 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National
More informationSystem Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.
System Design of Kepler Based HPC Solutions Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. Introduction The System Level View K20 GPU is a powerful parallel processor! K20 has
More informationDesigning Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationA Case for High Performance Computing with Virtual Machines
A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation
More informationCOL862: Low Power Computing Maximizing Performance Under a Power Cap: A Comparison of Hardware, Software, and Hybrid Techniques
COL862: Low Power Computing Maximizing Performance Under a Power Cap: A Comparison of Hardware, Software, and Hybrid Techniques Authors: Huazhe Zhang and Henry Hoffmann, Published: ASPLOS '16 Proceedings
More informationEnergy Efficiency for Exascale. High Performance Computing, Grids and Clouds International Workshop Cetraro Italy June 27, 2012 Natalie Bates
Energy Efficiency for Exascale High Performance Computing, Grids and Clouds International Workshop Cetraro Italy June 27, 2012 Natalie Bates The Exascale Power Challenge Where do we get a 1000x improvement
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationCMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**
CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OPENMP** George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA 1. INTRODUCTION This presentation reports on implementation of the
More informationOak Ridge National Laboratory Computing and Computational Sciences
Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman
More informationLeveraging Flash in HPC Systems
Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,
More informationPAC485 Managing Datacenter Resources Using the VirtualCenter Distributed Resource Scheduler
PAC485 Managing Datacenter Resources Using the VirtualCenter Distributed Resource Scheduler Carl Waldspurger Principal Engineer, R&D This presentation may contain VMware confidential information. Copyright
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationData Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling
Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus*, Murali Emani, Pei-Hung Lin, Chunhua Liao *University of Edinburgh (UK), Lawrence Livermore National Laboratory
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationI/O Aware Power Shifting
I/O Aware Power Shifting Lee Savoie and David K. Lowenthal Dept. of Computer Science, The University of Arizona Bronis R. de Supinski, Tanzima Islam, Kathryn Mohror, Barry Rountree, Martin Schulz Lawrence
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
More informationTianhe-2, the world s fastest supercomputer. Shaohua Wu Senior HPC application development engineer
Tianhe-2, the world s fastest supercomputer Shaohua Wu Senior HPC application development engineer Inspur Inspur revenue 5.8 2010-2013 6.4 2011 2012 Unit: billion$ 8.8 2013 21% Staff: 14, 000+ 12% 10%
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationCan Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?
Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu
More informationDell OpenManage Power Center s Power Policies for 12 th -Generation Servers
Dell OpenManage Power Center s Power Policies for 12 th -Generation Servers This Dell white paper describes the advantages of using the Dell OpenManage Power Center to set power policies in a data center.
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationPotentials and Limitations for Energy Efficiency Auto-Tuning
Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne
More informationMILC Performance Benchmark and Profiling. April 2013
MILC Performance Benchmark and Profiling April 2013 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting
More informationPowerExecutive. Tom Brey IBM Agenda. Why PowerExecutive. Fundamentals of PowerExecutive. - The Data Center Power/Cooling Crisis.
PowerExecutive IBM Agenda Why PowerExecutive - The Data Center Power/Cooling Crisis Fundamentals of PowerExecutive 1 The Data Center Power/Cooling Crisis Customers want more IT processing cycles to run
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationOperational Robustness of Accelerator Aware MPI
Operational Robustness of Accelerator Aware MPI Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014 Computing Systems @ CSCS http://www.cscs.ch/computers
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationThe Optimal CPU and Interconnect for an HPC Cluster
5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationUnified Runtime for PGAS and MPI over OFED
Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction
More informationThe IBM Blue Gene/Q: Application performance, scalability and optimisation
The IBM Blue Gene/Q: Application performance, scalability and optimisation Mike Ashworth, Andrew Porter Scientific Computing Department & STFC Hartree Centre Manish Modani IBM STFC Daresbury Laboratory,
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationBest Practices for Setting BIOS Parameters for Performance
White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationA POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL
A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH
More informationJob Startup at Exascale: Challenges and Solutions
Job Startup at Exascale: Challenges and Solutions Sourav Chakraborty Advisor: Dhabaleswar K (DK) Panda The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Tremendous increase
More informationOutline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Performance Analysis Instructor: Haidar M. Harmanani Spring 2018 Outline Performance scalability Analytical performance measures Amdahl
More informationLBANN: Livermore Big Ar.ficial Neural Network HPC Toolkit
LBANN: Livermore Big Ar.ficial Neural Network HPC Toolkit MLHPC 2015 Nov. 15, 2015 Brian Van Essen, Hyojin Kim, Roger Pearce, Kofi Boakye, Barry Chen Center for Applied ScienDfic CompuDng (CASC) + ComputaDonal
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationAggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments
Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer
More informationSlurm BOF SC13 Bull s Slurm roadmap
Slurm BOF SC13 Bull s Slurm roadmap SC13 Eric Monchalin Head of Extreme Computing R&D 1 Bullx BM values Bullx BM bullx MPI integration ( runtime) Automatic Placement coherency Scalable launching through
More informationHPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh
HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight
More information