Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer
|
|
- Lorin Greene
- 5 years ago
- Views:
Transcription
1 Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig
2 Introduction Power Management Power and energy now first-order constraints Recent CPUs allow voltage/frequency scaling Minimize energy or power within some performance constraint Application Driven DVFS Performance during memory-bound periods more or less unaffected by frequency Reducing frequency and voltage saves energy/power Periods are often recurrent Periods can be learned and predicted Intl. Conf. on Parallel Architectures and Compilation Techniques - October
3 Introduction Dynamic Voltage and Frequency Scaling (DVFS) on Many-Cores Voltage and frequency control not independent for each core: control granularity of domains Possible application-level monitoring and control of settings The challenge Power management on many-cores Exploit application behavior to: Minimize energy consumption within a performance window Target platform: SCC (exp. concept vehicle from Intel Labs) running MPI applications Intl. Conf. on Parallel Architectures and Compilation Techniques - October
4 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October
5 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Front End: Phase Predictor Back End: Power Controller Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October
6 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October
7 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Local Manager Requests Domain Manager Control System Phase state misprediction + Application performance Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October
8 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October
9 Proposed Scheme Overview A modular, hierarchical, transparent,, dynamic software power management scheme for a many-core system Captures application behavior without user intervention Phase Search Macro Phase Phase Partition Subphases Phases Tracks performance vs. energy behavior and adapts Phase misprediction f/t table Setup Phase misprediction f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October
10 Contributions Novel power management scheme for many-core systems Hierarchical capable of operating on domain-based systems Novel phase prediction scheme (SMRP SuperMaximal Repeat phase Predictor) Based on supermaximal repeat string better accuracy than previous approaches Transparent instrumentation of MPI applications Implemented and evaluated schemes on a real experimental many-core system Significant energy savings with little performance degradation Average 15% energy savings with only 7% performance degradation Well within 3:1 ratio of power savings to performance degradation Intl. Conf. on Parallel Architectures and Compilation Techniques - October
11 Outline Introduction Power Manager Phase Predictor Results Conclusions Intl. Conf. on Parallel Architectures and Compilation Techniques - October
12 Power Manager: Local Manager Input: phases of repeatable behavior For each repetitive phase a frequency/time table is built Iterative approach: 1. Start at highest frequency 2. Measure execution time of current instance and record in table 3. Reduce frequency by one step 4. Stop if performance impact higher than threshold δ (e.g., 10%) 5. Otherwise, repeat until lowest frequency is reached Output: frequency requested per core Tile 0 Tile 3 Core 0 Core 1 Domain 0 Sub- Phase 0 Sub- Phase m f f1. f4 f f1. f4.... t t1. t4 t t1. t4 Local manager Local Freq. Analysis fr0 fr1 fr6 fr6 fr7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
13 Local Management Example Learning i i i i Intl. Conf. on Parallel Architectures and Compilation Techniques - October
14 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
15 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Intl. Conf. on Parallel Architectures and Compilation Techniques - October
16 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Sub-phase i detected: Set f to next value (e.g., f i3 =400MHz) Measure execution time t i 3 t i3 > (1+δ)t i2 stop exploration Intl. Conf. on Parallel Architectures and Compilation Techniques - October
17 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Sub-phase i detected: Set f to next value (e.g., f i3 =400MHz) Measure execution time t i 3 t i3 > (1+δ)t i2 stop exploration Sub-phase i detected: Use f i3 = 533MHz, no further exploration Intl. Conf. on Parallel Architectures and Compilation Techniques - October
18 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Sub-phase i detected: Set f to next value (e.g., f i3 =400MHz) Measure execution time t i 3 t i3 > (1+δ)t i2 stop exploration Sub-phase i detected: Use f i3 = 533MHz, no further exploration Steady-state frequency will change if there is a phase misprediction In reality test of new frequencies is dependent on domain management decisions Intl. Conf. on Parallel Architectures and Compilation Techniques - October
19 Power Manager: Domain Manager One per voltage domain Decides/controls voltage and frequency Input: frequency requests from local managers for each core Output: frequencies for the entire domain. Policies investigated: Simple: simply service requests in order Mean: select the mean of the requests All_low, All_high: assign the lowest/highest frequency requested Voltage set for entire domain based on highest frequency (max) fr0 fr1 fr6 fr7 fr0 fr1 fr6 fr7 Domain Freq. Control max(f0...f3) Voltage LUT Vdom Domain manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October
20 Domain Management Example i i Intl. Conf. on Parallel Architectures and Compilation Techniques - October
21 Domain Management Example i i Domain Manager Local Manager 0 fr0 Policy f() Local Manager 1 fr1 e.g.: SCC V/F f doms V dom Local Manager 7 fr7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
22 Domain Management Example i i Domain Manager Local Manager 0 fr0 800 Policy f() Local Manager 1 Local Manager 7 fr fr e.g.: SCC V/F f doms V dom Intl. Conf. on Parallel Architectures and Compilation Techniques - October
23 Domain Management Example i i Domain Manager Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Policy f() e.g.: SCC V/F f doms V dom Intl. Conf. on Parallel Architectures and Compilation Techniques - October
24 Domain Management Example i i Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Domain Manager Policy f() 800 e.g.: SCC V/F f doms V dom Intl. Conf. on Parallel Architectures and Compilation Techniques - October
25 Domain Management Example i i Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Domain Manager Policy f() 800 e.g.: SCC V/F f doms V dom max (f doms ) = 800 V dom = 1.1V Intl. Conf. on Parallel Architectures and Compilation Techniques - October
26 Domain Management Example i i Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Domain Manager Policy f() e.g.: SCC V/F f doms V dom max (f doms ) = 800 V dom = 1.1V repeat Frequency (MHz) Intl. Conf. on Parallel Architectures and Compilation Techniques - October Voltage (V)
27 Outline Introduction Power Manager Phase Predictor Results Conclusions Intl. Conf. on Parallel Architectures and Compilation Techniques - October
28 Phase Predictor Recurring patterns in MPI applications Communication and execution patterns at MPI-event granularity Patterns highly repeatable Pattern detection Use of Supermaximal Repeat String algorithm Predictor Predicts next call and program phase with projected execution time Places DVFS scheduling points around repeatable regions within pattern Front-end to the local power controllers Implemented as a wrapper library for MPI calls Intl. Conf. on Parallel Architectures and Compilation Techniques - October
29 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
30 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
31 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
32 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
33 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
34 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
35 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
36 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
37 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
38 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
39 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
40 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
41 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
42 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
43 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
44 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
45 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
46 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
47 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
48 Outline Introduction Power Manager Phase Predictor Results Conclusions Intl. Conf. on Parallel Architectures and Compilation Techniques - October
49 Experimental Setup Platform Single chip Cloud Computer 48 cores running Linux kernel V and F levels obtained empirically Compiler: Frequency (MHz) GCC Voltage (V) Benchmarks: NAS MPI Parallel Benchmarks 2 SPEC MPI 2007 Evaluation Methodology: Lab setup for accurate power measurement Schemes evaluated: Scheme Predictor Domain Policy SMRP + Amean SMRP Mean SMRP + Simple SMRP Simple Chipwide + SMRP + Amean SMRP Intl. Conf. on Parallel Architectures and Compilation Techniques - October Mean, but chip-wide GHTP + Amean GHTP Mean GHTP + Simple GHTP Simple
50 EDP Results Bottom Line 1.20 GHTP+Simple GHTP+Amean Chipwide+SMRP+Amean SMRP+Simple SMRP+Amean is ft mg cg lu bt sp lammps tachyon AVG 11% avg. EDP reduction with 7% increase in execution time Intl. Conf. on Parallel Architectures and Compilation Techniques - October
51 Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW EDP Execution Time Results Domain Management Policies 1.4 EDP Execution Time is ft mg cg lu bt sp lammps tachyon avg Arithmetic mean policy performs best Intl. Conf. on Parallel Architectures and Compilation Techniques - October
52 SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP Results Phase Predictor Performance 100% Accuracy Coverage 80% 60% 40% 20% 0% is ft mg cg lu bt sp lammps tachyon avg Predictor 17% more accurate on avg. than state-of-the-art GHT Intl. Conf. on Parallel Architectures and Compilation Techniques - October
53 Conclusions Many-cores offer new challenges and opportunities in DVFS Possible division into domains Possible application-level control Presented a novel power mng. scheme applicable to many-cores Modularity: allows for separation of concerns between phase detection/prediction and control Hierarchical: can accommodate control at domains Transparent: does not require user or OS intervention Demonstrated significant energy improvements of 15% on average on a real system Benefits come from both better prediction and better management Intl. Conf. on Parallel Architectures and Compilation Techniques - October
54 Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig
55 Experimental Setup Lab environment to accurately and directly measure system power Directly measure input voltage and current with digital multimeters current measure through shunt resistor V A V A Management Console Management PC Console PC Windows Lab Windows Lab Measurement PC Measurement PC Intl. Conf. on Parallel Architectures and Compilation Techniques - October
56 Speedup normalized to 533MHz SCC MPI Frequency Scaling SCC is less communication bound than traditional clusters Computation and Communication Speedup Ratios for different Frequencies on the SCC Intl. Conf. on Parallel Architectures and Compilation Techniques - October
57 Performance Threshold Sensitivity 10% seems to be the sweet spot 1.2 EDP Execution Time % 10% 15% 20% 25% 30% Performance Threshold δ Intl. Conf. on Parallel Architectures and Compilation Techniques - October
58 Related Work Hardware schemes (Isci et. al. MICRO 06, Huang et. al. ISCA 03) Require additional hardware for monitoring and control Results obtained through simulation DVFS management of MPI applications (Freeh et. al. PPoPP 05, Lim et. al. SC 06, Rountree et. al. ICS 09) Assume per-core power management Require profile data Powernap (Meisner et. al. ASPLOS 09) Idle time policy Our scheme is complementary to such idle time schemes Intl. Conf. on Parallel Architectures and Compilation Techniques - October
59 Bibliography C. Isci, A. Buyuktosunoglu, C.-Y. Cher, and M. Martonosi An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget MICRO 2006 M. C. Huang, J. Renau, and J. Torrellas Positional Adaptation of Processors: Application to Energy Reduction ISCA 2003 V. W. Freeh and D. K. Lowenthal Using Multiple Energy Gears in MPI Programs on a Power-Scalable Cluster PPoPP 2005 B. Rountree et. al. Adagio: Making DVS Practical for Complex HPC Applications ICS 2009 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
60 Bibliography (cont.) M. Y. Lim, V. W. Freeh, and D. K. Lowenthal Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs SC 2006 D. Meisner, B. T. Gold, and T. F. Wenisch PowerNap: Eliminating Server Idle Power ASPLOS 2009 Intl. Conf. on Parallel Architectures and Compilation Techniques - October
61 Background: DVFS Dynamic power consumption: P dy α V dd2. f Frequency is also a function of V dd (lower V dd lower f) Thus, lowering both V dd and f can bring significant power savings If power savings come with little impact on performance then energy savings can be achieved as well DVFS usually applied to cores but not memories Higher core f higher memory latency in core cycles Commonly accepted rule-of-thumb is 3:1 ratio of power savings to performance degradation Intl. Conf. on Parallel Architectures and Compilation Techniques - October
62 Background: MPI Applications Library supporting message-passing programming model User API for exchanging messages across abstract processes Common messages are send, receive and collective types System interface to hardware communication mechanisms (e.g., TCP/IP, Infiniband, vendor proprietary) In most systems library runs in the same core as the user code DVFS can be applied to both user code and MPI library code Common programming styles lead to much regularity in patterns of message exchanges Well-defined standard allows for easy addition of wrappers to common MPI calls Intl. Conf. on Parallel Architectures and Compilation Techniques - October
63 Background: SCC Many-Core Experimental concept vehicle developed by Intel Labs to serve as a platform for software research 48 Intel Pentium IA cores Tiled organization with 2 cores per tile and mesh interconnect Frequency domain per tile and voltage domain per 4 tiles Current frequency and voltage level can be read and set by user software through registers Voltage changes take ~1ms and frequency changes take only a few cycles Intl. Conf. on Parallel Architectures and Compilation Techniques - October
Slurm Configuration Impact on Benchmarking
Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16
More informationDesigning Power-Aware Collective Communication Algorithms for InfiniBand Clusters
Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,
More informationA2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications
A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications Li Tan 1, Zizhong Chen 1, Ziliang Zong 2, Rong Ge 3, and Dong Li 4 1 University of California, Riverside 2 Texas
More informationApplication-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering
More informationScheduling for Better Energy Efficiency on Many-Core Chips
Scheduling for Better Energy Efficiency on Many-Core Chips Chanseok Kang, Seungyul Lee, Yong-Jun Lee, Jaejin Lee, and Bernhard Egger (B) Department of Computer Science and Engineering, Seoul National University,
More informationEfficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems
Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun
More informationA Case for High Performance Computing with Virtual Machines
A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationPower-Aware Compile Technology. Xiaoming Li
Power-Aware Compile Technology Xiaoming Li Frying Eggs Future CPU? Watts/cm 2 1000 100 10 1 i386 Hot plate i486 Nuclear Reactor Pentium III processor Pentium II processor Pentium Pro processor Pentium
More informationAdaptive Power Profiling for Many-Core HPC Architectures
Adaptive Power Profiling for Many-Core HPC Architectures Jaimie Kelley, Christopher Stewart The Ohio State University Devesh Tiwari, Saurabh Gupta Oak Ridge National Laboratory State-of-the-Art Schedulers
More informationA Cool Scheduler for Multi-Core Systems Exploiting Program Phases
IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationPerformance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms
Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State
More informationEconomic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng
Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng Energy Efficient Supercompu1ng, SC 16 November 14, 2016 This work was performed under the auspices of the U.S.
More informationA Characterization of Shared Data Access Patterns in UPC Programs
IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview
More informationImplementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand
Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction
More informationScheduling for Better Energy Efficiency on Many-core Chips
Scheduling for Better Energy Efficiency on Many-core Chips Chanseok Kang, Seungyul Lee, Yong-Jun Lee, Jaejin Lee, and Bernhard Egger Department of Computer Science and Engineering, Seoul National University,
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationAuto-Tuning Multi-Programmed Workload on the SCC
Auto-Tuning Multi-Programmed Workload on the SCC Brian Roscoe, Mathias Herlev, Chen Liu Department of Electrical and Computer Engineering Clarkson University Potsdam, NY 13699, USA {roscoebj,herlevm,cliu}@clarkson.edu
More informationDISP: Optimizations Towards Scalable MPI Startup
DISP: Optimizations Towards Scalable MPI Startup Huansong Fu, Swaroop Pophale*, Manjunath Gorentla Venkata*, Weikuan Yu Florida State University *Oak Ridge National Laboratory Outline Background and motivation
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationEvaluation and Improvements of Programming Models for the Intel SCC Many-core Processor
Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Carsten Clauss, Stefan Lankes, Pablo Reble, Thomas Bemmerl International Workshop on New Algorithms and Programming
More informationPower-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters
Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters Gregor von Laszewski, Lizhe Wang, Andrew J. Younge, Xi He Service Oriented Cyberinfrastructure Lab Rochester Institute of Technology,
More informationAn Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors
An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction with Prof. Jingling Xue, UNSW, Australia
More informationHybrid MPI/OpenMP Power-Aware Computing
Hybrid MPI/OpenMP Power-Aware Computing Dong Li Bronis R. de Supinski Martin Schulz Kirk Cameron Dimitrios S. Nikolopoulos Virginia Tech Blacksburg, VA, USA {lid,cameron}@cs.vt.edu Lawrence Livermore National
More informationBottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt
Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Executive Summary Problem: Performance and scalability of multithreaded applications
More informationLeveraging Burst Buffer Coordination to Prevent I/O Interference
Leveraging Burst Buffer Coordination to Prevent I/O Interference Anthony Kougkas akougkas@hawk.iit.edu Matthieu Dorier, Rob Latham, Rob Ross, Xian-He Sun Wednesday, October 26th Baltimore, USA Outline
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationA Framework for Continuously Adaptive DVFS
A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos Stefanos Kaxiras Georgios Keramidas Uppsala University, Sweden Uppsala University, Sweden University of Patras, Greece vasileios.spiliopoulos@it.uu.se
More informationA High Performance Cluster System Design by Adaptive Power Control
A High Performance Cluster System Design by Adaptive Power Control Masaaki Kondo, Yoshimichi Ikeda, Hiroshi Nakamura Research Center for Advanced Science and Technology, The University of Tokyo 4-6-1 Komaba,
More informationExploring Hardware Overprovisioning in Power-Constrained, High Performance Computing
Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Tapasya Patki 1 David Lowenthal 1 Barry Rountree 2 Martin Schulz 2 Bronis de Supinski 2 1 The University of Arizona
More informationAdvanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED
Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationOUTLINE Introduction Power Components Dynamic Power Optimization Conclusions
OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions 04/15/14 1 Introduction: Low Power Technology Process Hardware Architecture Software Multi VTH Low-power circuits Parallelism
More informationClusters. Rob Kunz and Justin Watson. Penn State Applied Research Laboratory
Clusters Rob Kunz and Justin Watson Penn State Applied Research Laboratory rfk102@psu.edu Contents Beowulf Cluster History Hardware Elements Networking Software Performance & Scalability Infrastructure
More informationA Workload-Aware, Eco-Friendly Daemon for Cluster Computing
A Workload-Aware, Eco-Friendly Daemon for Cluster Computing S. Huang and W. Feng Department of Computer Science Virginia Tech {huangs,feng}@cs.vt.edu Abstract This paper presents an eco-friendly daemon
More informationCan Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?
Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu
More informationSpeculative Parallelization in Decoupled Look-ahead
Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread
More informationPer-call Energy Saving Strategies in All-to-all Communications
Computer Science Technical Reports Computer Science 2011 Per-call Energy Saving Strategies in All-to-all Communications Vaibhav Sundriyal Iowa State University, vaibhavs@iastate.edu Masha Sosonkina Iowa
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationEhsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas
Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas 2 Increasing number of transistors on chip Power and energy limited Single- thread performance limited => parallelism Many opeons: heavy mulecore,
More informationGreen Governors: A Framework for Continuously Adaptive DVFS
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos Stefanos Kaxiras Georgios Keramidas Uppsala University, Sweden Uppsala University, Sweden University of Patras, Greece
More informationWrong Path Events and Their Application to Early Misprediction Detection and Recovery
Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University of Texas at Austin Motivation Branch predictors are
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationSUSE Linux Entreprise Server for ARM
FUT89013 SUSE Linux Entreprise Server for ARM Trends and Roadmap Jay Kruemcke Product Manager jayk@suse.com @mr_sles ARM Overview ARM is a Reduced Instruction Set (RISC) processor family British company,
More informationUNDERCLOCKED SOFTWARE PREFETCHING: MORE CORES, LESS ENERGY
... UNDERCLOCKED SOFTWARE PREFETCHING: MORE CORES, LESS ENERGY... POWER CONSUMPTION IS A CONCERN FOR HELPER-THREAD PREFETCHING THAT USES EXTRA CORES TO SPEED UP THE SINGLE-THREAD EXECUTION, BECAUSE POWER
More informationCSE5351: Parallel Processing Part III
CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?
More informationPerformance and Power Analysis of RCCE Message Passing on the Intel Single-Chip Cloud Computer
Performance and Power Analysis of RCCE Message Passing on the Intel Single-Chip Cloud Computer John-Nicholas Furst Ayse K. Coskun Electrical and Computer Engineering Department, Boston University, Boston,
More informationSnatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks
Snatch: Opportunistically Reassigning Power Allocation between and in 3D Stacks Dimitrios Skarlatos, Renji Thomas, Aditya Agrawal, Shibin Qin, Robert Pilawa, Ulya Karpuzcu, Radu Teodorescu, Nam Sung Kim,
More informationAnalyzing the Energy-Time Tradeoff in High-Performance Computing Applications
Analyzing the Energy-Time Tradeoff in High-Performance Computing Applications Vincent W. Freeh Feng Pan David K. Lowenthal Nandini Kappiah Rob Springer Barry L. Rountree Mark E. Femal Department of Computer
More informationProactive Process-Level Live Migration in HPC Environments
Proactive Process-Level Live Migration in HPC Environments Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory SC 08 Nov. 20 Austin,
More informationPrediction Models for Multi-dimensional Power-Performance Optimization on Many Cores
Prediction Models for Multi-dimensional Power-Performance Optimization on Many Cores Matthew Curtis-Maury, Ankur Shah Filip Blagojevic, Dimitrios S. Nikolopoulos ABSTRACT Department of Computer Science,
More informationMEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS
MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing
More informationContents. Acknowledgments... xi. Foreword 1... xiii Pierre FICHEUX. Foreword 2... xv Maryline CHETTO. Part 1. Introduction... 1
Contents Acknowledgments... xi Foreword 1... xiii Pierre FICHEUX Foreword 2... xv Maryline CHETTO Part 1. Introduction... 1 Chapter 1. General Introduction... 3 1.1. The outburst of digital data... 3 1.2.
More informationDVFS Space Exploration in Power-Constrained Processing-in-Memory Systems
DVFS Space Exploration in Power-Constrained Processing-in-Memory Systems Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationBenchmarking CPU Performance. Benchmarking CPU Performance
Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,
More informationPerformance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology
Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Shinobu Miwa Sho Aita Hiroshi Nakamura The University of Tokyo {miwa, aita, nakamura}@hal.ipc.i.u-tokyo.ac.jp
More informationEnergy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques
Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering
More informationManaging Hardware Power Saving Modes for High Performance Computing
Managing Hardware Power Saving Modes for High Performance Computing Second International Green Computing Conference 2011, Orlando Timo Minartz, Michael Knobloch, Thomas Ludwig, Bernd Mohr timo.minartz@informatik.uni-hamburg.de
More informationConservation Cores: Reducing the Energy of Mature Computations
Conservation Cores: Reducing the Energy of Mature Computations Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, Michael Bedford
More informationDavid Cronk University of Tennessee, Knoxville, TN
Penvelope: A New Approach to Rapidly Predicting the Performance of Computationally Intensive Scientific Applications on Parallel Computer Architectures Daniel M. Pressel US Army Research Laboratory (ARL),
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationEnergy-centric DVFS Controlling Method for Multi-core Platforms
Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To
More informationResponse Time and Throughput
Response Time and Throughput Response time How long it takes to do a task Throughput Total work done per unit time e.g., tasks/transactions/ per hour How are response time and throughput affected by Replacing
More informationProjects on the Intel Single-chip Cloud Computer (SCC)
Projects on the Intel Single-chip Cloud Computer (SCC) Jan-Arne Sobania Dr. Peter Tröger Prof. Dr. Andreas Polze Operating Systems and Middleware Group Hasso Plattner Institute for Software Systems Engineering
More informationCSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI Contents 1.7 - End of Chapter 1 Power wall The multicore era
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications
More informationLAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015
LAMMPS-KOKKOS Performance Benchmark and Profiling September 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, NVIDIA
More informationPerformance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and
Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet Swamy N. Kandadai and Xinghong He swamy@us.ibm.com and xinghong@us.ibm.com ABSTRACT: We compare the performance of several applications
More informationECE 571 Advanced Microprocessor-Based Design Lecture 21
ECE 571 Advanced Microprocessor-Based Design Lecture 21 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 9 April 2013 Project/HW Reminder Homework #4 comments Good job finding references,
More informationSDR Forum Technical Conference 2007
THE APPLICATION OF A NOVEL ADAPTIVE DYNAMIC VOLTAGE SCALING SCHEME TO SOFTWARE DEFINED RADIO Craig Dolwin (Toshiba Research Europe Ltd, Bristol, UK, craig.dolwin@toshiba-trel.com) ABSTRACT This paper presents
More informationHigh performance, power-efficient DSPs based on the TI C64x
High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research
More informationPerformance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA
Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,
More informationLow-Complexity Reorder Buffer Architecture*
Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower
More informationUNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM
UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM Sweety Sen, Sonali Samanta B.Tech, Information Technology, Dronacharya College of Engineering,
More informationLAMMPSCUDA GPU Performance. April 2011
LAMMPSCUDA GPU Performance April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory Council
More informationPerformance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster
Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster Marcelo Lobosco, Vítor Santos Costa, and Claudio L. de Amorim Programa de Engenharia de Sistemas e Computação, COPPE, UFRJ Centro
More informationCS3350B Computer Architecture CPU Performance and Profiling
CS3350B Computer Architecture CPU Performance and Profiling Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada
More informationBirds of a Feather Presentation
Mellanox InfiniBand QDR 4Gb/s The Fabric of Choice for High Performance Computing Gilad Shainer, shainer@mellanox.com June 28 Birds of a Feather Presentation InfiniBand Technology Leadership Industry Standard
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationComputing and energy performance
Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization i i of a multi algorithms li l i PDE solver on CPU and GPU clusters Stéphane Vialle, Sylvain Contassot Vivier, Thomas
More information3D WiNoC Architectures
Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",
More informationPart IV: 3D WiNoC Architectures
Wireless NoC as Interconnection Backbone for Multicore Chips: Promises, Challenges, and Recent Developments Part IV: 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan 1 Outline: 3D WiNoC Architectures
More informationLarge Scale Debugging of Parallel Tasks with AutomaDeD!
International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Seattle, Nov, 0 Large Scale Debugging of Parallel Tasks with AutomaDeD Ignacio Laguna, Saurabh Bagchi Todd
More informationComputer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.
Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters 2006 ANSYS, Inc. All rights reserved. 1 ANSYS, Inc. Proprietary Our Business Simulation Driven Product Development Deliver superior
More informationRUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS
RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering
More informationCHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER
73 CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER 7.1 INTRODUCTION The proposed DVS algorithm is implemented on DELL INSPIRON 6000 model laptop, which has Intel Pentium Mobile Processor
More informationJob Startup at Exascale:
Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core
More informationMulticore Cache Coherence Control by a Parallelizing Compiler
Multicore Cache Coherence Control by a Parallelizing Compiler Hironori Kasahara, Boma A. Adhi, Yohei Kishimoto, Keiji Kimura, Yuhei Hosokawa Masayoshi Mase Department of Computer Science and Engineering
More informationDynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization
Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization Jeffrey Young, Sudhakar Yalamanchili School of Electrical and Computer Engineering, Georgia Institute of Technology Talk
More informationSystem Software Solutions for Exploiting Power Limited HPC Systems
http://scalability.llnl.gov/ System Software Solutions for Exploiting Power Limited HPC Systems 45th Martin Schulz, LLNL/CASC SPEEDUP Workshop on High-Performance Computing September 2016, Basel, Switzerland
More informationAdaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs
Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Vincent W. Freeh David K. Lowenthal Abstract Although users of high-performance computing are most
More informationPower Constrained HPC
http://scalability.llnl.gov/ Power Constrained HPC Martin Schulz Center or Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory With many collaborators and Co-PIs, incl.: LLNL: Barry
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More information