Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer

Size: px
Start display at page:

Download "Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer"

Transcription

1 Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig

2 Introduction Power Management Power and energy now first-order constraints Recent CPUs allow voltage/frequency scaling Minimize energy or power within some performance constraint Application Driven DVFS Performance during memory-bound periods more or less unaffected by frequency Reducing frequency and voltage saves energy/power Periods are often recurrent Periods can be learned and predicted Intl. Conf. on Parallel Architectures and Compilation Techniques - October

3 Introduction Dynamic Voltage and Frequency Scaling (DVFS) on Many-Cores Voltage and frequency control not independent for each core: control granularity of domains Possible application-level monitoring and control of settings The challenge Power management on many-cores Exploit application behavior to: Minimize energy consumption within a performance window Target platform: SCC (exp. concept vehicle from Intel Labs) running MPI applications Intl. Conf. on Parallel Architectures and Compilation Techniques - October

4 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

5 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Front End: Phase Predictor Back End: Power Controller Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

6 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

7 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Local Manager Requests Domain Manager Control System Phase state misprediction + Application performance Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

8 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

9 Proposed Scheme Overview A modular, hierarchical, transparent,, dynamic software power management scheme for a many-core system Captures application behavior without user intervention Phase Search Macro Phase Phase Partition Subphases Phases Tracks performance vs. energy behavior and adapts Phase misprediction f/t table Setup Phase misprediction f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

10 Contributions Novel power management scheme for many-core systems Hierarchical capable of operating on domain-based systems Novel phase prediction scheme (SMRP SuperMaximal Repeat phase Predictor) Based on supermaximal repeat string better accuracy than previous approaches Transparent instrumentation of MPI applications Implemented and evaluated schemes on a real experimental many-core system Significant energy savings with little performance degradation Average 15% energy savings with only 7% performance degradation Well within 3:1 ratio of power savings to performance degradation Intl. Conf. on Parallel Architectures and Compilation Techniques - October

11 Outline Introduction Power Manager Phase Predictor Results Conclusions Intl. Conf. on Parallel Architectures and Compilation Techniques - October

12 Power Manager: Local Manager Input: phases of repeatable behavior For each repetitive phase a frequency/time table is built Iterative approach: 1. Start at highest frequency 2. Measure execution time of current instance and record in table 3. Reduce frequency by one step 4. Stop if performance impact higher than threshold δ (e.g., 10%) 5. Otherwise, repeat until lowest frequency is reached Output: frequency requested per core Tile 0 Tile 3 Core 0 Core 1 Domain 0 Sub- Phase 0 Sub- Phase m f f1. f4 f f1. f4.... t t1. t4 t t1. t4 Local manager Local Freq. Analysis fr0 fr1 fr6 fr6 fr7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

13 Local Management Example Learning i i i i Intl. Conf. on Parallel Architectures and Compilation Techniques - October

14 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

15 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Intl. Conf. on Parallel Architectures and Compilation Techniques - October

16 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Sub-phase i detected: Set f to next value (e.g., f i3 =400MHz) Measure execution time t i 3 t i3 > (1+δ)t i2 stop exploration Intl. Conf. on Parallel Architectures and Compilation Techniques - October

17 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Sub-phase i detected: Set f to next value (e.g., f i3 =400MHz) Measure execution time t i 3 t i3 > (1+δ)t i2 stop exploration Sub-phase i detected: Use f i3 = 533MHz, no further exploration Intl. Conf. on Parallel Architectures and Compilation Techniques - October

18 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Sub-phase i detected: Set f to next value (e.g., f i3 =400MHz) Measure execution time t i 3 t i3 > (1+δ)t i2 stop exploration Sub-phase i detected: Use f i3 = 533MHz, no further exploration Steady-state frequency will change if there is a phase misprediction In reality test of new frequencies is dependent on domain management decisions Intl. Conf. on Parallel Architectures and Compilation Techniques - October

19 Power Manager: Domain Manager One per voltage domain Decides/controls voltage and frequency Input: frequency requests from local managers for each core Output: frequencies for the entire domain. Policies investigated: Simple: simply service requests in order Mean: select the mean of the requests All_low, All_high: assign the lowest/highest frequency requested Voltage set for entire domain based on highest frequency (max) fr0 fr1 fr6 fr7 fr0 fr1 fr6 fr7 Domain Freq. Control max(f0...f3) Voltage LUT Vdom Domain manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

20 Domain Management Example i i Intl. Conf. on Parallel Architectures and Compilation Techniques - October

21 Domain Management Example i i Domain Manager Local Manager 0 fr0 Policy f() Local Manager 1 fr1 e.g.: SCC V/F f doms V dom Local Manager 7 fr7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

22 Domain Management Example i i Domain Manager Local Manager 0 fr0 800 Policy f() Local Manager 1 Local Manager 7 fr fr e.g.: SCC V/F f doms V dom Intl. Conf. on Parallel Architectures and Compilation Techniques - October

23 Domain Management Example i i Domain Manager Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Policy f() e.g.: SCC V/F f doms V dom Intl. Conf. on Parallel Architectures and Compilation Techniques - October

24 Domain Management Example i i Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Domain Manager Policy f() 800 e.g.: SCC V/F f doms V dom Intl. Conf. on Parallel Architectures and Compilation Techniques - October

25 Domain Management Example i i Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Domain Manager Policy f() 800 e.g.: SCC V/F f doms V dom max (f doms ) = 800 V dom = 1.1V Intl. Conf. on Parallel Architectures and Compilation Techniques - October

26 Domain Management Example i i Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Domain Manager Policy f() e.g.: SCC V/F f doms V dom max (f doms ) = 800 V dom = 1.1V repeat Frequency (MHz) Intl. Conf. on Parallel Architectures and Compilation Techniques - October Voltage (V)

27 Outline Introduction Power Manager Phase Predictor Results Conclusions Intl. Conf. on Parallel Architectures and Compilation Techniques - October

28 Phase Predictor Recurring patterns in MPI applications Communication and execution patterns at MPI-event granularity Patterns highly repeatable Pattern detection Use of Supermaximal Repeat String algorithm Predictor Predicts next call and program phase with projected execution time Places DVFS scheduling points around repeatable regions within pattern Front-end to the local power controllers Implemented as a wrapper library for MPI calls Intl. Conf. on Parallel Architectures and Compilation Techniques - October

29 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

30 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

31 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

32 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

33 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

34 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

35 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

36 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

37 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

38 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

39 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

40 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

41 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

42 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

43 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

44 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

45 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

46 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

47 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

48 Outline Introduction Power Manager Phase Predictor Results Conclusions Intl. Conf. on Parallel Architectures and Compilation Techniques - October

49 Experimental Setup Platform Single chip Cloud Computer 48 cores running Linux kernel V and F levels obtained empirically Compiler: Frequency (MHz) GCC Voltage (V) Benchmarks: NAS MPI Parallel Benchmarks 2 SPEC MPI 2007 Evaluation Methodology: Lab setup for accurate power measurement Schemes evaluated: Scheme Predictor Domain Policy SMRP + Amean SMRP Mean SMRP + Simple SMRP Simple Chipwide + SMRP + Amean SMRP Intl. Conf. on Parallel Architectures and Compilation Techniques - October Mean, but chip-wide GHTP + Amean GHTP Mean GHTP + Simple GHTP Simple

50 EDP Results Bottom Line 1.20 GHTP+Simple GHTP+Amean Chipwide+SMRP+Amean SMRP+Simple SMRP+Amean is ft mg cg lu bt sp lammps tachyon AVG 11% avg. EDP reduction with 7% increase in execution time Intl. Conf. on Parallel Architectures and Compilation Techniques - October

51 Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW EDP Execution Time Results Domain Management Policies 1.4 EDP Execution Time is ft mg cg lu bt sp lammps tachyon avg Arithmetic mean policy performs best Intl. Conf. on Parallel Architectures and Compilation Techniques - October

52 SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP Results Phase Predictor Performance 100% Accuracy Coverage 80% 60% 40% 20% 0% is ft mg cg lu bt sp lammps tachyon avg Predictor 17% more accurate on avg. than state-of-the-art GHT Intl. Conf. on Parallel Architectures and Compilation Techniques - October

53 Conclusions Many-cores offer new challenges and opportunities in DVFS Possible division into domains Possible application-level control Presented a novel power mng. scheme applicable to many-cores Modularity: allows for separation of concerns between phase detection/prediction and control Hierarchical: can accommodate control at domains Transparent: does not require user or OS intervention Demonstrated significant energy improvements of 15% on average on a real system Benefits come from both better prediction and better management Intl. Conf. on Parallel Architectures and Compilation Techniques - October

54 Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig

55 Experimental Setup Lab environment to accurately and directly measure system power Directly measure input voltage and current with digital multimeters current measure through shunt resistor V A V A Management Console Management PC Console PC Windows Lab Windows Lab Measurement PC Measurement PC Intl. Conf. on Parallel Architectures and Compilation Techniques - October

56 Speedup normalized to 533MHz SCC MPI Frequency Scaling SCC is less communication bound than traditional clusters Computation and Communication Speedup Ratios for different Frequencies on the SCC Intl. Conf. on Parallel Architectures and Compilation Techniques - October

57 Performance Threshold Sensitivity 10% seems to be the sweet spot 1.2 EDP Execution Time % 10% 15% 20% 25% 30% Performance Threshold δ Intl. Conf. on Parallel Architectures and Compilation Techniques - October

58 Related Work Hardware schemes (Isci et. al. MICRO 06, Huang et. al. ISCA 03) Require additional hardware for monitoring and control Results obtained through simulation DVFS management of MPI applications (Freeh et. al. PPoPP 05, Lim et. al. SC 06, Rountree et. al. ICS 09) Assume per-core power management Require profile data Powernap (Meisner et. al. ASPLOS 09) Idle time policy Our scheme is complementary to such idle time schemes Intl. Conf. on Parallel Architectures and Compilation Techniques - October

59 Bibliography C. Isci, A. Buyuktosunoglu, C.-Y. Cher, and M. Martonosi An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget MICRO 2006 M. C. Huang, J. Renau, and J. Torrellas Positional Adaptation of Processors: Application to Energy Reduction ISCA 2003 V. W. Freeh and D. K. Lowenthal Using Multiple Energy Gears in MPI Programs on a Power-Scalable Cluster PPoPP 2005 B. Rountree et. al. Adagio: Making DVS Practical for Complex HPC Applications ICS 2009 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

60 Bibliography (cont.) M. Y. Lim, V. W. Freeh, and D. K. Lowenthal Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs SC 2006 D. Meisner, B. T. Gold, and T. F. Wenisch PowerNap: Eliminating Server Idle Power ASPLOS 2009 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

61 Background: DVFS Dynamic power consumption: P dy α V dd2. f Frequency is also a function of V dd (lower V dd lower f) Thus, lowering both V dd and f can bring significant power savings If power savings come with little impact on performance then energy savings can be achieved as well DVFS usually applied to cores but not memories Higher core f higher memory latency in core cycles Commonly accepted rule-of-thumb is 3:1 ratio of power savings to performance degradation Intl. Conf. on Parallel Architectures and Compilation Techniques - October

62 Background: MPI Applications Library supporting message-passing programming model User API for exchanging messages across abstract processes Common messages are send, receive and collective types System interface to hardware communication mechanisms (e.g., TCP/IP, Infiniband, vendor proprietary) In most systems library runs in the same core as the user code DVFS can be applied to both user code and MPI library code Common programming styles lead to much regularity in patterns of message exchanges Well-defined standard allows for easy addition of wrappers to common MPI calls Intl. Conf. on Parallel Architectures and Compilation Techniques - October

63 Background: SCC Many-Core Experimental concept vehicle developed by Intel Labs to serve as a platform for software research 48 Intel Pentium IA cores Tiled organization with 2 cores per tile and mesh interconnect Frequency domain per tile and voltage domain per 4 tiles Current frequency and voltage level can be read and set by user software through registers Voltage changes take ~1ms and frequency changes take only a few cycles Intl. Conf. on Parallel Architectures and Compilation Techniques - October

Slurm Configuration Impact on Benchmarking

Slurm Configuration Impact on Benchmarking Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications Li Tan 1, Zizhong Chen 1, Ziliang Zong 2, Rong Ge 3, and Dong Li 4 1 University of California, Riverside 2 Texas

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

Scheduling for Better Energy Efficiency on Many-Core Chips

Scheduling for Better Energy Efficiency on Many-Core Chips Scheduling for Better Energy Efficiency on Many-Core Chips Chanseok Kang, Seungyul Lee, Yong-Jun Lee, Jaejin Lee, and Bernhard Egger (B) Department of Computer Science and Engineering, Seoul National University,

More information

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Power-Aware Compile Technology. Xiaoming Li

Power-Aware Compile Technology. Xiaoming Li Power-Aware Compile Technology Xiaoming Li Frying Eggs Future CPU? Watts/cm 2 1000 100 10 1 i386 Hot plate i486 Nuclear Reactor Pentium III processor Pentium II processor Pentium Pro processor Pentium

More information

Adaptive Power Profiling for Many-Core HPC Architectures

Adaptive Power Profiling for Many-Core HPC Architectures Adaptive Power Profiling for Many-Core HPC Architectures Jaimie Kelley, Christopher Stewart The Ohio State University Devesh Tiwari, Saurabh Gupta Oak Ridge National Laboratory State-of-the-Art Schedulers

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng

Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng Energy Efficient Supercompu1ng, SC 16 November 14, 2016 This work was performed under the auspices of the U.S.

More information

A Characterization of Shared Data Access Patterns in UPC Programs

A Characterization of Shared Data Access Patterns in UPC Programs IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview

More information

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction

More information

Scheduling for Better Energy Efficiency on Many-core Chips

Scheduling for Better Energy Efficiency on Many-core Chips Scheduling for Better Energy Efficiency on Many-core Chips Chanseok Kang, Seungyul Lee, Yong-Jun Lee, Jaejin Lee, and Bernhard Egger Department of Computer Science and Engineering, Seoul National University,

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Auto-Tuning Multi-Programmed Workload on the SCC

Auto-Tuning Multi-Programmed Workload on the SCC Auto-Tuning Multi-Programmed Workload on the SCC Brian Roscoe, Mathias Herlev, Chen Liu Department of Electrical and Computer Engineering Clarkson University Potsdam, NY 13699, USA {roscoebj,herlevm,cliu}@clarkson.edu

More information

DISP: Optimizations Towards Scalable MPI Startup

DISP: Optimizations Towards Scalable MPI Startup DISP: Optimizations Towards Scalable MPI Startup Huansong Fu, Swaroop Pophale*, Manjunath Gorentla Venkata*, Weikuan Yu Florida State University *Oak Ridge National Laboratory Outline Background and motivation

More information

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance

More information

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Carsten Clauss, Stefan Lankes, Pablo Reble, Thomas Bemmerl International Workshop on New Algorithms and Programming

More information

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters Gregor von Laszewski, Lizhe Wang, Andrew J. Younge, Xi He Service Oriented Cyberinfrastructure Lab Rochester Institute of Technology,

More information

An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors

An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction with Prof. Jingling Xue, UNSW, Australia

More information

Hybrid MPI/OpenMP Power-Aware Computing

Hybrid MPI/OpenMP Power-Aware Computing Hybrid MPI/OpenMP Power-Aware Computing Dong Li Bronis R. de Supinski Martin Schulz Kirk Cameron Dimitrios S. Nikolopoulos Virginia Tech Blacksburg, VA, USA {lid,cameron}@cs.vt.edu Lawrence Livermore National

More information

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Executive Summary Problem: Performance and scalability of multithreaded applications

More information

Leveraging Burst Buffer Coordination to Prevent I/O Interference

Leveraging Burst Buffer Coordination to Prevent I/O Interference Leveraging Burst Buffer Coordination to Prevent I/O Interference Anthony Kougkas akougkas@hawk.iit.edu Matthieu Dorier, Rob Latham, Rob Ross, Xian-He Sun Wednesday, October 26th Baltimore, USA Outline

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

A Framework for Continuously Adaptive DVFS

A Framework for Continuously Adaptive DVFS A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos Stefanos Kaxiras Georgios Keramidas Uppsala University, Sweden Uppsala University, Sweden University of Patras, Greece vasileios.spiliopoulos@it.uu.se

More information

A High Performance Cluster System Design by Adaptive Power Control

A High Performance Cluster System Design by Adaptive Power Control A High Performance Cluster System Design by Adaptive Power Control Masaaki Kondo, Yoshimichi Ikeda, Hiroshi Nakamura Research Center for Advanced Science and Technology, The University of Tokyo 4-6-1 Komaba,

More information

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Tapasya Patki 1 David Lowenthal 1 Barry Rountree 2 Martin Schulz 2 Bronis de Supinski 2 1 The University of Arizona

More information

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions 04/15/14 1 Introduction: Low Power Technology Process Hardware Architecture Software Multi VTH Low-power circuits Parallelism

More information

Clusters. Rob Kunz and Justin Watson. Penn State Applied Research Laboratory

Clusters. Rob Kunz and Justin Watson. Penn State Applied Research Laboratory Clusters Rob Kunz and Justin Watson Penn State Applied Research Laboratory rfk102@psu.edu Contents Beowulf Cluster History Hardware Elements Networking Software Performance & Scalability Infrastructure

More information

A Workload-Aware, Eco-Friendly Daemon for Cluster Computing

A Workload-Aware, Eco-Friendly Daemon for Cluster Computing A Workload-Aware, Eco-Friendly Daemon for Cluster Computing S. Huang and W. Feng Department of Computer Science Virginia Tech {huangs,feng}@cs.vt.edu Abstract This paper presents an eco-friendly daemon

More information

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu

More information

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread

More information

Per-call Energy Saving Strategies in All-to-all Communications

Per-call Energy Saving Strategies in All-to-all Communications Computer Science Technical Reports Computer Science 2011 Per-call Energy Saving Strategies in All-to-all Communications Vaibhav Sundriyal Iowa State University, vaibhavs@iastate.edu Masha Sosonkina Iowa

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas 2 Increasing number of transistors on chip Power and energy limited Single- thread performance limited => parallelism Many opeons: heavy mulecore,

More information

Green Governors: A Framework for Continuously Adaptive DVFS

Green Governors: A Framework for Continuously Adaptive DVFS Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos Stefanos Kaxiras Georgios Keramidas Uppsala University, Sweden Uppsala University, Sweden University of Patras, Greece

More information

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University of Texas at Austin Motivation Branch predictors are

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

SUSE Linux Entreprise Server for ARM

SUSE Linux Entreprise Server for ARM FUT89013 SUSE Linux Entreprise Server for ARM Trends and Roadmap Jay Kruemcke Product Manager jayk@suse.com @mr_sles ARM Overview ARM is a Reduced Instruction Set (RISC) processor family British company,

More information

UNDERCLOCKED SOFTWARE PREFETCHING: MORE CORES, LESS ENERGY

UNDERCLOCKED SOFTWARE PREFETCHING: MORE CORES, LESS ENERGY ... UNDERCLOCKED SOFTWARE PREFETCHING: MORE CORES, LESS ENERGY... POWER CONSUMPTION IS A CONCERN FOR HELPER-THREAD PREFETCHING THAT USES EXTRA CORES TO SPEED UP THE SINGLE-THREAD EXECUTION, BECAUSE POWER

More information

CSE5351: Parallel Processing Part III

CSE5351: Parallel Processing Part III CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?

More information

Performance and Power Analysis of RCCE Message Passing on the Intel Single-Chip Cloud Computer

Performance and Power Analysis of RCCE Message Passing on the Intel Single-Chip Cloud Computer Performance and Power Analysis of RCCE Message Passing on the Intel Single-Chip Cloud Computer John-Nicholas Furst Ayse K. Coskun Electrical and Computer Engineering Department, Boston University, Boston,

More information

Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks Snatch: Opportunistically Reassigning Power Allocation between and in 3D Stacks Dimitrios Skarlatos, Renji Thomas, Aditya Agrawal, Shibin Qin, Robert Pilawa, Ulya Karpuzcu, Radu Teodorescu, Nam Sung Kim,

More information

Analyzing the Energy-Time Tradeoff in High-Performance Computing Applications

Analyzing the Energy-Time Tradeoff in High-Performance Computing Applications Analyzing the Energy-Time Tradeoff in High-Performance Computing Applications Vincent W. Freeh Feng Pan David K. Lowenthal Nandini Kappiah Rob Springer Barry L. Rountree Mark E. Femal Department of Computer

More information

Proactive Process-Level Live Migration in HPC Environments

Proactive Process-Level Live Migration in HPC Environments Proactive Process-Level Live Migration in HPC Environments Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory SC 08 Nov. 20 Austin,

More information

Prediction Models for Multi-dimensional Power-Performance Optimization on Many Cores

Prediction Models for Multi-dimensional Power-Performance Optimization on Many Cores Prediction Models for Multi-dimensional Power-Performance Optimization on Many Cores Matthew Curtis-Maury, Ankur Shah Filip Blagojevic, Dimitrios S. Nikolopoulos ABSTRACT Department of Computer Science,

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

Contents. Acknowledgments... xi. Foreword 1... xiii Pierre FICHEUX. Foreword 2... xv Maryline CHETTO. Part 1. Introduction... 1

Contents. Acknowledgments... xi. Foreword 1... xiii Pierre FICHEUX. Foreword 2... xv Maryline CHETTO. Part 1. Introduction... 1 Contents Acknowledgments... xi Foreword 1... xiii Pierre FICHEUX Foreword 2... xv Maryline CHETTO Part 1. Introduction... 1 Chapter 1. General Introduction... 3 1.1. The outburst of digital data... 3 1.2.

More information

DVFS Space Exploration in Power-Constrained Processing-in-Memory Systems

DVFS Space Exploration in Power-Constrained Processing-in-Memory Systems DVFS Space Exploration in Power-Constrained Processing-in-Memory Systems Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Benchmarking CPU Performance. Benchmarking CPU Performance

Benchmarking CPU Performance. Benchmarking CPU Performance Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,

More information

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Shinobu Miwa Sho Aita Hiroshi Nakamura The University of Tokyo {miwa, aita, nakamura}@hal.ipc.i.u-tokyo.ac.jp

More information

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering

More information

Managing Hardware Power Saving Modes for High Performance Computing

Managing Hardware Power Saving Modes for High Performance Computing Managing Hardware Power Saving Modes for High Performance Computing Second International Green Computing Conference 2011, Orlando Timo Minartz, Michael Knobloch, Thomas Ludwig, Bernd Mohr timo.minartz@informatik.uni-hamburg.de

More information

Conservation Cores: Reducing the Energy of Mature Computations

Conservation Cores: Reducing the Energy of Mature Computations Conservation Cores: Reducing the Energy of Mature Computations Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, Michael Bedford

More information

David Cronk University of Tennessee, Knoxville, TN

David Cronk University of Tennessee, Knoxville, TN Penvelope: A New Approach to Rapidly Predicting the Performance of Computationally Intensive Scientific Applications on Parallel Computer Architectures Daniel M. Pressel US Army Research Laboratory (ARL),

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Energy-centric DVFS Controlling Method for Multi-core Platforms

Energy-centric DVFS Controlling Method for Multi-core Platforms Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To

More information

Response Time and Throughput

Response Time and Throughput Response Time and Throughput Response time How long it takes to do a task Throughput Total work done per unit time e.g., tasks/transactions/ per hour How are response time and throughput affected by Replacing

More information

Projects on the Intel Single-chip Cloud Computer (SCC)

Projects on the Intel Single-chip Cloud Computer (SCC) Projects on the Intel Single-chip Cloud Computer (SCC) Jan-Arne Sobania Dr. Peter Tröger Prof. Dr. Andreas Polze Operating Systems and Middleware Group Hasso Plattner Institute for Software Systems Engineering

More information

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI Contents 1.7 - End of Chapter 1 Power wall The multicore era

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications

More information

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015 LAMMPS-KOKKOS Performance Benchmark and Profiling September 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, NVIDIA

More information

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet Swamy N. Kandadai and Xinghong He swamy@us.ibm.com and xinghong@us.ibm.com ABSTRACT: We compare the performance of several applications

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 21

ECE 571 Advanced Microprocessor-Based Design Lecture 21 ECE 571 Advanced Microprocessor-Based Design Lecture 21 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 9 April 2013 Project/HW Reminder Homework #4 comments Good job finding references,

More information

SDR Forum Technical Conference 2007

SDR Forum Technical Conference 2007 THE APPLICATION OF A NOVEL ADAPTIVE DYNAMIC VOLTAGE SCALING SCHEME TO SOFTWARE DEFINED RADIO Craig Dolwin (Toshiba Research Europe Ltd, Bristol, UK, craig.dolwin@toshiba-trel.com) ABSTRACT This paper presents

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM Sweety Sen, Sonali Samanta B.Tech, Information Technology, Dronacharya College of Engineering,

More information

LAMMPSCUDA GPU Performance. April 2011

LAMMPSCUDA GPU Performance. April 2011 LAMMPSCUDA GPU Performance April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory Council

More information

Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster

Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster Marcelo Lobosco, Vítor Santos Costa, and Claudio L. de Amorim Programa de Engenharia de Sistemas e Computação, COPPE, UFRJ Centro

More information

CS3350B Computer Architecture CPU Performance and Profiling

CS3350B Computer Architecture CPU Performance and Profiling CS3350B Computer Architecture CPU Performance and Profiling Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada

More information

Birds of a Feather Presentation

Birds of a Feather Presentation Mellanox InfiniBand QDR 4Gb/s The Fabric of Choice for High Performance Computing Gilad Shainer, shainer@mellanox.com June 28 Birds of a Feather Presentation InfiniBand Technology Leadership Industry Standard

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

Computing and energy performance

Computing and energy performance Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization i i of a multi algorithms li l i PDE solver on CPU and GPU clusters Stéphane Vialle, Sylvain Contassot Vivier, Thomas

More information

3D WiNoC Architectures

3D WiNoC Architectures Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",

More information

Part IV: 3D WiNoC Architectures

Part IV: 3D WiNoC Architectures Wireless NoC as Interconnection Backbone for Multicore Chips: Promises, Challenges, and Recent Developments Part IV: 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan 1 Outline: 3D WiNoC Architectures

More information

Large Scale Debugging of Parallel Tasks with AutomaDeD!

Large Scale Debugging of Parallel Tasks with AutomaDeD! International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Seattle, Nov, 0 Large Scale Debugging of Parallel Tasks with AutomaDeD Ignacio Laguna, Saurabh Bagchi Todd

More information

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc. Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters 2006 ANSYS, Inc. All rights reserved. 1 ANSYS, Inc. Proprietary Our Business Simulation Driven Product Development Deliver superior

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER

CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER 73 CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER 7.1 INTRODUCTION The proposed DVS algorithm is implemented on DELL INSPIRON 6000 model laptop, which has Intel Pentium Mobile Processor

More information

Job Startup at Exascale:

Job Startup at Exascale: Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core

More information

Multicore Cache Coherence Control by a Parallelizing Compiler

Multicore Cache Coherence Control by a Parallelizing Compiler Multicore Cache Coherence Control by a Parallelizing Compiler Hironori Kasahara, Boma A. Adhi, Yohei Kishimoto, Keiji Kimura, Yuhei Hosokawa Masayoshi Mase Department of Computer Science and Engineering

More information

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization Jeffrey Young, Sudhakar Yalamanchili School of Electrical and Computer Engineering, Georgia Institute of Technology Talk

More information

System Software Solutions for Exploiting Power Limited HPC Systems

System Software Solutions for Exploiting Power Limited HPC Systems http://scalability.llnl.gov/ System Software Solutions for Exploiting Power Limited HPC Systems 45th Martin Schulz, LLNL/CASC SPEEDUP Workshop on High-Performance Computing September 2016, Basel, Switzerland

More information

Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs

Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Vincent W. Freeh David K. Lowenthal Abstract Although users of high-performance computing are most

More information

Power Constrained HPC

Power Constrained HPC http://scalability.llnl.gov/ Power Constrained HPC Martin Schulz Center or Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory With many collaborators and Co-PIs, incl.: LLNL: Barry

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information