Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology

Similar documents
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Improving Virtual Machine Scheduling in NUMA Multicore Systems

IEEE 802.3az: The Road to Energy Efficient Ethernet

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Adaptive Power Profiling for Many-Core HPC Architectures

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Network Design Considerations for Grid Computing

Powernightmares: The Challenge of Efficiently Using Sleep States on Multi-Core Systems

OCTOPUS Performance Benchmark and Profiling. June 2015

A Case for High Performance Computing with Virtual Machines

Virtualizing Agilent OpenLAB CDS EZChrom Edition with VMware

Advanced RDMA-based Admission Control for Modern Data-Centers

Configuring EEE. Finding Feature Information. This chapter describes how to configure Energy Efficient Ethernet (EEE) on Cisco NX-OS devices.

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Industry Standards for the Exponential Growth of Data Center Bandwidth and Management. Craig W. Carlson

Benchmarking CPU Performance. Benchmarking CPU Performance

Fujitsu s Approach to Application Centric Petascale Computing

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Application Acceleration Beyond Flash Storage

Towards energy efficient data intensive computing using IEEE 802.3az

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance

FAST FORWARD TO YOUR <NEXT> CREATION

VMware VMmark V1.1 Results

Tracing and Visualization of Energy Related Metrics

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

An Experimental Analysis on Iterative Block Ciphers and Their Effects on VoIP under Different Coding Schemes

Best Practices for Setting BIOS Parameters for Performance

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

CSE5351: Parallel Processing Part III

Multimedia Streaming. Mike Zink

Performance Characterization of the Dell Flexible Computing On-Demand Desktop Streaming Solution

3D WiNoC Architectures

RiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Comparing the Performance and Power of Dell VRTX and HP c3000

xsim The Extreme-Scale Simulator

Potentials and Limitations for Energy Efficiency Auto-Tuning

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

ARISTA: Improving Application Performance While Reducing Complexity

On the Impact of the TCP Acknowledgement Frequency on Energy Efficient Ethernet Performance

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

MILC Performance Benchmark and Profiling. April 2013

On the Impact of the TCP Acknowledgement Frequency on Energy Efficient Ethernet Performance

SNAP Performance Benchmark and Profiling. April 2014

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

IEEE 802.3az Energy Efficient Ethernet

Understanding and Improving the Cost of Scaling Distributed Event Processing

An On/Off Link Activation Method for Low-Power Ethernet in PC Clusters

A Comprehensive Study on the Performance of Implicit LS-DYNA

Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect

ibench: Quantifying Interference in Datacenter Applications

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

Acano solution. White Paper on Virtualized Deployments. Simon Evans, Acano Chief Scientist. March B

BlackBerry AtHoc Networked Crisis Communication Capacity Planning Guidelines. AtHoc SMS Codes

Lossless 10 Gigabit Ethernet: The Unifying Infrastructure for SAN and LAN Consolidation

Introduction to Ethernet Latency

AP-VPMS VoIP Plug & Play Management System Next Generation Large Scale VoIP Equipment Management Solution

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Etiquette protocol for Ultra Low Power Operation in Sensor Networks

Optimizing LS-DYNA Productivity in Cluster Environments

An Extensible Message-Oriented Offload Model for High-Performance Applications

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Optimal BIOS settings for HPC with Dell PowerEdge 12 th generation servers

How much energy can you save with a multicore computer for web applications?

Noise Injection Techniques to Expose Subtle and Unintended Message Races

HPC In The Cloud? Michael Kleber. July 2, Department of Computer Sciences University of Salzburg, Austria

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing

Impact of Dell FlexMem Bridge on Microsoft SQL Server Database Performance

Influence of the Progress Engine on the Performance of Asynchronous Communication Libraries

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Rositsa Tinkova Proseminar Technische Informatik SS 2013

SAN Design Best Practices for the Dell PowerEdge M1000e Blade Enclosure and EqualLogic PS Series Storage (1GbE) A Dell Technical Whitepaper

Intergenerational Energy Efficiency of Dell EMC PowerEdge Servers

Performance Evaluation of WiFiRe using OPNET

An MPI failure detector over PMPI 1

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

4-Port Gigabit Ethernet Network Card - PCI Express, Intel I350 NIC

Page Replacement Algorithm using Swap-in History for Remote Memory Paging

Messaging Overview. Introduction. Gen-Z Messaging

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Single-Points of Performance

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Connection Handoff Policies for TCP Offload Network Interfaces

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Multifunction Networking Adapters

DISP: Optimizations Towards Scalable MPI Startup

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Transcription:

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Shinobu Miwa Sho Aita Hiroshi Nakamura The University of Tokyo {miwa, aita, nakamura}@hal.ipc.i.u-tokyo.ac.jp

Executive Summary Energy Efficient Ethernet (EEE) is a technique to lower power consumption of networks Pros: significant power saving of network links Cons: slight performance penalty caused by link-on/off To support system developers, we propose a perf. estimation method of HPC systems with EEE Using novel performance models with network profiles The experimental results show that our method has significant accuracy in the most cases 2.63% on average and 20.0% in worst case 2

Agenda Introduction Energy Efficient Ethernet (EEE) technology Performance estimation of HPC systems with EEE Experimental result Summary & future work 3

Power of Interconnection Networks Power consumption of interconnection networks is not negligible in modern HPC systems It may achieve up to 33% of total system power* The reason is that interconnection networks have widened bandwidth and increased redundancy Ex.) Tofu network has ten links per node, each with 6.25GB/s PHYs (physical layer devices) are dominant modules in networks in terms of power consumption Around 70% of network device power Always activated to maintain link connection * P. M. Kogge, Architectural Challenges at the Exascale Frontier, Simulating the Future: Using One Million Cores and Beyond (invited talk), 2008 4

Energy Efficient Ethernet (EEE) Ethernet standard for saving power of PHYs Standardized as IEEE802.3az in 2010 Change into a low power mode during low network loads Save PHYs power by up to 70%* Devices compliant with EEE increase gradually [GEU-0820 (Level One)] [PowerConnect 5548 (Dell)] [8-port Gigabit GREENnet (Trendnet)] [Catalyst 3560CG (Cisco)] * http://www.broadcom.com/press/release.php?id=s430231 5

Situation of EEE for HPC Use Few studies about EEE for HPC use have been done There only exists a study of power evaluation of EEEsupported devices for a ping-pong test* Power and performance of EEE-supported devices for HPC applications are still unknown Why? No hardware for HPC systems Quite new technology * P. Reviriego et al., An Energy Consumption Model for Energy Efficient Ethernet Switches, HPCS, 2012 6

Requirements for the Spread of EEE in HPC Development of EEE-supported devices for HPC systems Each task force of a high-performance network (e.g. InfiniBand) should standardize EEE-technology rapidly EEE-supported devices should be developed immediately Power/performance estimation when using EEE Although there does not exist EEE-supported HPC systems yet, we want to know the impact of EEE on existing systems If it is small, it would motivate system developers to use EEE Establishment of power management scheme Optimal power management scheme may be different between Internet and interconnection networks 7

Mission of This Work Our goal To develop a performance estimation method of EEEsupported HPC systems Power model of EEE already exists, but performance one does not We can start the discussion about power management schemes without EEE-supported hardware Prerequisite for estimation We do not have any EEE-supported devices for HPC Our approach Using performance models with network profiles 8

EEE Technique to lower power consumption of PHYs during low network loads Start to power a PHY off when detecting an idle state Periodically get up for confirmation of link connectivity Start to power the PHY on when a packet arrives PHY s power Active Sleep Low Power Mode Refresh time Although the detailed power management of EEE is not published, the most devices seem to use time-out control and on-demand wake-up Wakeup Active 9

Performance Penalty of EEE Packets arrived during a low power mode are delayed PHY s power Active Sleep Low Power Mode The wake-up delay is at least 16 microseconds in 1000BASE-T networks Protocol Refresh 100BASE-TX 30 1000BASE-T 16 10GBASE-T 4.48 A packet arrives Min Tw (usec) Tw Wakeup Beginning of communication Active time We must model this penalty! 10

Proposed Performance Model Suppose that an application i runs with j threads on an EEE-supported HPC system Elapsed time T can be described below T T base T overhead T base : Elapsed time when the application i runs with j threads on an EEE-unsupported system : Time overhead caused by EEE T overhead 11

Model of T overhead We assume that T overhead is written as follows T overhead n f ( I ) n : Communication count per node I : Average idle interval of network links f : Performance penalty per communication The function f forms a step function Ideally, but performance penalty actually shows gradual increase because of I variation f Time-out interval f T w T w [Ideal model] I [Empirical model] I 12

Model of I We suppose that all communication occurs periodically and transmits the same size of data Under the above assumption, I can be written below I ( T / n ) ( S / B base ) S : Average communication data size per node B : Network bandwidth per node Network load B S I T / n base Beginning of communication End of communication 13 Time

List of Proposed Models T T base T overhead T overhead n f ( I ) I T base ( T / n ) ( S / B base : Elapsed time on EEE-unsupported systems n : Communication count per node I : Average idle interval of network links S : Average communication data size per node f : Performance penalty per communication B : Network bandwidth per node ) By measurement By network profiles By assumed power management scheme Already known 14

Evaluation Methodology Evaluation item Accuracy of the proposed models Evaluation method Estimate the performance under the following situation EEE-disabled system EEE-unsupported HPC system EEE-enabled system future EEE-supported HPC system Benchmark programs Synthetic application HPC applications (NAS Parallel Benchmark) 15

System Configuration Switch: Dell PowerConnect 5548 48-port Gigabit Ethernet Compliant with EEE Time-out interval: 1 msec Node: HP ProLiant DL360p Gen8 4 nodes CPU: Xeon E5-2680, 2-socket 8C16T, 2.7GHz, 130W TDP Memory: 64GB (8GB x8) NIC: HP FlexibleLOM 1Gb 4-port 331FLR Ethernet adapter Compliant with EEE Disable Turbo Boost and cpuspeed NIC Node 0 NIC Node 1 LAN Switch NIC Node 2 NIC Node 3 16

Evaluation with Synthetic Application Synthetic application that all processes repeat concurrent communication Repeat all-to-all 100,000 times for given array Insert usleep function to adjust communication intervals Parameters used for experiment # of Rank: 4 (1 rank/node), 16 (4 rank/node) Array size: 256-131,072 Byte Sleep time: 0, 100, 500, 1,000 usec [Pseudo code of synthetic application] Execute 5 times in each parameter and then average the results 17

Evaluation with NAS Parallel Benchmark Version: 3.3.1 Compile options: -O2 funroll-loops Parameters used for experiment # of Rank: 4 (1 rank/node), 16 (4 rank/node) Class: A, B, C (However, we will only show the result of 16-rank Class-B ) Execute 10 times in each parameter and average the results 18

T_overhead (sec) Evaluation Result of T overhead We can model many cases correctly (Synthetic, 4-rank) There exists a few points that show large errors This is because the firmware changes CPU frequency unexpectedly 12 10 8 6 4 2 0-2 0 500 1000 1500 2000 2500 3000 Average idle interval (usec) trend 0 usec 100 usec 500 usec 1000 usec 19

256B 512B 1KB 2KB 4KB 8KB 32KB 128KB 256B 512B 1KB 2KB 4KB 8KB 32KB 128KB 256B 512B 1KB 2KB 4KB 8KB 32KB 128KB 256B 512B 1KB 2KB 4KB 8KB 32KB 128KB Elapsed time (sec) 256B 512B 1KB 2KB 4KB 8KB 32KB 128KB 256B 512B 1KB 2KB 4KB 8KB 32KB 128KB 256B 512B 1KB 2KB 4KB 8KB 32KB 128KB 256B 512B 1KB 2KB 4KB 8KB 32KB 128KB Elapsed time (sec) Accuracy of Performance Estimation (Synthetic) 300 250 200 150 100 50 0 EEE-disabled EEE-enabled estimated Next slide 4-rank 0 usec 100 usec 500 usec 1000 usec 600 500 400 300 200 100 0 EEE-disabled EEE-enabled estimated 16-rank 0 usec 100 usec 500 usec 1000 usec 20

Elapsed time (sec) Accuracy of Performance Estimation (Synthetic, 4-rank, 100 usec sleep) Performance degradation by EEE: up to 25.8% (4KB) Estimation error: 2.63% (on average) 20.0% (in the worst case) 40 35 30 25 20 15 10 5 0 EEE-disabled EEE-enabled estimated 256B 512B 1KB 2KB 4KB 8KB 21

Elapsed time (sec) Accuracy of Performance Estimation (NPB, 16-rank, class B) Since the most applications have a little communication, EEE hardly degrades the performance Only LU (which communicates frequently) shows a large error because of inaccuracy of model of average idle interval 120 100 80 60 40 20 0 BT CG EP FT IS LU MG SP Benchmark programs 22

Summary and Future Work Summary Summarize requirements for the spread of EEE in HPC Propose a novel performance estimation method for EEEenabled HPC systems The most cases show good accuracy but some cases do not Accurate profile-based estimation is hard because of many impractical assumption Future work Develop trace-based estimation Evaluate other situation (other applications and topologies) 23

Any Questions? 24