Data Management and Analysis for Energy Efficient HPC Centers

Similar documents
OSIsoft at Lawrence Livermore National Laboratory

Energy Efficiency for Exascale. High Performance Computing, Grids and Clouds International Workshop Cetraro Italy June 27, 2012 Natalie Bates

Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng

MVAPICH User s Group Meeting August 20, 2015

Power Bounds and Large Scale Computing

Oak Ridge National Laboratory Computing and Computational Sciences

Practical Uses of PI AF

Mapping MPI+X Applications to Multi-GPU Architectures

The FUSE Microgrid: Academic Research on Critical Infrastructures using the PI System

System Software Solutions for Exploiting Power Limited HPC Systems

EQT Midstream Reliability Centered Maintenance Program. Oscar Smith, Senior Principal Engineer, EQT Corporation November 2, 2016

8233-E8B 3x6-core ENERGY STAR Power and Performance Data Sheet

PowerExecutive. Tom Brey IBM Agenda. Why PowerExecutive. Fundamentals of PowerExecutive. - The Data Center Power/Cooling Crisis.

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

USERS CONFERENCE Copyright 2016 OSIsoft, LLC

Driving Business Value through Enterprise Agreements and Partnering

Cray XC Scalability and the Aries Network Tony Ford

PI SERVER 2012 Do. More. Faster. Now! Copyri g h t 2012 OSIso f t, LLC.

8205-E6C ENERGY STAR Power and Performance Data Sheet

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Measurement and Management Technologies (MMT)

The Blue Water s File/Archive System. Data Management Challenges Michelle Butler

IBM CORAL HPC System Solution

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

MULTITHERMAN: Out-of-band High-Resolution HPC Power and Performance Monitoring Support for Big-Data Analysis

7 Best Practices for Increasing Efficiency, Availability and Capacity. XXXX XXXXXXXX Liebert North America

Power Constrained HPC

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

SmartSacramento Distribution Automation

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

Copyri g h t 2012 OSIso f t, LLC. 1

Preparing GPU-Accelerated Applications for the Summit Supercomputer

InterAct User. InterAct User (01/2011) Page 1

HPC projects. Grischa Bolls

Future of Enzo. Michael L. Norman James Bordner LCA/SDSC/UCSD

Peeling the Power Onion

Green Computing: Realities and Myths

Managing Hardware Power Saving Modes for High Performance Computing

TOSS - A RHEL-based Operating System for HPC Clusters

Power Attack Defense: Securing Battery-Backed Data Centers

An End2End (E2E) Operationalized Pipeline for Predictive Analysis for the Intelligent Grid

CAS 2K13 Sept Jean-Pierre Panziera Chief Technology Director

Leveraging Flash in HPC Systems

Lecture 9: MIMD Architectures

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

The Energy Challenge in HPC

Lustre at Scale The LLNL Way

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Overview. Energy-Efficient and Power-Constrained Techniques for ExascaleComputing. Motivation: Power is becoming a leading design constraint in HPC

Systems and Technology Group. IBM Technology and Solutions Jan Janick IBM Vice President Modular Systems and Storage Development

TriStar MPPT (150V and 600V) Wind Charging Control Info

IBM Power Advanced Compute (AC) AC922 Server

PQube 3. Power Quality, Energy & Environment Monitoring. New Low-Cost, High-Precision

The IBM Blue Gene/Q: Application performance, scalability and optimisation

On the Path to Energy-Efficient Exascale: A Perspective from the Green500

Wind Plant Systems Grid Integration Plant and Turbine Controls SCADA

Dynamic Power Optimization for Higher Server Density Racks A Baidu Case Study with Intel Dynamic Power Technology

Motivation Goal Idea Proposition for users Study

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Some Joules Are More Precious Than Others: Managing Renewable Energy in the Datacenter

LANL High Performance Computing Facilities Operations. Rick Rivera and Farhad Banisadr. Facility Data Center Management

OVERVIEW: KEY APPLICATIONS

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Reliability Models for Electric Power Systems

IDEA Campus Energy UNIVERSITY OF MARYLAND Securing Our Networked Utility Infrastructure

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Architecting High Performance Computing Systems for Fault Tolerance and Reliability

Outline. March 5, 2012 CIRMMT - McGill University 2

MULTITHERMAN: Out-of-band High-Resolution HPC Power and Performance Monitoring Support for Big-Data Analysis

in Action Fujitsu High Performance Computing Ecosystem Human Centric Innovation Innovation Flexibility Simplicity

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Disaster Recovery and Mitigation: Is your business prepared when disaster hits?

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

OSIsoft Cloud Services Core Infrastructure for Developing Partner Applications

Predictive Resilience: Leveraging Integrated DCIM to Reduce Data Center Downtime

HPC Solutions in High Density Data Centers

HPC Technology Trends

IBM InfoSphere Streams v4.0 Performance Best Practices

HPC Downtime Budgets: Moving SRE Practice to the Rest of the World

Roadmapping of HPC interconnects

COST EFFICIENCY VS ENERGY EFFICIENCY. Anna Lepak Universität Hamburg Seminar: Energy-Efficient Programming Wintersemester 2014/2015

SKA. data processing, storage, distribution. Jean-Marc Denis Head of Strategy, BigData & HPC. Oct. 16th, 2017 Paris. Coyright Atos 2017

Slurm Configuration Impact on Benchmarking

LAMMPS Performance Benchmark and Profiling. July 2012

HP GTC Presentation May 2012

Welcome to PI World Transmission & Distribution Industry Session

A Comparative Study of High Performance Computing on the Cloud. Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez

Shared Infrastructure: Scale-Out Advantages and Effects on TCO

Energy Logic: Emerson Network Power. A Roadmap for Reducing Energy Consumption in the Data Center. Ross Hammond Managing Director

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

Ft Worth Data Center Power Metering for Real Time PUE Calculation. Scope of Work

What s New in PI Analytics and PI Notifications

CP2K Performance Benchmark and Profiling. April 2011

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Improving Situational Awareness for Utilities Operators and Energy Managers

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing

The State and Needs of IO Performance Tools

Transcription:

Data Management and Analysis for Energy Efficient HPC Centers Presented by Ghaleb Abdulla, Anna Maria Bailey and John Weaver; Copyright 2015 OSIsoft, LLC

Data management and analysis for energy efficient HPC centers Ghaleb Abdulla, Anna Maria Bailey and John Weaver LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Acknowledgement Philip Top (LLNL) Chuck Wells (OSIsoft) EEHPC D/R group EEHPC PUE/TUE group LLNL-PRES-xxxxxx 3

Data collection helps assess the health and efficiency of our facilities Perform event analysis Loss of power, voltage sag (dip), Voltage swell, etc. Understand where and how power is used Perform energy efficiency studies Plan for Exascale computing Understand current usage patterns Relate component, system, and facility level data LLNL-PRES-xxxxxx 4

Implement Centralized System Manage Gather and Evaluate Large Amounts of Data Data Sources Rack, Equipment, Metering, Building Management, Utility Interfaces tens of thousands of Real Time Data Streams Analyze Correlate Real Time Data Notify Centralized Event Notification Created an operational, event, and real-time data management infrastructure of all external and internal data sources Visualize View Data and Reports LLNL-PRES-xxxxxx 5

Current data sources spread across LLNL Synap Cabarnet Env. Data ALC Building Mgmt. System Nexus HPC Rack Metering Seqouia & Vulcan Power OSIsoft PI System MV90/EEM Sitewide Metering Siemens SCADA Electric Utility Data Forseer Env. Monitoring PMU (B181 & B453) LLNL Weather and Env. Info. LLNL-PRES-xxxxxx 6

A hierarchical data structure for efficient data discovery LLNL-PRES-xxxxxx 7

Data analysis Exploratory data analysis Use coresight for quick data exploration tasks Use PI DataLink for reporting and some data analysis tasks Use R for more advanced or repetitive tasks Example: Motif matching LLNL-PRES-xxxxxx 8

Agenda for the rest of the talk 1. Demand response and Dynamic power management (DPM) 2. Energy efficiency metrics 3. Power quality and outage 4. Power usage characteristics of Sequoia 5. PI Data Server compression study LLNL-PRES-xxxxxx 9

What is Dynamic Power management? Adjusting power parameters on-the-fly while ensuring deadlines of running software are met Several strategies, but we are looking into one fine-grained power management strategy: Power capping, running the CPU with power bound LLNL-PRES-xxxxxx 10

Appro System Intel Xeon ES-2670 OS TOSS Interconnect IB QDR 426 TeraFLOP/s peak Memory 41,472 GB 1296 nodes, 16 cores/node Power 564kW in 675 ft 2 #94 on November, 2013 Top 500 LLNL-PRES-xxxxxx 11

Power bound experiment Embarrassingly parallel application and memory bound application, single socket runs Run the application with the same power bound across cluster processors Characterize processor variations across several power bounds Data shows that dynamic power management will be challenging LLNL-PRES-xxxxxx 12

Effective clock frequency multiplier Processor performance with 80W and 65W PB Normalized slow-down LLNL-PRES-xxxxxx 13

Energy efficiency metrics PUE/TUE We found a meter not reporting correctly (data redundancy) LLNL-PRES-xxxxxx 14

Energy efficiency metrics LLNL-PRES-xxxxxx 15

Sequoia Parameters IBM Blue Gene*/Q architecture 98,304 nodes 1,572,864 cores 20 PF, 3 rd on Top 500 June 2013 96 racks 91% liquid cooled 30 gpm/rack at 62 F 9% air cooled 1700 cfm/rack at 70 F 4800 square feet *Copyright 2013 by International Business Machine Corporation LLNL-PRES-xxxxxx 16

PUE Dashboard PUE calculated using the metered data (not sequoia rack power) PUE is now a tag in the DB We found a meter not reporting correctly, data verification using different sensors and interfaces High spikes are when Sequoia is down for maintenance Regular maintenance schedule with one major outage Daily and weekly cycles LLNL-PRES-xxxxxx 17

PUE TUE graph for Sequoia 1.55 1.9 1.5 1.85 1.45 1.8 1.4 1.35 1.75 1.7 PUE TUE (for Sequoia) 1.3 1.65 1.25 7/10/14 0:00 7/20/14 0:00 7/30/14 0:00 8/9/14 0:00 8/19/14 0:00 8/29/14 0:00 9/8/14 0:00 9/18/14 0:00 9/28/14 0:00 Date & Time 1.6 LLNL-PRES-xxxxxx 18

Power quality and outage LLNL-PRES-xxxxxx 19

Event time and date7:41 pm on 10/27/2013 LLNL-PRES-xxxxxx 20

Event time and date: 10:51 pm on 02/08/2015 Wind with gusts over 67 always has a SW and SSW direction Winds gusting below 67 come from other directions LLNL-PRES-xxxxxx 21

Power usage characteristics of Sequoia LLNL-PRES-xxxxxx 22

Bursty energy (power) usage Users jobs Bringing the machine down for maintenance results in dumping over 5 MW of power back to the grid in a short period of time Bursty behavior of real workload, Power fluctuations are more abrupt LLNL-PRES-xxxxxx 23

Characterizing the maintenance schedule Scheduled maintenance happens every Wednesday Base load ~ 5MW Can go down to 180KW or lower Duration depends on what kind of maintenance will take place ~ 50 seconds to go from 5.5 MW down to 100 KW. LLNL-PRES-xxxxxx 24

A closer look at the Sequoia shutdown events LLNL-PRES-xxxxxx 25

PI Data Server Compression Study LLNL-PRES-xxxxxx 26

The question How does compression change the characteristics of the signal? Frequency data from PMU device Arbiter Systems, model 1133A power Sentinel LLNL-PRES-xxxxxx 27

Experiment Collect raw data and store it in binary format on the file system Collect data into PI Data Server with different levels of compression Use FFT analysis to compare the signal with no compression and different levels of compression Averaging 6 hours over 15 minute window LLNL-PRES-xxxxxx 28

Amplitude Distortion and compression level Compression Level = 0.00000 10 2 frequency Uncompressed Compressed 10 0 10-2 10-4 10-6 10-2 10-1 10 0 10 1 Frequency (Hz) LLNL-PRES-xxxxxx 29

Amplitude Distortion and compression level Compression Level = 0.06250 10 2 frequency Uncompressed Compressed 10 0 10-2 10-4 10-6 10-2 10-1 10 0 10 1 Frequency (Hz) LLNL-PRES-xxxxxx 30

Pi Compression Level Distortion frequency Frequencies higher than 10Hz will be distorted with compression level above 0.007 10-1 Distortion frequency 10-2 10-3 10-4 10-5 10-3 10-2 10-1 10 0 10 1 10 2 Cutoff Frequency LLNL-PRES-xxxxxx 31

Compression Ratios LLNL-PRES-xxxxxx 32

Frequency of Divergence (Hz) Distance & signal divergence 10 1 PMU Seperation vs Frequency difference 10 0 10-1 10 1 10 2 10 3 10 4 10 5 Distance (m) LLNL-PRES-xxxxxx 33

Ghaleb Abdulla abdulla1@llnl.gov Senior Computer Scientist Director, Institute for Scientific Computing Research Copyright 2015 OSIsoft, LLC 34

Questions Please wait for the microphone before asking your questions State your name & company Copyright 2015 OSIsoft, LLC 3 5

Copyright 2015 OSIsoft, LLC