Data Management and Analysis for Energy Efficient HPC Centers

Data management and analysis for energy efficient HPC centers Ghaleb Abdulla, Anna Maria Bailey and John Weaver LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Acknowledgement Philip Top (LLNL) Chuck Wells (OSIsoft) EEHPC D/R group EEHPC PUE/TUE group LLNL-PRES-xxxxxx 3

Data collection helps assess the health and efficiency of our facilities Perform event analysis Loss of power, voltage sag (dip), Voltage swell, etc. Understand where and how power is used Perform energy efficiency studies Plan for Exascale computing Understand current usage patterns Relate component, system, and facility level data LLNL-PRES-xxxxxx 4

Implement Centralized System Manage Gather and Evaluate Large Amounts of Data Data Sources Rack, Equipment, Metering, Building Management, Utility Interfaces tens of thousands of Real Time Data Streams Analyze Correlate Real Time Data Notify Centralized Event Notification Created an operational, event, and real-time data management infrastructure of all external and internal data sources Visualize View Data and Reports LLNL-PRES-xxxxxx 5

Current data sources spread across LLNL Synap Cabarnet Env. Data ALC Building Mgmt. System Nexus HPC Rack Metering Seqouia & Vulcan Power OSIsoft PI System MV90/EEM Sitewide Metering Siemens SCADA Electric Utility Data Forseer Env. Monitoring PMU (B181 & B453) LLNL Weather and Env. Info. LLNL-PRES-xxxxxx 6

A hierarchical data structure for efficient data discovery LLNL-PRES-xxxxxx 7

Data analysis Exploratory data analysis Use coresight for quick data exploration tasks Use PI DataLink for reporting and some data analysis tasks Use R for more advanced or repetitive tasks Example: Motif matching LLNL-PRES-xxxxxx 8

Agenda for the rest of the talk 1. Demand response and Dynamic power management (DPM) 2. Energy efficiency metrics 3. Power quality and outage 4. Power usage characteristics of Sequoia 5. PI Data Server compression study LLNL-PRES-xxxxxx 9

What is Dynamic Power management? Adjusting power parameters on-the-fly while ensuring deadlines of running software are met Several strategies, but we are looking into one fine-grained power management strategy: Power capping, running the CPU with power bound LLNL-PRES-xxxxxx 10

Appro System Intel Xeon ES-2670 OS TOSS Interconnect IB QDR 426 TeraFLOP/s peak Memory 41,472 GB 1296 nodes, 16 cores/node Power 564kW in 675 ft 2 #94 on November, 2013 Top 500 LLNL-PRES-xxxxxx 11

Power bound experiment Embarrassingly parallel application and memory bound application, single socket runs Run the application with the same power bound across cluster processors Characterize processor variations across several power bounds Data shows that dynamic power management will be challenging LLNL-PRES-xxxxxx 12

Effective clock frequency multiplier Processor performance with 80W and 65W PB Normalized slow-down LLNL-PRES-xxxxxx 13

Energy efficiency metrics PUE/TUE We found a meter not reporting correctly (data redundancy) LLNL-PRES-xxxxxx 14

Energy efficiency metrics LLNL-PRES-xxxxxx 15

Sequoia Parameters IBM Blue Gene*/Q architecture 98,304 nodes 1,572,864 cores 20 PF, 3 rd on Top 500 June 2013 96 racks 91% liquid cooled 30 gpm/rack at 62 F 9% air cooled 1700 cfm/rack at 70 F 4800 square feet *Copyright 2013 by International Business Machine Corporation LLNL-PRES-xxxxxx 16

PUE Dashboard PUE calculated using the metered data (not sequoia rack power) PUE is now a tag in the DB We found a meter not reporting correctly, data verification using different sensors and interfaces High spikes are when Sequoia is down for maintenance Regular maintenance schedule with one major outage Daily and weekly cycles LLNL-PRES-xxxxxx 17

PUE TUE graph for Sequoia 1.55 1.9 1.5 1.85 1.45 1.8 1.4 1.35 1.75 1.7 PUE TUE (for Sequoia) 1.3 1.65 1.25 7/10/14 0:00 7/20/14 0:00 7/30/14 0:00 8/9/14 0:00 8/19/14 0:00 8/29/14 0:00 9/8/14 0:00 9/18/14 0:00 9/28/14 0:00 Date & Time 1.6 LLNL-PRES-xxxxxx 18

Power quality and outage LLNL-PRES-xxxxxx 19

Event time and date7:41 pm on 10/27/2013 LLNL-PRES-xxxxxx 20

Event time and date: 10:51 pm on 02/08/2015 Wind with gusts over 67 always has a SW and SSW direction Winds gusting below 67 come from other directions LLNL-PRES-xxxxxx 21

Power usage characteristics of Sequoia LLNL-PRES-xxxxxx 22

Bursty energy (power) usage Users jobs Bringing the machine down for maintenance results in dumping over 5 MW of power back to the grid in a short period of time Bursty behavior of real workload, Power fluctuations are more abrupt LLNL-PRES-xxxxxx 23

Characterizing the maintenance schedule Scheduled maintenance happens every Wednesday Base load ~ 5MW Can go down to 180KW or lower Duration depends on what kind of maintenance will take place ~ 50 seconds to go from 5.5 MW down to 100 KW. LLNL-PRES-xxxxxx 24

A closer look at the Sequoia shutdown events LLNL-PRES-xxxxxx 25

PI Data Server Compression Study LLNL-PRES-xxxxxx 26

The question How does compression change the characteristics of the signal? Frequency data from PMU device Arbiter Systems, model 1133A power Sentinel LLNL-PRES-xxxxxx 27

Experiment Collect raw data and store it in binary format on the file system Collect data into PI Data Server with different levels of compression Use FFT analysis to compare the signal with no compression and different levels of compression Averaging 6 hours over 15 minute window LLNL-PRES-xxxxxx 28

Amplitude Distortion and compression level Compression Level = 0.00000 10 2 frequency Uncompressed Compressed 10 0 10-2 10-4 10-6 10-2 10-1 10 0 10 1 Frequency (Hz) LLNL-PRES-xxxxxx 29

Amplitude Distortion and compression level Compression Level = 0.06250 10 2 frequency Uncompressed Compressed 10 0 10-2 10-4 10-6 10-2 10-1 10 0 10 1 Frequency (Hz) LLNL-PRES-xxxxxx 30

Pi Compression Level Distortion frequency Frequencies higher than 10Hz will be distorted with compression level above 0.007 10-1 Distortion frequency 10-2 10-3 10-4 10-5 10-3 10-2 10-1 10 0 10 1 10 2 Cutoff Frequency LLNL-PRES-xxxxxx 31

Compression Ratios LLNL-PRES-xxxxxx 32

Frequency of Divergence (Hz) Distance & signal divergence 10 1 PMU Seperation vs Frequency difference 10 0 10-1 10 1 10 2 10 3 10 4 10 5 Distance (m) LLNL-PRES-xxxxxx 33