ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

Similar documents
Micron and Hortonworks Power Advanced Big Data Solutions

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Cloudera Enterprise 5 Reference Architecture

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Three Paths to Better Business Decisions

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

10/1/ Introduction 2. Existing Methods 3. Future Research Issues 4. Existing works 5. My Research plan. What is Data Center

Got Isilon? Need IOPS? Get Avere.

Portable Power/Performance Benchmarking and Analysis with WattProf

Managing Hardware Power Saving Modes for High Performance Computing

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

OpenSFS Test Cluster Donation. Dr. Nihat Altiparmak, Assistant Professor Computer Engineering & Computer Science Department University of Louisville

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

Towards Energy Efficient Data Management in HPC: The Open Ethernet Drive Approach

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

IBM Emulex 16Gb Fibre Channel HBA Evaluation

I/O at the Center for Information Services and High Performance Computing

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Gen-Z Memory-Driven Computing

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Improved Solutions for I/O Provisioning and Application Acceleration

Supercomputers. Alex Reid & James O'Donoghue

Energy Management of MapReduce Clusters. Jan Pohland

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Performance Pack. Benchmarking with PlanetPress Connect and PReS Connect

Performance and Power Co-Design of Exascale Systems and Applications

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

DELL EMC POWER EDGE R940 MAKES DE NOVO ASSEMBLY EASIER

Emerging Technologies for HPC Storage

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Low-Power Interconnection Networks

Refining and redefining HPC storage

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

WHITE PAPER ULINK Technology

HIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

The Fusion Distributed File System

: A new version of Supercomputing or life after the end of the Moore s Law

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Extreme Storage Performance with exflash DIMM and AMPS

A 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures

Overview. Energy-Efficient and Power-Constrained Techniques for ExascaleComputing. Motivation: Power is becoming a leading design constraint in HPC

BlueGene/L. Computer Science, University of Warwick. Source: IBM

LECTURE 5: MEMORY HIERARCHY DESIGN

HP Z Turbo Drive G2 PCIe SSD

ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing

AUTOMATIC SMT THREADING

CPU-GPU Heterogeneous Computing

BlueDBM: An Appliance for Big Data Analytics*

ECE 571 Advanced Microprocessor-Based Design Lecture 7

Copyright 2012, Elsevier Inc. All rights reserved.

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

High-Performance Lustre with Maximum Data Assurance

HPC Storage Use Cases & Future Trends

Virtuozzo Hyperconverged Platform Uses Intel Optane SSDs to Accelerate Performance for Containers and VMs

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Copyright 2012, Elsevier Inc. All rights reserved.

Deep Learning Performance and Cost Evaluation

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

HPC projects. Grischa Bolls

A Large-Scale Study of Soft- Errors on GPUs in the Field

HPC Capabilities at Research Intensive Universities

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

scc: Cluster Storage Provisioning Informed by Application Characteristics and SLAs

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

AN ALTERNATIVE TO ALL- FLASH ARRAYS: PREDICTIVE STORAGE CACHING

HPC Technology Trends

High Performance Computing

CS61C : Machine Structures

rcuda: an approach to provide remote access to GPU computational power

Demonstration Milestone for Parallel Directory Operations

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage

ECE 571 Advanced Microprocessor-Based Design Lecture 16

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Four-Socket Server Consolidation Using SQL Server 2008

ADVANCED IN-MEMORY COMPUTING USING SUPERMICRO MEMX SOLUTION

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

NEMO Performance Benchmark and Profiling. May 2011

Approaches to Parallel Computing

Deep Learning Performance and Cost Evaluation

Consolidating OLTP Workloads on Dell PowerEdge R th generation Servers

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

IMPROVING THE PERFORMANCE, INTEGRITY, AND MANAGEABILITY OF PHYSICAL STORAGE IN DB2 DATABASES

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Transcription:

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION Vignesh Adhinarayanan Ph.D. (CS) Student Synergy Lab, Virginia Tech

INTRODUCTION Supercomputers are constrained by power Power budget for Los Alamos county = 66 MW Power budget for Trinity supercomputer alone = 15 MW Exceeding power budget Brownouts in Los Alamos Installing and starting ASCI White believed to play a part in the rolling California brownouts in 2001

INTRODUCTION Supercomputers are constrained by energy 1 MW power consumption 1 million dollars per year Operating cost of supercomputers is comparable to the acquisition cost The gap is expected to narrow down in the future

THE ENERGY CHALLENGE Off-chip data movement cost nearly hundred times as much energy as on-chip data movement Image source: J. Shalf et al., Exascale Computing Technology Challenges, VECPAR 2010

TRADITIONAL POST-PROCESSING VISUALIZATION HPC System Rendering Farm Nodes (Simulation runs here) Rendering Nodes (Visualization takes place here) Disks

MODERN POST-PROCESSING VISUALIZATION HPC System Nodes (Simulation and visualization runs in HPC nodes) Also write raw output only every few iterations (i.e., temporal sampling technique is used) Disks But you may miss out on important simulation events

POST-PROCESSING VS INSITU PIPELINES Simulation Simulation Large raw output Disk Write Visualization Small image Disk Write Disk Read Visualization Disk Write Large raw output Small image Post-processing In-situ Traditional Post-Processing: Post-processing without any sampling Modern Post-Processing: Post-processing with temporal sampling (write output every few iterations here every 24 iterations) In-situ: Produce images on the fly and do so only every few iterations

GOAL Study the performance, power, and energy trade-offs among traditional post-processing, modern post-processing, and in-situ visualization pipelines Detailed sub-component level power measurements within a node to gain detailed insights i.e., measure power consumption of CPU, memory, and disk Measurements at scale to understand problems unique to big supercomputers

APPLICATION Eddies near Southern Africa Modeling and Prediction Across Scale (MPAS) Ocean Simulation Solves an unstructured mesh problem End goal: Identify eddies in the ocean

EXPERIMENTS AT SCALE HARDWARE PLATFORM Compute nodes 64 nodes Each node contains 2x Intel Xeon E5-2670 and 64 GB of RAM Nominal power consumption 6000 W (idle) to 20000 W (workload such as MPAS) Storage nodes Lustre file system 5 nodes configured as 1 master + 2 MDS + 2 OSS 1 RAID storage per MDS and OSS Nominal power consumption 2500W (idle) to 2800W (active)

EXPERIMENTS AT SCALE ENERGY COMPARISON 19% Real measurements Partial measurement and estimation In-situ consumes 19% lower energy than post-processing

SINGLE-NODE EXPERIMENTS HARDWARE PLATFORM Hardware configuration

DATA COLLECTION Power readings logged every one second Memory and processor power provided by validated power model Full-system power provided by real power meter Disk power consumption for micro-benchmarks estimated as Wattsup minus RAPL

DISK POWER MODEL Constant power from the spinning of disk Power consumption of read/write head dependent on number of I/O operations Power consumption of actual reads and writes dependent on volume of data

SINGLE-NODE EXPERIMENTS ENERGY COMPARISON Processor and memory consume lot of energy while waiting for I/O Worthwhile to minimize energy consumption while idling

SINGLE-NODE EXPERIMENTS STORAGE REQUIREMENTS ~97.5% lower storage requirement for the in-situ pipeline Implies smaller storage cluster Implies lower power consumption

RESOURCES FOR POST-PROCESSING COMPUTE NODES STORAGE NODES

RESOURCES FOR INSITU COMPUTE NODES STORAGE NODES

REDISTRIBUTING STORAGE POWER TO COMPUTE NODES: IMPACT ON PERFORMANCE Assuming reduced storage nodes results in 10% of total power redirected to compute nodes Performance improves by up to 6% for MPAS-O

FINDINGS Most energy savings come from reducing system idling (i.e., from reducing the I/O wait time) Further savings possible if we can reduced size of the storage nodes

CONCLUSION In-situ visualization offers the following advantages: Reduced energy consumption (by reducing system idling or I/O wait time) Reduced power (by using fewer storage nodes) Improved performance (by reducing I/O wait time and by making more power available for compute nodes)

APPENDIX

EXPECTATIONS FOR A SUPERCOMPUTER Increased I/O wait time Storage separated from compute by network Longer execution time and corresponding increase in energy Additional energy consumption from data movement through the network No data transfer via network cables in single-node Power/energy overhead for storage higher Separate cluster for storage additional CPUs, memory, cooling etc. Storage sub-system shared with compute sub-system in single-node

FUTURE DIRECTIONS Enhancing HPC systems Flash buffers and SSDs can reduce I/O wait time Downside: Introducing more components can increase power consumption HPC system design changes Bringing storage nodes and compute nodes together Similar to Memory in Processor or Processor in Memory concepts in the computer architecture community Runtime system changes Energy proportional computing and storage Putting compute nodes to sleep states during I/O Putting some storage nodes to deep sleep state when bandwidth and storage requirements are lower

SINGLE-NODE EXPERIMENTS EXECUTION-TIME COMPARISON In-situ consumes 7% lower execution time than modern post-processing Reduced I/O wait time The difference will be significant for an HPC system Details later

SINGLE-NODE EXPERIMENTS POWER COMPARISON In-situ consumes 3% more power than modern post-processing Difficult trade-off choice Might not be the same for a supercomputer Details later