I/O Profiling Towards the Exascale

Similar documents
Analyzing I/O Performance on a NEXTGenIO Class System

NEXTGenIO Performance Tools for In-Memory I/O

ECMWF s Next Generation IO for the IFS Model

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

ECMWF's Next Generation IO for the IFS Model and Product Generation

Monitoring and evaluating I/O performance of HPC systems

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Approaches to I/O Scalability Challenges in the ECMWF Forecasting System

Fujitsu SC 18. Human Centric Innovation Co-creation for Success 2018 FUJITSU

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Comprehensive Lustre I/O Tracing with Vampir

VMware vsphere Virtualization of PMEM (PM) Richard A. Brunner, VMware

Memory and Storage-Side Processing

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

* Contributed while interning at SAP. September 1 st, 2017 PUBLIC

PCIe Storage Beyond SSDs

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Profiling: Understand Your Application

Benchmark: In-Memory Database System (IMDS) Deployed on NVDIMM

Powernightmares: The Challenge of Efficiently Using Sleep States on Multi-Core Systems

THE IN-PLACE WORKING STORAGE TIER OPPORTUNITIES FOR SOFTWARE INNOVATORS KEN GIBSON, INTEL, DIRECTOR MEMORY SW ARCHITECTURE

Improved Solutions for I/O Provisioning and Application Acceleration

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs

Leveraging Flash in HPC Systems

Brent Gorda. General Manager, High Performance Data Division

Persistent Memory over Fabrics

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Challenges in HPC I/O

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Adrian Proctor Vice President, Marketing Viking Technology

AUTOMATIC SMT THREADING

Overcoming System Memory Challenges with Persistent Memory and NVDIMM-P

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

The Long-Term Future of Solid State Storage Jim Handy Objective Analysis

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

RDMA Requirements for High Availability in the NVM Programming Model

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

A New Key-Value Data Store For Heterogeneous Storage Architecture

Creating Storage Class Persistent Memory With NVDIMM

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

Capabilities and System Benefits Enabled by NVDIMM-N

HPC Storage Use Cases & Future Trends

EIOW Exa-scale I/O workgroup (exascale10)

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Software and Tools for HPE s The Machine Project

Best Practices for Setting BIOS Parameters for Performance

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Efficiency Evaluation of the Input/Output System on Computer Clusters

Kaladhar Voruganti Senior Technical Director NetApp, CTO Office. 2014, NetApp, All Rights Reserved

Flash Memory Summit Persistent Memory - NVDIMMs

PASTE: A Network Programming Interface for Non-Volatile Main Memory

March NVM Solutions Group

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

Evolution of Rack Scale Architecture Storage

Analysts Weigh In On Persistent Memory...

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Windows Support for PM. Tom Talpey, Microsoft

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Fine-grained Metadata Journaling on NVM

Overview of Persistent Memory FMS 2018 Pre-Conference Seminar

Fusion Engine Next generation storage engine for Flash- SSD and 3D XPoint storage system

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

I/O at the Center for Information Services and High Performance Computing

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

AutoTune Workshop. Michael Gerndt Technische Universität München

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

Warehouse- Scale Computing and the BDAS Stack

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Markets for 3D-Xpoint Applications, Performance and Revenue

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

The Impact of Persistent Memory and Intelligent Data Encoding

Recovering Disk Storage Metrics from low level Trace events

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Windows Support for PM. Tom Talpey, Microsoft

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

DRAM and Storage-Class Memory (SCM) Overview

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Exploiting the benefits of native programming access to NVM devices

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Programming for Fujitsu Supercomputers

NVMe SSDs with Persistent Memory Regions

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

Readings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important

SNIA NVM Programming Model Workgroup Update. #OFADevWorkshop

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

IBM Power Systems HPC Cluster

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Hybrid Memory Platform

Transcription:

I/O Profiling Towards the Exascale holger.brunst@tu-dresden.de ZIH, Technische Universität Dresden NEXTGenIO & SAGE: Working towards Exascale I/O Barcelona,

NEXTGenIO facts Project Research & Innovation Action 36 month duration 8.1 million Partners EPCC INTEL FUJITSU BSC TUD ALLINEA ECMWF ARCTUR

Approx. 50% committed to hardware development Note: final configuration may differ

Intel DIMMs are a key feature Non-volatile RAM 3D XPoint technology Much larger capacity than DRAM Slower than DRAM By a certain factor Significantly faster than SSDs 12 DIMM slots per socket Combination of DDR4 and Intel DIMMs

Three usage models The memory usage model Extension of the main memory Data is volatile like normal main memory The storage usage model Classic persistent block device Like a very fast SSD The application direct usage model Maps persistent storage into address space Direct CPU load/store instructions

New members in memory hierarchy New memory technology Changes the memory hierarchy we have Impact on applications e.g. simulations? I/O performance is one of the critical components for scaling up HPC applications and enabling HPDA applications at scale Memory & Storage Latency Gaps socket socket CPU CPU 1x 1x socket socket Register Register 10x 10x socket socket Cache Cache 10x 10x DIMM DIMM DRAM DRAM 10x DIMM Memory NVRAM 100x 100,000x IO Storage SSD 100x IO IO Spinning storage disk Spinning storage disk 1,000x backup 10,000x Storage disk - MAID 10x backup Storage tape backup Storage tape HPC systems today HPC systems of the future

Remote memory access on top Network hardware will support remote access Data in NVDIMMs To be shared between nodes Systemware Support remote access Data partitioning and replication

Using distributed storage Global file system No changes to apps Required functionality Create and tear down file systems for jobs Works across nodes Preload and postmove filesystems Support multiple filesystems across system Filesystem Memory Node Memory Node Memory Memory Memory Memory Node Node Node Node Network Filesystem I/O Performance Sum of many layers

Using an object store Needs changes in apps Needs same functionality as global filesystem Removes need for POSIX functionality Memory Memory Memory Memory Memory Memory I/O Performance Different type of abstraction Mapping to objects Different kind of Instrumentation Node Object store Node Node Node Network Filesystem Node Node

Towards workflows Resident data sets Sharing preloaded data across a range of jobs Data analytic workflows How to control access/ authorisation/security/ etc.? Workflows Producer-consumer model Remove file system from intermediate stages I/O Performance Data merging/integration? Job 1 Job 2 Job 2 Job 4 Job 3 Job 2 Job 2 Job 4 Filesystem

Tools have three key objectives Analysis tools need to Reveal performance interdependencies in I/O and memory hierarchy Support workflow visualization Exploit NVRAM to store data themselves (Workload modelling)

Vampir & Score-P June 2nd, 2017 LUG17 18

How to meet the objectives? File I/O, NVRAM performance Monitoring (data acquisition) Sampling Tracing Statistical analysis (profiles) Time series analysis Multiple layers Simultaneously Topology context Workflow support Merge and relate performance data Data sources

Tapping the I/O layers I/O layers POSIX MPI-I/O HDF5 NetCDF PNetCDF File system (Lustre, Adios) Data of interest Open/Create/Close operations (meta data) Data transfer operations

What the NVM library tells us Allocation and free events Information Memory size (requested, usable) High Water Mark metric Size and number of elements in memory NVRAM health status Not measurable at high frequencies Individual NVRAM load/stores Remain out of scope (e.g. memory mapped files)

Memory Access Statistics Memory access hotspots for using DRAM and NVRAM? Where? When? Type of memory? Metric collection needs to be extended 1. DRAM local access 2. DRAM remote access (on a different socket) 3. NVRAM local access 4. NVRAM remote access (on a different socket)

Access to PMU using perf Architectural independent counters May introduce some overhead MEM_TRANS_RETIRED.LOAD_LATENCY MEM_TRANS_RETIRED.PRECISE_STORE Guess: It will also work for NVRAM? Architectural dependent counters Counter for DRAM MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM MEM_LOAD_UOPS_*.REMOTE_NVRAM? MEM_LOAD_UOPS_*.LOCAL_NVRAM?

I/O operations over time Individual I/O OperaGon I/O RunGme ContribuGon

I/O data rate over time I/O Data Rate of single thread

I/O summaries with totals Other Metrics: IOPS I/O Time I/O Size I/O Bandwidth

I/O summaries per file

I/O operations per file Focus on specific resource Show all resources

Taken from my daily work... Bringing the system I/O down with a single (serial) application Higher I/O demand than IOR benchmark Why?

Coarse grained time series reveal some clue, but...

Details make a difference A single NetCDF get_vara_float triggers......15! POSIX read operagons

Approaching the real cause Even worse: NetCDF reads 136kb to provide just 2kb A single NetCDF get_vara_float triggers......15! POSIX read operagons

Before and after

Summary NEXTGenIO developing a full hardware and software solution Performance focus Consider complete I/O stack Incorporate new I/O paradigms Study implications of NVRAM Reduce I/O costs New usage models for HPC and HPDA