Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Similar documents
Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

CA485 Ray Walshe Google File System

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

MOHA: Many-Task Computing Framework on Hadoop

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

IME Infinite Memory Engine Technical Overview

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Techniques to improve the scalability of Checkpoint-Restart

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

The Fusion Distributed File System

Comparing Performance of Solid State Devices and Mechanical Disks

Hedvig as backup target for Veeam

A New Key-Value Data Store For Heterogeneous Storage Architecture

Dongjun Shin Samsung Electronics

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Using DDN IME for Harmonie

Mining Supercomputer Jobs' I/O Behavior from System Logs. Xiaosong Ma

SFS: Random Write Considered Harmful in Solid State Drives

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

Analyzing I/O Performance on a NEXTGenIO Class System

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

Method to Establish a High Availability and High Performance Storage Array in a Green Environment

Distributed File Systems II

NPTEL Course Jan K. Gopinath Indian Institute of Science

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

Speeding Up Cloud/Server Applications Using Flash Memory

Utilizing Unused Resources To Improve Checkpoint Performance

Enhancing Checkpoint Performance with Staging IO & SSD

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Optimizing Local File Accesses for FUSE-Based Distributed Storage

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Revealing Applications Access Pattern in Collective I/O for Cache Management

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

GFS: The Google File System

Request-Oriented Durable Write Caching for Application Performance appeared in USENIX ATC '15. Jinkyu Jeong Sungkyunkwan University

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Leading Parallel Cluster File System

Request-Oriented Durable Write Caching for Application Performance

WearDrive: Fast and Energy Efficient Storage for Wearables

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Improved Solutions for I/O Provisioning and Application Acceleration

stdchk: A Checkpoint Storage System for Desktop Grid Computing

IBM InfoSphere Streams v4.0 Performance Best Practices

DISP: Optimizations Towards Scalable MPI Startup

HPC Storage Use Cases & Future Trends

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

The Google File System

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Onyx: A Prototype Phase-Change Memory Storage Array

vcacheshare: Automated Server Flash Cache Space Management in a Virtualiza;on Environment

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Structuring PLFS for Extensibility

NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines

Combing Partial Redundancy and Checkpointing for HPC

Leveraging Flash in HPC Systems

<Insert Picture Here> Filesystem Features and Performance

NetVault Backup Client and Server Sizing Guide 2.1

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

The RAMDISK Storage Accelerator

The Google File System

SMORE: A Cold Data Object Store for SMR Drives

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

REMEM: REmote MEMory as Checkpointing Storage

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Presented by: Nafiseh Mahmoudi Spring 2017

CS3600 SYSTEMS AND NETWORKS

FhGFS - Performance at the maximum

Harp-DAAL for High Performance Big Data Computing

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

A Framework for Providing Quality of Service in Chip Multi-Processors

IBM DS8870 Release 7.0 Performance Update

Samsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions

Emerging NVM Memory Technologies

NEXTGenIO Performance Tools for In-Memory I/O

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Toward Scalable Monitoring on Large-Scale Storage for Software Defined Cyberinfrastructure

Oasis: An Active Storage Framework for Object Storage Platform

SolidFire and Ceph Architectural Comparison

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Using Transparent Compression to Improve SSD-based I/O Caches

Deduplication Storage System

Experimental Study of Virtual Machine Migration in Support of Reservation of Cluster Resources

Pocket: Elastic Ephemeral Storage for Serverless Analytics

XtremIO Business Continuity & Disaster Recovery. Aharon Blitzer & Marco Abela XtremIO Product Management

Scale-out Data Deduplication Architecture

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

Transcription:

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim,Christian Engelmann, and Galen Shipman Virginia Tech, Oak Ridge National Laboratory, North Carolina State University

Many-cores are driving HPC There is a need for redesigning the HPC software stack to benefit from increasing number of cores 2

Can t apps simply use more cores? 2.4 mpiblast FLASH 2.2 2 1.8 Simply assigning more cores to applications Speedup 1.6 does not scale. 1.4 1.2 1 0 5 10 15 20 25 30 Number of Cores 3

Growing computation-i/o gap degrades performance Performance Growing Trend Performance Server/CPUs Disk-based Storage Systems 25X Research question: Can the underutilized cores be leveraged to bridge the Compute-I/O gap? Storage Wall! 2X 2000 2002 2004 2006 2008 2010 Source: storagetopic.com 4

Observation: All workflow activities (not just compute) affect overall performance Script Preprocessing Computation Checkpointing End-to-end Performance Performance gain! Computation I/O I/O Computation Computation I/O I/O Timeline 5

Our Contribution: Functional Partitioning (FP) of Cores Idea: Partition the app core allocation Dedicate partitions to different app activities Compute, checkpoint, format transformation, etc. Bring app support services into the compute node Transforms the compute node into a scalable unit for service composition A generalized FP-based I/O runtime environment SSD-based checkpointing Adaptive checkpoint draining Deduplication Format transformation A thorough evaluation of FP using a160-core testbed 6

Functional Partitioning (FP) Deduplication core Checkpointing core Launch Application Script FP Configuration FP(2,8) Data Aggregate Buffer Data Parallel File System 7

Agenda Motivation Functional Partitioning (FP) FP Case Studies Evaluation Conclusion 8

Challenges in FP design How to co-execute the support services with the app? How to assign cores for the support activities? How to share data between compute and support activities? How to make the FP runtime transparent? How to have a flexible API for different support activities? How to do adapt support partitions based on progress? How to minimize the overhead of FP runtime? 9

FP runtime design Uses app-specific instances setup as part of job startup Uses interpositioning strategy for data management: Initiates after core allocation by the scheduler and before application startup (mpirun) Pins the admin software to a core Sets up a fuse-based mount point for data sharing between compute and support services Initiates the support services and the application s main compute to use the shared mount space 10

Aux-apps: Capturing support activities Provide an API for writing code for support activities int dedup_write (void * output_buffer, int size){ int result=success; //process output in chunks while((chunk=get_chunk(&out_buffer,size))!=null){ // compute hash on output_buffer chunks char* hash=sha1(chunk); Describe actions to take when data is accessed Adavantages: //write the new chunk if(!hashtable_get(hash)) Decouple application design from support activity design Provide result=data_write(chunk); a flexible, reusable interface Support // update recycling de-dup of common hash-table activities across apps Reduce } application development time } hashtable_update(&result,chunk,hash); return result; 11

Assigning cores to aux-apps Per-activity partition: dedicate a core to each aux-app Intra-Node: Dedicated cores are co-located with the main app Inter-Node: Dedicated cores are on specialized nodes Shared partition: multiple cores for multiple aux-apps One service runs on multiple cores One core runs multiple services 12

Key FP runtime components for managing aux-apps Benefactor: Software that runs on each node Manages a node s contributions, SSD, memory, core Serves as a basic management unit in the system Provides services and communication layer between nodes Uses FUSE to provide a special transparent mount point Manager: Software that runs on a dedicated node Manages and coordinates benefactors Schedules aux-apps and orchestrates data transfers Manager and benefactors are application specific and utilize cores from the application s allotment 13

Minimizing FP overhead Minimize main memory consumption Use non-volatile memory, e.g. SSD, instead of DRAM Minimize cache contention Schedule aux-apps based on socket boundaries Minimize interconnection bandwidth consumption Coordinate the application and FP aux-apps Extend the ioclt call to the runtime to define blackout periods 14

Agenda Motivation Functional Partitioning(FP) FP Case Studies Evaluation Conclusion 15

SSD-based checkpointing FP can help compose a scalable service out of nodelocal checkpointing Why SSD checkpointing: More efficient than memorycheckpointing Does not compete with app main-memory demands Provide fault tolerance Cost less How: Aggregate SSD on multiple nodes as an aggregate buffer Provide faster transfer of checkpoint data to Parallel FS Utilize dedicated core memory for I/O speed matching 16

SSD-based checkpointing aux-app Compute Nodes/Benefactors Checkpointing core Application Application Application Launch Application Fuse:/ Fuse:/ Fuse:/ M M M Script Configuration FP(1,8) SSDs Aggregate SSD Store Manager MetaInfo M Parallel File system 17

Deduplication of checkpoint data FP cores can be used to perform compute-intensive de-duplication, in-situ, on the node Why: Reduce the data written and improve I/O throughput How: Identify similar data across checkpoints If data is duplicate, update only the metadata Co-located with ssd-checkpointing app on the same core 18

Deduplication aux-app Compute Nodes/Benefactors Checkpointing Deduplication core Application Application Application Launch Application Fuse:/ Fuse:/ Fuse:/ M M M Script Configuration FP(1,8) Aggregate SSD Store Manager MetaInfo M Parallel File system 19

Agenda Motivation Functional Partitioning(FP) FP Case Studies Evaluation Conclusion 20

Evaluation objectives How effective is functional partitioning? How efficient is the SSD-checkpointing aux-app? Real world workload Synthetic workload How efficient is the deduplication aux-app? Synthetic workload 21

Experimentation methodology Testbed: 20 nodes, 160 cores, 8G memory/node, Linux 2.6.27.10 HDD model: WD3200AAJS SATAII 320GB, 85MB/s SSD model: Intel X25-E Extreme, sequential read 250MB/s, sequential write 175MB/s, capacity 32G Workloads: FLASH: Real-world astrophysics simulation application Synthetic benchmark: A checkpoint application that generates 250MB/process every barrier step FP(x,y) -> dedicate x out of y cores on each node to aux-apps 22

Impact of SSD-checkpointing using real-world workload 150 41% 44% 100 FP effectively improves application end-to-end performance Time(s) Local Disk non-fp(0,8) Local Disk non-fp(0,7) Aggregate Disk FP(1,8) Aggregate SSD FP(1,8) 200 27% 15% 50 0 80 160 Number of Total Cores 23

SSD-checkpointing I/O throughput using synthetic workload 7000 In Memory(20) FP(1,8) SSD(20) FP(1,8) 6000 I/O Throughput (MB/s) 5000 4000 SSD checkpointing provides sustained high throughput vs. an in-memory approach without memory overhead 3000 2000 1000 0 0 20 40 60 80 100 120 140 160 Number of Compute Cores 24

Varying the number of benefactors 7000 In Memory FP(0-2,8) 6000 I/O Throughput(MB/s) 5000 4000 A small number of benefactors can significantly improve performance 3000 2000 1000 0 5 10 16 20 25 30 36 40 Number of Benefactors 25

Varying the number of compute cores 5000 SSD FP(0-1,8), N=benefactors N=1 N=2 N=5 N=10 N=20 4500 I/O Throughput (MB/s) 4000 3500 3000 2500 2000 1500 1000 500 0 10 30 50 70 90 110 130 150 Number of Compute Cores 26

Efficiency of De-duplication Aux-app Using FP(2,8) Normalized I/O Throughput 2 60% 1.5 Dedup(0.25) 1 For compute intensive tasks such as deduplication Dedup(0.50) assigning more cores to the aux-app improve end-to-end Dedup(0.75) performance 0.5 Dedup(0.90) Non-dedup 0 10 30 50 70 90 110 130 Number of Compute Cores 27

Impact on end-to-end performance In Memory FP(1,8) 1400 Execution Time Checkpointing Time Total Time Time (s) 1200 1000 800 FP effectively improves application end-to-end performance 600 400 200 0 20 40 60 80 100 120 140 160 Number of Compute Cores 28

Agenda Motivation Functional Partition(FP) Sample core services Evaluation Conclusion 29

Conclusion Created a novel FP run-time for many-cores systems Transparent to applications Easy-to-use, flexible, and support recyclable aux-apps Implemented several sample FP support services SSD-based checkpointing Deduplication Format transformation Adaptive checkpoint data draining Showed that FP can improve end-to-end application performance by reducing support activity time with minimal overhead to compute 30

Future work Explore dynamic functional partitioning Implement more FP-based services Utilize FP for non-i/o-based activities Contact information Viginia Tech: Min Li, Ali R. Butt -- limin@cs.vt.edu, butta@cs.vt.edu ORNL: Sudharshan S. Vazhkudai vazhkudaiss@ornl.gov NCSU: Xiaosong Ma ma@cs.ncsu.edu 31

Backup slides 32

Adaptive checkpoint data draining Why: Data cannot be stored in the SSD buffer forever How: Lazily draining the data to PFS every k checkpoints Periodically update the manager with free space status The manager uses this info to determine when to drain Dedicated cores can be used to facilitate the draining and support tasks 33

Adaptive checkpoint data draining Compute Nodes/Benefactors Checkpointing core Draining core Application Application Application Launch Application Fuse:/ Fuse:/ Fuse:/ M M M Aggregate SSD Store Manager MetaInfo M Parallel File system 34

Removed 35

Deduplication aux-app Compute Nodes/Benefactors Checkpointing Deduplication core Draining core Application Application Application Launch Application Fuse:/ Fuse:/ Fuse:/ M M M Manager Aggregate SSD Store MetaInfo M Parallel File system 36

Efficiency of de-duplication aux-app 5000 Non-dedup Dedup(0.90) Dedup(0.75) Dedup(0.50) Dedup(0.25) Dedup(0.10) I/O Throughput (MB/s) 4500 4000 3500 3000 2500 Using a core to support a deduplciation aux-app improves I/O throughput 2000 and in turn improve end-to-end application performance 1500 1000 500 0 10 30 50 70 90 110 130 150 Number Of Compute Cores 37