Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Similar documents
Fault-Tolerant Parallel Analysis of Millisecond-scale Molecular Dynamics Trajectories. Tiankai Tu D. E. Shaw Research

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Parallel File Systems for HPC

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

MOHA: Many-Task Computing Framework on Hadoop

CA485 Ray Walshe Google File System

REMEM: REmote MEMory as Checkpointing Storage

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

High Performance Computing (HPC) Prepared By: Abdussamad Muntahi Muhammad Rahman

EMC Backup and Recovery for Microsoft SQL Server

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

Distributed File Systems II

The Fusion Distributed File System

Coordinating Parallel HSM in Object-based Cluster Filesystems

BeoLink.org. Design and build an inexpensive DFS. Fabrizio Manfredi Furuholmen. FrOSCon August 2008

Structuring PLFS for Extensibility

Xyratex ClusterStor6000 & OneStor

Assessing performance in HP LeftHand SANs

GFS: The Google File System

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

W b b 2.0. = = Data Ex E pl p o l s o io i n

Map-Reduce. Marco Mura 2010 March, 31th

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

MapReduce. U of Toronto, 2014

Distributed Filesystem

Hadoop/MapReduce Computing Paradigm

An Introduction to GPFS

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Ioan Raicu. Everyone else. More information at: Background? What do you want to get out of this course?

The advantages of architecting an open iscsi SAN

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

High-Performance Lustre with Maximum Data Assurance

Distributed Systems 16. Distributed File Systems II

Massive Data Processing on the Acxiom Cluster Testbed

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Cluster Setup and Distributed File System

CSE 124: Networked Services Lecture-17

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

EsgynDB Enterprise 2.0 Platform Reference Architecture

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

BigData and Map Reduce VITMAC03

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Efficiency Evaluation of the Input/Output System on Computer Clusters

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Next-Generation Cloud Platform

System Requirements. PREEvision. System requirements and deployment scenarios Version 7.0 English

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

CSE 124: Networked Services Lecture-16

The Google File System. Alexandru Costan

PREEvision System Requirements. Version 9.0 English

Datura The new HPC-Plant at Albert Einstein Institute

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

SMCCSE: PaaS Platform for processing large amounts of social media

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

A GPFS Primer October 2005

Enabling Active Storage on Parallel I/O Software Stacks. Seung Woo Son Mathematics and Computer Science Division

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

System Requirements. PREEvision. System requirements and deployment scenarios Version 7.0 English

CSE 124: Networked Services Fall 2009 Lecture-19

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

NAMD Performance Benchmark and Profiling. January 2015

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Introduction to Hadoop and MapReduce

ViewBox. Integrating Local File Systems with Cloud Storage Services. Yupu Zhang +, Chris Dragga + *, Andrea Arpaci-Dusseau +, Remzi Arpaci-Dusseau +

Divide & Recombine with Tessera: Analyzing Larger and More Complex Data. tessera.io

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

IBM InfoSphere Streams v4.0 Performance Best Practices

EMC SYMMETRIX VMAX 40K SYSTEM

Benchmark of a Cubieboard cluster

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

HPC Storage Use Cases & Future Trends

Aziz Gulbeden Dell HPC Engineering Team

Emerging Technologies for HPC Storage

PREEvision. System Requirements. Version 7.5 English

STAR-CCM+ Performance Benchmark and Profiling. July 2014

pnfs and Linux: Working Towards a Heterogeneous Future

Quobyte The Data Center File System QUOBYTE INC.

Energy Management of MapReduce Clusters. Jan Pohland

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Transcription:

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research

Motivation Goal: To model biological processes that occur on the millisecond time scale 2 Approach: A specialized, massively parallel supercomputer called Anton (2009 ACM Gordon Bell Award for Special Achievement)

3 Millisecond-scale MD Trajectories A biomolecular system: 25 K atoms Position and velocity: 24 bytes/atom Frame size: 0.6 MB/frame Simulation length: 1x 10-3 s Output interval: 10 x 10-12 s Number of frames: 100 M frames

4 Part I: How We Analyze Simulation Data in Parallel

5 An MD Trajectory Analysis Example: Ion Permeation

6 A Hypothetic Trajectory 20,000 atoms in total; two ions of interest 2 1.5 1 0.5 0-0.5 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30-1 -1.5-2 Ion A Ion B

7 Ion State Transition S Above channel Into channel from below S Inside channel Into channel from above S Below channel

8 Typical Sequential Analysis Maintain a main-memory resident data structure to record states and positions Process frames in ascending simulated physical time order Strong inter-frame data dependence: Data analysis tightly coupled with data acquisition

9 Problems with Sequential Analysis Millisecond-scale trajectory size : 60 TB Local disk read bandwidth : 100 MB / s Time to fetch data to memory : 1 week Analysis time : Varied Time to perform data analysis : Weeks Sequential analysis lack the computational, memory, and I/O capabilities!

10 A Parallel Data Analysis Model Specify which frames to be accessed Decouple data acquisition from data analysis Trajectory definition Stage1: Per-frame data acquisition Stage 2: Cross-frame data analysis

Trajectory Definition Every other frame in the trajectory 2 1.5 1 0.5 0-0.5 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30-1 -1.5-2 Ion A Ion B 11

Per-frame Data Acquisition (stage 1) 2 1.5 1 0.5 P0 P1 0-0.5 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30-1 -1.5-2 Ion A Ion B 12

Cross-frame Data Analysis (stage 2) Analyze ion A on P0 and ion B on P1 in parallel 2 1.5 1 0.5 0-0.5 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30-1 -1.5-2 Ion A Ion B 13

Inspiration: Google s MapReduce Output file Output file Google File System Input files Input files Input files map(...) map(...) map(...) K1: {v1 i } K2: {v2 i } K1: {v1 j } K2: {v2 j } K1: {v1 k } K2: {v2 k } K1: {v1 j, v1 i, v1 k } K2: {v2 k, v2 j, v2 i } reduce(k1,...) reduce(k2,...) 14

Trajectory Analysis Cast Into MapReduce 15 Per-frame data acquisition (stage 1): map() Cross-frame data analysis (stage 2): reduce() Key-value pairs: connecting stage1 and stage2 Keys: categorical identifiers or names Values: including timestamps Examples: (ion_id j, (t k, x ik, y jk, z jk )) Key Value

16 The HiMach Library A MapReduce-style API that allows users to write Python programs to analyze MD trajectories A parallel runtime that executes HiMach user programs in parallel on a Linux cluster automatically Performance results on a Linux cluster: 2 orders of magnitude faster on 512 cores than on a single core

Typical Simulation Analysis Storage Infrastructure Analysis cluster Analysis node Analysis node Parallel analysis programs Local disks Local disks File servers I/O node Parallel supercomputer 17

18 Part II: How We Overcome the I/O Bottleneck in Parallel Analysis

19 Trajectory Characteristics A large number of small frames Write once, read many Distinguishable by unique integer sequence numbers Amenable to out-of-order parallel access in the map phase

20 Our Main Idea At simulation time, actively cache frames in the local disks of the analysis nodes as the frames become available At analysis time, fetch data from local disk caches in parallel

21 Limitations Require large aggregate disk capacity on the analysis cluster Assume relatively low average simulation data output rate

22 An Example Analysis node 0 Merged bitmap Remote bitmap Local bitmap / Analysis node 1 1 1 1 1 1 1 1 1 Merged bitmap 0 1 0 1 1 0 1 0 1 0 1 0 Remote bitmap 0 1 0 1 Local bitmap / bodhi bodhi sim0 sim1 sim0 sim1 seq 0 2 f0 f2 seq 1 3 f1 f3 NFS server / sim0 sim1 f0 f1 f2 f3

23 How to guarantee that each frame is read by one and only one node in the face of node failure and recovery? The Zazen Protocol

The Zazen Protocol Execute a distributed consensus protocol before performing actual disk I/O Assign data retrieval tasks in a locationaware manner Read data from local disks if the data are already cached Fetch missing data from file servers No metadata servers to keep record of who has what 24

25 The Zazen Protocol (cont d) Bitmaps: a compact structure for recording the presence or non-presence of a cached copy All-to-all reduction algorithms: an efficient mechanism for inter-processor collective communications (used an MPI library in practice)

26 Implementation The Bodhi library Zazen cluster Analysis node Analysis node Parallel analysis programs (HiMach jobs) The Bodhi server The Zazen protocol Bodhi library Bodhi server Zazen protocol Bodhi library Bodhi server File servers Bodhi library I/O node Parallel supercomputer

Performance Evaluation 27

28 Experiment Setup A Linux cluster with 100 nodes Two Intel Xeon 2.33 GHz quad-core processors per node Four 500 GB 7200-RPM SATA disks organized in RAID 0 per node 16 GB physical memory per node CentOS 4.6 with a Linux kernel of 2.6.26 Nodes connected to a Gigabit Ethernet core switch Common accesses to NFS directories exported by a number of enterprise storage servers

Fixed-Problem-Size Scalability Execution time of the Zazen protocol to assign the I/O tasks of reading 1 billion frames 16 14 12 Time (s) 10 8 6 4 2 0 1 2 4 8 16 32 64 128 Number of nodes 29

Fixed-Cluster-Size Scalability Execution time of the Zazen protocol on 100 nodes assigning different number of frames 1E+02 1E+01 Time (s) 1E+00 1E-01 1E-02 1E-03 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+09 Number of frames 30

31 Efficiency I: Achieving Better I/O BW One Bodhi daemon per user process One Bodhi daemon per analysis node 25 1-GB 25 20 256-MB 20 GB/s 15 10 64-MB 2-MB GB/s 15 10 5 5 0 1 2 4 8 0 1 2 4 8 Application read processes per node Application read processes per node

Efficiency II: Comparison w. NFS/PFS NFS (v3) on separate enterprise storage servers Dual quad-core 2.8-GHz Opteron processors, 16 GB memory, 48 SATA disks organized in RAID 6 Four 1 GigE connection to the core switch of the 100-node cluster PVFS2 (2.8.1) on the same 100 analysis nodes I/O (data) server and metadata server on all nodes File I/O performed via the PVFS2 Linux kernel interface Hadoop/HDFS (0.19.1) on the same 100 nodes Data stored via HDFS s C library interface, block sizes set to be equal to file sizes, three replications per file Data accessed via a read-only Hadoop MapReduce Java program (with a number of best-effort optimizations) 32

Efficiency II: Outperforming NFS/PFS I/O bandwidth of reading files of different sizes 25 NFS PVFS2 Hadoop/HDFS Zazen 20 GB/s 15 10 5 0 2 MB 64 MB 256 MB 1 GB File size for read 33

34 Efficiency II: Outperforming NFS/PFS Time to read one terabyte of data 1E+05 1E+04 NFS PVFS2 Hadoop/HDFS Zazen Time (s) 1E+03 1E+02 1E+01 2 MB 64 MB 256 MB 1 GB File size for read

35 Read Perf. under Writes (1GB/s) File size for writes No writes 1 GB files 256 MB files 64 MB files 2 MB files Normalized performance 100 90 80 70 60 50 40 30 20 10 0 2 MB 64 MB 256 MB 1 GB File size for reads

End-to-End Performance A HiMach analysis program called water residence on 100 nodes 2.5 million small frame files (430 KB each) 10,000 Time (s) 1,000 100 NFS Zazen Memory 1 2 4 8 Application processes per node 36

Robustness Worst case execution time is T(1 + δ (B/b) ) The water-residence program re-executed with varying number of nodes powered off 1,600 1,400 1,200 Theoretical worst case Actual running time Time (s) 1,000 800 600 400 200 0% 10% 20% 30% 40% 50% Node failure rate 37

Summary Zazen accelerates order-independent, parallel data access by (1) actively caching simulation output, and (2) executing an efficient distributed consensus protocol. Simple and robust Scalable on a large number of nodes Much higher performance than NFS/PFS * Applicable to a certain class of timedependent simulation datasets * 38