High Speed Asynchronous Data Transfers on the Cray XT3

Similar documents
Enabling high-speed asynchronous data extraction and transfer using DART

Adaptable IO System (ADIOS)

XpressSpace: a programming framework for coupling partitioned global address space simulation codes

Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL

IsoStack Highly Efficient Network Processing on Dedicated Cores

Performance database technology for SciDAC applications

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Topics. TCP sliding window protocol TCP PUSH flag TCP slow start Bulk data throughput

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Salsa: Scalable Asynchronous Replica Exchange for Parallel Molecular Dynamics Applications

PreDatA - Preparatory Data Analytics on Peta-Scale Machines

High Throughput Data Movement

Design and Performance Evaluation of a New Spatial Reuse FireWire Protocol. Master s thesis defense by Vijay Chandramohan

Engineering Fault-Tolerant TCP/IP servers using FT-TCP. Dmitrii Zagorodnov University of California San Diego

Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter. Glenn Judd Morgan Stanley

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

Adaptive Cluster Computing using JavaSpaces

Chapter 4 Communication

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

The Google File System

A Decentralized Content-based Aggregation Service for Pervasive Environments

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Opportunistic Application Flows in Sensor-based Pervasive Environments

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

An O/S perspective on networks: Active Messages and U-Net

Introduction to Grid Computing

Initial Performance Evaluation of the Cray SeaStar Interconnect

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Next Generation Architecture for NVM Express SSD

ARCHITECTURE SPECIFIC COMMUNICATION OPTIMIZATIONS FOR STRUCTURED ADAPTIVE MESH-REFINEMENT APPLICATIONS

Distributed Filesystem

Structured Streams: Data Services for Petascale Science Environments

Distributed Information Processing

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Damaris. In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations. KerData Team. Inria Rennes,

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

High Throughput WAN Data Transfer with Hadoop-based Storage

Experiments with Wide Area Data Coupling Using the Seine Coupling Framework

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Practical Near-Data Processing for In-Memory Analytics Frameworks

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

RAMP: RDMA Migration Platform

GFS: The Google File System. Dr. Yingwu Zhu

Scalable Interaction with Parallel Applications

Learning with Purpose

Workflow as a Service: An Approach to Workflow Farming

Lightstreamer. The Streaming-Ajax Revolution. Product Insight

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

RDMA programming concepts

Scalable In-memory Checkpoint with Automatic Restart on Failures

A Scalable and Reliable Message Transport Service for the ATLAS Trigger and Data Acquisition System

Using RDMA for Lock Management

Using Time Division Multiplexing to support Real-time Networking on Ethernet

Module 12: I/O Systems

Flexible Architecture Research Machine (FARM)

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

The Google File System

Providing Multi-tenant Services with FPGAs: Case Study on a Key-Value Store

Ceph: A Scalable, High-Performance Distributed File System

02 - Distributed Systems

02 - Distributed Systems

Eng 3553 Lab #5 TCP Throughput

by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Understanding MPI on Cray XC30

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

The Common Communication Interface (CCI)

Curriculum 2013 Knowledge Units Pertaining to PDC

Google is Really Different.

Module 12: I/O Systems

Steven Carter. Network Lead, NCCS Oak Ridge National Laboratory OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 1

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Unifying UPC and MPI Runtimes: Experience with MVAPICH

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

PROGRAMMING AND RUNTIME SUPPORT FOR ENABLING DATA-INTENSIVE COUPLED SCIENTIFIC SIMULATION WORKFLOWS

Operating System: Chap13 I/O Systems. National Tsing-Hua University 2016, Fall Semester

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

I/O Buffering and Streaming

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Distributed computing: index building and use

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group

Network Layer Flow Control via Credit Buffering

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Chapter 8: I/O functions & socket options

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

High-performance message striping over reliable transport protocols

Ausgewählte Betriebssysteme - Mark Russinovich & David Solomon (used with permission of authors)

Improving I/O Performance in POP (Parallel Ocean Program)

The Google File System

Unified Runtime for PGAS and MPI over OFED

Chapter 2 Application Layer. Lecture 4: principles of network applications. Computer Networking: A Top Down Approach

High-Performance Broadcast for Streaming and Deep Learning

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT

Transcription:

High Speed Asynchronous Data Transfers on the Cray XT3 Ciprian Docan, Manish Parashar and Scott Klasky The Applied Software System Laboratory Rutgers, The State University of New Jersey CUG 2007, Seattle, WA May, 2007

Outline Data Streaming Requirements in the Fusion Simulation Project Support for Code Coupling & Asynchronous Data Movement DART Overview DART Architecture and Description DART Evaluation Conclusions and Future Work

Post processing Data Streaming Requirements in the Fusion Simulation Project (FSP) GTC Runs on Teraflop/Petaflop Supercomputers Large data analysis 40Gbps End-to-end system with monitoring routines Data replication Data archiving User monitoring Data replication User monitoring Visualization Data Streaming Requirements Enable high throughput, low latency data transfer to support near real time access to the data Minimize related overhead on the executing simulation Adapt to network conditions to maintain desired QoS Handle network failures while eliminating data loss

Support for Code Coupling & Asynchronous Data Movement Seine: Dynamic, semantically specialized shared spaces for code coupling High level (shared space) programming abstractions, efficient and scalable runtime DART: Light weight substrate for asynchronous, low overhead/highthroughput data IO on petascale system Based on Portals and RDMA ADAPT: Middleware for autonomic wide area data streaming and in transit data processing Enable high throughput, low latency data transfer to support near real time access to the data Effectively outsource processing to in transit processing nodes Minimize related overhead on the simulation Adapt to network conditions to maintain desired QoS

DART: A Substrate for Asynchronous IO Objectives: Alleviate impact of IO on scientific simulations Minimize total IO time on compute nodes Maximize data throughput on compute nodes Minimize overhead on data transfers (packet header size) Minimize IO computational overhead on compute nodes Data transfer logic, data buffering Key idea Overlap transfers with computations DART Client Runs on compute nodes Lightweight & Simple! maintains buffers for data transfers notifies server when data is available DART Server Runs on service/io nodes Contains logic and buffers pulls data from compute nodes on notification performs requested IO (i.e. file system, socket, etc.) Data transfers are asynchronous and decoupled

DART: Architecture Compute and Service nodes communicate using the Portals library Service nodes and receivers on remote cluster (i.e. Ewok) communicate using TCP sockets

The Portals Library RDMA implementation with OS and Application bypass Put operation Initiator pushes its MD content into the target's address space Events are recorded Get operation Initiator fetches the target's MD content into its address space Events are recorded

DART: Server Operation empty buffer pending buffer full buffer header data 1 eq 2 t1 Portals getdata 4 3 3 6 5 t2 7 7 9 8 t3 10 10 12 TCP transfer (Ewok) 11 13 DART Service Node

DART: Time Sequence at the Client DART asynchronous IO calls at the client Send calls return before data transfer completes Data transfers overlap with computations IO calls can block if buffer is full or busy Send call Data transfer Computations Simulation time step IO time

DART: Time Sequence at the Server Asynchronous data transfers teb time spent in the empty buffer queue tpb time spent in the pending buffer queue tfb time spent in the full buffer queue empty buffer pending buffer full buffer teb schedule tpb schedule tfb t total time

DART Evaluation: Achieved Transfer Rates Maximum transfer rate achieved between CN and SN on Jaguar One DART server and two DART clients Message sizes varied from 1 MB to 100 MB using 4 MB increments Data dropped at the service node

DART Evaluation: Influence of Compute Time on IO Overhead Stream time on service node Data sends are sequential ~ 3.4 sec cumulative time over 100 iterations Compute time 1 microsecond No overlap IO operation sequential Optimum compute time value: # CN * stream time

DART Evaluation: Influence of Compute Time on IO Overhead Stream time on service node Data sends are sequential ~ 3.4 sec cumulative time over 100 iterations Compute time 4.3 sec Maximum overlap IO operations parallel Need to balance compute time with IO time for maximum overlap

DART Evaluation: Tests Using the GTC Application Experiment using the GTC fusion application 128 compute nodes served by 1 service node A ~8.9 MB restart file/compute node written every ~4.5 sec 100 simulation steps DART write time: ~0.12 sec Effective transfer rate ~593Mbps / compute node IO overhead on compute node 2.7%

DART Evaluation: Tests Using the GTC Application

Conclusions and Future Work High throughput, low latency data streaming is critical for emerging scientific applications DART supports low overhead asynchronous IO on the Cray XT3 DART Performance Maximum achievable transfer rates using Portals Two clients and one server Overhead of IO operations on GTC application Restart files ~ 8.9 MB/node Future work Extend DART to support large message sizes Integrate with GTC to enable ~160 MB/node restart files Enable N x M x P coupling on top of DART

Questions Thank You!