Enabling high-speed asynchronous data extraction and transfer using DART

Size: px

Start display at page:

Download "Enabling high-speed asynchronous data extraction and transfer using DART"

Isabella Casey
6 years ago
Views:

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (21) Published online in Wiley InterScience (www.interscience.wiley.com).

1 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (21) Published online in Wiley InterScience ( Enabling high-speed asynchronous data extraction and transfer using DART Ciprian Docan 1,,,ManishParashar 1 and Scott Klasky 2 1 Center for Autonomic Computing/TASSL Laboratory, Rutgers University, Piscataway, NJ 8854, U.S.A. 2 Oak Ridge National Laboratory, P.O. Box 28 Oak Ridge, TN 37831, U.S.A. SUMMARY As the complexity and scale of applications grow, managing and transporting the large amounts of data they generate are quickly becoming a significant challenge. Moreover, the interactive and real-time nature of emerging applications, as well as their increasing runtime, make online data extraction and analysis a key requirement in addition to traditional data I/O and archiving. To be effective, online data extraction and transfer should impose minimal additional synchronization requirements, should have minimal impact on the computational performance and communication latencies, maintain overall quality of service, and ensure that no data is lost. In this paper we present Decoupled and Asynchronous Remote Transfers (DART), an efficient data transfer substrate that effectively addresses these requirements. DART is a thin software layer built on RDMA technology to enable fast, low-overhead, and asynchronous access to data from a running simulation, and supports high-throughput, low-latency data transfers. DART has been integrated with applications simulating fusion plasma in a Tokamak, being developed at the Center for Plasma Edge Simulation (CPES), a DoE Office of Fusion Energy Science (OFES) Fusion Simulation Project (FSP). A performance evaluation using the Gyrokinetic Toroidal Code and XGC-1 particle-in-cell-based FSP simulations running on the Cray XT3/XT4 system at Oak Ridge National Laboratory demonstrates how DART can effectively and efficiently offload simulation data to local service and remote analysis nodes, with minimal overheads on the simulation itself. Copyright 21 John Wiley & Sons, Ltd. Received 18 January 29; Revised 23 December 29; Accepted 25 December 29 KEY WORDS: asynchronous transfers; RDMA; low-overhead Correspondence to: Ciprian Docan, Center for Autonomic Computing/TASSL Laboratory, Rutgers University, Piscataway, NJ 8854, U.S.A. docan@cac.rutgers.edu Contract/grant sponsor: National Science Foundation; contract/grant numbers: IIP , CCF-83339, DMS , CNS , IIS 43826, CNS Contract/grant sponsor: Department of Energy; contract/grant number: DE-FG2-6ER54857 Contract/grant sponsor: The Extreme Scale Systems Center Contract/grant sponsor: Department of Defense Contract/grant sponsor: IBM Faculty Award Copyright 21 John Wiley & Sons, Ltd.

C. DOCAN, M. PARASHAR AND S. KLASKY 1. INTRODUCTION Computing systems, both academic and enterprise, are rapidly growing in scale and complexity, and are enabling larger and more complex applications.

2 C. DOCAN, M. PARASHAR AND S. KLASKY 1. INTRODUCTION Computing systems, both academic and enterprise, are rapidly growing in scale and complexity, and are enabling larger and more complex applications. For example, high-performance computing (HPC) is playing an important role in science, engineering, and finance, and is enabling highly accurate simulations of complex phenomena. As these systems and applications grow, efficiently managing and transporting the large amounts of associated data, while still maintaining the desired computational efficiencies, is becoming an important and immediate challenge. Furthermore, emerging applications are based on seamless interactions and couplings across multiple and potentially distributed computational, data, and information services. For example, current fusion simulation efforts are exploring coupled models and codes that simultaneously simulate separate application processes and run on different HPC resources at supercomputing centers. These components need to interact at runtime with each other, and with services for online data monitoring, analysis, or archiving. Similar data management and transport requirements exist in applications that require, for example, migration of virtual machines or replication of databases for reliability or quality of service. These applications thus require a scalable and robust substrate for managing the large amounts of data they generate and to asynchronously extract and transport the data between interacting components. For fusion simulations, large volumes and heterogeneous types of data have to be continuously streamed from the compute partition of a petascale machine to its service partition, and from there to compute systems that run coupled simulation components, and to auxiliary data analysis and storage systems. Akeychallengethatmustbeaddressedbysuchasubstrateinthiscaseisgettingthelarge amounts of data being generated by these applications off the compute nodes at runtime, and over to service nodes or another system for code coupling, online monitoring, analysis, or archiving. To be effective, such an online data extraction and transfer service must (1) have minimal impact on the execution of the application in terms of performance overhead, increased latencies, or synchronization requirements, (2) satisfy stringent application/user space, time, and quality of service constraints, and (3) ensure that no data is lost. On many HPC resources, the large number of compute nodes are typically serviced by a smaller number of service nodes where they can offload expensive I/O operations.asaresult,thei/o substrateshouldbeabletoasynchronouslytransfer data from compute nodes to a service node with minimal delay and overhead on the simulation. Technologies such as RDMA allow fast memory access into the address space of an application without interrupting the computational process, and provide a mechanism that can support these requirements. In this paper, we present Decoupled and Asynchronous Remote Transfers (DART), an efficient data transfer substrate that effectively addresses the requirements described above. DART is a thin software layer built on RDMA technology to enable fast, low-overhead, and asynchronous access to data from a running simulation, and to support high-throughput, low-latency data transfers. The design and prototype implementation of DART using the Portals RDMA library [1] on the Cray XT3/XT4 at Oak Ridge National Laboratory is also presented. The key contribution of DART is in providing application-level mechanisms for asynchronous, high-throughput data transport for large-scale applications that can effectively move data from a running application to a (smaller) set of service nodes with minimal overhead on the application.

3 ASYNCHRONOUS TRANSFER USING DART Once the data is at the service nodes, expensive I/O operationscanbeoff-loadedtothesenodes without impacting the application performance. Furthermore, the data can also be served (possibly after manipulations) to a variety of services such as coupling, storage and archival, streaming, monitoring, visualizations, indexing, etc. DART has been integrated with applications simulating fusion plasma in a Tokamak, being developed at the Center for Plasma Edge Simulation (CPES), a DoE Office of Fusion Energy Science (OFES) Fusion Simulation Project (FSP), and is a key component of a high-throughput data streaming substrate that uses metadata rich outputs to support in-transit data processing and data redistribution for coupled simulations. A performance evaluation using the Gyrokinetic Toroidal Code (GTC) [2,3] and XGC-1 [4] particle-in-cell-based FSP simulations is presented. The evaluation demonstrates that DART can effectively use RDMA technologies to off-load expensive I/O operations to service nodes with very small overheads on the simulation itself, allowing a more efficient utilization of the compute elements, and enabling efficient online data monitoring and analysis on remote clusters. The remainder of this paper is organized as follows. Section 2 describes the architecture of DART and its key components. Section 3 describes the implementation on the Cray system and its operations. Section 4 presents the evaluation of DART. Section 5 presents the related work, and Section 6 concludes the paper and outlines the future research directions. 2. DART ARCHITECTURE AND OPERATION The primary goal of DART is to efficiently manage and transfer large amounts of data from applications running on the compute nodes of an HPC system to local storage or remote locations, in order to enable remote application monitoring, data analysis, code coupling, and data archiving. The key requirements that DART is trying to satisfy include minimizing data transfer overheads on the running application, achieving high-throughput, low-latency data transfers, and preventing data losses. In order to meet these requirements, we designed DART to enable dedicated nodes, i.e. separate from the application compute nodes, to schedule and extract data directly from the memory of the compute nodes using asynchronous transfers build on RDMA communication primitives. Thus, we off-load expensive data I/O andstreamingoperationsfromapplicationsrunningoncomputenodes to dedicated nodes, and allow applications to progress while data is being transferred. DART is able to provide asynchronous communication abstractions for environments that support only a single thread of execution, i.e. operating systems tuned for scientific computing that do not have support for threads. The asynchronous abstraction provided is transparent to applications, and DART ensures transport coherency and resources release. DART architecture contains three key components as shown in Figure 1: (1) a thin client layer (DARTClient)whichintegrateswithapplicationsandrunsonthecomputenodesofanHPCsystem, (2) a streaming server (DARTSServer)whichrunsondedicatednodesindependentlyoftheapplication executed on the compute nodes, and is responsible for data scheduling, extraction, and transport, and (3) a receiver (DARTReceiver), running on remote nodes, which receives and processes data streamed by the DARTSServer.

4 C. DOCAN, M. PARASHAR AND S. KLASKY Jaguar Ewok M Processors N Processors P Processors Asynchronous Figure 1. Architectural overview of DART. Note that DART clients and servers have layered architectures with a base communication layer. This communication layer can be implemented using different RDMA systems such as, for example, Portals, InfiniBand, DCMF, iwarp, etc., allowing DART to be portable across systems Asynchronous communication using RDMA The RDMA communication paradigm provides a window in the memory address space of a process. It supports process-to-process communication models with zero-copy, and OS and application bypass. The zero-copy transfer model is efficient becauseitcantransferdatadirectlyfromthe application memory buffers to a remote destination, thus avoiding to make an extra copy of the data into the kernel buffers. The OS bypass communication model can execute data transfers without involving the CPU (e.g. DMA transfers). The application-bypass communication model can perform adatatransferwhileanapplicationisrunning,andwithoutinterruptingtheapplication.thus, RDMA communication models are well suited to implement an asynchronous transfer protocol that has a very small impact on a running application. To provide the asynchronous communication abstractions to applications, we used the Portals [1] library, which implements the RDMA communication paradigm, and provides data transfer primitives for direct memory-to-memory communication. The Portals library supports the RDMA pull communication model ( get operation), which can extract messages from a remote node s memory, and the RDMA push model ( put operation), which can inject messages to a remote node s memory. To support memory-to-memory data transfers between processes, the Portals library maps blocks of an application memory (memory buffers) to the network card, and thus makes them accessible to processes running on remote nodes. To support this mapping, it uses memory descriptors, which are Portals data types containing internal bookkeeping information such as the start address and the length of the memory buffers, current offsets, etc. The asynchronous communication abstraction allows an application to start a data transfer and continue its execution without waiting for the transfer completion. With DART, this process is transparent to an application, as DART uses the application-bypass features for data transfers, and ahigher-leveldatasignalingandtransfercompletionnotificationmechanismbyleveragingthe Portals library software events. ThePortalslibrarycangenerateeventsforeachRDMAtransfer operation to indicate the start or completion of a transfer. Each RDMA event is associated with a

5 ASYNCHRONOUS TRANSFER USING DART specific memory descriptor, and Portals uses an internal event queue for each memory descriptor to log the event. At a higher layer, DART enables and uses these events to ensure the asynchronous protocol coherency and transfer completion DART client layer DARTClient is a lightweight software layer that integrates with an application and runs on the compute nodes of an HPC resource. It works in conjunction with a streaming server to export the asynchronous communication abstraction at the application level, and to provide a transparent and coherent transfer protocol. It is responsible for two key functions, (1) control and coordination and (2) data transfers. The control and coordination function registers a compute node with a streaming server at application start-up, and posts requests for transfer notifications to the streaming server at runtime. During the registration phase, DARTClient and the streaming server exchange identification and communication parameters, such as unique numerical instance identifiers, communication offsets, and memory descriptor identifiers. The data transfer function establishes the communication parameters, and polls for previous transfer completion, and releases resources. To post a transfer request, DARTClient sets up the parameters and priorities for the transfer, and notifies the streaming server that it has data ready to be sent. It is then the responsibility of the streaming server, based on the transfer parameters, to actually schedule, execute the transfer of the data, and notify back the DARTClient on the completion of the operation. Following the RDMA communication paradigm, DARTClient pre-allocates the communication buffers and exposes two classes of memory descriptors to communicate with the streaming server. The first class contains application-level memory descriptors that DARTClient creates during the registration phase, and preserves throughout the lifetime of the application. DARTClient defines two types of application-level memory descriptors the first is used to post request-for-transfer notifications to the streaming server, and the other is used to retrieve transfer completion acknowledgments from the streaming server. The second class contains stream-level memory descriptors that DARTClient creates for each data stream the application generates. A data stream is a continuous flux of bytes that have the same destination, e.g. a file. Stream-level memory descriptors are also of two types the first is used to set up data stream specific-communication parameters and signals, and the second is used to map and expose the actual memory buffers for the streaming server to fetch and transfer from the application. DARTClient creates these memory descriptors on-the-fly when the application generates a new data stream, and publishes them to the streaming server as part of the notification message. DARTClient provides asynchronous abstractions that enable the application layer to make nonblocking send calls and continue its computation without waiting for the transfers to complete. The underlying mechanisms use OS/application-bypass capabilities provided by RDMA transfers to overlap data transfers with application computations, and hide the latency of data transfers. DARTClient uses multiple memory buffers to allow multiple data streams of different sizes to proceed in parallel. DARTClient supports a cooperative data transfer mode, in which multiple compute nodes write to the same data stream, e.g. file. In the cooperative mode, DARTClient is flexible and allows the

6 C. DOCAN, M. PARASHAR AND S. KLASKY application to define and group together the nodes that write to the same data stream. Based on the groups defined at the application-level, DARTClient coordinates the nodes of a group at runtime, and for each node it computes an associated offset in the common stream where the node should write at. This offset is included in the notification message that is posted to the streaming server. This allows the server to be stateless with respect to a data stream and reduces the number of active connections that the server needs to keep opened DART streaming server and receiver DARTSServer is the streaming server component that runs as a stand-alone application on dedicated nodes of an HPC system, and is responsible for asynchronously extracting data from applications running on the compute nodes, and streaming it to remote clusters or to local storage. Multiple instances of the DARTSServer can run on different nodes, and cooperate to service data transfer requests from the larger number of compute nodes. DARTSServer provides multiple key functions. It registers the compute nodes running the DART- Client layer, or other instances of streaming servers, at the applications start-up, and assigns and exchanges communication parameters, e.g. identification numbers, communication offsets, identifiers for communication memory descriptors. It receives notification posts that indicate and describe the data available from the compute nodes, schedules and manages the data extraction from an application running on the compute nodes, and transfers the data out to local storage or to remote clusters as requested in the notification posts. Data transfers have priorities based on the size of the data blocks and the frequency at which the application generates them. DARTSServer maintains two priority queues: a high priority queue for notification requests for smaller or more frequent data blocks, and a low priority queue for less frequent or larger data blocks. This priority mechanism prevents the transfers of smaller blocks from being blocked by the transfers of larger ones and enables them to proceed in parallel with the larger block transfers. The primary performance goal of the DARTSServer is to minimize the data transfer latencies and maximize the transfer throughput. To achieve these goals, the streaming server pre-allocates a large cache of transfer memory buffers and associates it with a memory descriptor at the start-up of the server application. The transfer cache improves the performance of the transfers and keeps aconstantmemoryfootprintthroughouttheruntimeoftheserver.thesizeofthetransfercache and the number of memory buffers that the server can allocate are influenced by multiple factors, such as (1) the maximum limit on the number of file descriptors that can be used by a process (a memory descriptor counts as an open file), (2) the memory available on the host node, and (3) the size of the working set (to avoid thrashing behaviors). For example, the implementation on the Cray XT3/XT4 at Oak Ridge National Laboratory (described in more detail in Section 3) allocates 512 memory buffers of 4 MB each, and exposes the cache using a single memory descriptor. In addition to the memory descriptor used for the transfer cache, the streaming server also preallocates and exposes memory buffers for communication and transfer signaling with the compute nodes and the other instances of streaming servers. For the transfer signaling function, the streaming server uses two descriptors. The first descriptor accepts notification posts from compute nodes, which indicate that data is available, whereas the second descriptor sends notifying acknowledgments to compute nodes when an asynchronous transfer completes.

7 ASYNCHRONOUS TRANSFER USING DART Dedicated Nodes Compute Nodes eqb pbq fbq T1 T2 14 T3 T Remote Cluster/Locat Storage Figure 2. Data flow at the DARTSServer streaming server. DARTSServer is multi-threaded to allow asynchronous and parallel data streaming, e.g. fetching new data from compute nodes while streaming previous data to a remote cluster, as well as the data transfer scheduling management to progress simultaneously, as illustrated in Figure 2. The server uses multiple queues (1) to manage the incoming compute nodes transfer requests, (2) to schedule and prioritize the data streaming operations, (3) to track and manage the transfer cache memory buffers in the different states of a transfer, and (4) to reduce synchronization contention between server threads. The server reuses the buffers of the transfer cache and links them in different queues according to the status of a data transfer, e.g. eqb queue contains empty memory buffers, pqb queue contains memory buffers associated with in-progress data transfers, and fqb queue contains full buffers that are ready to be streamed to remote nodes or saved on local storage. The overall data flow associated with the operation of DARTSServer is illustrated in Figure 2. Adatatransferbeginswitharequestforsendmessagegeneratedbyanapplicationonthecompute nodes and posted to the notification descriptor (1) of the server. The server extracts each incoming notification message as it arrives (2), then encodes and inserts it in a priority queue accordingly (3). Next, the server processes the priority queues (4), schedules the data transfers (5 and 6), and starts the asynchronous data extraction from the compute nodes (6 and 7). A different thread in the server monitors the asynchronous data completion (8), and schedules the data streaming or saving (9 and 1). Then, the server finishes transfers (11, 12, and 13) and sends an acknowledgment notification message to the notification memory descriptor of the corresponding compute node DART API and usage The DART client layer (DARTClient) provides I/O primitives that are very similar to standard Fortran file operations, i.e. dart open(), dart write(), and dart close(). We chose the names and signatures of these operators to make the library easy to use by scientists and to facilitate incorporation into existing application codes. Additionally DART provides functions to initialize and

8 C. DOCAN, M. PARASHAR AND S. KLASKY finalize the DART library, i.e. dart init() and dart finalize(). DARTalsoprovidesAPIforcompute node coordination to enable cooperative operations and interface with additional middle-ware. DART operators are asynchronous. A typical usage of the DART API consists of a stream open dart open() operation that returns a handle, one or more non-blocking dart write() send operations on this handle, and a stream end dart close() operation on the handle. A DART operation may be in one of three states, not started, in progress, or finished. In case of the first two states, DART maintains stream state information, and when the application calls dart finalize(), itensuresthatall in-progress operations finish. As a result, a call to dart finalize() can block and should be called at the end of the application so that it does not impact the simulation. Sample application code that uses DARTClient is presented below. // Sample application code using DARTClient dart_init(); // Application computations dart_handle = dart_open(file_name, \ {read_mode write_mode}); dart_write(dart_handle, variable_1, size_1); dart_write(dart_handle, variable_2, size_2); dart_close(dart_handle); // Application computations dart_finalize(); 3. IMPLEMENTATION OF DART ON THE CRAY XT3/XT System description We implemented DART on Jaguar, a Cray XT3/XT4 at Oak Ridge National Lab (ORNL). Jaguar [5] has 11k nodes logically divided into compute and service nodes. XT3 is the second most recent version of the machine, and is built from compute nodes with dual-core 2.6 GHz AMD Opteron CPUs, and 4 GB memory per node (2 GB per core) and runs Cray s UNICOS/lc operating system with the Catamount micro-kernel [6]. Theoperatingsystem on the compute nodes is specially tuned for scientific computing. To meet the scientific computation performance requirements, the OS removed some features such as the sockets interface, the TCP, and UDP transport protocols, the shared memory support, and the multi-threading support. These tunings impose additional constraints and challenges on the implementation of a fast and asynchronous I/O substrate. A service node has a 2.4 GHz single-core AMD Opteron CPU with 4 GB memory and runs a full-fledged Linux kernel. XT4 is the most recent version of the machine, and is built up from compute nodes with quadcore 2.1 GHz AMD Opteron CPUs, and 8 GB of memory per node, i.e. 2 GB per core, and runs Compute Node Linux (CNL) operating system. CNL is a version of the Linux kernel, specifically customized for scientific computing environments. IncontrastwithCatamount,CNLscalesbetter to quad-cores and adds support for multi-threading. Nevertheless, CNL does not provide sockets interface or higher-level transport protocols such as TCP or UDP. The service nodes have 2.6 GHz dual-core AMD Opteron with 8 GB of memory.

9 ASYNCHRONOUS TRANSFER USING DART Jaguar s compute and service nodes are interconnected in a 3D Torus topology by Cray SeaStar routers through the proprietary HyperTransport interconnection bus that is able to deliver a transfer bandwidth of 6.4GB/s. The Jaguar service nodes are connected to a remote cluster, i.e. Ewok, by a 5 GB/s aggregatedlink.theewokclusterhas128nodes,eachnodewith2intelxeon 3.4 GHz CPUs and 4 GB memory. The nodes on the cluster are interconnected by an InfiniBand switch through a 1 GB/slink.Inourimplementation,DARTClientrunsonJaguar scomputenodes, DARTSServer runs on Jaguar s service nodes and DARTReceiver runs on Ewok nodes Compute node bootstrap mechanism On the Cray XT3/XT4 machine, the compute nodes communicate with the service nodes using Portals RDMA calls. Portals identifies each process in the system by a Portals address, which is atuplecomposedofauniquenumericalidentifierforthenode(computeorservice)andforthe process, i.e. nid and pid. Two processes in the system can communicate if they know each other Portals addresses. To run an application on the system, a user submits a job and requests a number of compute nodes from the job allocation system. However, the specific nodes assigned for the application, and consequently their Portals addresses, are determined only at runtime. The user application and the streaming server are two independent applications, that need to communicate at runtime over Portals and using RDMA calls. We defined a customized bootstrap mechanism that is independent of the underlying execution mechanism (e.g. MPI or the batch queuing system) that allows the compute nodes to find the Portals addresses of the streaming servers and to publish their own addresses to the servers. It requires that the streaming servers start up and complete their initialization phase before the applications start, and relies on the availability of a shared file system among the nodes that run the streaming server instances. The applications that run on the compute nodes and the streaming servers are launched using the same job script file. The job script first starts the streaming servers. The first instance of the server writes its Portals address (i.e. the tuple { nid, pid }) toapersistent configuration file (Figure 3), which is then read by the job script and exported to its environment variables. The job script then starts the user applications on the compute nodes, which inherit the environment variables from the script. When an application initializes, DARTClient parses the corresponding environment variables and reads the Portals address of the streaming server. As part of the bootstrap process, each DART client has to register its Portals address with an instance of the DART servers. Each DART client obtains its Portals address after it initializes the Portals network and includes its identifiers, i.e. { nid, pid }, intheregistrationmessagethatitexchangeswiththe servers. This bootstrap mechanism is both flexible and efficient because (1) it allows the streaming servers to start and configure themselves before the application starts, and (2) the compute nodes do not access the file system, which may be a concern especially for jobs with a large number of nodes. To bootstrap multiple streaming servers running on different nodes, the first server instance to enter the initialization phase creates and locks the configuration file into which it writes its Portals address, thus becoming the master instance. The other instances check the existence of the configuration file, read the Portals address of the master instance, then send and register their own Portals addresses with the master server.

10 C. DOCAN, M. PARASHAR AND S. KLASKY Compute Nodes Server Nodes Master Instance Client Instance Conf Luster FS Environment Variables Figure 3. DART compute node bootstrap mechanisms. During the application initialization phase, DART client contacts the master streaming server which forwards the registration request to an instance of a streaming server via the bootstrap process (see example in Figure 3). The designated streaming server instance registers the compute node, that will send all its requests to this instance of the server. In this process, the master server uniformly distributes the compute nodes across the streaming servers, so that each server serves approximately the same number of compute nodes. In the implementation of DART described in this paper, the initial assignment of compute nodes to streaming servers is preserved through the runtime of the applications. This is because the load on the servers is naturally balanced due to the single program multiple data (SPMD) nature of the applications. Cooperative load balancing between streaming servers is used in cases where this is not true (e.g. if local adaptivity is used) Data transfer protocol As described in Section 2, DARTClient is lightweight and implements the bare minimum functionality so as to minimize its impact on the application. Consequently, the DARTSServer implements all the logic associated with scheduling, data extraction, and streaming. To achieve this, it uses a two-step protocol. In the first step, a compute node posts a small rfs (request for send) notification message into the communication buffer of the associated streaming server. In the second step, the streaming server schedules the request, extracts the data directly from the memory of the compute node, and transfers the data to the destination specified in the rfs notification (i.e. remote node or local storage). The streaming server processes multiple data notification requests from the same or different compute nodes in a round robin fashion according to the priorities of the requests. We illustrate the operation of DART on the Cray XT3/XT4 system using the timing diagrams presented in Figure 4. A usual run of a scientific application consists of sequences of computations (identified by c t in Figure 4(a)), followed by data transfer requests (identified by w t and rfs t in the same figure). The data transfer protocol is asynchronous, and an application running on a compute node continues its computations after it posts a rfs notification (i.e. rfs t ), and does not wait for

11 ASYNCHRONOUS TRANSFER USING DART (a) I/O Time wt = Wait Time rfst = Request for Send Time ct = Compute Time I/O Transfer Time Time Transfer Time rfst ct wt rfst ct wt t (b) ebq pbq fbq teb Thread Schedule tpb Thread Schedule tfb Total Time t Figure4. Timing diagrams illustratingtheoperation ofdartclient and DARTSServer.(a)DARTClient layer and (b) DARTSServer steaming server. the actual transfer to complete. Although the streaming server does not begin the data extraction immediately, the transfer can proceed in parallel with the computation. A compute node may block before sending a new rfs notification message (i.e. w t ), if the streaming server is busy and has not yet processed a previous request from the same compute node. The effective data transfer time at the application level is the sum of w t and rfs t times. In Figure 4(b), we present the timing diagram corresponding to the transfer of a data block from acomputenodetoaremotenodeonthestreamingserverside. In the figure, teb isthetimetheserver spendsto process a notificationmessagefrom acompute node and to schedule the corresponding data extraction, tpb isthetimetheserverspendstoextract the data from the compute node using the Portals interface, and tfb isthetimetheserverspends to transfer the data to a remote node using the TCP interface or to save it to local storage. The value of these parameters corresponds to the processing time of the threads t 1 +t 2, t 3,andt 4 as described in Section 2. The value for the total I/O timeequalsthesumofthesethreevaluesplussomeos thread scheduling overheads. 4. EXPERIMENTAL EVALUATION We evaluated the performance of DART using three sets of experiments. The first set evaluated the raw performance of DART using synthetic simulations. The synthetic simulations followed the same computational pattern as the real applications, e.g. a set of one or more computational stages followed by one I/O stage.theseexperimentsevaluatedthecoreparameters that influence the effective data transfer rates from compute to service nodes. This includes (1) the size of the data unit fetched, i.e. the block size, (2) the number of memory descriptors allocated by the streaming server, (3) the I/O rates,i.e.thefrequencyatwhichthesimulationsproducedata, and (4) the time required to stream the data further, i.e. the time to transfer data to its destination, such as remote cluster or local storage. The second set of experiments evaluated the performance of DART when we integrated it with real application codes, e.g. the GTC and XGC-1 fusion simulations, and analyzed the trade-offs that we can make to improve the performance of an I/O system.thekeymetricsusedinthese experiments were (1) the effective data transfer rate and (2) the overhead of using DART on the performance of an application. The third set of experiments evaluated the impact of using computing nodes as an explicit staging area for data I/O. In these experiments, we ran multiple instances of the DARTSServer on a separate

12 C. DOCAN, M. PARASHAR AND S. KLASKY Transfer BW 1148 Transfer rate MB/s Size of data unit transferred (MB) Figure 5. Data transfer rate between compute and service nodes. partition of the compute system, i.e. the staging nodes, whichwroteapplicationdatatoalocal storage system Evaluating raw performance Data Transfer Rate In this experiment, we evaluated the maximum application-level transfer rate that we can achieve between compute and service nodes, using RDMA data transfers. We set up this experiment using two compute nodes and one service node. The streaming server running on the service node extracted data from the simulation running on two compute nodes and measured the effective transfer rates. For the synthetic simulation used, we fixed the compute stage duration to 1 ms. However, in this case the compute stage duration, which determines the I/O rate, does not influence the transfer rates, because we measured only the transfer time and size at the streaming server, but not the I/O overhead. We varied the size of the data unit (i.e. the block size) from 1 to 1 MB in increments of 4 MB, and measured the achieved transfer rate for each value. We ran the synthetic simulation for 1 iterations, and performed 1 data transfers for each value of the block size. We plot the average transfer rate (over the 1 iterations) for each block size in Figure 5. The plot shows that we can saturate the communication link between compute and service nodes using a minimum block size of 4 MB. The saturation value is achieved at application level, and Note that the Y axis in the plot does not start at and instead zooms in on the region between 1134 MB/s and1152mb/s.

13 ASYNCHRONOUS TRANSFER USING DART represents the maximum remote memory to memory data transfer rate that the system can effectively sustain in the best case. We used the gettimeofday() system call to measure the transfer times, and this includes the additional time for system operations, such as OS process scheduling, etc., which explains the variability in the measurements Overlapping Computations with Data Transfers In this experiment, we evaluated the influence of the compute stage duration, or the I/O rate, on the efficiency of overlapping computations with transfers to hide the latency of the data transfers. We used 128 compute nodes to run the synthetic simulation and one service node to run the instance of the streaming server. Further, the streaming server transferred the data to a node on a remote cluster, i.e. Ewok. We varied the duration of the compute stage for the synthetic simulation to control the frequency of the I/O operations.thei/o rateforthesyntheticsimulationinthese tests was selected to demonstrate the effect on the application overhead, and on the ability to overlap I/O withcomputations. At the beginning of an I/Ostage,allthenodesofthesyntheticsimulationposttheirrfsnotification messages at the same time, but the streaming server can process these notifications sequentially and in a round robin fashion. To transfer efficiently the data from consecutive IO stages, the streaming server requires a minimum amount of time in which it can fetch and stream the data produced by the simulation. It can process the i +1st notification request from a compute node only after it finishes processing all the ith requests from the other compute nodes. We derived the formula for the required time as t fb N +t pb,wheret fb is the time spent to transfer the content of a memory buffer to a remote cluster, i.e. Ewok or to local storage, N is the number of compute nodes used by the simulation, and t pb is the additional time spent to process and fetch the data from a compute node memory. The value for t pb is typically negligible as compared to t fb N, becausestreaming data to a remote location is the most expensive operation. Our goal was to try to find a good balance between the duration of a compute stage and the size of the data streamed, to better overlap computations with data transfers. In each of the tests, we measured the I/O, thewait, andtherfs time at the simulation level on the compute nodes and the total I/O time for an operation (from the rfs notification to streaming completion) at the streaming server. In the first test we fixed the duration for the compute stage to 2 s. We chose this value based on previous experiments to demonstrate the effects of poor overlapping. The results from this test are plotted in Figure 6. In these plots, the x axis is the numerical identifier of each compute node and the y axis is the cumulative I/O timefor1simulationiterations. The I/O timeonthecomputenodesequalsthesumofrfsandwaittime.thevaluesofwaitand I/O timeinfigure6(a)arealmostidentical,andthisindicatesthatthei/o timeisdominatedby the wait time. A large wait time value indicates that a notification operation blocks, and waits for a previous operation to finish. In this case, the compute stage duration was small, and the frequency of I/O stageswashigherthanthestreamingserver sabilitytotransferthefetcheddatablockto the remote location (i.e. Ewok cluster). As a result, the streaming server delayed the processing of the i +1st rfs notification from compute nodes and this causes the large I/O timeinfigure6(b). The average I/O time difference for a block transfer as measured at the streaming server and at the compute nodes was 34 ms, and represents the block streaming time to the remote cluster. The

14 C. DOCAN, M. PARASHAR AND S. KLASKY (a) (b) Figure 6. Overhead of DART on the application for a compute phase of 2 s. Cumulative I/O time:(a)atthe compute nodes and (b) at the streaming server. (a) (b) Figure 7. Overhead of DART on the application for a compute phase of 4.3 s. Cumulative I/O time:(a)atthe compute nodes and (b) at the streaming server. results of this test show that the overlap between computations and the overhead of data extraction is not optimal, as the wait time is very high on compute nodes. In the second test, we fixed the compute stage duration to 4.3 s, e.g. t fb =34ms, N =128, based on the minimum time formula t fb N +t pb established above and the results from the previous test, and neglected the t pb term. The rest of the setup and parameters remained the same. The results of the test are plotted in Figure 7. The wait time value in this case is very small (5 ms), indicating that the i +1st fetch operation from a compute node could proceed almost immediately, and without waiting for the ith transfer to finish in the first place. These results demonstrate an efficient overlap of simulation computations with data transfers, resulting in very low I/Ooverhead on the simulation.

15 ASYNCHRONOUS TRANSFER USING DART 4.2. Evaluating application performance In the second set of experiments, we tested the additional overhead that DART operations impose on a real application, which runs on compute nodes, and expressed the overhead as a percentage of the time the simulation spends in its compute stages. In these experiments, we integrated DART with real scientific application codes and analyzed its performance using multiple criteria. First, we analyzed the throughput of the streaming server on both, the Portals (compute nodes to streaming server) and TCP (streaming server to remote node) interfaces; second, we analyzed the I/Ooverhead on the simulation, and third, we analyzed the scalability of DART Integration with GTC In this set of experiments, we used the GTC simulation. GTC is a highly accurate fusion modeling code that performs a highly complex analysis of nonlinear dynamics of plasma turbulence by using numerical simulations, and requires very large-scale computing capabilities. We combined DART with the GTC code and varied simulation parameters in the runs described below. The key GTC parameters that influence the size of the data generated and the IO rate are micell, mecell, which determine the size of the global particle domain, npartdom,whichdeterminesthenumberofparticle sub-domains, and msnap, which determines the I/O rate, i.e. the number and the interval of the restart file produced. To test the scalability of DART, we ran the GTC simulation on 124 and 248 compute nodes. For the test on 124 compute nodes, we set the micell and mecell parameters to 16, npartdom to 16, and msnap to 13. With these parameter values, each compute node in the GTC application generated a checkpoint file of 4 MB every 13 simulation iterations, and each simulation iteration took 8 s, and the simulation ran for 2 iterations. Proportionately, for the test on 248 compute nodes, we set the micell and mecell parameters to 16, npartdom to 32, and msnap to 26, and ran the GTC code for 4 iterations. Using these parameters, each compute node of the GTC application generated a checkpoint file of 4 MB every 26 simulation iterations, and each iteration took 8 s. For these two GTC experiments, we ran the streaming server on a dedicated service node (i.e. no other users or I/O serviceswererunningonthesamenode).toanalyzetheperformanceof DART, we measured the wait and I/O times at the application level for each of the compute nodes, and the throughput obtained on both Portals and TCP interfaces at the streaming server. During these experiments, the remote receiver did not perform any processing on the data, but received and trashed it. We expressed the DART I/O overheadonthegtcapplicationduetothedataextractionand streaming as a percentage of the time spent in the compute stages, and plotted the results in Figure 8. The x axis in these plots is a numerical identifier for each compute node, and the y axis is the overhead measured at each node of the simulation. In Figure 8(a), the average wait time value across the 124 nodes is <.15%, and this indicates that DART is able to efficiently overlap computations with data extraction and transport. Further, the average value of the I/O time is <.65%, i.e. the overhead on the simulation is very small. In Figure 8(b) the average I/O time value across 248 nodes is still small at <.4%. However, in this case the difference between I/O and wait time is much smaller than in the 124 case, indicating that the streaming server is reaching the limit for the maximum number of compute nodes it can

16 C. DOCAN, M. PARASHAR AND S. KLASKY % of time spent in I/O (a) I/O time Wait time % of time spent in I/O compute node id (b) compute node id I/O time Wait time Figure 8. Percentage of I/O andwaittimeoverheadonthegtcsimulationmeasuredatthecomputenodes. I/O overheadonthegtccode:(a)on124nodesand(b)on248nodes. time(sec) 5 TCP Interface compute node id 1 time(sec) 6 TCP Interface compute node id time(sec) 18 Portals Interface compute node id (a) time(sec) 18 Portals Interface compute node id (b) Figure 9. Cumulative I/OtimetimesfortheGTCsimulationmeasuredattheDARTstreamingserver.Cumulative I/O timeatthestreamingserverservicing:(a)124computenodesand(b)248computenodes. service simultaneously. The average overhead value for the test on 248 nodes is smaller than the value for the test on 124 nodes because the GTC application generated the same amount of data for its checkpoints per compute node, but the length of computing stage was doubled. Figure 9 shows the cumulative I/O time at the streaming server to (1) extract the checkpoint data from the application using the Portals interface and (2) to transport the data to a remote node using the TCP interface, for each compute node over the runtime of the application. In these plots, the x axis is a numerical identifier for each compute node and the y axis is the cumulative I/O time. As we expected, the transfer time for the Portals interface is smaller than the TCP interface. The corresponding transfer rate values (Figure 1(a), (b), both Portals and TCP) are similar for the two cases (124 and 248 compute nodes), because the buffers that the streaming server allocated for communication are filled to capacity throughout the tests.

17 ASYNCHRONOUS TRANSFER USING DART % of TCP Peak Throughput 8 TCP Interface % of TCP Peak Throughput 8 TCP Interface Time (sec) Time (sec) % of TCP Peak Throughput (a) 8 Portals Interface Time (sec) % of TCP Peak Throughput (b) 8 Portals Interface Time (sec) Figure 1. Throughput at the DART server using Portals (compute nodes to service node) and TCP (service node to remote node). DART streaming server servicing: (a) 124 compute nodes and (b) 248 compute nodes. We benchmarked the link throughput between the streaming server and the remote receiver at the application level by using the TCP transport protocol, and measured a 5.1 Gbps peak value. In Figure 1, we present the throughput achieved by using DART at the streaming server on both, the Portals and TCP interfaces. In this plot, the x axis is the runtime of the simulation, and the y axis is the throughput percentage from the peak value. On both tests, i.e. the 124 and 248 nodes, we achieved on the TCP interface an average sustained throughput of 6% from the peak TCP value. This is so because the streaming server spends some time on Portals RDMA transfers and on prioritizing and scheduling the extract operations. The average throughput on the Portals interface is comparable to the value on the TCP interface, i.e. 6%, which we expected since the communication buffers at the streaming server are fully utilized and become a bottleneck. The TCP transport is slower than Portals, but as the two transports share the same communication buffers at the streaming server, the Portals throughput adapts to the slower link. To summarize, in the experiment on 124 compute nodes, the simulation ran for 15 s and each node generated 55 MB of data, resulting in a total of 555 GB across the system. The overhead on the simulation was less than.7%. In the experiment on 248 compute nodes, the simulation ran for 3 s, with each node generating 55 MB of data for a total of 1.2 TB across the system and the overhead on the simulation of.4% Integration with XGC-1 We have also integrated DART with the XGC-1 [4] simulation. XGC-1 is an edge turbulence particlein-cell application that uses numerical models to simulate a 2D dimensional density, the temperature, the bootstrap current, and the viscosity profiles in accordance with neoclassical turbulent transport. In our experiments with XGC-1, we used 124 compute nodes and one instance of the streaming server. We set up the XGC-1 simulation to run for 1 iterations, and to produce a checkpoint file

18 C. DOCAN, M. PARASHAR AND S. KLASKY Node 1 Node 2 Node 3 Node 4 Compute Nodes Staging Nodes Server 1 Server 2 Resulting File Figure 11. Application execution flow using staging nodes. every 2 iterations. One application iteration took 3 s to complete, and the size of a checkpoint file is 76 MB per compute node for a simulation with 1 million particles. The behavior and performance of DART observed in these experiments was similar to that described above for GTC. The XGC-1 simulation ran for 32 s, and each node generated 76 MB of data, resulting in a total of 38 GB of data across the system. The overhead on the simulation due to DART operations was <.3% of the simulation compute time. An average sustained throughput of 6% of the peak TCP throughput was achieved between streaming server and remote node Evaluating the impact of using staging nodes In the experiments presented so far, we used a single instance of the DARTSServer, which ran on either a dedicated or a shared service node, and streamed the data to a remote DARTReceiver. The scalability and the efficiency of this approach was limited by the capabilities of the service node. Amoreflexibleapproachistorunmultiplestreamingserverinstancesincooperativemode. However, the number of dedicated service nodes that are publicly available is small, and they are typically shared between various user jobs, e.g. application compilation, data archiving, and can produce heavy loads on the system, and thus can impact negatively the performance of the streaming server. To alleviate these problems, we migrated the instances of the DARTSServer to staging nodes, which were essentially a partition of the compute nodes exclusively running the streaming server. These DARTSServer instances running on staging nodes could essentially be viewed as a smaller job in the system, which exists in addition to the application. The application data in this case was written to a local storage system. The new setup is illustrated in Figure 11. In this example, compute nodes 1 4 run the application, and staging nodes 1 and 2 run the streaming server. DART allows compute nodes associated with different server instances to write their data to the same file, and it ensures transport protocol coherency. To preserve the data coherency, DARTClient on the compute nodes coordinates among the nodes writing to the same file and determines a distinct offset for each node to write at based on the data size at each node. In the example, the numbers at the top of the figure represent a possible sequence of notification posts to the streaming server, and the numbers at the bottom represent a possible sequence for

High Speed Asynchronous Data Transfers on the Cray XT3

High Speed Asynchronous Data Transfers on the Cray XT3 Ciprian Docan, Manish Parashar and Scott Klasky The Applied Software System Laboratory Rutgers, The State University of New Jersey CUG 2007, Seattle,