Enabling high-speed asynchronous data extraction and transfer using DART

Size: px
Start display at page:

Download "Enabling high-speed asynchronous data extraction and transfer using DART"

Transcription

1 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (21) Published online in Wiley InterScience ( Enabling high-speed asynchronous data extraction and transfer using DART Ciprian Docan 1,,,ManishParashar 1 and Scott Klasky 2 1 Center for Autonomic Computing/TASSL Laboratory, Rutgers University, Piscataway, NJ 8854, U.S.A. 2 Oak Ridge National Laboratory, P.O. Box 28 Oak Ridge, TN 37831, U.S.A. SUMMARY As the complexity and scale of applications grow, managing and transporting the large amounts of data they generate are quickly becoming a significant challenge. Moreover, the interactive and real-time nature of emerging applications, as well as their increasing runtime, make online data extraction and analysis a key requirement in addition to traditional data I/O and archiving. To be effective, online data extraction and transfer should impose minimal additional synchronization requirements, should have minimal impact on the computational performance and communication latencies, maintain overall quality of service, and ensure that no data is lost. In this paper we present Decoupled and Asynchronous Remote Transfers (DART), an efficient data transfer substrate that effectively addresses these requirements. DART is a thin software layer built on RDMA technology to enable fast, low-overhead, and asynchronous access to data from a running simulation, and supports high-throughput, low-latency data transfers. DART has been integrated with applications simulating fusion plasma in a Tokamak, being developed at the Center for Plasma Edge Simulation (CPES), a DoE Office of Fusion Energy Science (OFES) Fusion Simulation Project (FSP). A performance evaluation using the Gyrokinetic Toroidal Code and XGC-1 particle-in-cell-based FSP simulations running on the Cray XT3/XT4 system at Oak Ridge National Laboratory demonstrates how DART can effectively and efficiently offload simulation data to local service and remote analysis nodes, with minimal overheads on the simulation itself. Copyright 21 John Wiley & Sons, Ltd. Received 18 January 29; Revised 23 December 29; Accepted 25 December 29 KEY WORDS: asynchronous transfers; RDMA; low-overhead Correspondence to: Ciprian Docan, Center for Autonomic Computing/TASSL Laboratory, Rutgers University, Piscataway, NJ 8854, U.S.A. docan@cac.rutgers.edu Contract/grant sponsor: National Science Foundation; contract/grant numbers: IIP , CCF-83339, DMS , CNS , IIS 43826, CNS Contract/grant sponsor: Department of Energy; contract/grant number: DE-FG2-6ER54857 Contract/grant sponsor: The Extreme Scale Systems Center Contract/grant sponsor: Department of Defense Contract/grant sponsor: IBM Faculty Award Copyright 21 John Wiley & Sons, Ltd.

2 C. DOCAN, M. PARASHAR AND S. KLASKY 1. INTRODUCTION Computing systems, both academic and enterprise, are rapidly growing in scale and complexity, and are enabling larger and more complex applications. For example, high-performance computing (HPC) is playing an important role in science, engineering, and finance, and is enabling highly accurate simulations of complex phenomena. As these systems and applications grow, efficiently managing and transporting the large amounts of associated data, while still maintaining the desired computational efficiencies, is becoming an important and immediate challenge. Furthermore, emerging applications are based on seamless interactions and couplings across multiple and potentially distributed computational, data, and information services. For example, current fusion simulation efforts are exploring coupled models and codes that simultaneously simulate separate application processes and run on different HPC resources at supercomputing centers. These components need to interact at runtime with each other, and with services for online data monitoring, analysis, or archiving. Similar data management and transport requirements exist in applications that require, for example, migration of virtual machines or replication of databases for reliability or quality of service. These applications thus require a scalable and robust substrate for managing the large amounts of data they generate and to asynchronously extract and transport the data between interacting components. For fusion simulations, large volumes and heterogeneous types of data have to be continuously streamed from the compute partition of a petascale machine to its service partition, and from there to compute systems that run coupled simulation components, and to auxiliary data analysis and storage systems. Akeychallengethatmustbeaddressedbysuchasubstrateinthiscaseisgettingthelarge amounts of data being generated by these applications off the compute nodes at runtime, and over to service nodes or another system for code coupling, online monitoring, analysis, or archiving. To be effective, such an online data extraction and transfer service must (1) have minimal impact on the execution of the application in terms of performance overhead, increased latencies, or synchronization requirements, (2) satisfy stringent application/user space, time, and quality of service constraints, and (3) ensure that no data is lost. On many HPC resources, the large number of compute nodes are typically serviced by a smaller number of service nodes where they can offload expensive I/O operations.asaresult,thei/o substrateshouldbeabletoasynchronouslytransfer data from compute nodes to a service node with minimal delay and overhead on the simulation. Technologies such as RDMA allow fast memory access into the address space of an application without interrupting the computational process, and provide a mechanism that can support these requirements. In this paper, we present Decoupled and Asynchronous Remote Transfers (DART), an efficient data transfer substrate that effectively addresses the requirements described above. DART is a thin software layer built on RDMA technology to enable fast, low-overhead, and asynchronous access to data from a running simulation, and to support high-throughput, low-latency data transfers. The design and prototype implementation of DART using the Portals RDMA library [1] on the Cray XT3/XT4 at Oak Ridge National Laboratory is also presented. The key contribution of DART is in providing application-level mechanisms for asynchronous, high-throughput data transport for large-scale applications that can effectively move data from a running application to a (smaller) set of service nodes with minimal overhead on the application.

3 ASYNCHRONOUS TRANSFER USING DART Once the data is at the service nodes, expensive I/O operationscanbeoff-loadedtothesenodes without impacting the application performance. Furthermore, the data can also be served (possibly after manipulations) to a variety of services such as coupling, storage and archival, streaming, monitoring, visualizations, indexing, etc. DART has been integrated with applications simulating fusion plasma in a Tokamak, being developed at the Center for Plasma Edge Simulation (CPES), a DoE Office of Fusion Energy Science (OFES) Fusion Simulation Project (FSP), and is a key component of a high-throughput data streaming substrate that uses metadata rich outputs to support in-transit data processing and data redistribution for coupled simulations. A performance evaluation using the Gyrokinetic Toroidal Code (GTC) [2,3] and XGC-1 [4] particle-in-cell-based FSP simulations is presented. The evaluation demonstrates that DART can effectively use RDMA technologies to off-load expensive I/O operations to service nodes with very small overheads on the simulation itself, allowing a more efficient utilization of the compute elements, and enabling efficient online data monitoring and analysis on remote clusters. The remainder of this paper is organized as follows. Section 2 describes the architecture of DART and its key components. Section 3 describes the implementation on the Cray system and its operations. Section 4 presents the evaluation of DART. Section 5 presents the related work, and Section 6 concludes the paper and outlines the future research directions. 2. DART ARCHITECTURE AND OPERATION The primary goal of DART is to efficiently manage and transfer large amounts of data from applications running on the compute nodes of an HPC system to local storage or remote locations, in order to enable remote application monitoring, data analysis, code coupling, and data archiving. The key requirements that DART is trying to satisfy include minimizing data transfer overheads on the running application, achieving high-throughput, low-latency data transfers, and preventing data losses. In order to meet these requirements, we designed DART to enable dedicated nodes, i.e. separate from the application compute nodes, to schedule and extract data directly from the memory of the compute nodes using asynchronous transfers build on RDMA communication primitives. Thus, we off-load expensive data I/O andstreamingoperationsfromapplicationsrunningoncomputenodes to dedicated nodes, and allow applications to progress while data is being transferred. DART is able to provide asynchronous communication abstractions for environments that support only a single thread of execution, i.e. operating systems tuned for scientific computing that do not have support for threads. The asynchronous abstraction provided is transparent to applications, and DART ensures transport coherency and resources release. DART architecture contains three key components as shown in Figure 1: (1) a thin client layer (DARTClient)whichintegrateswithapplicationsandrunsonthecomputenodesofanHPCsystem, (2) a streaming server (DARTSServer)whichrunsondedicatednodesindependentlyoftheapplication executed on the compute nodes, and is responsible for data scheduling, extraction, and transport, and (3) a receiver (DARTReceiver), running on remote nodes, which receives and processes data streamed by the DARTSServer.

4 C. DOCAN, M. PARASHAR AND S. KLASKY Jaguar Ewok M Processors N Processors P Processors Asynchronous Figure 1. Architectural overview of DART. Note that DART clients and servers have layered architectures with a base communication layer. This communication layer can be implemented using different RDMA systems such as, for example, Portals, InfiniBand, DCMF, iwarp, etc., allowing DART to be portable across systems Asynchronous communication using RDMA The RDMA communication paradigm provides a window in the memory address space of a process. It supports process-to-process communication models with zero-copy, and OS and application bypass. The zero-copy transfer model is efficient becauseitcantransferdatadirectlyfromthe application memory buffers to a remote destination, thus avoiding to make an extra copy of the data into the kernel buffers. The OS bypass communication model can execute data transfers without involving the CPU (e.g. DMA transfers). The application-bypass communication model can perform adatatransferwhileanapplicationisrunning,andwithoutinterruptingtheapplication.thus, RDMA communication models are well suited to implement an asynchronous transfer protocol that has a very small impact on a running application. To provide the asynchronous communication abstractions to applications, we used the Portals [1] library, which implements the RDMA communication paradigm, and provides data transfer primitives for direct memory-to-memory communication. The Portals library supports the RDMA pull communication model ( get operation), which can extract messages from a remote node s memory, and the RDMA push model ( put operation), which can inject messages to a remote node s memory. To support memory-to-memory data transfers between processes, the Portals library maps blocks of an application memory (memory buffers) to the network card, and thus makes them accessible to processes running on remote nodes. To support this mapping, it uses memory descriptors, which are Portals data types containing internal bookkeeping information such as the start address and the length of the memory buffers, current offsets, etc. The asynchronous communication abstraction allows an application to start a data transfer and continue its execution without waiting for the transfer completion. With DART, this process is transparent to an application, as DART uses the application-bypass features for data transfers, and ahigher-leveldatasignalingandtransfercompletionnotificationmechanismbyleveragingthe Portals library software events. ThePortalslibrarycangenerateeventsforeachRDMAtransfer operation to indicate the start or completion of a transfer. Each RDMA event is associated with a

5 ASYNCHRONOUS TRANSFER USING DART specific memory descriptor, and Portals uses an internal event queue for each memory descriptor to log the event. At a higher layer, DART enables and uses these events to ensure the asynchronous protocol coherency and transfer completion DART client layer DARTClient is a lightweight software layer that integrates with an application and runs on the compute nodes of an HPC resource. It works in conjunction with a streaming server to export the asynchronous communication abstraction at the application level, and to provide a transparent and coherent transfer protocol. It is responsible for two key functions, (1) control and coordination and (2) data transfers. The control and coordination function registers a compute node with a streaming server at application start-up, and posts requests for transfer notifications to the streaming server at runtime. During the registration phase, DARTClient and the streaming server exchange identification and communication parameters, such as unique numerical instance identifiers, communication offsets, and memory descriptor identifiers. The data transfer function establishes the communication parameters, and polls for previous transfer completion, and releases resources. To post a transfer request, DARTClient sets up the parameters and priorities for the transfer, and notifies the streaming server that it has data ready to be sent. It is then the responsibility of the streaming server, based on the transfer parameters, to actually schedule, execute the transfer of the data, and notify back the DARTClient on the completion of the operation. Following the RDMA communication paradigm, DARTClient pre-allocates the communication buffers and exposes two classes of memory descriptors to communicate with the streaming server. The first class contains application-level memory descriptors that DARTClient creates during the registration phase, and preserves throughout the lifetime of the application. DARTClient defines two types of application-level memory descriptors the first is used to post request-for-transfer notifications to the streaming server, and the other is used to retrieve transfer completion acknowledgments from the streaming server. The second class contains stream-level memory descriptors that DARTClient creates for each data stream the application generates. A data stream is a continuous flux of bytes that have the same destination, e.g. a file. Stream-level memory descriptors are also of two types the first is used to set up data stream specific-communication parameters and signals, and the second is used to map and expose the actual memory buffers for the streaming server to fetch and transfer from the application. DARTClient creates these memory descriptors on-the-fly when the application generates a new data stream, and publishes them to the streaming server as part of the notification message. DARTClient provides asynchronous abstractions that enable the application layer to make nonblocking send calls and continue its computation without waiting for the transfers to complete. The underlying mechanisms use OS/application-bypass capabilities provided by RDMA transfers to overlap data transfers with application computations, and hide the latency of data transfers. DARTClient uses multiple memory buffers to allow multiple data streams of different sizes to proceed in parallel. DARTClient supports a cooperative data transfer mode, in which multiple compute nodes write to the same data stream, e.g. file. In the cooperative mode, DARTClient is flexible and allows the

6 C. DOCAN, M. PARASHAR AND S. KLASKY application to define and group together the nodes that write to the same data stream. Based on the groups defined at the application-level, DARTClient coordinates the nodes of a group at runtime, and for each node it computes an associated offset in the common stream where the node should write at. This offset is included in the notification message that is posted to the streaming server. This allows the server to be stateless with respect to a data stream and reduces the number of active connections that the server needs to keep opened DART streaming server and receiver DARTSServer is the streaming server component that runs as a stand-alone application on dedicated nodes of an HPC system, and is responsible for asynchronously extracting data from applications running on the compute nodes, and streaming it to remote clusters or to local storage. Multiple instances of the DARTSServer can run on different nodes, and cooperate to service data transfer requests from the larger number of compute nodes. DARTSServer provides multiple key functions. It registers the compute nodes running the DART- Client layer, or other instances of streaming servers, at the applications start-up, and assigns and exchanges communication parameters, e.g. identification numbers, communication offsets, identifiers for communication memory descriptors. It receives notification posts that indicate and describe the data available from the compute nodes, schedules and manages the data extraction from an application running on the compute nodes, and transfers the data out to local storage or to remote clusters as requested in the notification posts. Data transfers have priorities based on the size of the data blocks and the frequency at which the application generates them. DARTSServer maintains two priority queues: a high priority queue for notification requests for smaller or more frequent data blocks, and a low priority queue for less frequent or larger data blocks. This priority mechanism prevents the transfers of smaller blocks from being blocked by the transfers of larger ones and enables them to proceed in parallel with the larger block transfers. The primary performance goal of the DARTSServer is to minimize the data transfer latencies and maximize the transfer throughput. To achieve these goals, the streaming server pre-allocates a large cache of transfer memory buffers and associates it with a memory descriptor at the start-up of the server application. The transfer cache improves the performance of the transfers and keeps aconstantmemoryfootprintthroughouttheruntimeoftheserver.thesizeofthetransfercache and the number of memory buffers that the server can allocate are influenced by multiple factors, such as (1) the maximum limit on the number of file descriptors that can be used by a process (a memory descriptor counts as an open file), (2) the memory available on the host node, and (3) the size of the working set (to avoid thrashing behaviors). For example, the implementation on the Cray XT3/XT4 at Oak Ridge National Laboratory (described in more detail in Section 3) allocates 512 memory buffers of 4 MB each, and exposes the cache using a single memory descriptor. In addition to the memory descriptor used for the transfer cache, the streaming server also preallocates and exposes memory buffers for communication and transfer signaling with the compute nodes and the other instances of streaming servers. For the transfer signaling function, the streaming server uses two descriptors. The first descriptor accepts notification posts from compute nodes, which indicate that data is available, whereas the second descriptor sends notifying acknowledgments to compute nodes when an asynchronous transfer completes.

7 ASYNCHRONOUS TRANSFER USING DART Dedicated Nodes Compute Nodes eqb pbq fbq T1 T2 14 T3 T Remote Cluster/Locat Storage Figure 2. Data flow at the DARTSServer streaming server. DARTSServer is multi-threaded to allow asynchronous and parallel data streaming, e.g. fetching new data from compute nodes while streaming previous data to a remote cluster, as well as the data transfer scheduling management to progress simultaneously, as illustrated in Figure 2. The server uses multiple queues (1) to manage the incoming compute nodes transfer requests, (2) to schedule and prioritize the data streaming operations, (3) to track and manage the transfer cache memory buffers in the different states of a transfer, and (4) to reduce synchronization contention between server threads. The server reuses the buffers of the transfer cache and links them in different queues according to the status of a data transfer, e.g. eqb queue contains empty memory buffers, pqb queue contains memory buffers associated with in-progress data transfers, and fqb queue contains full buffers that are ready to be streamed to remote nodes or saved on local storage. The overall data flow associated with the operation of DARTSServer is illustrated in Figure 2. Adatatransferbeginswitharequestforsendmessagegeneratedbyanapplicationonthecompute nodes and posted to the notification descriptor (1) of the server. The server extracts each incoming notification message as it arrives (2), then encodes and inserts it in a priority queue accordingly (3). Next, the server processes the priority queues (4), schedules the data transfers (5 and 6), and starts the asynchronous data extraction from the compute nodes (6 and 7). A different thread in the server monitors the asynchronous data completion (8), and schedules the data streaming or saving (9 and 1). Then, the server finishes transfers (11, 12, and 13) and sends an acknowledgment notification message to the notification memory descriptor of the corresponding compute node DART API and usage The DART client layer (DARTClient) provides I/O primitives that are very similar to standard Fortran file operations, i.e. dart open(), dart write(), and dart close(). We chose the names and signatures of these operators to make the library easy to use by scientists and to facilitate incorporation into existing application codes. Additionally DART provides functions to initialize and

8 C. DOCAN, M. PARASHAR AND S. KLASKY finalize the DART library, i.e. dart init() and dart finalize(). DARTalsoprovidesAPIforcompute node coordination to enable cooperative operations and interface with additional middle-ware. DART operators are asynchronous. A typical usage of the DART API consists of a stream open dart open() operation that returns a handle, one or more non-blocking dart write() send operations on this handle, and a stream end dart close() operation on the handle. A DART operation may be in one of three states, not started, in progress, or finished. In case of the first two states, DART maintains stream state information, and when the application calls dart finalize(), itensuresthatall in-progress operations finish. As a result, a call to dart finalize() can block and should be called at the end of the application so that it does not impact the simulation. Sample application code that uses DARTClient is presented below. // Sample application code using DARTClient dart_init(); // Application computations dart_handle = dart_open(file_name, \ {read_mode write_mode}); dart_write(dart_handle, variable_1, size_1); dart_write(dart_handle, variable_2, size_2); dart_close(dart_handle); // Application computations dart_finalize(); 3. IMPLEMENTATION OF DART ON THE CRAY XT3/XT System description We implemented DART on Jaguar, a Cray XT3/XT4 at Oak Ridge National Lab (ORNL). Jaguar [5] has 11k nodes logically divided into compute and service nodes. XT3 is the second most recent version of the machine, and is built from compute nodes with dual-core 2.6 GHz AMD Opteron CPUs, and 4 GB memory per node (2 GB per core) and runs Cray s UNICOS/lc operating system with the Catamount micro-kernel [6]. Theoperatingsystem on the compute nodes is specially tuned for scientific computing. To meet the scientific computation performance requirements, the OS removed some features such as the sockets interface, the TCP, and UDP transport protocols, the shared memory support, and the multi-threading support. These tunings impose additional constraints and challenges on the implementation of a fast and asynchronous I/O substrate. A service node has a 2.4 GHz single-core AMD Opteron CPU with 4 GB memory and runs a full-fledged Linux kernel. XT4 is the most recent version of the machine, and is built up from compute nodes with quadcore 2.1 GHz AMD Opteron CPUs, and 8 GB of memory per node, i.e. 2 GB per core, and runs Compute Node Linux (CNL) operating system. CNL is a version of the Linux kernel, specifically customized for scientific computing environments. IncontrastwithCatamount,CNLscalesbetter to quad-cores and adds support for multi-threading. Nevertheless, CNL does not provide sockets interface or higher-level transport protocols such as TCP or UDP. The service nodes have 2.6 GHz dual-core AMD Opteron with 8 GB of memory.

9 ASYNCHRONOUS TRANSFER USING DART Jaguar s compute and service nodes are interconnected in a 3D Torus topology by Cray SeaStar routers through the proprietary HyperTransport interconnection bus that is able to deliver a transfer bandwidth of 6.4GB/s. The Jaguar service nodes are connected to a remote cluster, i.e. Ewok, by a 5 GB/s aggregatedlink.theewokclusterhas128nodes,eachnodewith2intelxeon 3.4 GHz CPUs and 4 GB memory. The nodes on the cluster are interconnected by an InfiniBand switch through a 1 GB/slink.Inourimplementation,DARTClientrunsonJaguar scomputenodes, DARTSServer runs on Jaguar s service nodes and DARTReceiver runs on Ewok nodes Compute node bootstrap mechanism On the Cray XT3/XT4 machine, the compute nodes communicate with the service nodes using Portals RDMA calls. Portals identifies each process in the system by a Portals address, which is atuplecomposedofauniquenumericalidentifierforthenode(computeorservice)andforthe process, i.e. nid and pid. Two processes in the system can communicate if they know each other Portals addresses. To run an application on the system, a user submits a job and requests a number of compute nodes from the job allocation system. However, the specific nodes assigned for the application, and consequently their Portals addresses, are determined only at runtime. The user application and the streaming server are two independent applications, that need to communicate at runtime over Portals and using RDMA calls. We defined a customized bootstrap mechanism that is independent of the underlying execution mechanism (e.g. MPI or the batch queuing system) that allows the compute nodes to find the Portals addresses of the streaming servers and to publish their own addresses to the servers. It requires that the streaming servers start up and complete their initialization phase before the applications start, and relies on the availability of a shared file system among the nodes that run the streaming server instances. The applications that run on the compute nodes and the streaming servers are launched using the same job script file. The job script first starts the streaming servers. The first instance of the server writes its Portals address (i.e. the tuple { nid, pid }) toapersistent configuration file (Figure 3), which is then read by the job script and exported to its environment variables. The job script then starts the user applications on the compute nodes, which inherit the environment variables from the script. When an application initializes, DARTClient parses the corresponding environment variables and reads the Portals address of the streaming server. As part of the bootstrap process, each DART client has to register its Portals address with an instance of the DART servers. Each DART client obtains its Portals address after it initializes the Portals network and includes its identifiers, i.e. { nid, pid }, intheregistrationmessagethatitexchangeswiththe servers. This bootstrap mechanism is both flexible and efficient because (1) it allows the streaming servers to start and configure themselves before the application starts, and (2) the compute nodes do not access the file system, which may be a concern especially for jobs with a large number of nodes. To bootstrap multiple streaming servers running on different nodes, the first server instance to enter the initialization phase creates and locks the configuration file into which it writes its Portals address, thus becoming the master instance. The other instances check the existence of the configuration file, read the Portals address of the master instance, then send and register their own Portals addresses with the master server.

10 C. DOCAN, M. PARASHAR AND S. KLASKY Compute Nodes Server Nodes Master Instance Client Instance Conf Luster FS Environment Variables Figure 3. DART compute node bootstrap mechanisms. During the application initialization phase, DART client contacts the master streaming server which forwards the registration request to an instance of a streaming server via the bootstrap process (see example in Figure 3). The designated streaming server instance registers the compute node, that will send all its requests to this instance of the server. In this process, the master server uniformly distributes the compute nodes across the streaming servers, so that each server serves approximately the same number of compute nodes. In the implementation of DART described in this paper, the initial assignment of compute nodes to streaming servers is preserved through the runtime of the applications. This is because the load on the servers is naturally balanced due to the single program multiple data (SPMD) nature of the applications. Cooperative load balancing between streaming servers is used in cases where this is not true (e.g. if local adaptivity is used) Data transfer protocol As described in Section 2, DARTClient is lightweight and implements the bare minimum functionality so as to minimize its impact on the application. Consequently, the DARTSServer implements all the logic associated with scheduling, data extraction, and streaming. To achieve this, it uses a two-step protocol. In the first step, a compute node posts a small rfs (request for send) notification message into the communication buffer of the associated streaming server. In the second step, the streaming server schedules the request, extracts the data directly from the memory of the compute node, and transfers the data to the destination specified in the rfs notification (i.e. remote node or local storage). The streaming server processes multiple data notification requests from the same or different compute nodes in a round robin fashion according to the priorities of the requests. We illustrate the operation of DART on the Cray XT3/XT4 system using the timing diagrams presented in Figure 4. A usual run of a scientific application consists of sequences of computations (identified by c t in Figure 4(a)), followed by data transfer requests (identified by w t and rfs t in the same figure). The data transfer protocol is asynchronous, and an application running on a compute node continues its computations after it posts a rfs notification (i.e. rfs t ), and does not wait for

11 ASYNCHRONOUS TRANSFER USING DART (a) I/O Time wt = Wait Time rfst = Request for Send Time ct = Compute Time I/O Transfer Time Time Transfer Time rfst ct wt rfst ct wt t (b) ebq pbq fbq teb Thread Schedule tpb Thread Schedule tfb Total Time t Figure4. Timing diagrams illustratingtheoperation ofdartclient and DARTSServer.(a)DARTClient layer and (b) DARTSServer steaming server. the actual transfer to complete. Although the streaming server does not begin the data extraction immediately, the transfer can proceed in parallel with the computation. A compute node may block before sending a new rfs notification message (i.e. w t ), if the streaming server is busy and has not yet processed a previous request from the same compute node. The effective data transfer time at the application level is the sum of w t and rfs t times. In Figure 4(b), we present the timing diagram corresponding to the transfer of a data block from acomputenodetoaremotenodeonthestreamingserverside. In the figure, teb isthetimetheserver spendsto process a notificationmessagefrom acompute node and to schedule the corresponding data extraction, tpb isthetimetheserverspendstoextract the data from the compute node using the Portals interface, and tfb isthetimetheserverspends to transfer the data to a remote node using the TCP interface or to save it to local storage. The value of these parameters corresponds to the processing time of the threads t 1 +t 2, t 3,andt 4 as described in Section 2. The value for the total I/O timeequalsthesumofthesethreevaluesplussomeos thread scheduling overheads. 4. EXPERIMENTAL EVALUATION We evaluated the performance of DART using three sets of experiments. The first set evaluated the raw performance of DART using synthetic simulations. The synthetic simulations followed the same computational pattern as the real applications, e.g. a set of one or more computational stages followed by one I/O stage.theseexperimentsevaluatedthecoreparameters that influence the effective data transfer rates from compute to service nodes. This includes (1) the size of the data unit fetched, i.e. the block size, (2) the number of memory descriptors allocated by the streaming server, (3) the I/O rates,i.e.thefrequencyatwhichthesimulationsproducedata, and (4) the time required to stream the data further, i.e. the time to transfer data to its destination, such as remote cluster or local storage. The second set of experiments evaluated the performance of DART when we integrated it with real application codes, e.g. the GTC and XGC-1 fusion simulations, and analyzed the trade-offs that we can make to improve the performance of an I/O system.thekeymetricsusedinthese experiments were (1) the effective data transfer rate and (2) the overhead of using DART on the performance of an application. The third set of experiments evaluated the impact of using computing nodes as an explicit staging area for data I/O. In these experiments, we ran multiple instances of the DARTSServer on a separate

12 C. DOCAN, M. PARASHAR AND S. KLASKY Transfer BW 1148 Transfer rate MB/s Size of data unit transferred (MB) Figure 5. Data transfer rate between compute and service nodes. partition of the compute system, i.e. the staging nodes, whichwroteapplicationdatatoalocal storage system Evaluating raw performance Data Transfer Rate In this experiment, we evaluated the maximum application-level transfer rate that we can achieve between compute and service nodes, using RDMA data transfers. We set up this experiment using two compute nodes and one service node. The streaming server running on the service node extracted data from the simulation running on two compute nodes and measured the effective transfer rates. For the synthetic simulation used, we fixed the compute stage duration to 1 ms. However, in this case the compute stage duration, which determines the I/O rate, does not influence the transfer rates, because we measured only the transfer time and size at the streaming server, but not the I/O overhead. We varied the size of the data unit (i.e. the block size) from 1 to 1 MB in increments of 4 MB, and measured the achieved transfer rate for each value. We ran the synthetic simulation for 1 iterations, and performed 1 data transfers for each value of the block size. We plot the average transfer rate (over the 1 iterations) for each block size in Figure 5. The plot shows that we can saturate the communication link between compute and service nodes using a minimum block size of 4 MB. The saturation value is achieved at application level, and Note that the Y axis in the plot does not start at and instead zooms in on the region between 1134 MB/s and1152mb/s.

13 ASYNCHRONOUS TRANSFER USING DART represents the maximum remote memory to memory data transfer rate that the system can effectively sustain in the best case. We used the gettimeofday() system call to measure the transfer times, and this includes the additional time for system operations, such as OS process scheduling, etc., which explains the variability in the measurements Overlapping Computations with Data Transfers In this experiment, we evaluated the influence of the compute stage duration, or the I/O rate, on the efficiency of overlapping computations with transfers to hide the latency of the data transfers. We used 128 compute nodes to run the synthetic simulation and one service node to run the instance of the streaming server. Further, the streaming server transferred the data to a node on a remote cluster, i.e. Ewok. We varied the duration of the compute stage for the synthetic simulation to control the frequency of the I/O operations.thei/o rateforthesyntheticsimulationinthese tests was selected to demonstrate the effect on the application overhead, and on the ability to overlap I/O withcomputations. At the beginning of an I/Ostage,allthenodesofthesyntheticsimulationposttheirrfsnotification messages at the same time, but the streaming server can process these notifications sequentially and in a round robin fashion. To transfer efficiently the data from consecutive IO stages, the streaming server requires a minimum amount of time in which it can fetch and stream the data produced by the simulation. It can process the i +1st notification request from a compute node only after it finishes processing all the ith requests from the other compute nodes. We derived the formula for the required time as t fb N +t pb,wheret fb is the time spent to transfer the content of a memory buffer to a remote cluster, i.e. Ewok or to local storage, N is the number of compute nodes used by the simulation, and t pb is the additional time spent to process and fetch the data from a compute node memory. The value for t pb is typically negligible as compared to t fb N, becausestreaming data to a remote location is the most expensive operation. Our goal was to try to find a good balance between the duration of a compute stage and the size of the data streamed, to better overlap computations with data transfers. In each of the tests, we measured the I/O, thewait, andtherfs time at the simulation level on the compute nodes and the total I/O time for an operation (from the rfs notification to streaming completion) at the streaming server. In the first test we fixed the duration for the compute stage to 2 s. We chose this value based on previous experiments to demonstrate the effects of poor overlapping. The results from this test are plotted in Figure 6. In these plots, the x axis is the numerical identifier of each compute node and the y axis is the cumulative I/O timefor1simulationiterations. The I/O timeonthecomputenodesequalsthesumofrfsandwaittime.thevaluesofwaitand I/O timeinfigure6(a)arealmostidentical,andthisindicatesthatthei/o timeisdominatedby the wait time. A large wait time value indicates that a notification operation blocks, and waits for a previous operation to finish. In this case, the compute stage duration was small, and the frequency of I/O stageswashigherthanthestreamingserver sabilitytotransferthefetcheddatablockto the remote location (i.e. Ewok cluster). As a result, the streaming server delayed the processing of the i +1st rfs notification from compute nodes and this causes the large I/O timeinfigure6(b). The average I/O time difference for a block transfer as measured at the streaming server and at the compute nodes was 34 ms, and represents the block streaming time to the remote cluster. The

14 C. DOCAN, M. PARASHAR AND S. KLASKY (a) (b) Figure 6. Overhead of DART on the application for a compute phase of 2 s. Cumulative I/O time:(a)atthe compute nodes and (b) at the streaming server. (a) (b) Figure 7. Overhead of DART on the application for a compute phase of 4.3 s. Cumulative I/O time:(a)atthe compute nodes and (b) at the streaming server. results of this test show that the overlap between computations and the overhead of data extraction is not optimal, as the wait time is very high on compute nodes. In the second test, we fixed the compute stage duration to 4.3 s, e.g. t fb =34ms, N =128, based on the minimum time formula t fb N +t pb established above and the results from the previous test, and neglected the t pb term. The rest of the setup and parameters remained the same. The results of the test are plotted in Figure 7. The wait time value in this case is very small (5 ms), indicating that the i +1st fetch operation from a compute node could proceed almost immediately, and without waiting for the ith transfer to finish in the first place. These results demonstrate an efficient overlap of simulation computations with data transfers, resulting in very low I/Ooverhead on the simulation.

15 ASYNCHRONOUS TRANSFER USING DART 4.2. Evaluating application performance In the second set of experiments, we tested the additional overhead that DART operations impose on a real application, which runs on compute nodes, and expressed the overhead as a percentage of the time the simulation spends in its compute stages. In these experiments, we integrated DART with real scientific application codes and analyzed its performance using multiple criteria. First, we analyzed the throughput of the streaming server on both, the Portals (compute nodes to streaming server) and TCP (streaming server to remote node) interfaces; second, we analyzed the I/Ooverhead on the simulation, and third, we analyzed the scalability of DART Integration with GTC In this set of experiments, we used the GTC simulation. GTC is a highly accurate fusion modeling code that performs a highly complex analysis of nonlinear dynamics of plasma turbulence by using numerical simulations, and requires very large-scale computing capabilities. We combined DART with the GTC code and varied simulation parameters in the runs described below. The key GTC parameters that influence the size of the data generated and the IO rate are micell, mecell, which determine the size of the global particle domain, npartdom,whichdeterminesthenumberofparticle sub-domains, and msnap, which determines the I/O rate, i.e. the number and the interval of the restart file produced. To test the scalability of DART, we ran the GTC simulation on 124 and 248 compute nodes. For the test on 124 compute nodes, we set the micell and mecell parameters to 16, npartdom to 16, and msnap to 13. With these parameter values, each compute node in the GTC application generated a checkpoint file of 4 MB every 13 simulation iterations, and each simulation iteration took 8 s, and the simulation ran for 2 iterations. Proportionately, for the test on 248 compute nodes, we set the micell and mecell parameters to 16, npartdom to 32, and msnap to 26, and ran the GTC code for 4 iterations. Using these parameters, each compute node of the GTC application generated a checkpoint file of 4 MB every 26 simulation iterations, and each iteration took 8 s. For these two GTC experiments, we ran the streaming server on a dedicated service node (i.e. no other users or I/O serviceswererunningonthesamenode).toanalyzetheperformanceof DART, we measured the wait and I/O times at the application level for each of the compute nodes, and the throughput obtained on both Portals and TCP interfaces at the streaming server. During these experiments, the remote receiver did not perform any processing on the data, but received and trashed it. We expressed the DART I/O overheadonthegtcapplicationduetothedataextractionand streaming as a percentage of the time spent in the compute stages, and plotted the results in Figure 8. The x axis in these plots is a numerical identifier for each compute node, and the y axis is the overhead measured at each node of the simulation. In Figure 8(a), the average wait time value across the 124 nodes is <.15%, and this indicates that DART is able to efficiently overlap computations with data extraction and transport. Further, the average value of the I/O time is <.65%, i.e. the overhead on the simulation is very small. In Figure 8(b) the average I/O time value across 248 nodes is still small at <.4%. However, in this case the difference between I/O and wait time is much smaller than in the 124 case, indicating that the streaming server is reaching the limit for the maximum number of compute nodes it can

16 C. DOCAN, M. PARASHAR AND S. KLASKY % of time spent in I/O (a) I/O time Wait time % of time spent in I/O compute node id (b) compute node id I/O time Wait time Figure 8. Percentage of I/O andwaittimeoverheadonthegtcsimulationmeasuredatthecomputenodes. I/O overheadonthegtccode:(a)on124nodesand(b)on248nodes. time(sec) 5 TCP Interface compute node id 1 time(sec) 6 TCP Interface compute node id time(sec) 18 Portals Interface compute node id (a) time(sec) 18 Portals Interface compute node id (b) Figure 9. Cumulative I/OtimetimesfortheGTCsimulationmeasuredattheDARTstreamingserver.Cumulative I/O timeatthestreamingserverservicing:(a)124computenodesand(b)248computenodes. service simultaneously. The average overhead value for the test on 248 nodes is smaller than the value for the test on 124 nodes because the GTC application generated the same amount of data for its checkpoints per compute node, but the length of computing stage was doubled. Figure 9 shows the cumulative I/O time at the streaming server to (1) extract the checkpoint data from the application using the Portals interface and (2) to transport the data to a remote node using the TCP interface, for each compute node over the runtime of the application. In these plots, the x axis is a numerical identifier for each compute node and the y axis is the cumulative I/O time. As we expected, the transfer time for the Portals interface is smaller than the TCP interface. The corresponding transfer rate values (Figure 1(a), (b), both Portals and TCP) are similar for the two cases (124 and 248 compute nodes), because the buffers that the streaming server allocated for communication are filled to capacity throughout the tests.

17 ASYNCHRONOUS TRANSFER USING DART % of TCP Peak Throughput 8 TCP Interface % of TCP Peak Throughput 8 TCP Interface Time (sec) Time (sec) % of TCP Peak Throughput (a) 8 Portals Interface Time (sec) % of TCP Peak Throughput (b) 8 Portals Interface Time (sec) Figure 1. Throughput at the DART server using Portals (compute nodes to service node) and TCP (service node to remote node). DART streaming server servicing: (a) 124 compute nodes and (b) 248 compute nodes. We benchmarked the link throughput between the streaming server and the remote receiver at the application level by using the TCP transport protocol, and measured a 5.1 Gbps peak value. In Figure 1, we present the throughput achieved by using DART at the streaming server on both, the Portals and TCP interfaces. In this plot, the x axis is the runtime of the simulation, and the y axis is the throughput percentage from the peak value. On both tests, i.e. the 124 and 248 nodes, we achieved on the TCP interface an average sustained throughput of 6% from the peak TCP value. This is so because the streaming server spends some time on Portals RDMA transfers and on prioritizing and scheduling the extract operations. The average throughput on the Portals interface is comparable to the value on the TCP interface, i.e. 6%, which we expected since the communication buffers at the streaming server are fully utilized and become a bottleneck. The TCP transport is slower than Portals, but as the two transports share the same communication buffers at the streaming server, the Portals throughput adapts to the slower link. To summarize, in the experiment on 124 compute nodes, the simulation ran for 15 s and each node generated 55 MB of data, resulting in a total of 555 GB across the system. The overhead on the simulation was less than.7%. In the experiment on 248 compute nodes, the simulation ran for 3 s, with each node generating 55 MB of data for a total of 1.2 TB across the system and the overhead on the simulation of.4% Integration with XGC-1 We have also integrated DART with the XGC-1 [4] simulation. XGC-1 is an edge turbulence particlein-cell application that uses numerical models to simulate a 2D dimensional density, the temperature, the bootstrap current, and the viscosity profiles in accordance with neoclassical turbulent transport. In our experiments with XGC-1, we used 124 compute nodes and one instance of the streaming server. We set up the XGC-1 simulation to run for 1 iterations, and to produce a checkpoint file

18 C. DOCAN, M. PARASHAR AND S. KLASKY Node 1 Node 2 Node 3 Node 4 Compute Nodes Staging Nodes Server 1 Server 2 Resulting File Figure 11. Application execution flow using staging nodes. every 2 iterations. One application iteration took 3 s to complete, and the size of a checkpoint file is 76 MB per compute node for a simulation with 1 million particles. The behavior and performance of DART observed in these experiments was similar to that described above for GTC. The XGC-1 simulation ran for 32 s, and each node generated 76 MB of data, resulting in a total of 38 GB of data across the system. The overhead on the simulation due to DART operations was <.3% of the simulation compute time. An average sustained throughput of 6% of the peak TCP throughput was achieved between streaming server and remote node Evaluating the impact of using staging nodes In the experiments presented so far, we used a single instance of the DARTSServer, which ran on either a dedicated or a shared service node, and streamed the data to a remote DARTReceiver. The scalability and the efficiency of this approach was limited by the capabilities of the service node. Amoreflexibleapproachistorunmultiplestreamingserverinstancesincooperativemode. However, the number of dedicated service nodes that are publicly available is small, and they are typically shared between various user jobs, e.g. application compilation, data archiving, and can produce heavy loads on the system, and thus can impact negatively the performance of the streaming server. To alleviate these problems, we migrated the instances of the DARTSServer to staging nodes, which were essentially a partition of the compute nodes exclusively running the streaming server. These DARTSServer instances running on staging nodes could essentially be viewed as a smaller job in the system, which exists in addition to the application. The application data in this case was written to a local storage system. The new setup is illustrated in Figure 11. In this example, compute nodes 1 4 run the application, and staging nodes 1 and 2 run the streaming server. DART allows compute nodes associated with different server instances to write their data to the same file, and it ensures transport protocol coherency. To preserve the data coherency, DARTClient on the compute nodes coordinates among the nodes writing to the same file and determines a distinct offset for each node to write at based on the data size at each node. In the example, the numbers at the top of the figure represent a possible sequence of notification posts to the streaming server, and the numbers at the bottom represent a possible sequence for

High Speed Asynchronous Data Transfers on the Cray XT3

High Speed Asynchronous Data Transfers on the Cray XT3 High Speed Asynchronous Data Transfers on the Cray XT3 Ciprian Docan, Manish Parashar and Scott Klasky The Applied Software System Laboratory Rutgers, The State University of New Jersey CUG 2007, Seattle,

More information

XpressSpace: a programming framework for coupling partitioned global address space simulation codes

XpressSpace: a programming framework for coupling partitioned global address space simulation codes CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 214; 26:644 661 Published online 17 April 213 in Wiley Online Library (wileyonlinelibrary.com)..325 XpressSpace:

More information

Compute Node Linux: Overview, Progress to Date & Roadmap

Compute Node Linux: Overview, Progress to Date & Roadmap Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute

More information

Initial Performance Evaluation of the Cray SeaStar Interconnect

Initial Performance Evaluation of the Cray SeaStar Interconnect Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on

More information

Adaptable IO System (ADIOS)

Adaptable IO System (ADIOS) Adaptable IO System (ADIOS) http://www.cc.gatech.edu/~lofstead/adios Cray User Group 2008 May 8, 2008 Chen Jin, Scott Klasky, Stephen Hodson, James B. White III, Weikuan Yu (Oak Ridge National Laboratory)

More information

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access

More information

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Chapter 3. Design of Grid Scheduler. 3.1 Introduction Chapter 3 Design of Grid Scheduler The scheduler component of the grid is responsible to prepare the job ques for grid resources. The research in design of grid schedulers has given various topologies

More information

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 Jeff Larkin, Cray Inc. and Mark Fahey, Oak Ridge National Laboratory ABSTRACT: This paper will present an overview of I/O methods on Cray XT3/XT4

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors University of Crete School of Sciences & Engineering Computer Science Department Master Thesis by Michael Papamichael Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

Performance database technology for SciDAC applications

Performance database technology for SciDAC applications Performance database technology for SciDAC applications D Gunter 1, K Huck 2, K Karavanic 3, J May 4, A Malony 2, K Mohror 3, S Moore 5, A Morris 2, S Shende 2, V Taylor 6, X Wu 6, and Y Zhang 7 1 Lawrence

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit

More information

Optimizing LS-DYNA Productivity in Cluster Environments

Optimizing LS-DYNA Productivity in Cluster Environments 10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Performance and Scalability with Griddable.io

Performance and Scalability with Griddable.io Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Compute Node Linux (CNL) The Evolution of a Compute OS

Compute Node Linux (CNL) The Evolution of a Compute OS Compute Node Linux (CNL) The Evolution of a Compute OS Overview CNL The original scheme plan, goals, requirements Status of CNL Plans Features and directions Futures May 08 Cray Inc. Proprietary Slide

More information

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 ZEST Snapshot Service A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 Design Motivation To optimize science utilization of the machine Maximize

More information

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT Konstantinos Alexopoulos ECE NTUA CSLab MOTIVATION HPC, Multi-node & Heterogeneous Systems Communication with low latency

More information

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection Switching Operational modes: Store-and-forward: Each switch receives an entire packet before it forwards it onto the next switch - useful in a general purpose network (I.e. a LAN). usually, there is a

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory A Tree-Based Overlay Network (TBON) like MRNet provides scalable infrastructure for tools and applications MRNet's

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Ref: Chap 12. Secondary Storage and I/O Systems. Applied Operating System Concepts 12.1

Ref: Chap 12. Secondary Storage and I/O Systems. Applied Operating System Concepts 12.1 Ref: Chap 12 Secondary Storage and I/O Systems Applied Operating System Concepts 12.1 Part 1 - Secondary Storage Secondary storage typically: is anything that is outside of primary memory does not permit

More information

Introduction to High Performance Parallel I/O

Introduction to High Performance Parallel I/O Introduction to High Performance Parallel I/O Richard Gerber Deputy Group Lead NERSC User Services August 30, 2013-1- Some slides from Katie Antypas I/O Needs Getting Bigger All the Time I/O needs growing

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Extreme I/O Scaling with HDF5

Extreme I/O Scaling with HDF5 Extreme I/O Scaling with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group koziol@hdfgroup.org July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 1 Outline Brief overview of

More information

Chapter 13: I/O Systems

Chapter 13: I/O Systems Chapter 13: I/O Systems DM510-14 Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations STREAMS Performance 13.2 Objectives

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

RDMA-like VirtIO Network Device for Palacios Virtual Machines

RDMA-like VirtIO Network Device for Palacios Virtual Machines RDMA-like VirtIO Network Device for Palacios Virtual Machines Kevin Pedretti UNM ID: 101511969 CS-591 Special Topics in Virtualization May 10, 2012 Abstract This project developed an RDMA-like VirtIO network

More information

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL

Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL Fourth Workshop on Ultrascale Visualization 10/28/2009 Scott A. Klasky klasky@ornl.gov Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL H. Abbasi, J. Cummings, C. Docan,, S. Ethier, A Kahn,

More information

Performance and Optimization Issues in Multicore Computing

Performance and Optimization Issues in Multicore Computing Performance and Optimization Issues in Multicore Computing Minsoo Ryu Department of Computer Science and Engineering 2 Multicore Computing Challenges It is not easy to develop an efficient multicore program

More information

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE Michael Repplinger 1,2, Martin Beyer 1, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken,

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Computer Systems Assignment 4: Scheduling and I/O

Computer Systems Assignment 4: Scheduling and I/O Autumn Term 018 Distributed Computing Computer Systems Assignment : Scheduling and I/O Assigned on: October 19, 018 1 Scheduling The following table describes tasks to be scheduled. The table contains

More information

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of California, Berkeley Operating Systems Principles

More information

Unit 2 : Computer and Operating System Structure

Unit 2 : Computer and Operating System Structure Unit 2 : Computer and Operating System Structure Lesson 1 : Interrupts and I/O Structure 1.1. Learning Objectives On completion of this lesson you will know : what interrupt is the causes of occurring

More information

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition Chapter 13: I/O Systems Silberschatz, Galvin and Gagne 2013 Chapter 13: I/O Systems Overview I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Noise Injection Techniques to Expose Subtle and Unintended Message Races Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau

More information

CS370 Operating Systems Midterm Review

CS370 Operating Systems Midterm Review CS370 Operating Systems Midterm Review Yashwant K Malaiya Fall 2015 Slides based on Text by Silberschatz, Galvin, Gagne 1 1 What is an Operating System? An OS is a program that acts an intermediary between

More information

PreDatA - Preparatory Data Analytics on Peta-Scale Machines

PreDatA - Preparatory Data Analytics on Peta-Scale Machines PreDatA - Preparatory Data Analytics on Peta-Scale Machines Fang Zheng 1, Hasan Abbasi 1, Ciprian Docan 2, Jay Lofstead 1, Scott Klasky 3, Qing Liu 3, Manish Parashar 2, Norbert Podhorszki 3, Karsten Schwan

More information

Memory Management Strategies for Data Serving with RDMA

Memory Management Strategies for Data Serving with RDMA Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

Using RDMA for Lock Management

Using RDMA for Lock Management Using RDMA for Lock Management Yeounoh Chung Erfan Zamanian {yeounoh, erfanz}@cs.brown.edu Supervised by: John Meehan Stan Zdonik {john, sbz}@cs.brown.edu Abstract arxiv:1507.03274v2 [cs.dc] 20 Jul 2015

More information

Chapter 13: I/O Systems

Chapter 13: I/O Systems Chapter 13: I/O Systems Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations Streams Performance 13.2 Silberschatz, Galvin

More information

Multifunction Networking Adapters

Multifunction Networking Adapters Ethernet s Extreme Makeover: Multifunction Networking Adapters Chuck Hudson Manager, ProLiant Networking Technology Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information contained

More information

Device-Functionality Progression

Device-Functionality Progression Chapter 12: I/O Systems I/O Hardware I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations Incredible variety of I/O devices Common concepts Port

More information

Chapter 12: I/O Systems. I/O Hardware

Chapter 12: I/O Systems. I/O Hardware Chapter 12: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations I/O Hardware Incredible variety of I/O devices Common concepts Port

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access File File System Implementation Operating Systems Hebrew University Spring 2009 Sequence of bytes, with no structure as far as the operating system is concerned. The only operations are to read and write

More information

Performance Enhancement for IPsec Processing on Multi-Core Systems

Performance Enhancement for IPsec Processing on Multi-Core Systems Performance Enhancement for IPsec Processing on Multi-Core Systems Sandeep Malik Freescale Semiconductor India Pvt. Ltd IDC Noida, India Ravi Malhotra Freescale Semiconductor India Pvt. Ltd IDC Noida,

More information

Operating System Review Part

Operating System Review Part Operating System Review Part CMSC 602 Operating Systems Ju Wang, 2003 Fall Virginia Commonwealth University Review Outline Definition Memory Management Objective Paging Scheme Virtual Memory System and

More information

REMEM: REmote MEMory as Checkpointing Storage

REMEM: REmote MEMory as Checkpointing Storage REMEM: REmote MEMory as Checkpointing Storage Hui Jin Illinois Institute of Technology Xian-He Sun Illinois Institute of Technology Yong Chen Oak Ridge National Laboratory Tao Ke Illinois Institute of

More information

Chapter 12: I/O Systems

Chapter 12: I/O Systems Chapter 12: I/O Systems Chapter 12: I/O Systems I/O Hardware! Application I/O Interface! Kernel I/O Subsystem! Transforming I/O Requests to Hardware Operations! STREAMS! Performance! Silberschatz, Galvin

More information

Chapter 13: I/O Systems

Chapter 13: I/O Systems Chapter 13: I/O Systems Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations STREAMS Performance Silberschatz, Galvin and

More information

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition Chapter 12: I/O Systems Silberschatz, Galvin and Gagne 2011 Chapter 12: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations STREAMS

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

The Red Storm System: Architecture, System Update and Performance Analysis

The Red Storm System: Architecture, System Update and Performance Analysis The Red Storm System: Architecture, System Update and Performance Analysis Douglas Doerfler, Jim Tomkins Sandia National Laboratories Center for Computation, Computers, Information and Mathematics LACSI

More information

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana

More information

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World

More information

Cluster Network Products

Cluster Network Products Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster

More information

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G 10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures

More information

Experiments with Wide Area Data Coupling Using the Seine Coupling Framework

Experiments with Wide Area Data Coupling Using the Seine Coupling Framework Experiments with Wide Area Data Coupling Using the Seine Coupling Framework Li Zhang 1, Manish Parashar 1, and Scott Klasky 2 1 TASSL, Rutgers University, 94 Brett Rd. Piscataway, NJ 08854, USA 2 Oak Ridge

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

IBM Spectrum Scale IO performance

IBM Spectrum Scale IO performance IBM Spectrum Scale 5.0.0 IO performance Silverton Consulting, Inc. StorInt Briefing 2 Introduction High-performance computing (HPC) and scientific computing are in a constant state of transition. Artificial

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

High-performance message striping over reliable transport protocols

High-performance message striping over reliable transport protocols J Supercomput (2006) 38:261 278 DOI 10.1007/s11227-006-8443-6 High-performance message striping over reliable transport protocols Nader Mohamed Jameela Al-Jaroodi Hong Jiang David Swanson C Science + Business

More information

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD. OceanStor 9000 Issue V1.01 Date 2014-03-29 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2014. All rights reserved. No part of this document may be reproduced or transmitted in

More information

Oak Ridge National Laboratory Computing and Computational Sciences

Oak Ridge National Laboratory Computing and Computational Sciences Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

Parallel Computer Architecture II

Parallel Computer Architecture II Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to A Sleep-based Our Akram Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) May 1, 2011 Our 1 2 Our 3 4 5 6 Our Efficiency in Back-end Processing Efficiency in back-end

More information

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) I/O Systems Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) I/O Systems 1393/9/15 1 / 57 Motivation Amir H. Payberah (Tehran

More information

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb. Messaging App IB Verbs NTRDMA dmaengine.h ntb.h DMA DMA DMA NTRDMA v0.1 An Open Source Driver for PCIe and DMA Allen Hubbe at Linux Piter 2015 1 INTRODUCTION Allen Hubbe Senior Software Engineer EMC Corporation

More information

The Tofu Interconnect 2

The Tofu Interconnect 2 The Tofu Interconnect 2 Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited Introduction Tofu interconnect

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

The way toward peta-flops

The way toward peta-flops The way toward peta-flops ISC-2011 Dr. Pierre Lagier Chief Technology Officer Fujitsu Systems Europe Where things started from DESIGN CONCEPTS 2 New challenges and requirements! Optimal sustained flops

More information

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy The Role of InfiniBand Technologies in High Performance Computing 1 Managed by UT-Battelle Contributors Gil Bloch Noam Bloch Hillel Chapman Manjunath Gorentla- Venkata Richard Graham Michael Kagan Vasily

More information

Module 12: I/O Systems

Module 12: I/O Systems Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations Performance 12.1 I/O Hardware Incredible variety of I/O devices Common

More information

The Common Communication Interface (CCI)

The Common Communication Interface (CCI) The Common Communication Interface (CCI) Presented by: Galen Shipman Technology Integration Lead Oak Ridge National Laboratory Collaborators: Scott Atchley, George Bosilca, Peter Braam, David Dillow, Patrick

More information

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Paul Hargrove Dan Bonachea, Michael Welcome, Katherine Yelick UPC Review. July 22, 2009. What is GASNet?

More information

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity

More information

Comparison of Scheduling Policies and Workloads on the NCCS and NICS XT4 Systems at Oak Ridge National Laboratory

Comparison of Scheduling Policies and Workloads on the NCCS and NICS XT4 Systems at Oak Ridge National Laboratory Comparison of Scheduling Policies and Workloads on the NCCS and NICS XT4 Systems at Oak Ridge National Laboratory Troy Baer HPC System Administrator National Institute for Computational Sciences University

More information