Table 1. Aggregate bandwidth of the memory bus on an SMP PC. threads read write copy

Size: px
Start display at page:

Download "Table 1. Aggregate bandwidth of the memory bus on an SMP PC. threads read write copy"

Transcription

1 Network Interface Active Messages for Low Overhead Communication on SMP PC Clusters Motohiko Matsuda, Yoshio Tanaka, Kazuto Kubota and Mitsuhisa Sato Real World Computing Partnership Tsukuba Mitsui Building 16F, Takezono Tsukuba, Ibaraki , Japan Abstract. NICAM is a communication layer for SMP PC clusters connected via Myrinet, designed to reduce overhead and latency by directly utilizing a micro-processor equipped on the network interface. It adopts remote memory operations to reduce much of the overhead found in message passing. NICAM employs an Active Messages framework for exibility in programming on the network interface, and this exibility will compensate for the large latency resulting from the relatively slow micro-processor. Running message handlers directly on the network interface reduces the overhead by freeing the main processors from the work of polling incoming messages. The handlers also make synchronizations faster by avoiding the costly interactions between the main processors and the network interface. In addition, this implementation can completely hide latency of barriers in data-parallel programs, because handlers running in the background of the main processors allow reposition of barriers to any place where the latency is not critical. 1 Introduction Symmetric multiprocessor PCs (SMP PCs) have recently attracted widespread attention, and clusters of SMP PCs with fast networks have emerged as important platforms for high performance computing. While the bus bandwidth sometimes limits the computation performance, SMP PCs easily reveal a bottleneck when multiple processors are accessing the bus simultaneously. Even worse, the network interfaces for clustering further burden the bus. Thus, we designed a communication layer NICAM which reduces the communication overhead by utilizing a micro-processor on the network interface. Overhead reduction is important, because the overhead, the involvement of main processors in communication, wastes bus bandwidth as well as the processing power of the processors. In addition, a common technique of overlapping of computation and communication tends to make the communication grain-size ner, which results in larger total cost of overhead. As researchers show [1], while latency reduction in data transfer is not so relevant for performance, overhead reduction directly aects the utilization of the processing power. Thus, overhead reduction by the network interface will be fruitful while it incurs larger latency resulting from the relatively slow micro-processor on the network interface.

2 Table 1. Aggregate bandwidth of the memory bus on an SMP PC. threads read write copy (MB/s) Table 2. Barrier synchronization time for threads in a node. threads SMP PC Sun SMP (sec) NICAM also reduces latency in synchronization primitives. While latency is not the rst issue in data transfer, latency is the only issue in synchronization. Direct handling of messages by the network interface not only frees the processors from polling overhead, but also eliminates the costly interaction between the processors and the network interface. NICAM employs an Active Messages framework [2] for exibility in programming the network interface. Active Messages provide extensibility by simple additions of new handlers, because they run almost mutually independently. This exibility allows combination of data transfer primitives with synchronizations, which compensates for large latency. This paper reports on the design of NICAM and its basic performance. We present our platform PC cluster in Section 2, and the NICAM primitives in Section. Then, we present a technique of latency hiding and two sets of experimental results in Section 4. We briey discuss related work in section 5 and conclude in section 6. 2 Background 2.1 SMP PC Cluster Platform Our research platform, COMPaS [], is a PC-based SMP cluster, consisting of eight server-type PCs. Each node contains four Pentium Pro's (200 MHz, 512 KB L2 cache, 450GX chip-set, 256 MB main memory), and a Myrinet network interface card. They are connected by a single Myrinet switch. The operating system is Solaris 2.5. This section presents the performance characteristics of the node which guided our design of the communication layer. Table 1 shows the memory bus bandwidth for read, write, and copy when multiple threads execute operations simultaneously. Each gure shows the aggregate bandwidth, that is, the sum of the measured bandwidth of each processor. Notice that the aggregate bandwidth is almost independent of the number of threads. This means that a single processor can consume all the bandwidth available to the memory bus for these simple operations. Table 2 shows barrier performance among the threads in a node to compare the cache coherency performance. The column Sun SMP shows the results of the Sun Enterprise 4000 for comparison. The algorithm of this barrier uses write

3 operations to a location dedicated to each processor, and detects the condition that all locations are written. A single processor checks the condition and noties the others. This algorithm does not need any atomic operations and is regarded as reecting the cache coherency performance. The result shows that the SMP PCs have a good cache coherency implementation comparable to a more sophisticated SMP machine. 2.2 Myrinet Myrinet is a Giga-bit LAN system from Myricom Inc. [4]. It consists of communication links, switches, and host interfaces. The communication link is a bi-directional 8-bit data path, whose speed is 160 MB/s in each direction. The switch is an 8-by-8 crossbar and uses cut-through routing. Switches can be cascaded in an arbitrary topology, and the route is statically specied in the header of a packet. The host interface is implemented as an I/O adaptor, and performs data transfers from/to the host memory system using a DMA (Direct Memory Access). Each Myrinet board contains three DMA engines, each dedicated to the main memory, transmitter and receiver, respectively, and a micro-processor to control these DMA engines. The on-board SRAM is used both to buer messages and to store the program of the micro-processor. The following features guided the design: On-board micro-processor: The micro-processor is a 2-bit custom RISC CPU core with a general purpose instruction set. Reliable link hardware: Links are reliable to assume they are error-free. Unrestricted DMA capability: The DMA engine is capable of accessing the whole physical memory in the host PC. Large SRAM size: The SRAM size is up to 1 MB depending on the board type. NICAM Primitives.1 Communication Layer Design NICAM provides primitives for remote memory operations and synchronization primitives. We based communication primitives on remote memory operations, because they are preferable to message passing with respect to overhead. That is, message passing suers from ow-control and buer management tasks for handling incoming messages, and sometimes requires copying messages which sacrices bus bandwidth. In addition, message passing may need mutual exclusions to coordinate processors in an SMP node. In NICAM, all events to the main processors, such as completion of a data transfer or a barrier, are notied via a ag in the main memory, because it reduces much of the overhead of the main processor. Some primitives take a ag argument which points to a memory location. The ag is set by the network interface when a condition is satised.

4 Initialization nicam_init() nicam_lock_memory(addr, range) Simple Data Transfer nicam_bcopy(src_node, src_addr, dst_node, dst_addr, size) nicam_sync(flag_addr) nicam_write1(dst_node, flag_addr, val) Data Transfer with Synchronization nicam_bcopy_notify(src_node, src_addr, dst_node, dst_addr, size, flag_addr, onoff) nicam_set_counter(flag_addr, count) nicam_bcopy_countup(src_node, src_addr, dst_node, dst_addr, size) Broadcast nicam_bcast(src_node, src_addr, dst_addr, size, flag_addr, onoff) nicam_bcast_discard(onoff) Barrier Synchronization nicam_barrier(flag_addr) Message Passing Support nicam_bcopy_src(key, dst_node, src_addr, size, flag_addr, onoff) nicam_bcopy_dst(key, src_node, dst_addr, size, flag_addr, onoff) Fig. 1. NICAM primitives. NICAM is designed for a single job, and makes exclusive use of the resources for communication. This is not a problem practically, since we assume a dedicated environment for a parallel job to investigate the utilization of the resources in a cluster. In addition, it requires that the remotely accessed regions of memory should be pinned-down in advance of remote memory operations. The pin-down operation protects the region of memory from the paging system in a virtual memory environment. Some other systems take alternative approaches and use only limited pinned-down areas [5]..2 Remote Memory Operations Figure 1 lists the set of primitive operations supported in NICAM. Remote memory operations have a similar interface to the local copy (bcopy) operation, but have additional arguments to specify source and destination nodes. If the source is specied as a node other than the local one, it can act as a remote read. An Active Messages mechanism forwards the request to the specied source node. nicam_sync is used by the invoking node to know the completion of currently issued copies. A variant of a copy operation, nicam_bcopy_notify, provides a data transfer primitive combined with point-to-point synchronization. It noties the destination node about the completion of a copy by setting a ag. This is useful for programs taking a message passing style. While NICAM does not directly support message passing, many message passing programs can be rewritten using

5 Main Proc. barrier request ag update NI0 NI1 NI2 NI H H Hj H 88 H Hj H 88 H Hj XXXXXXz 9 XXXXXXz HHj HHj HHj 2-way join 2-way join Fig. 2. Steps in a multi-stage barrier performed between network interfaces (NI0{NI). this primitive. It is also useful in optimization to reduce the number of synchronizations in data-parallel programs, where it is assumed that completions of data transfers are notied to the destination [6]. Another variant, nicam_bcopy_countup provides a counted completion. Sometimes synchronization points are known by the number of data items exchanged. For example, in all-to-all communication, the synchronization point is reached when the count reaches the number of nodes involved. This completely avoids explicit synchronizations. nicam_set_counter species the count and the ag address to signal completion.. Barrier Synchronization The barrier primitive signals the completion by setting a ag in memory, too. This not only lowers the notication overhead, but also makes it a fuzzy barrier [7]. In addition, its implementation uses a relaxed algorithm which allows concurrent progress of multiple instances of barriers. That is, barriers can be invoked multiple times before the completion of a previously issued one. This feature tolerates latency and is useful when the environment has a large variance of loads, or the scale of a cluster is large. Figure 2 shows the steps in a barrier execution. The barrier is implemented using a multi-stage log(p ) step algorithm, and each stage of the barrier performs a 2-way join and forwards a message to the next stage. The node n performs a join with the node n 8 2 i at the i-th stage. Small queues are placed in the join to accommodate multiple instances of barriers. It is benecial to avoid the involvement of the main processors, where the algorithm needs multiple stages. It would be worthless in relaxing barriers if the execution required polling by the processors..4 Implementation Incoming requests of Active Messages are handled using a simple polling on the network interface. There are three sources of requests: the main processors

6 MB/s Bandwidth Bytes sec Latency Bytes Fig.. Basic Performance of NICAM. Bandwidth (throughput) and latency (one-way) of remote memory data transfer. and the two DMA engines of the receiver and the main memory, respectively. Active Messages are also used for local requests from the main processors, which simplify the implementation by making the local and remote requests be handled in the same way. Hand-shaking is used between the main processors and the network interface to manage requests. The main processor makes a request by writing a ag in the SRAM on the network interface, and the network interface makes an acknowledgment by writing a ag in the main memory. In order to avoid mutual exclusions among the main processors, there are multiple pairs of ags, one for each processor. NICAM passes only virtual addresses between nodes, and they are translated to physical addresses by the network interface. It maintains its own copy of the address translation table in the SRAM. The SRAM on the network interface is large enough to hold the whole table for the entire physical memory..5 Basic Performance Figure shows the bandwidth (throughput) and the latency of nicam_bcopy, up to sizes of 16 KB for bandwidth and 4 KB for latency. The maximum bandwidth is about 105 MB/s observed at 64 KB, and the start-up parameter N 1 2 is a little below 2 KB. The minimum latency for small messages is about 20 sec (one-way). Since NICAM uses pinned-down areas and does not need any buering or ow-control, these gures show the actual performance observable from applications. Figure 4 compares the barrier synchronization time between NICAM and a message passing library PM [5]. We implemented a barrier on top of PM, which is performed by the main processor using message passing. It uses essentially the same algorithm as the one used in NICAM. The gure shows that, while NICAM is slower than PM for two nodes, it becomes faster as the number of nodes increases. This is because the cost of interaction between the main

7 sec NICAM PM Nodes Fig. 4. Barrier synchronization time between nodes using NICAM and a message passing library PM. processor and the network interface is larger in NICAM, but the interaction occurs only once in NICAM. The call overhead of most NICAM operations is about 5.7 sec. The break down of the overhead is: (1) argument checks, (2) copying of a handler address and arguments into the SRAM on the network interface, and () a word write to a ag in the SRAM to start processing. Serializing instructions (the CPUID instruction) are inserted after steps (2) and () to ush the write buer. Although the serializing instruction has a very large cost, it is necessary because the write buer of Pentium Pro reorders write requests. 4 Hiding Latency and Experimental Results 4.1 Hiding Latency of Barriers in an Array Class The main target application of NICAM is a scientic computing library in C. The array class library may be considered in a variant of the BSP (Bulk Synchronous Parallel) model. However, while synchronization points are explicitly specied in the programs in the BSP model, they are implicit in the array class and it is necessary to synchronize at each expression or statement. Some researchers have been successful in avoiding explicit barriers at the end of the super-steps in a message passing environment [8]. However, synchronizations may still be required at the beginning when the communication is based on remote writes, because there is a possibility of overwriting the memory contents on which local processing is currently in progress. Since the library has no knowledge of the use of arrays, it would have to strictly lock-step in each super-step, and this would make the library sensitive to latency. The barriers at the beginning of super-steps can be eliminated as mentioned in [6], where a new area is always allocated to store the results. It is clear that the remote writes are directed to those newly allocated locations and never overwrite the existing one. However, incorporating this technique requires modication to the storage reclaimer because the storage cannot be freed locally. That is, the

8 Speedup Overlap No Overlap Threads Fig. 5. Dierence of speed-up between Laplace solvers with/without overlapping. sec Latency Hiding No Latency Hiding Size Fig. 6. The eect of latency hiding in the cshift operation. Size N for a two-dimensional array of N 2 N. storage could be overwritten if a storage were reclaimed and then reallocated independently while other nodes still had a reference to it. The storage should be reclaimed when all nodes are ready to release it. This condition can be checked by the use of barriers. The reclaimer rst signals a barrier on a location specic to the freed storage, then checks the completion of the barrier later. Completion indicates that all the nodes are ready to release it. The implementation of the barrier has the preferable properties of background execution and the concurrent progress of multiple instances. The background execution is important, because the barriers are triggered at the storage reclamation and it is hard to nd insertion points for polling. It would otherwise degrade performance when the polling by the main processors were scattered in every operation. The concurrent progress is also important, because the variance of issuing of barriers is large and the barriers may overlap. 4.2 Overlapping Communication and Computation Overlapping of computation and communication is a common technique in exploiting hybrid distributed/shared-memory programming on SMP clusters []. The communication overhead must be small enough for overlapping to be eective. An overlapping eect is presented for an explicit Laplace equation solver using the Jacobi method. In each iteration, a new array is computed from the old array by averaging the four neighbors in a two-dimensional space. The array is partitioned into equally-sized strips and these are distributed among the nodes. Although the values of the boundary elements need to be exchanged between nodes, computation can be started without waiting for the completion of the exchanges, because the internal elements have no dependency on the ones exchanged. Figure 5 shows the result of the experiments, when the number of threads is varied by 1, 2, and 4. The number of nodes is xed to eight. The result shows the speed-up relative to the single processor, for the array of size

9 The speed-up is not good because of the large bus trac in this application. However, the gain of overlapping becomes larger, even though the number of communications increases in proportion to the number of threads. 4. Eect of Hiding Latency The eect of hiding latency of the barrier synchronization is shown using the two versions of a cshift (cyclic shift) operation. One version is a bulk synchronous version which uses a barrier at the beginning of the operation, and the other is a latency hiding version which uses a barrier in the storage reclaimer. Figure 6 shows the eect of the latency hiding. All the eight nodes are used, but only a single processor on a node is used. The shift size is one. The gure shows that the whole barrier cost is hidden (the gain is over 0 sec). Also, slightly more gain is achieved by the relaxation eect of the synchronization condition in each step, because no explicit barriers appear in the code. 5 Related Work Schauser et al reported on experiments running Active Messages on a network interface [9]. Also, Krishnamurthy et al reported on running the Active Messages handlers for Split-C primitives on various network interfaces [10]. They reported that low latency is achieved on platforms such as the Paragon and Berkeley NOW. However, NICAM further exploited the benets of utilizing the network interface. There are a number of fast message passing layers for the Myrinet network, such as BIP [11], FM [12], and PM [5]. These are concerned mainly with the bandwidth and latency. In contrast, NICAM aims at reducing overhead, but it shows some disadvantage in bandwidth and latency. This is due to the fact that NICAM runs much more work on the relatively slow micro-processor. 6 Conclusion NICAM makes use of the micro-processor on the network interface to reduce the overhead in data transfer, and also to reduce the latency in synchronization. In addition, background execution of barriers can completely hide latency. It employs an Active Messages framework for exibility in programming the network interface, which allows an easy integration of new primitives, such as the ones to help rewrite message passing by remote memory operations. While NICAM is based on remote memory operations, it is not considered restrictive in data-parallel programming. The one-sided communications are often more suitable to implement data-parallel operations than message passing. Many data-parallel operations are straightforwardly implemented using remote memory operations, including shift, scan, and exchange operations. In contrast, it is sometimes necessary in message passing to swap the order of code to make

10 a sender-receiver pair, or to use asynchronous operations to avoid a pair and to tolerate latency. While many distributed memory MPP machines provide remote memory operations and fast barrier synchronizations in hardware, NICAM attempts to provide these operations using commodity network hardware. In NICAM, the data transfer primitives are optionally combined with actions for synchronization and the synchronization primitives take latency hiding into account, which will compensate for the relatively large latency in a software implementation. References 1. R. P. Martin, A. M.Vahdat, D. E. Culler, T. E. Anderson: Eects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture. Int'l Symp. on Computer Architecture (ISCA'97) (1997). 2. T. von Eicken, D. E. Culler, S. C. Goldstein, K. E. Schauser: Active Messages: a Mechanism for Integrated Communication and Computation. Int'l Symp. on Computer Architecture (ISCA'92), pp (1992).. Y. Tanaka, M. Matsuda, M. Ando, K. Kubota, M. Sato: COMPaS: A Pentium Pro PC-based SMP Cluster and its Experience. IPPS Workshop on Personal Computer based Networks of Workstations (PC-NOW'98) (1998). 4. N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, S. Wen-King: Myrinet { A Gigabit-per-Second Local-Area Network. IEEE MI- CRO, Vol.15, No.1, pp.29{6 (1996). ( 5. H. Tezuka, A. Hori, Y. Ishikawa, M. Sato: PM: An Operating System Coordinated High Performance Communication Library. High-Performance Computing and Networking, LNCS 1225, pp.708{717, Springer-Verlag (1997). 6. M. Gupta, E. Schonberg: Static Analysis to Reduce Synchronization Costs in Data- Parallel Programs. Symp. on Principles of Programming Languages, pp.22{2 (1996). 7. R. Gupta: The Fuzzy Barrier: A Mechanism for High Speed Synchronization of Processors. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-III) pp.54{6 (1989). 8. A. Fahmy, A. Heddaya: Communicable Memory and Lazy Barriers for Bulk Synchronous Parallelism in BSPk. Boston University Technical Report BU-CS (1996). 9. K. E. Schauser, C. J. Scheiman, J. M. Ferguson, P. Z. Kolano: Exploiting the Capability of Communications Co-processor. Int'l Parallel Processing Symposium (IPPS'96) (1996). 10. A. Krishnamurthy, K. E. Schauser, C. J. Scheiman, R. Y. Wang, D. E. Culler, K. Yelick: Evaluation of Architectural Support for Global Address-Based Communication in Large-Scale Parallel Machines. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII) (1996). 11. L. Prylli, B. Tourancheau: BIP: a New Protocol Designed for High Performance Networking on Myrinet. IPPS Workshop on Personal Computer based Networks of Workstations (PC-NOW'98) (1998). 12. S. Pakin, M. Lauria, A. Chien: High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet. Supercomputing'95 (1995). This article was processed using the LaT E X macro package with LLNCS style

Switch. Switch. PU: Pentium Pro 200MHz Memory: 128MB Myricom Myrinet 100Base-T Ethernet

Switch. Switch. PU: Pentium Pro 200MHz Memory: 128MB Myricom Myrinet 100Base-T Ethernet COMPaS: A Pentium Pro PC-based SMP Cluster and its Experience Yoshio Tanaka 1, Motohiko Matsuda 1, Makoto Ando 1, Kazuto Kubota and Mitsuhisa Sato 1 Real World Computing Partnership fyoshio,matu,ando,kazuto,msatog@trc.rwcp.or.jp

More information

Recently, symmetric multiprocessor systems have become

Recently, symmetric multiprocessor systems have become Global Broadcast Argy Krikelis Aspex Microsystems Ltd. Brunel University Uxbridge, Middlesex, UK argy.krikelis@aspex.co.uk COMPaS: a PC-based SMP cluster Mitsuhisa Sato, Real World Computing Partnership,

More information

Omni OpenMP compiler. C++ Frontend. C- Front. F77 Frontend. Intermediate representation (Xobject) Exc Java tool. Exc Tool

Omni OpenMP compiler. C++ Frontend. C- Front. F77 Frontend. Intermediate representation (Xobject) Exc Java tool. Exc Tool Design of OpenMP Compiler for an SMP Cluster Mitsuhisa Sato, Shigehisa Satoh, Kazuhiro Kusano and Yoshio Tanaka Real World Computing Partnership, Tsukuba, Ibaraki 305-0032, Japan E-mail:fmsato,sh-sato,kusano,yoshiog@trc.rwcp.or.jp

More information

Design and Implementation of Virtual Memory-Mapped Communication on Myrinet

Design and Implementation of Virtual Memory-Mapped Communication on Myrinet Design and Implementation of Virtual Memory-Mapped Communication on Myrinet Cezary Dubnicki, Angelos Bilas, Kai Li Princeton University Princeton, New Jersey 854 fdubnicki,bilas,lig@cs.princeton.edu James

More information

The Design and Implementation of Zero Copy MPI Using Commodity Hardware with a High Performance Network

The Design and Implementation of Zero Copy MPI Using Commodity Hardware with a High Performance Network The Design and Implementation of Zero Copy MPI Using Commodity Hardware with a High Performance Network Francis O Carroll, Hiroshi Tezuka, Atsushi Hori and Yutaka Ishikawa Tsukuba Research Center Real

More information

Application Program. Language Runtime. SCore-D UNIX. Myrinet. WS or PC

Application Program. Language Runtime. SCore-D UNIX. Myrinet. WS or PC Global State Detection using Network Preemption Atsushi Hori Hiroshi Tezuka Yutaka Ishikawa Tsukuba Research Center Real World Computing Partnership 1-6-1 Takezono, Tsukuba-shi, Ibaraki 305, JAPAN TEL:+81-298-53-1661,

More information

RWC PC Cluster II and SCore Cluster System Software High Performance Linux Cluster

RWC PC Cluster II and SCore Cluster System Software High Performance Linux Cluster RWC PC Cluster II and SCore Cluster System Software High Performance Linux Cluster Yutaka Ishikawa Hiroshi Tezuka Atsushi Hori Shinji Sumimoto Toshiyuki Takahashi Francis O Carroll Hiroshi Harada Real

More information

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL Jun Sun, Yasushi Shinjo and Kozo Itano Institute of Information Sciences and Electronics University of Tsukuba Tsukuba,

More information

Building MPI for Multi-Programming Systems using Implicit Information

Building MPI for Multi-Programming Systems using Implicit Information Building MPI for Multi-Programming Systems using Implicit Information Frederick C. Wong 1, Andrea C. Arpaci-Dusseau 2, and David E. Culler 1 1 Computer Science Division, University of California, Berkeley

More information

Cross-platform Analysis of Fast Messages for. Universita di Napoli Federico II. fiannello, lauria,

Cross-platform Analysis of Fast Messages for. Universita di Napoli Federico II. fiannello, lauria, Cross-platform Analysis of Fast Messages for Myrinet? Giulio Iannello, Mario Lauria, and Stefano Mercolino Dipartimento di Informatica e Sistemistica Universita di Napoli Federico II via Claudio, 21 {

More information

SMP PCs: A Case Study on Cluster Computing

SMP PCs: A Case Study on Cluster Computing SMP PCs: A Case Study on Cluster Computing Antônio Augusto Fröhlich Wolfgang Schröder-Preikschat GMD FIRST University of Magdeburg Rudower Chaussee 5 Universitätsplatz 2 D-12489 Berlin, Germany D-39106

More information

Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters

Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters Title Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters Author(s) Wong, KP; Wang, CL Citation International Conference on Parallel Processing Proceedings, Aizu-Wakamatsu

More information

PM2: High Performance Communication Middleware for Heterogeneous Network Environments

PM2: High Performance Communication Middleware for Heterogeneous Network Environments PM2: High Performance Communication Middleware for Heterogeneous Network Environments Toshiyuki Takahashi, Shinji Sumimoto, Atsushi Hori, Hiroshi Harada, and Yutaka Ishikawa Real World Computing Partnership,

More information

Evaluation of Architectural Support for Global Address-Based Communication. in Large-Scale Parallel Machines

Evaluation of Architectural Support for Global Address-Based Communication. in Large-Scale Parallel Machines Evaluation of Architectural Support for Global Address-Based Communication in Large-Scale Parallel Machines Arvind Krishnamurthy, Klaus E. Schauser y, Chris J. Scheiman y, Randolph Y. Wang,David E. Culler,

More information

Low-Latency Message Passing on Workstation Clusters using SCRAMNet 1 2

Low-Latency Message Passing on Workstation Clusters using SCRAMNet 1 2 Low-Latency Message Passing on Workstation Clusters using SCRAMNet 1 2 Vijay Moorthy, Matthew G. Jacunski, Manoj Pillai,Peter, P. Ware, Dhabaleswar K. Panda, Thomas W. Page Jr., P. Sadayappan, V. Nagarajan

More information

Low-Latency Communication over Fast Ethernet

Low-Latency Communication over Fast Ethernet Low-Latency Communication over Fast Ethernet Matt Welsh, Anindya Basu, and Thorsten von Eicken {mdw,basu,tve}@cs.cornell.edu Department of Computer Science Cornell University, Ithaca, NY 14853 http://www.cs.cornell.edu/info/projects/u-net

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP (extended abstract) Mitsuhisa Sato 1, Motonari Hirano 2, Yoshio Tanaka 2 and Satoshi Sekiguchi 2 1 Real World Computing Partnership,

More information

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,

More information

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ A First Implementation of In-Transit Buffers on Myrinet GM Software Λ S. Coll, J. Flich, M. P. Malumbres, P. López, J. Duato and F.J. Mora Universidad Politécnica de Valencia Camino de Vera, 14, 46071

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

Component-Based Communication Support for Parallel Applications Running on Workstation Clusters

Component-Based Communication Support for Parallel Applications Running on Workstation Clusters Component-Based Communication Support for Parallel Applications Running on Workstation Clusters Antônio Augusto Fröhlich 1 and Wolfgang Schröder-Preikschat 2 1 GMD FIRST Kekulésraÿe 7 D-12489 Berlin, Germany

More information

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1 October 4 th, 2012 1 Department of Computer Science, Cornell University Papers 2 Active Messages: A Mechanism for Integrated Communication and Control,

More information

PU0 PU1 PU2 PU3. barrier gather

PU0 PU1 PU2 PU3. barrier gather Practical Simulation of Large-Scale Parallel Programs and its Performance Analysis of the NAS Parallel Benchmarks Kazuto Kubota 1 Ken'ichi Itakura 2 Mitsuhisa Sato 1 and Taisuke Boku 2 1 Real World Computing

More information

Myrinet Switch. Myrinet Switch. Host Machines Sun Sparc Station 20 (75MHz) Software Version Sun OS Workstation Cluster.

Myrinet Switch. Myrinet Switch. Host Machines Sun Sparc Station 20 (75MHz) Software Version Sun OS Workstation Cluster. Implementation of Gang-Scheduling on Workstation Cluster Atsushi Hori, Hiroshi Tezuka, Yutaka Ishikawa, Noriyuki Soda y, Hiroki Konaka, Munenori Maeda Tsukuba Research Center Real World Computing Partnership

More information

1 Introduction Myrinet grew from the results of two ARPA-sponsored projects. Caltech's Mosaic and the USC Information Sciences Institute (USC/ISI) ATO

1 Introduction Myrinet grew from the results of two ARPA-sponsored projects. Caltech's Mosaic and the USC Information Sciences Institute (USC/ISI) ATO An Overview of Myrinet Ralph Zajac Rochester Institute of Technology Dept. of Computer Engineering EECC 756 Multiple Processor Systems Dr. M. Shaaban 5/18/99 Abstract The connections between the processing

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática

More information

The Avalanche Myrinet Simulation Package. University of Utah, Salt Lake City, UT Abstract

The Avalanche Myrinet Simulation Package. University of Utah, Salt Lake City, UT Abstract The Avalanche Myrinet Simulation Package User Manual for V. Chen-Chi Kuo, John B. Carter fchenchi, retracg@cs.utah.edu WWW: http://www.cs.utah.edu/projects/avalanche UUCS-96- Department of Computer Science

More information

Implementing TreadMarks over GM on Myrinet: Challenges, Design Experience, and Performance Evaluation

Implementing TreadMarks over GM on Myrinet: Challenges, Design Experience, and Performance Evaluation Implementing TreadMarks over GM on Myrinet: Challenges, Design Experience, and Performance Evaluation Ranjit Noronha and Dhabaleswar K. Panda Dept. of Computer and Information Science The Ohio State University

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of

More information

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. of Computer Engineering (DISCA) Universidad Politécnica de Valencia

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Can User-Level Protocols Take Advantage of Multi-CPU NICs?

Can User-Level Protocols Take Advantage of Multi-CPU NICs? Can User-Level Protocols Take Advantage of Multi-CPU NICs? Piyush Shivam Dept. of Comp. & Info. Sci. The Ohio State University 2015 Neil Avenue Columbus, OH 43210 shivam@cis.ohio-state.edu Pete Wyckoff

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Scalable Distributed Memory Machines

Scalable Distributed Memory Machines Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands of processors. Design Choices: Custom-designed or commodity nodes? Network scalability. Capability

More information

Parallel Pipeline STAP System

Parallel Pipeline STAP System I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel Chapter-6 SUBJECT:- Operating System TOPICS:- I/O Management Created by : - Sanjay Patel Disk Scheduling Algorithm 1) First-In-First-Out (FIFO) 2) Shortest Service Time First (SSTF) 3) SCAN 4) Circular-SCAN

More information

Wide-area Cluster System

Wide-area Cluster System Performance Evaluation of a Firewall-compliant Globus-based Wide-area Cluster System Yoshio Tanaka 3, Mitsuhisa Sato Real World Computing Partnership Mitsui bldg. 14F, 1-6-1 Takezono Tsukuba Ibaraki 305-0032,

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Profile-Based Load Balancing for Heterogeneous Clusters *

Profile-Based Load Balancing for Heterogeneous Clusters * Profile-Based Load Balancing for Heterogeneous Clusters * M. Banikazemi, S. Prabhu, J. Sampathkumar, D. K. Panda, T. W. Page and P. Sadayappan Dept. of Computer and Information Science The Ohio State University

More information

Design, Implementation, and Impact of. Multicast in the ParPar Control Network. David Er-El Avi Kavas Dror G. Feitelson

Design, Implementation, and Impact of. Multicast in the ParPar Control Network. David Er-El Avi Kavas Dror G. Feitelson Design, Implementation, and Impact of Multicast in the ParPar Control Network David Er-El Avi Kavas Dror G. Feitelson Institute of Computer Science The Hebrew University of Jerusalem 91904 Jerusalem, Israel

More information

LogP Performance Assessment of Fast Network Interfaces

LogP Performance Assessment of Fast Network Interfaces November 22, 1995 LogP Performance Assessment of Fast Network Interfaces David Culler, Lok Tin Liu, Richard P. Martin, and Chad Yoshikawa Computer Science Division University of California, Berkeley Abstract

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Ethan Kao CS 6410 Oct. 18 th 2011

Ethan Kao CS 6410 Oct. 18 th 2011 Ethan Kao CS 6410 Oct. 18 th 2011 Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. In Proceedings

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central

More information

Computer-System Organization (cont.)

Computer-System Organization (cont.) Computer-System Organization (cont.) Interrupt time line for a single process doing output. Interrupts are an important part of a computer architecture. Each computer design has its own interrupt mechanism,

More information

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. Informática de Sistemas y Computadores Universidad Politécnica

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system 123 Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system Mitsuhisa Sato a, Hiroshi Harada a, Atsushi Hasegawa b and Yutaka Ishikawa a a Real World Computing

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Protocols and Software for Exploiting Myrinet Clusters

Protocols and Software for Exploiting Myrinet Clusters Protocols and Software for Exploiting Myrinet Clusters P. Geoffray 1, C. Pham, L. Prylli 2, B. Tourancheau 3, and R. Westrelin Laboratoire RESAM, Université Lyon 1 1 Myricom Inc., 2 ENS-Lyon, 3 SUN Labs

More information

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs Evaluation of Communication Mechanisms in Invalidate-based Shared Memory Multiprocessors Gregory T. Byrd and Michael J. Flynn Computer Systems Laboratory Stanford University, Stanford, CA Abstract. Producer-initiated

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 What is an Operating System? What is

More information

Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects

Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects Jiuxing Liu Balasubramanian Chandrasekaran Weikuan Yu Jiesheng Wu Darius Buntinas Sushmitha Kini Peter Wyckoff Dhabaleswar

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

The Overall SHRIMP Project. Related Work (from last section) Paper Goals. Bill Kramer April 17, 2002

The Overall SHRIMP Project. Related Work (from last section) Paper Goals. Bill Kramer April 17, 2002 CS 258 Reading Assignment 16 Discussion Design Choice in the SHRIMP System: An Empirical Study Bill Kramer April 17, 2002 # The Overall SHRIMP Project The SHRIMP (Scalable High-performance Really Inexpensive

More information

Operating System Performance and Large Servers 1

Operating System Performance and Large Servers 1 Operating System Performance and Large Servers 1 Hyuck Yoo and Keng-Tai Ko Sun Microsystems, Inc. Mountain View, CA 94043 Abstract Servers are an essential part of today's computing environments. High

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Efficient Multicast on Myrinet Using Link-Level Flow Control

Efficient Multicast on Myrinet Using Link-Level Flow Control Efficient Multicast on Myrinet Using Link-Level Flow Control Raoul A.F. Bhoedjang Tim Rühl Henri E. Bal Vrije Universiteit, Amsterdam, The Netherlands raoul, tim, bal @cs.vu.nl Abstract This paper studies

More information

Fig. 1. Omni OpenMP compiler

Fig. 1. Omni OpenMP compiler Performance Evaluation of the Omni OpenMP Compiler Kazuhiro Kusano, Shigehisa Satoh and Mitsuhisa Sato RWCP Tsukuba Research Center, Real World Computing Partnership 1-6-1, Takezono, Tsukuba-shi, Ibaraki,

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Deadlock- and Livelock-Free Routing Protocols for Wave Switching

Deadlock- and Livelock-Free Routing Protocols for Wave Switching Deadlock- and Livelock-Free Routing Protocols for Wave Switching José Duato,PedroLópez Facultad de Informática Universidad Politécnica de Valencia P.O.B. 22012 46071 - Valencia, SPAIN E-mail:jduato@gap.upv.es

More information

A Hardware Cache memcpy Accelerator

A Hardware Cache memcpy Accelerator A Hardware memcpy Accelerator Stephan Wong, Filipa Duarte, and Stamatis Vassiliadis Computer Engineering, Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands {J.S.S.M.Wong, F.Duarte,

More information

The Tofu Interconnect 2

The Tofu Interconnect 2 The Tofu Interconnect 2 Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited Introduction Tofu interconnect

More information

Lessons learned from MPI

Lessons learned from MPI Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.

More information

Low-Cost Support for Fine-Grain Synchronization in. David Kranz, Beng-Hong Lim, Donald Yeung and Anant Agarwal. Massachusetts Institute of Technology

Low-Cost Support for Fine-Grain Synchronization in. David Kranz, Beng-Hong Lim, Donald Yeung and Anant Agarwal. Massachusetts Institute of Technology Low-Cost Support for Fine-Grain Synchronization in Multiprocessors David Kranz, Beng-Hong Lim, Donald Yeung and Anant Agarwal Laboratory for Computer Science Massachusetts Institute of Technology Cambridge,

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

An O/S perspective on networks: Active Messages and U-Net

An O/S perspective on networks: Active Messages and U-Net An O/S perspective on networks: Active Messages and U-Net Theo Jepsen Cornell University 17 October 2013 Theo Jepsen (Cornell University) CS 6410: Advanced Systems 17 October 2013 1 / 30 Brief History

More information

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University 740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

To provide a faster path between applications

To provide a faster path between applications Cover Feature Evolution of the Virtual Interface Architecture The recent introduction of the VIA standard for cluster or system-area networks has opened the market for commercial user-level network interfaces.

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,

More information

Chapter 13: I/O Systems

Chapter 13: I/O Systems Chapter 13: I/O Systems DM510-14 Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations STREAMS Performance 13.2 Objectives

More information

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above Janus a C++ Template Library for Parallel Dynamic Mesh Applications Jens Gerlach, Mitsuhisa Sato, and Yutaka Ishikawa fjens,msato,ishikawag@trc.rwcp.or.jp Tsukuba Research Center of the Real World Computing

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

clients (compute nodes) servers (I/O nodes)

clients (compute nodes) servers (I/O nodes) Parallel I/O on Networks of Workstations: Performance Improvement by Careful Placement of I/O Servers Yong Cho 1, Marianne Winslett 1, Szu-wen Kuo 1, Ying Chen, Jonghyun Lee 1, Krishna Motukuri 1 1 Department

More information

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Buffering roblems [ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Input-buffer overflow Suppose a large

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Performance of a High-Level Parallel Language on a High-Speed Network

Performance of a High-Level Parallel Language on a High-Speed Network Performance of a High-Level Parallel Language on a High-Speed Network Henri Bal Raoul Bhoedjang Rutger Hofman Ceriel Jacobs Koen Langendoen Tim Rühl Kees Verstoep Dept. of Mathematics and Computer Science

More information

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System? Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information