Data Distribution, Migration and Replication on a cc-numa Architecture

Size: px
Start display at page:

Download "Data Distribution, Migration and Replication on a cc-numa Architecture"

Transcription

1 Data Distribution, Migration and Replication on a cc-numa Architecture J. Mark Bull and Chris Johnson EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. [m.bull,c.johnson]@epcc.ed.ac.uk 1 Introduction It is well known that, although cc-numa architectures allow construction of large scale shared memory systems, they are more difficult to program effectively because data locality is an important consideration. Support for specifying data distribution in OpenMP has been the subject of much debate [1], [4], and several proposed implementations. These take the form of data distribution directives, giving the programmer control of where data is placed in the memory system. In the absence of additional directives, data distribution can be controlled by exploiting the system s allocation policy. In most cc-numa systems, data is placed on the node which first accesses it: the so-called first touch policy. An alternative strategy is to give this control not to the programmer but to the operating system, by allowing the location of data in memory to change as a program executes. This can be done either by data migration, where pages can move between nodes, but there is only ever one copy, or by replication, where multiple copies of pages can exist. In this study, we examine the interactions between data distribution (implemented via the first touch policy), migration, and replication on a prototype cc-numa architecture. On this system it appears that replication is almost always more effective that migration, despite the additional cost in memory usage. Data distribution can be effective in applications where there is obviously a correct distribution. In many applications the correct distribution is not so obvious, and although it pays to distribute data in the absence of migration and replication, some combination of replication and migration can often achieve comparable performance. 2 The Sun Wildfire prototype The Sun Wildfire system [3], [5] is a prototype cc-numa architecture, built from standard SMP nodes. In each SMP, one processor board is replaced by a Wildfire interconnect board, which can have up to three high-speed links to other SMP nodes, allowing construction of machine with up to four nodes. Unlike other cc-numa machines (such as the SGI Origin series) the Wildfire system has a small number of (potentially) large nodes, rather than a large number of small nodes. The Wildfire system at the University of Edinburgh consists of three nodes: one E6000 server with 18 processors and two E4000 servers with 8 processors each. The processors are 250MHz Ultrasparc IIs, and the system runs a modified version of Solaris 6. The memory latency for a memory access to a remote node is around 6-7 times that for an access to main memory on the local node. This is quite a large factor for a cc-numa architecture. Page migration and replication are managed by daemons running on each node, and can be switched on and off on a system-wide basis via a command line interface. The algorithms for determining when a page should be migrated or replicated are described in [5]. The system monitors remote accesses to pages, and when these exceed a certain threshold, the page is marked as a candidate for replication or 1

2 migration. If both are enabled, in general migration is tried first, and a page is replicated if it is already been replicated, or has been recently migrated. When pages are replicated, a shadow page is set up on the local node, which can satisfy misses for the data on that page. Cache coherency is still maintained at the cache-line level on a system-wide basis. 3 Experiments and Results To evaluate the interactions between distribution, migration and replication, we have taken a simple two-dimensional CFD simulation, and number of codes from the OpenMP version of the NAS Parallel Benchmark suite [2]. Two versions of each code were produced: with sequential and parallel data initialisation to control data distribution. In the case of the NAS benchmarks, the supplied code usually contains parallelised data initialisation, so to obtain a sequential initialisation we removed the relevant OpenMP directives. Each version was run with migration and replication independently enabled and disabled, giving a total of eight runs for each code. The codes were run on 18 threads, utilising six processors on each of the three nodes. OpenMP threads were bound to processors to prevent threads being migrated between nodes. Experiments with different number of processors suggest that similar conclusions would be drawn, and so for sake of clarity we do not report them here. Replication OFF Replication ON Table 1: Execution time (in seconds) of SHALLOW on 18 processors Distributed Seq, Rep off, Mig off Seq, Rep off, Mig on Seq, Rep on, Mig off Seq, Rep on, Mig on 2.5 Time (s) Timestep Figure 1: Execution time (in seconds) per timestep for SHALLOW on 18 processors 2

3 Replication OFF Replication ON Table 2: Execution time (in seconds) of final 10 timesteps of SHALLOW on 18 processors Table 1 shows the execution time for SHALLOW, a simple 2-D shallow water simulation, on the Wildfire system. Without data distribution, all memory is allocated on one node. By distributing the data, a more than four-fold performance increase is obtained, and replication and migration have no additional benefit. Without distribution, migration, replication and the combination of the two all reduce the run time significantly, to within a factor of 1.4 of that achieved by distribution. As the run time is quite short, transient effects are still significant: these are observable in Figure 1, in which the execution time for each of the 100 timesteps is displayed. Transient behaviour is seen in the first 20 or 25 timesteps. Table 2 shows the execution time of the last 10 timesteps out of the 100 executed in this run of the code. Migration alone achieves a slightly better performance than distribution: replication in addition to migration does not help here, and replication alone is around 20% slower than distribution. As an aside, it is interesting to note that if we run the same simulation on the 18 processors of the E6000, then the execution time is approximately 10% longer than when distributed across the three nodes. In the latter case, the additional memory bandwidth available outweighs any penalty of longer latencies across the Wildfire interconnect. Replication OFF Replication ON Table 3: Execution time (in seconds) of BT, Class B, on 18 processors Table 3 shows the execution time for BT from the NAS suite. In this case data access patterns are more complex than in SHALLOW, and the best data distribution strategy is less obvious. We observe that without migration or replication, data distribution has a significant effect on performance. However, both migration and replication, and in particular the combination of both, achieve better performance than distribution alone. Indeed, if at least one of these is enabled, data distribution has very little additional benefit. Replication OFF Replication ON Table 4: Execution time (in seconds) of CG, Class B, on 18 processors Table 4 shows the execution time for CG from the NAS suite. In this case the distribution of pages by first touch policy is not very effective, reducing the execution time from 1065 seconds to 941 seconds. Migration is able to reduce the time to 699 seconds, but replication is by far the most effective strategy, regardless of whether migration or distribution is used. Table 5 shows the execution time for FT from the NAS suite. All three strategies are beneficial, but the best performance is obtained by combining all three. This is the only case in our experiments where 3

4 Replication OFF Replication ON Table 5: Execution time (in seconds) of FT, Class B, on 18 processors it appears that benefits of distribution cannot be replicated by using migration or replication. However, we have not discounted the possibility that this is a transient effect and that a longer run time would not show such a significant advantage for distribution. Replication OFF Replication ON Table 6: Execution time (in seconds) of MG, Class C, on 18 processors Table 6 shows the execution time for MG from the NAS suite. In this case neither distribution, replication nor migration has a significant effect on the execution time. Replication OFF Replication ON Table 7: Execution time (in seconds) of SP, Class B, on 18 processors Table 7 shows the execution time for SP from the NAS suite. Here we observe a similar situation as for BT: both distribution and migration show a significant benefit, but replication is the best strategy. With replication enabled, it makes little difference whether the other two strategies are used or not. 4 Discussion The results we present above indicate that, at least for this type of code, the dynamic techniques of replication and migration are able to obtain performance as good as, and sometimes significantly better than, static data distribution. We acknowledge the limitations of this experiments: there is little pressure on memory which might put replication (and to some extent migration) at a disadvantage in cases where a large fraction of the physical memory is required for an application. Furthermore the system we have used is on the small side, though we might expect to obtain similar results on a system with a small number of large SMP nodes. This type of system has so far not found favour commercially: previous distributed shared memory systems have typically been constructed with a large number of small nodes each containing between one and four processors. An advantage of the large-node design is that the scalability of the dynamic techniques is not severely tested. Another important observation is that replication is, if anything, more beneficial than migration, which has tended to receive more attention in the literature. It has been noted, for example in [4], that migration alone can result in pages which are referenced by multiple nodes bouncing around the system unless some action is taken to prevent this. In the Wildfire system, such pages are no longer migrated, but replicated instead. 4

5 Nevertheless, migration is still a useful strategy, so it would be preferable for the user to retain some control over the use of replication and migration. In the Wildfire system such control operates on a system wide basis, which is undesirable if the system is to be used to run multiple applications at the same time. Our findings therefore tend to support those of [4] who conclude that data distribution is not a necessary extension for OpenMP, because runtime techniques are sufficently powerful. In the past year, the economic climate has forced a number of vendors to cancel or postpone plans to build large cc- NUMA machines. Instead, there has been a trend towards building larger SMP systems (for example Sun Fire 15000, Fujitsu PrimePower) with better scalability of memory bandwidth, although bandwidth limitations in these systems are still a significant obstacle to the scaling of some applications. However, it may be possible that interest will be renewed in large cc-numa machines in the future. If so, then it will be critical to co-design the hardware and operating system to implement effective dynamic page placement strategies for large scale scientific applications and to give the user adequate control over their use. Given the technical difficulties involved, further research in this area could be a sound investment for the future of large shared memory architectures. References [1] John Bircsak, Peter Craig, RaeLyn Crowell, Zarka Cvetanovic, Jonathan Harris, C. Alexander Nelson, Carl D. Offner, Extending OpenMP for NUMA Machines, in Proc. of IEEE/ACM Supercomputing 2000: High Performance Computing and Networking Conference, Dallas TX, November [2] Bailey, D. H., E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, The NAS Parallel Benchmarks Technical Report RNR , NASA Ames Research Center, March [3] E. Hagersten and M. Koster WildFire: A scalable path for SMPs, in Proceedings of the Fifth IEEE Symposium on High-Performance Computer Architecture, pages , February [4] D.S. Nikolopoulos, T.S. Papatheodorou, C.D. Polychronopoulos, J. Labarta and E. Ayguade Is Data Distribution Necessary in OpenMP?, in Proc. of IEEE/ACM Supercomputing 2000: High Performance Computing and Networking Conference, Dallas TX, November [5] L. Noordergraaf and R. Van der Pas, Performance Experiences on Sun s Wildfire Prototype, in Proc. of Supercomputing 99, Portland, OR, November

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

Using Hardware Counters to Automatically Improve Memory Performance

Using Hardware Counters to Automatically Improve Memory Performance Using Hardware Counters to Automatically Improve Memory Performance Mustafa M. Tikir Jeffrey K. Hollingsworth Computer Science Department University of Maryland College Park, MD 20742 {tikir,hollings}@cs.umd.edu

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Omni/SCASH 1 2 3 4 heterogeneity Omni/SCASH page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Yoshiaki Sakae, 1 Satoshi Matsuoka,

More information

Building MPI for Multi-Programming Systems using Implicit Information

Building MPI for Multi-Programming Systems using Implicit Information Building MPI for Multi-Programming Systems using Implicit Information Frederick C. Wong 1, Andrea C. Arpaci-Dusseau 2, and David E. Culler 1 1 Computer Science Division, University of California, Berkeley

More information

682 M. Nordén, S. Holmgren, and M. Thuné

682 M. Nordén, S. Holmgren, and M. Thuné OpenMP versus MPI for PDE Solvers Based on Regular Sparse Numerical Operators? Markus Nord n, Sverk er Holmgren, and Michael Thun Uppsala University, Information Technology, Dept. of Scientic Computing,

More information

Effective Cross-Platform, Multilevel Parallelism via Dynamic Adaptive Execution

Effective Cross-Platform, Multilevel Parallelism via Dynamic Adaptive Execution Effective Cross-Platform, Multilevel Parallelism via Dynamic Adaptive Execution Walden Ko, Mark Yankelevsky, Dimitrios S. Nikolopoulos, and Constantine D. Polychronopoulos Center for Supercomputing Research

More information

Exploiting Memory Affinity in OpenMP through Schedule Reuse

Exploiting Memory Affinity in OpenMP through Schedule Reuse Exploiting Memory Affinity in OpenMP through Schedule Reuse D.S. Nikolopoulos Coordinated Sciences Laboratory University of Illinois at Urbana-Champaign 1308 W. Main Street, MC-228 Urbana, IL, 61801, USA

More information

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system 123 Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system Mitsuhisa Sato a, Hiroshi Harada a, Atsushi Hasegawa b and Yutaka Ishikawa a a Real World Computing

More information

David Cronk University of Tennessee, Knoxville, TN

David Cronk University of Tennessee, Knoxville, TN Penvelope: A New Approach to Rapidly Predicting the Performance of Computationally Intensive Scientific Applications on Parallel Computer Architectures Daniel M. Pressel US Army Research Laboratory (ARL),

More information

Cost-Performance Evaluation of SMP Clusters

Cost-Performance Evaluation of SMP Clusters Cost-Performance Evaluation of SMP Clusters Darshan Thaker, Vipin Chaudhary, Guy Edjlali, and Sumit Roy Parallel and Distributed Computing Laboratory Wayne State University Department of Electrical and

More information

Performance Evaluation in Computational Grid Environments

Performance Evaluation in Computational Grid Environments Performance Evaluation in Computational Environments Liang Peng, Simon See, Yueqin Jiang*, Jie Song, Appie Stoelwinder, and Hoon Kang Neo Asia Pacific Science and Technology Center, Sun Microsystems Inc.

More information

Performance Optimisations of the NPB FT Kernel by Special-Purpose Unroller

Performance Optimisations of the NPB FT Kernel by Special-Purpose Unroller Performance Optimisations of the NPB FT Kernel by Special-Purpose Unroller Vladimir Getov ½, Yuan Wei ½, Larry Carter ¾, Kang Su Gatlin ¾ ½ School of Computer Science University of Westminster Northwick

More information

Performance Study of Hyper-Threading Technology on the LUSITANIA Supercomputer

Performance Study of Hyper-Threading Technology on the LUSITANIA Supercomputer Performance Study of Hyper-Threading Technology on the LUSITANIA Supercomputer César Gómez-Martín 1, José Luis González-Sánchez 1, Javier Corral-García 1, Ángel Bejarano-Borrega 1, Javier Lázaro-Jareño

More information

Performance Experiences on Sun s WildFire 1 Prototype

Performance Experiences on Sun s WildFire 1 Prototype Performance Experiences on Sun s WildFire 1 Prototype Lisa Noordergraaf High End Server Engineering Sun Microsystems Burlington, MA lisa.noordergraaf@sun.com Ruud van der Pas European HPC Team Sun Microsystems

More information

NUMA effects on multicore, multi socket systems

NUMA effects on multicore, multi socket systems NUMA effects on multicore, multi socket systems Iakovos Panourgias 9/9/2011 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011 Abstract Modern multicore/multi socket

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth

More information

In the rest of this paper, we present a brief overview of the SMP environment on the Fujitsu

In the rest of this paper, we present a brief overview of the SMP environment on the Fujitsu OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries Dr. C. Addison and Dr. Y. Ren Fujitsu European Centre for Information Technology, Hayes, UK Abstract: Dense linear algebra

More information

Advanced OpenMP. Lecture 3: Cache Coherency

Advanced OpenMP. Lecture 3: Cache Coherency Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable

More information

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University

More information

Homeless and Home-based Lazy Release Consistency Protocols on Distributed Shared Memory

Homeless and Home-based Lazy Release Consistency Protocols on Distributed Shared Memory Homeless and Home-based Lazy Release Consistency Protocols on Distributed Shared Memory Byung-Hyun Yu and Zhiyi Huang Department of Computer Science University of Otago, New Zealand Email: byu,hzy@cs.otago.ac.nz

More information

ASYNC Loop Constructs for Relaxed Synchronization

ASYNC Loop Constructs for Relaxed Synchronization ASYNC Loop Constructs for Synchronization (LCPC2008 preprint) Russell Meyers and Zhiyuan Li Department of Computer Science Purdue University, West Lafayette IN 47906, USA, {rmeyers,li}@cs.purdue.edu Abstract.

More information

A Multiprogramming Aware OpenMP Implementation

A Multiprogramming Aware OpenMP Implementation Scientific Programming 11 (2003) 133 141 133 IOS Press A Multiprogramming Aware OpenMP Implementation Vasileios K. Barekas, Panagiotis E. Hadjidoukas, Eleftherios D. Polychronopoulos and Theodore S. Papatheodorou

More information

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand th IEEE Symposium on High Performance Interconnects Performance Analysis and Evaluation of PCIe. and Quad-Data Rate InfiniBand Matthew J. Koop Wei Huang Karthik Gopalakrishnan Dhabaleswar K. Panda Network-Based

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

On the scalability of tracing mechanisms 1

On the scalability of tracing mechanisms 1 On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

Preliminary Evaluation of Dynamic Load Balancing Using Loop Re-partitioning on Omni/SCASH

Preliminary Evaluation of Dynamic Load Balancing Using Loop Re-partitioning on Omni/SCASH Preliminary Evaluation of Dynamic Load Balancing Using Loop Re-partitioning on Omni/SCASH Yoshiaki Sakae Tokyo Institute of Technology, Japan sakae@is.titech.ac.jp Mitsuhisa Sato Tsukuba University, Japan

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

NUMA replicated pagecache for Linux

NUMA replicated pagecache for Linux NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations

More information

Evaluation and Modeling of Program Execution Models

Evaluation and Modeling of Program Execution Models Evaluation and ing of Program s Stéphane Zuckerman Computer Architecture & Parallel Systems Laboratory Department of Electrical and Computer Engineering University of Delaware October 19th, 2011 (University

More information

Parallel applications controllable by the resource manager. Operating System Scheduler

Parallel applications controllable by the resource manager. Operating System Scheduler An OpenMP Implementation for Multiprogrammed SMPs Vasileios K. Barekas, Panagiotis E. Hadjidoukas, Eleftherios D. Polychronopoulos and Theodore S. Papatheodorou High Performance Information Systems Laboratory,

More information

Architectural Requirements and Scalability of the NAS Parallel Benchmarks

Architectural Requirements and Scalability of the NAS Parallel Benchmarks Abstract Architectural Requirements and Scalability of the NAS Parallel Benchmarks Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and David E. Culler Computer Science Division Department

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Performance Characteristics of OpenMP Constructs, and Application Benchmarks on a Large Symmetric Multiprocessor

Performance Characteristics of OpenMP Constructs, and Application Benchmarks on a Large Symmetric Multiprocessor Performance Characteristics of OpenMP Constructs, and Application Benchmarks on a Large Symmetric Multiprocessor Nathan R. Fredrickson fredrick@ee.queensu.ca Ahmad Afsahi ahmad@ee.queensu.ca Department

More information

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver Author manuscript, published in "N/P" A NUMA Aware Scheduler for a Parallel Sparse Direct Solver Mathieu Faverge a, Pierre Ramet a a INRIA Bordeaux - Sud-Ouest & LaBRI, ScAlApplix project, Université Bordeaux

More information

A Proposal for Error Handling in OpenMP

A Proposal for Error Handling in OpenMP A Proposal for Error Handling in OpenMP Alejandro Duran, Roger Ferrer, Juan José Costa, Marc Gonzàlez, Xavier Martorell, Eduard Ayguadé, and Jesús Labarta Barcelona Supercomputing Center (BSC) Departament

More information

vprobe: Scheduling Virtual Machines on NUMA Systems

vprobe: Scheduling Virtual Machines on NUMA Systems vprobe: Scheduling Virtual Machines on NUMA Systems Song Wu, Huahua Sun, Like Zhou, Qingtian Gan, Hai Jin Service Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science

More information

OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries

OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries Dr. C. Addison, Dr. Y. Ren and Dr. M. van Waveren Fujitsu European Centre for Information Technology, Hayes, UK Abstract:

More information

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation Miao Luo, Ping Lai, Sreeram Potluri, Emilio P. Mancini, Hari Subramoni, Krishna Kandalla, Dhabaleswar K. Panda Department of Computer

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Introduction to OpenMP. Lecture 2: OpenMP fundamentals Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview 2 Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs What is OpenMP? 3 OpenMP is an API designed for programming

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains: The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs 2 1 What is OpenMP? OpenMP is an API designed for programming

More information

Topology Aware Task stealing for On-Chip NUMA Multi-Core Processors

Topology Aware Task stealing for On-Chip NUMA Multi-Core Processors Available online at www.sciencedirect.com Procedia Computer Science 18 (2013 ) 379 388 2013 International Conference on Computational Science Topology Aware Task stealing for On-Chip NUMA Multi-Core Processors

More information

Communication Characteristics in the NAS Parallel Benchmarks

Communication Characteristics in the NAS Parallel Benchmarks Communication Characteristics in the NAS Parallel Benchmarks Ahmad Faraj Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 32306 {faraj, xyuan}@cs.fsu.edu Abstract In this

More information

A transparent runtime data distribution engine for OpenMP 1

A transparent runtime data distribution engine for OpenMP 1 143 A transparent runtime data distribution engine for OpenMP 1 Dimitrios S. Nikolopoulos a,, Theodore S. Papatheodorou b, Constantine D. Polychronopoulos a, Jesús Labarta c and Eduard Ayguadé c a Computer

More information

Performance Consistency on Multi-socket AMD Opteron Systems

Performance Consistency on Multi-socket AMD Opteron Systems Performance Consistency on Multi-socket AMD Opteron Systems TR-08-07 Rob Fowler Anirban Mandal Min Yeol Lim December 5, 2008 RENCI Technical Report Series http://www.renci.org/techreports Performance Consistency

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh ymmetric MultiProcessing Each processor in an MP has equal access to all parts of memory same latency and

More information

Jan H. Schönherr, Ben Juurlink, Jan Richling Topology-aware equipartitioning with coscheduling on multicore systems

Jan H. Schönherr, Ben Juurlink, Jan Richling Topology-aware equipartitioning with coscheduling on multicore systems Jan H. Schönherr, Ben Juurlink, Jan Richling Topology-aware equipartitioning with coscheduling on multicore systems Conference object, Postprint version This version is available at http://doi.org/10.14279/depositonce-5745.

More information

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Optimizing Replication, Communication, and Capacity Allocation in CMPs Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important

More information

Automatic Nonblocking Communication for Partitioned Global Address Space Programs

Automatic Nonblocking Communication for Partitioned Global Address Space Programs Automatic Nonblocking Communication for Partitioned Global Address Space Programs Wei-Yu Chen 1,2 wychen@cs.berkeley.edu Costin Iancu 2 cciancu@lbl.gov Dan Bonachea 1,2 bonachea@cs.berkeley.edu Katherine

More information

A Characterization of Shared Data Access Patterns in UPC Programs

A Characterization of Shared Data Access Patterns in UPC Programs IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview

More information

Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University

Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University Adaptive Migratory Scheme for Distributed Shared Memory 1 Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Dynamic Loop Parallelisation

Dynamic Loop Parallelisation Dynamic Loop Parallelisation Adrian Jackson and Orestis Agathokleous EPCC, The University of Edinburgh, Kings Buildings, Mayfield Road, Edinburgh, EH9 3JZ, UK arxiv:1205.2367v1 [cs.pl] 10 May 2012 Abstract

More information

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS

More information

WP2.3 SRS G RIDB ENCH

WP2.3 SRS G RIDB ENCH WP2.3 SRS G RIDB ENCH ( THE C ROSSG RID B ENCHMARK S UITE) Document Filename: Work package: Partner(s): Lead Partner: Config ID: CG-2.3-D2.1-UCY001-1.0-SRS Metrics & Benchmarks (WP2.3) UCY, TUM UCY CG-2.3-D2.1-UCY001-1.0

More information

Feedback-Directed Page Placement for ccnuma via Hardware-generated Memory Traces

Feedback-Directed Page Placement for ccnuma via Hardware-generated Memory Traces 1 Feedback-Directed Page Placement for ccnuma via Hardware-generated Memory Traces Jaydeep Marathe, Vivek Thakkar and Frank Mueller Abstract Non-uniform memory architectures with cache coherence (ccnuma)

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

A Web-based Prophesy Automated Performance Modeling System

A Web-based Prophesy Automated Performance Modeling System A Web-based Prophesy Automated Performance Modeling System Xingfu Wu, Valerie Taylor Department of Computer Science, Texas A & M University, College Station, TX 77843, USA Email: {wuxf, taylor}@cs.tamu.edu

More information

Native Marshalling. Java Marshalling. Mb/s. kbytes

Native Marshalling. Java Marshalling. Mb/s. kbytes Design Issues for Ecient Implementation of MPI in Java Glenn Judd, Mark Clement, Quinn Snell Computer Science Department, Brigham Young University, Provo, USA Vladimir Getov 2 2 School of Computer Science,

More information

Applications. Message Passing Interface(PVM, MPI, P4 etc.) Socket Interface. Low Overhead Protocols. Network dependent protocols.

Applications. Message Passing Interface(PVM, MPI, P4 etc.) Socket Interface. Low Overhead Protocols. Network dependent protocols. Exploiting Multiple Heterogeneous Networks to Reduce Communication Costs in Parallel Programs JunSeong Kim jskim@ee.umn.edu David J. Lilja lilja@ee.umn.edu Department of Electrical Engineering University

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

PRIMEPOWER Server Architecture Excels in Scalability and Flexibility

PRIMEPOWER Server Architecture Excels in Scalability and Flexibility PRIMEPOWER Server Architecture Excels in Scalability and Flexibility A D.H. Brown Associates, Inc. White Paper Prepared for Fujitsu This document is copyrighted by D.H. Brown Associates, Inc. (DHBA) and

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

One-Sided Append: A New Communication Paradigm For PGAS Models

One-Sided Append: A New Communication Paradigm For PGAS Models One-Sided Append: A New Communication Paradigm For PGAS Models James Dinan and Mario Flajslik Intel Corporation {james.dinan, mario.flajslik}@intel.com ABSTRACT One-sided append represents a new class

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

NAS Applied Research Branch. Ref: Intl. Journal of Supercomputer Applications, vol. 5, no. 3 (Fall 1991), pg. 66{73. Abstract

NAS Applied Research Branch. Ref: Intl. Journal of Supercomputer Applications, vol. 5, no. 3 (Fall 1991), pg. 66{73. Abstract THE NAS PARALLEL BENCHMARKS D. H. Bailey 1, E. Barszcz 1, J. T. Barton 1,D.S.Browning 2, R. L. Carter, L. Dagum 2,R.A.Fatoohi 2,P.O.Frederickson 3, T. A. Lasinski 1,R.S. Schreiber 3, H. D. Simon 2,V.Venkatakrishnan

More information

Multi-Processor / Parallel Processing

Multi-Processor / Parallel Processing Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms

More information

Hardware Profile-guided Automatic Page Placement for ccnuma Systems

Hardware Profile-guided Automatic Page Placement for ccnuma Systems Hardware Profile-guided Automatic Page Placement for ccnuma Systems Jaydeep Marathe Frank Mueller Department of Computer Science, North Carolina State University, Raleigh, NC 27695-7534 e-mail: mueller@cs.ncsu.edu

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation

More information

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Binding Nested OpenMP Programs on Hierarchical Memory Architectures Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de

More information

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster : Support for High-Performance MPI Intra-Node Communication on Linux Cluster Hyun-Wook Jin Sayantan Sur Lei Chai Dhabaleswar K. Panda Department of Computer Science and Engineering The Ohio State University

More information

Adaptive Scheduling under Memory Pressure on Multiprogrammed SMPs

Adaptive Scheduling under Memory Pressure on Multiprogrammed SMPs Adaptive Scheduling under Memory Pressure on Multiprogrammed SMPs Dimitrios S. Nikolopoulos and Constantine D. Polychronopoulos Coordinated Science Laboratory University of Illinois at Urbana-Champaign

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems

Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems Michael Marchetti, Leonidas Kontothanassis, Ricardo Bianchini, and Michael L. Scott Department of

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures

Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures Konstantin Berlin 1,JunHuan 2, Mary Jacob 3, Garima Kochhar 3,JanPrins 2, Bill

More information

High Performance Algorithms on. Clusters of Symmetric Multiprocessors (SMPs)

High Performance Algorithms on. Clusters of Symmetric Multiprocessors (SMPs) SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs) David A. Bader æ Department of Electrical and Computer Engineering University of New Mexico,

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

A Test Suite for High-Performance Parallel Java

A Test Suite for High-Performance Parallel Java page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010

More information

Parallel Computer Architecture

Parallel Computer Architecture Parallel Computer Architecture What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:»

More information

Using Timestamps to Track Causal Dependencies

Using Timestamps to Track Causal Dependencies Using Timestamps to Track Causal Dependencies J. A. David McWha Dept. of Computer Science, University of Waikato, Private Bag 315, Hamilton jadm@cs.waikato.ac.nz ABSTRACT As computer architectures speculate

More information

CACHE-CONSCIOUS ALLOCATION OF POINTER- BASED DATA STRUCTURES

CACHE-CONSCIOUS ALLOCATION OF POINTER- BASED DATA STRUCTURES CACHE-CONSCIOUS ALLOCATION OF POINTER- BASED DATA STRUCTURES Angad Kataria, Simran Khurana Student,Department Of Information Technology Dronacharya College Of Engineering,Gurgaon Abstract- Hardware trends

More information

Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.

Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved. Intelligent Storage Results from real life testing Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA SAS Intelligent Storage components! OLAP Server! Scalable Performance Data Server!

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Performance of Variant Memory Configurations for Cray XT Systems

Performance of Variant Memory Configurations for Cray XT Systems Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)

More information

Mixed Mode MPI / OpenMP Programming

Mixed Mode MPI / OpenMP Programming Mixed Mode MPI / OpenMP Programming L.A. Smith Edinburgh Parallel Computing Centre, Edinburgh, EH9 3JZ 1 Introduction Shared memory architectures are gradually becoming more prominent in the HPC market,

More information

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction

More information

OPENMP TIPS, TRICKS AND GOTCHAS

OPENMP TIPS, TRICKS AND GOTCHAS OPENMP TIPS, TRICKS AND GOTCHAS Mark Bull EPCC, University of Edinburgh (and OpenMP ARB) markb@epcc.ed.ac.uk OpenMPCon 2015 OpenMPCon 2015 2 A bit of background I ve been teaching OpenMP for over 15 years

More information

A MIXED-LANGUAGE PROGRAMMING METHODOLOGY FOR HIGH PERFORMANCE JAVA COMPUTING*

A MIXED-LANGUAGE PROGRAMMING METHODOLOGY FOR HIGH PERFORMANCE JAVA COMPUTING* A MIXED-LANGUAGE PROGRAMMING METHODOLOGY FOR HIGH PERFORMANCE JAVA COMPUTING* Vladimir S. Getov University of Westminster Northwick Park, Harrow, UK and Los Alamos National Laboratory Los Alamos, NM, USA

More information

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark Abstract

More information

Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems

Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems Xingfu Wu and Valerie Taylor Department of Computer Science

More information