Data Distribution, Migration and Replication on a cc-numa Architecture
|
|
- Phyllis Douglas
- 5 years ago
- Views:
Transcription
1 Data Distribution, Migration and Replication on a cc-numa Architecture J. Mark Bull and Chris Johnson EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. [m.bull,c.johnson]@epcc.ed.ac.uk 1 Introduction It is well known that, although cc-numa architectures allow construction of large scale shared memory systems, they are more difficult to program effectively because data locality is an important consideration. Support for specifying data distribution in OpenMP has been the subject of much debate [1], [4], and several proposed implementations. These take the form of data distribution directives, giving the programmer control of where data is placed in the memory system. In the absence of additional directives, data distribution can be controlled by exploiting the system s allocation policy. In most cc-numa systems, data is placed on the node which first accesses it: the so-called first touch policy. An alternative strategy is to give this control not to the programmer but to the operating system, by allowing the location of data in memory to change as a program executes. This can be done either by data migration, where pages can move between nodes, but there is only ever one copy, or by replication, where multiple copies of pages can exist. In this study, we examine the interactions between data distribution (implemented via the first touch policy), migration, and replication on a prototype cc-numa architecture. On this system it appears that replication is almost always more effective that migration, despite the additional cost in memory usage. Data distribution can be effective in applications where there is obviously a correct distribution. In many applications the correct distribution is not so obvious, and although it pays to distribute data in the absence of migration and replication, some combination of replication and migration can often achieve comparable performance. 2 The Sun Wildfire prototype The Sun Wildfire system [3], [5] is a prototype cc-numa architecture, built from standard SMP nodes. In each SMP, one processor board is replaced by a Wildfire interconnect board, which can have up to three high-speed links to other SMP nodes, allowing construction of machine with up to four nodes. Unlike other cc-numa machines (such as the SGI Origin series) the Wildfire system has a small number of (potentially) large nodes, rather than a large number of small nodes. The Wildfire system at the University of Edinburgh consists of three nodes: one E6000 server with 18 processors and two E4000 servers with 8 processors each. The processors are 250MHz Ultrasparc IIs, and the system runs a modified version of Solaris 6. The memory latency for a memory access to a remote node is around 6-7 times that for an access to main memory on the local node. This is quite a large factor for a cc-numa architecture. Page migration and replication are managed by daemons running on each node, and can be switched on and off on a system-wide basis via a command line interface. The algorithms for determining when a page should be migrated or replicated are described in [5]. The system monitors remote accesses to pages, and when these exceed a certain threshold, the page is marked as a candidate for replication or 1
2 migration. If both are enabled, in general migration is tried first, and a page is replicated if it is already been replicated, or has been recently migrated. When pages are replicated, a shadow page is set up on the local node, which can satisfy misses for the data on that page. Cache coherency is still maintained at the cache-line level on a system-wide basis. 3 Experiments and Results To evaluate the interactions between distribution, migration and replication, we have taken a simple two-dimensional CFD simulation, and number of codes from the OpenMP version of the NAS Parallel Benchmark suite [2]. Two versions of each code were produced: with sequential and parallel data initialisation to control data distribution. In the case of the NAS benchmarks, the supplied code usually contains parallelised data initialisation, so to obtain a sequential initialisation we removed the relevant OpenMP directives. Each version was run with migration and replication independently enabled and disabled, giving a total of eight runs for each code. The codes were run on 18 threads, utilising six processors on each of the three nodes. OpenMP threads were bound to processors to prevent threads being migrated between nodes. Experiments with different number of processors suggest that similar conclusions would be drawn, and so for sake of clarity we do not report them here. Replication OFF Replication ON Table 1: Execution time (in seconds) of SHALLOW on 18 processors Distributed Seq, Rep off, Mig off Seq, Rep off, Mig on Seq, Rep on, Mig off Seq, Rep on, Mig on 2.5 Time (s) Timestep Figure 1: Execution time (in seconds) per timestep for SHALLOW on 18 processors 2
3 Replication OFF Replication ON Table 2: Execution time (in seconds) of final 10 timesteps of SHALLOW on 18 processors Table 1 shows the execution time for SHALLOW, a simple 2-D shallow water simulation, on the Wildfire system. Without data distribution, all memory is allocated on one node. By distributing the data, a more than four-fold performance increase is obtained, and replication and migration have no additional benefit. Without distribution, migration, replication and the combination of the two all reduce the run time significantly, to within a factor of 1.4 of that achieved by distribution. As the run time is quite short, transient effects are still significant: these are observable in Figure 1, in which the execution time for each of the 100 timesteps is displayed. Transient behaviour is seen in the first 20 or 25 timesteps. Table 2 shows the execution time of the last 10 timesteps out of the 100 executed in this run of the code. Migration alone achieves a slightly better performance than distribution: replication in addition to migration does not help here, and replication alone is around 20% slower than distribution. As an aside, it is interesting to note that if we run the same simulation on the 18 processors of the E6000, then the execution time is approximately 10% longer than when distributed across the three nodes. In the latter case, the additional memory bandwidth available outweighs any penalty of longer latencies across the Wildfire interconnect. Replication OFF Replication ON Table 3: Execution time (in seconds) of BT, Class B, on 18 processors Table 3 shows the execution time for BT from the NAS suite. In this case data access patterns are more complex than in SHALLOW, and the best data distribution strategy is less obvious. We observe that without migration or replication, data distribution has a significant effect on performance. However, both migration and replication, and in particular the combination of both, achieve better performance than distribution alone. Indeed, if at least one of these is enabled, data distribution has very little additional benefit. Replication OFF Replication ON Table 4: Execution time (in seconds) of CG, Class B, on 18 processors Table 4 shows the execution time for CG from the NAS suite. In this case the distribution of pages by first touch policy is not very effective, reducing the execution time from 1065 seconds to 941 seconds. Migration is able to reduce the time to 699 seconds, but replication is by far the most effective strategy, regardless of whether migration or distribution is used. Table 5 shows the execution time for FT from the NAS suite. All three strategies are beneficial, but the best performance is obtained by combining all three. This is the only case in our experiments where 3
4 Replication OFF Replication ON Table 5: Execution time (in seconds) of FT, Class B, on 18 processors it appears that benefits of distribution cannot be replicated by using migration or replication. However, we have not discounted the possibility that this is a transient effect and that a longer run time would not show such a significant advantage for distribution. Replication OFF Replication ON Table 6: Execution time (in seconds) of MG, Class C, on 18 processors Table 6 shows the execution time for MG from the NAS suite. In this case neither distribution, replication nor migration has a significant effect on the execution time. Replication OFF Replication ON Table 7: Execution time (in seconds) of SP, Class B, on 18 processors Table 7 shows the execution time for SP from the NAS suite. Here we observe a similar situation as for BT: both distribution and migration show a significant benefit, but replication is the best strategy. With replication enabled, it makes little difference whether the other two strategies are used or not. 4 Discussion The results we present above indicate that, at least for this type of code, the dynamic techniques of replication and migration are able to obtain performance as good as, and sometimes significantly better than, static data distribution. We acknowledge the limitations of this experiments: there is little pressure on memory which might put replication (and to some extent migration) at a disadvantage in cases where a large fraction of the physical memory is required for an application. Furthermore the system we have used is on the small side, though we might expect to obtain similar results on a system with a small number of large SMP nodes. This type of system has so far not found favour commercially: previous distributed shared memory systems have typically been constructed with a large number of small nodes each containing between one and four processors. An advantage of the large-node design is that the scalability of the dynamic techniques is not severely tested. Another important observation is that replication is, if anything, more beneficial than migration, which has tended to receive more attention in the literature. It has been noted, for example in [4], that migration alone can result in pages which are referenced by multiple nodes bouncing around the system unless some action is taken to prevent this. In the Wildfire system, such pages are no longer migrated, but replicated instead. 4
5 Nevertheless, migration is still a useful strategy, so it would be preferable for the user to retain some control over the use of replication and migration. In the Wildfire system such control operates on a system wide basis, which is undesirable if the system is to be used to run multiple applications at the same time. Our findings therefore tend to support those of [4] who conclude that data distribution is not a necessary extension for OpenMP, because runtime techniques are sufficently powerful. In the past year, the economic climate has forced a number of vendors to cancel or postpone plans to build large cc- NUMA machines. Instead, there has been a trend towards building larger SMP systems (for example Sun Fire 15000, Fujitsu PrimePower) with better scalability of memory bandwidth, although bandwidth limitations in these systems are still a significant obstacle to the scaling of some applications. However, it may be possible that interest will be renewed in large cc-numa machines in the future. If so, then it will be critical to co-design the hardware and operating system to implement effective dynamic page placement strategies for large scale scientific applications and to give the user adequate control over their use. Given the technical difficulties involved, further research in this area could be a sound investment for the future of large shared memory architectures. References [1] John Bircsak, Peter Craig, RaeLyn Crowell, Zarka Cvetanovic, Jonathan Harris, C. Alexander Nelson, Carl D. Offner, Extending OpenMP for NUMA Machines, in Proc. of IEEE/ACM Supercomputing 2000: High Performance Computing and Networking Conference, Dallas TX, November [2] Bailey, D. H., E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, The NAS Parallel Benchmarks Technical Report RNR , NASA Ames Research Center, March [3] E. Hagersten and M. Koster WildFire: A scalable path for SMPs, in Proceedings of the Fifth IEEE Symposium on High-Performance Computer Architecture, pages , February [4] D.S. Nikolopoulos, T.S. Papatheodorou, C.D. Polychronopoulos, J. Labarta and E. Ayguade Is Data Distribution Necessary in OpenMP?, in Proc. of IEEE/ACM Supercomputing 2000: High Performance Computing and Networking Conference, Dallas TX, November [5] L. Noordergraaf and R. Van der Pas, Performance Experiences on Sun s Wildfire Prototype, in Proc. of Supercomputing 99, Portland, OR, November
Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers
Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box
More informationUsing Hardware Counters to Automatically Improve Memory Performance
Using Hardware Counters to Automatically Improve Memory Performance Mustafa M. Tikir Jeffrey K. Hollingsworth Computer Science Department University of Maryland College Park, MD 20742 {tikir,hollings}@cs.umd.edu
More informationPoint-to-Point Synchronisation on Shared Memory Architectures
Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:
More informationpage migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH
Omni/SCASH 1 2 3 4 heterogeneity Omni/SCASH page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Yoshiaki Sakae, 1 Satoshi Matsuoka,
More informationBuilding MPI for Multi-Programming Systems using Implicit Information
Building MPI for Multi-Programming Systems using Implicit Information Frederick C. Wong 1, Andrea C. Arpaci-Dusseau 2, and David E. Culler 1 1 Computer Science Division, University of California, Berkeley
More information682 M. Nordén, S. Holmgren, and M. Thuné
OpenMP versus MPI for PDE Solvers Based on Regular Sparse Numerical Operators? Markus Nord n, Sverk er Holmgren, and Michael Thun Uppsala University, Information Technology, Dept. of Scientic Computing,
More informationEffective Cross-Platform, Multilevel Parallelism via Dynamic Adaptive Execution
Effective Cross-Platform, Multilevel Parallelism via Dynamic Adaptive Execution Walden Ko, Mark Yankelevsky, Dimitrios S. Nikolopoulos, and Constantine D. Polychronopoulos Center for Supercomputing Research
More informationExploiting Memory Affinity in OpenMP through Schedule Reuse
Exploiting Memory Affinity in OpenMP through Schedule Reuse D.S. Nikolopoulos Coordinated Sciences Laboratory University of Illinois at Urbana-Champaign 1308 W. Main Street, MC-228 Urbana, IL, 61801, USA
More informationCluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system
123 Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system Mitsuhisa Sato a, Hiroshi Harada a, Atsushi Hasegawa b and Yutaka Ishikawa a a Real World Computing
More informationDavid Cronk University of Tennessee, Knoxville, TN
Penvelope: A New Approach to Rapidly Predicting the Performance of Computationally Intensive Scientific Applications on Parallel Computer Architectures Daniel M. Pressel US Army Research Laboratory (ARL),
More informationCost-Performance Evaluation of SMP Clusters
Cost-Performance Evaluation of SMP Clusters Darshan Thaker, Vipin Chaudhary, Guy Edjlali, and Sumit Roy Parallel and Distributed Computing Laboratory Wayne State University Department of Electrical and
More informationPerformance Evaluation in Computational Grid Environments
Performance Evaluation in Computational Environments Liang Peng, Simon See, Yueqin Jiang*, Jie Song, Appie Stoelwinder, and Hoon Kang Neo Asia Pacific Science and Technology Center, Sun Microsystems Inc.
More informationPerformance Optimisations of the NPB FT Kernel by Special-Purpose Unroller
Performance Optimisations of the NPB FT Kernel by Special-Purpose Unroller Vladimir Getov ½, Yuan Wei ½, Larry Carter ¾, Kang Su Gatlin ¾ ½ School of Computer Science University of Westminster Northwick
More informationPerformance Study of Hyper-Threading Technology on the LUSITANIA Supercomputer
Performance Study of Hyper-Threading Technology on the LUSITANIA Supercomputer César Gómez-Martín 1, José Luis González-Sánchez 1, Javier Corral-García 1, Ángel Bejarano-Borrega 1, Javier Lázaro-Jareño
More informationPerformance Experiences on Sun s WildFire 1 Prototype
Performance Experiences on Sun s WildFire 1 Prototype Lisa Noordergraaf High End Server Engineering Sun Microsystems Burlington, MA lisa.noordergraaf@sun.com Ruud van der Pas European HPC Team Sun Microsystems
More informationNUMA effects on multicore, multi socket systems
NUMA effects on multicore, multi socket systems Iakovos Panourgias 9/9/2011 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011 Abstract Modern multicore/multi socket
More informationPortland State University ECE 588/688. Directory-Based Cache Coherence Protocols
Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All
More informationMulticore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh
Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth
More informationIn the rest of this paper, we present a brief overview of the SMP environment on the Fujitsu
OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries Dr. C. Addison and Dr. Y. Ren Fujitsu European Centre for Information Technology, Hayes, UK Abstract: Dense linear algebra
More informationAdvanced OpenMP. Lecture 3: Cache Coherency
Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable
More informationApproaches to Performance Evaluation On Shared Memory and Cluster Architectures
Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University
More informationHomeless and Home-based Lazy Release Consistency Protocols on Distributed Shared Memory
Homeless and Home-based Lazy Release Consistency Protocols on Distributed Shared Memory Byung-Hyun Yu and Zhiyi Huang Department of Computer Science University of Otago, New Zealand Email: byu,hzy@cs.otago.ac.nz
More informationASYNC Loop Constructs for Relaxed Synchronization
ASYNC Loop Constructs for Synchronization (LCPC2008 preprint) Russell Meyers and Zhiyuan Li Department of Computer Science Purdue University, West Lafayette IN 47906, USA, {rmeyers,li}@cs.purdue.edu Abstract.
More informationA Multiprogramming Aware OpenMP Implementation
Scientific Programming 11 (2003) 133 141 133 IOS Press A Multiprogramming Aware OpenMP Implementation Vasileios K. Barekas, Panagiotis E. Hadjidoukas, Eleftherios D. Polychronopoulos and Theodore S. Papatheodorou
More informationPerformance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand
th IEEE Symposium on High Performance Interconnects Performance Analysis and Evaluation of PCIe. and Quad-Data Rate InfiniBand Matthew J. Koop Wei Huang Karthik Gopalakrishnan Dhabaleswar K. Panda Network-Based
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationOn the scalability of tracing mechanisms 1
On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationPreliminary Evaluation of Dynamic Load Balancing Using Loop Re-partitioning on Omni/SCASH
Preliminary Evaluation of Dynamic Load Balancing Using Loop Re-partitioning on Omni/SCASH Yoshiaki Sakae Tokyo Institute of Technology, Japan sakae@is.titech.ac.jp Mitsuhisa Sato Tsukuba University, Japan
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationNUMA replicated pagecache for Linux
NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations
More informationEvaluation and Modeling of Program Execution Models
Evaluation and ing of Program s Stéphane Zuckerman Computer Architecture & Parallel Systems Laboratory Department of Electrical and Computer Engineering University of Delaware October 19th, 2011 (University
More informationParallel applications controllable by the resource manager. Operating System Scheduler
An OpenMP Implementation for Multiprogrammed SMPs Vasileios K. Barekas, Panagiotis E. Hadjidoukas, Eleftherios D. Polychronopoulos and Theodore S. Papatheodorou High Performance Information Systems Laboratory,
More informationArchitectural Requirements and Scalability of the NAS Parallel Benchmarks
Abstract Architectural Requirements and Scalability of the NAS Parallel Benchmarks Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and David E. Culler Computer Science Division Department
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationPerformance Characteristics of OpenMP Constructs, and Application Benchmarks on a Large Symmetric Multiprocessor
Performance Characteristics of OpenMP Constructs, and Application Benchmarks on a Large Symmetric Multiprocessor Nathan R. Fredrickson fredrick@ee.queensu.ca Ahmad Afsahi ahmad@ee.queensu.ca Department
More informationA NUMA Aware Scheduler for a Parallel Sparse Direct Solver
Author manuscript, published in "N/P" A NUMA Aware Scheduler for a Parallel Sparse Direct Solver Mathieu Faverge a, Pierre Ramet a a INRIA Bordeaux - Sud-Ouest & LaBRI, ScAlApplix project, Université Bordeaux
More informationA Proposal for Error Handling in OpenMP
A Proposal for Error Handling in OpenMP Alejandro Duran, Roger Ferrer, Juan José Costa, Marc Gonzàlez, Xavier Martorell, Eduard Ayguadé, and Jesús Labarta Barcelona Supercomputing Center (BSC) Departament
More informationvprobe: Scheduling Virtual Machines on NUMA Systems
vprobe: Scheduling Virtual Machines on NUMA Systems Song Wu, Huahua Sun, Like Zhou, Qingtian Gan, Hai Jin Service Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science
More informationOpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries
OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries Dr. C. Addison, Dr. Y. Ren and Dr. M. van Waveren Fujitsu European Centre for Information Technology, Hayes, UK Abstract:
More informationA Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation
A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation Miao Luo, Ping Lai, Sreeram Potluri, Emilio P. Mancini, Hari Subramoni, Krishna Kandalla, Dhabaleswar K. Panda Department of Computer
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationIntroduction to OpenMP. Lecture 2: OpenMP fundamentals
Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview 2 Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs What is OpenMP? 3 OpenMP is an API designed for programming
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:
The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs 2 1 What is OpenMP? OpenMP is an API designed for programming
More informationTopology Aware Task stealing for On-Chip NUMA Multi-Core Processors
Available online at www.sciencedirect.com Procedia Computer Science 18 (2013 ) 379 388 2013 International Conference on Computational Science Topology Aware Task stealing for On-Chip NUMA Multi-Core Processors
More informationCommunication Characteristics in the NAS Parallel Benchmarks
Communication Characteristics in the NAS Parallel Benchmarks Ahmad Faraj Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 32306 {faraj, xyuan}@cs.fsu.edu Abstract In this
More informationA transparent runtime data distribution engine for OpenMP 1
143 A transparent runtime data distribution engine for OpenMP 1 Dimitrios S. Nikolopoulos a,, Theodore S. Papatheodorou b, Constantine D. Polychronopoulos a, Jesús Labarta c and Eduard Ayguadé c a Computer
More informationPerformance Consistency on Multi-socket AMD Opteron Systems
Performance Consistency on Multi-socket AMD Opteron Systems TR-08-07 Rob Fowler Anirban Mandal Min Yeol Lim December 5, 2008 RENCI Technical Report Series http://www.renci.org/techreports Performance Consistency
More informationCray XE6 Performance Workshop
Cray XE6 Performance Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh ymmetric MultiProcessing Each processor in an MP has equal access to all parts of memory same latency and
More informationJan H. Schönherr, Ben Juurlink, Jan Richling Topology-aware equipartitioning with coscheduling on multicore systems
Jan H. Schönherr, Ben Juurlink, Jan Richling Topology-aware equipartitioning with coscheduling on multicore systems Conference object, Postprint version This version is available at http://doi.org/10.14279/depositonce-5745.
More informationOptimizing Replication, Communication, and Capacity Allocation in CMPs
Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important
More informationAutomatic Nonblocking Communication for Partitioned Global Address Space Programs
Automatic Nonblocking Communication for Partitioned Global Address Space Programs Wei-Yu Chen 1,2 wychen@cs.berkeley.edu Costin Iancu 2 cciancu@lbl.gov Dan Bonachea 1,2 bonachea@cs.berkeley.edu Katherine
More informationA Characterization of Shared Data Access Patterns in UPC Programs
IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview
More informationAdaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University
Adaptive Migratory Scheme for Distributed Shared Memory 1 Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: fjhkim,vaidyag@cs.tamu.edu
More informationDynamic Loop Parallelisation
Dynamic Loop Parallelisation Adrian Jackson and Orestis Agathokleous EPCC, The University of Edinburgh, Kings Buildings, Mayfield Road, Edinburgh, EH9 3JZ, UK arxiv:1205.2367v1 [cs.pl] 10 May 2012 Abstract
More informationOutline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers
Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS
More informationWP2.3 SRS G RIDB ENCH
WP2.3 SRS G RIDB ENCH ( THE C ROSSG RID B ENCHMARK S UITE) Document Filename: Work package: Partner(s): Lead Partner: Config ID: CG-2.3-D2.1-UCY001-1.0-SRS Metrics & Benchmarks (WP2.3) UCY, TUM UCY CG-2.3-D2.1-UCY001-1.0
More informationFeedback-Directed Page Placement for ccnuma via Hardware-generated Memory Traces
1 Feedback-Directed Page Placement for ccnuma via Hardware-generated Memory Traces Jaydeep Marathe, Vivek Thakkar and Frank Mueller Abstract Non-uniform memory architectures with cache coherence (ccnuma)
More informationAchieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation
Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College
More informationA Web-based Prophesy Automated Performance Modeling System
A Web-based Prophesy Automated Performance Modeling System Xingfu Wu, Valerie Taylor Department of Computer Science, Texas A & M University, College Station, TX 77843, USA Email: {wuxf, taylor}@cs.tamu.edu
More informationNative Marshalling. Java Marshalling. Mb/s. kbytes
Design Issues for Ecient Implementation of MPI in Java Glenn Judd, Mark Clement, Quinn Snell Computer Science Department, Brigham Young University, Provo, USA Vladimir Getov 2 2 School of Computer Science,
More informationApplications. Message Passing Interface(PVM, MPI, P4 etc.) Socket Interface. Low Overhead Protocols. Network dependent protocols.
Exploiting Multiple Heterogeneous Networks to Reduce Communication Costs in Parallel Programs JunSeong Kim jskim@ee.umn.edu David J. Lilja lilja@ee.umn.edu Department of Electrical Engineering University
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationPRIMEPOWER Server Architecture Excels in Scalability and Flexibility
PRIMEPOWER Server Architecture Excels in Scalability and Flexibility A D.H. Brown Associates, Inc. White Paper Prepared for Fujitsu This document is copyrighted by D.H. Brown Associates, Inc. (DHBA) and
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationOne-Sided Append: A New Communication Paradigm For PGAS Models
One-Sided Append: A New Communication Paradigm For PGAS Models James Dinan and Mario Flajslik Intel Corporation {james.dinan, mario.flajslik}@intel.com ABSTRACT One-sided append represents a new class
More informationA Case for High Performance Computing with Virtual Machines
A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation
More informationNAS Applied Research Branch. Ref: Intl. Journal of Supercomputer Applications, vol. 5, no. 3 (Fall 1991), pg. 66{73. Abstract
THE NAS PARALLEL BENCHMARKS D. H. Bailey 1, E. Barszcz 1, J. T. Barton 1,D.S.Browning 2, R. L. Carter, L. Dagum 2,R.A.Fatoohi 2,P.O.Frederickson 3, T. A. Lasinski 1,R.S. Schreiber 3, H. D. Simon 2,V.Venkatakrishnan
More informationMulti-Processor / Parallel Processing
Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms
More informationHardware Profile-guided Automatic Page Placement for ccnuma Systems
Hardware Profile-guided Automatic Page Placement for ccnuma Systems Jaydeep Marathe Frank Mueller Department of Computer Science, North Carolina State University, Raleigh, NC 27695-7534 e-mail: mueller@cs.ncsu.edu
More informationModule 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.
MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation
More informationBinding Nested OpenMP Programs on Hierarchical Memory Architectures
Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de
More informationLiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster
: Support for High-Performance MPI Intra-Node Communication on Linux Cluster Hyun-Wook Jin Sayantan Sur Lei Chai Dhabaleswar K. Panda Department of Computer Science and Engineering The Ohio State University
More informationAdaptive Scheduling under Memory Pressure on Multiprogrammed SMPs
Adaptive Scheduling under Memory Pressure on Multiprogrammed SMPs Dimitrios S. Nikolopoulos and Constantine D. Polychronopoulos Coordinated Science Laboratory University of Illinois at Urbana-Champaign
More informationJob Re-Packing for Enhancing the Performance of Gang Scheduling
Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT
More informationUsing Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems
Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems Michael Marchetti, Leonidas Kontothanassis, Ricardo Bianchini, and Michael L. Scott Department of
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationEvaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures
Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures Konstantin Berlin 1,JunHuan 2, Mary Jacob 3, Garima Kochhar 3,JanPrins 2, Bill
More informationHigh Performance Algorithms on. Clusters of Symmetric Multiprocessors (SMPs)
SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs) David A. Bader æ Department of Electrical and Computer Engineering University of New Mexico,
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationA Test Suite for High-Performance Parallel Java
page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium
More informationParallel and High Performance Computing CSE 745
Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel
More informationHow to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture
How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010
More informationParallel Computer Architecture
Parallel Computer Architecture What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:»
More informationUsing Timestamps to Track Causal Dependencies
Using Timestamps to Track Causal Dependencies J. A. David McWha Dept. of Computer Science, University of Waikato, Private Bag 315, Hamilton jadm@cs.waikato.ac.nz ABSTRACT As computer architectures speculate
More informationCACHE-CONSCIOUS ALLOCATION OF POINTER- BASED DATA STRUCTURES
CACHE-CONSCIOUS ALLOCATION OF POINTER- BASED DATA STRUCTURES Angad Kataria, Simran Khurana Student,Department Of Information Technology Dronacharya College Of Engineering,Gurgaon Abstract- Hardware trends
More informationIngo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.
Intelligent Storage Results from real life testing Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA SAS Intelligent Storage components! OLAP Server! Scalable Performance Data Server!
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationPerformance of Variant Memory Configurations for Cray XT Systems
Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)
More informationMixed Mode MPI / OpenMP Programming
Mixed Mode MPI / OpenMP Programming L.A. Smith Edinburgh Parallel Computing Centre, Edinburgh, EH9 3JZ 1 Introduction Shared memory architectures are gradually becoming more prominent in the HPC market,
More informationImplementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand
Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction
More informationOPENMP TIPS, TRICKS AND GOTCHAS
OPENMP TIPS, TRICKS AND GOTCHAS Mark Bull EPCC, University of Edinburgh (and OpenMP ARB) markb@epcc.ed.ac.uk OpenMPCon 2015 OpenMPCon 2015 2 A bit of background I ve been teaching OpenMP for over 15 years
More informationA MIXED-LANGUAGE PROGRAMMING METHODOLOGY FOR HIGH PERFORMANCE JAVA COMPUTING*
A MIXED-LANGUAGE PROGRAMMING METHODOLOGY FOR HIGH PERFORMANCE JAVA COMPUTING* Vladimir S. Getov University of Westminster Northwick Park, Harrow, UK and Los Alamos National Laboratory Los Alamos, NM, USA
More informationAn evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks
An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark Abstract
More informationUsing Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems
Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems Xingfu Wu and Valerie Taylor Department of Computer Science
More information