The Case of the Missing Supercomputer Performance

Size: px
Start display at page:

Download "The Case of the Missing Supercomputer Performance"

Transcription

1 The Case of the Missing Supercomputer Performance Achieving Optimal Performance on the 8192 Processors of ASCI Q Fabrizio Petrini, Darren Kerbyson, Scott Pakin (Los Alamos National Lab) Presented by Jiahua He

2 Skeleton of the Story Machine: ASCI Q (Second of Top500) 2048 Alpha SMP nodes with 4 proc per node Interconnected with Quadrics QsNet network Application: SAGE compressible Eulerian hydrodynmics program 150,000 lines of Fortran MPI code Beginning: a serious but previously undetected problem Techniques: Measurement to determine real performance Analytical model to predict expected performance Microbenchmarks to identify problem source Simulator to examine what if scenarios Result: a factor of 2 improvement in app performance 10/03/05 2

3 Steps Performance expectation Use analytical model to determine the performance that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 3

4 Step 1 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 4

5 Performance Expectation Model (Darren Kerbyson et al. SC01) Validated on many large-scale systems including all ASCI systems Typical prediction error of less than 10% Terms QA: first 4096-processor segment QB: second 4096-processor segment Weal-scaling: fix per-node problem size and scale # of proc 10/03/05 5

6 Performance Expectation Model (Darren Kerbyson et al. SC01) Validated on many large-scale systems including all ASCI systems Typical prediction error of less than 10% MYSTERY #1 Terms SAGE QA: first performs 4096-processor significantly worse on ASCI Q than segment was predicted by our performance model. QB: second 4096-processor segment Weal-scaling: fix per-node problem size and scale # of proc 10/03/05 6

7 Different # of proc Is it the model accurate? n-proc: using n processors per node Only significant difference occurs when 4-proc Giving confidence to the model Limit the problem in 4-proc 3-proc outperforms 4-proc when using more than 256 nodes 2-proc outperforms 4-proc when using more than 512 nodes 10/03/05 7

8 Perf Variability Constant amount of work in each cycle constant amount of time Vary from 0.7s to 3.0s A factor of 4 in variability 10/03/05 8

9 Breakdown of Cycle Time cycle = computation + local boundary exchange + collective communication Local boundary exchanges (get, put) Plateau above 500 proc Match model prediction Collective communications (allreduce, reduction, broadcast) Increase with # of proc Constant number and payload size in allreduce operations Difference between allreduce and reduction/broadcast: the difference in frequency of occurrence 10/03/05 9

10 Observations Summary Significant difference: expected performance observed performance Only when 4-proc High variability Source of performance deficit: collective operations, especially allreduce Deduction Improve the performance of allreduce, especially when using four processors per node 10/03/05 10

11 Step 2 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 11

12 Investigating allreduce allreduce latency 4-proc: 3ms Others: less than 0.3ms Synthetic parallel benchmark Alternately computes for either 0, 1 or 5 ms then performs either an allreduce or barrier Ideal scalable system Logarithmic growth with # nodes Insensitivity to computational granularity Result: not scalable 10/03/05 12

13 Optimization Optimizing allreduce Always polling Blocking after a limited time (100us, determined empirically) Improve latency by a factor of 7 Expectation At 4096 proc, SAGE spends 51% time in allreduce 78% performance gain Measurement result Only a marginal improvement in application performance 10/03/05 13

14 Optimization Optimizing allreduce Always polling Blocking after a limited time (100us, determined empirically) Improve latency MYSTERY by a factor #2 of 7 Expectation Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce seven At 4096 proc, SAGE spends 51% time in times faster leads to a negligible performance allreduce 78% performance gain improvement. Measurement result Only a marginal improvement in application performance 10/03/05 14

15 Analyzing Noise Neither MPI nor network node Periodic system activities (noise) Need a spare proc (Fig. 3, 6) Blocking in allreduce Benchmark Synthetic 1000s computation per proc without noise Max slowdown: only 2.5% Refined benchmark 1 million 1ms iterations per proc without noise Match LANL codes pattern Similar result 10/03/05 15

16 Analyzing Noise Neither MPI nor network node Periodic system activities (noise) Need a spare proc (Fig. 3, 6) Blocking in allreduce MYSTERY #3 Benchmark Synthetic Although 1000s the noise computation hypothesis per could explain proc SAGE s without suboptimal noise performance, microbenchmarks Max of per-processor slowdown: only noise 2.5% indicate that at most 2.5% of performance is being lost to noise. Refined benchmark 1 million 1ms iterations per proc without noise Match LANL codes pattern Similar result 10/03/05 16

17 Node Aggregation Expose structure in what appears to be uncorrelated noise on a per-proc basis Important observation Regular pattern across nodes Each cluster (32 nodes) contains noisier nodes Zoom into a cluster Node 0: cluster manager Node 1: quorum node Node 31: RMS cluster monitor 10/03/05 17

18 Node Aggregation Expose structure in what appears to be uncorrelated noise on a per-proc basis Important observation FINDING #1 Regular pattern across nodes Analyzing noise on a per-node basis instead of a Each per-processor cluster (32 basis nodes) reveals a regular structure contains noisier nodes across nodes. Zoom into a cluster Node 0: cluster manager Node 1: quorum node Node 31: RMS cluster monitor 10/03/05 18

19 Noise Events 10/03/05 19

20 Kernel Source of Noises Distributed heartbeat generated at kernel level Lightweight: hundreds of microseconds (us) High frequency: one every 125ms RMS daemons Quadrics Resource Management System One every 30s TruCluster daemons HP cluster management software One every about 100s 10/03/05 20

21 Step 3 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 21

22 Coscheduling Application: fine-grained, bulk-synchronous A delay in a process slows down the whole app Large # proc at least one slow process per iteration Coscheduling: pay the penalty only once Developed a prototype, but no details or results 10/03/05 22

23 Discrete-event Simulator Why simulator? Time on ASCI Q is scarce Configuration changes are not always practical Event = <F, L, E, P> F: frequency of the event L: average duration E: distribution; P: placement Barriers + 1ms computations Validated for measured events (top two curves) Predict performance gain of removing noises Node 0, 1 or 31: marginal improvement (15%) Kernel noise on all nodes: dramatically improved 10/03/05 23

24 Discrete-event Simulator Why simulator? Time on ASCI Q is scarce Configuration changes are not always practical FINDING #2 Event = <F, L, E, P> On F: fine-grained frequency of applications, the event more performance is lost L: to average short but duration frequent noise on all nodes than to long but less frequent noise on just a few nodes. E: distribution; P: placement Barriers + 1ms computations Validated for measured events (top two curves) Predict performance gain of removing noises Node 0, 1 or 31: marginal improvement (15%) Kernel noise on all nodes: dramatically improved 10/03/05 24

25 Eliminating Noise Infeasible to remove all the noise Two TruCluster heartbeats at kernel level Require substantial kernel modifications Optimizations Removed ten daemons from all nodes Increased RMS interval from 30s to 60s Moved several TruCluster daemons from node 1 and 2 to node 0 Microbenchmarks Barriers + Computations (0, 1 or 5ms) Improvements 2.2 to 13 times faster 10/03/05 25

26 Step 4 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 26

27 Optimized SAGE Performance Old curves (top two curves) New curves 4-proc, but w/o nodes 0 & 31 Jan-27-03: 1024-node segment (only up to 3716 proc) May-01-03: full sized ASCI Q (up to 7680 proc) May-01-03(min): minimum time over 50 cycles Results Jan and May-01-03: much improved May-01-03(min): closely match expected performance further optimizations 10/03/05 27

28 Summary Different configurations tested prior to and after noise removal Total processing rate (# usable proc) * (cells per proc) / (cycle time) Fixed 13,500 cells per proc Varied # of usable proc Best observed (???) processing rate is only 15% below model expectation 10/03/05 28

29 Summary Different configurations tested prior to and after noise removal Total processing rate (# usable proc) * (cells FINDING per proc) #3 / (cycle time) Fixed 13,500 cells per proc We Varied were # able of usable to double procsage s performance by removing noise caused by several types of dæmons, confining dæmons to the cluster manager, and removing the cluster manager and the RMS cluster monitor from each cluster s compute pool. Best observed (???) processing rate is only 15% below model expectation 10/03/05 29

30 Discussion Computational granularity of app type of noise Load balanced, coarse-grained app (e.g. LINPACH): Long noise dominate Short noise becomes coscheduled Medium-grained app (e.g. SAGE): Medium noise dominate Fine-grained app (e.g. deterministic Sn-transport): Short noise dominate The freq of long noise is low 10/03/05 30

31 Discussion Computational granularity of app type of noise Load balanced, coarse-grained app (e.g. LINPACH): FINDING #4 Long noise dominate Substantial performance loss occurs when an application Short noise becomes coscheduled resonates with system noise: high-frequency, fine-grained noise affects only fine-grained applications; low-frequency, coarse-grained SAGE): noise affects only coarse-grained applications. Medium noise dominate Medium-grained app (e.g. Fine-grained app (e.g. deterministic Sn-transport): Short noise dominate The freq of long noise is low 10/03/05 31

32 Conclusion Described a figurative journey to improve the performance of a sizable hydrodynamics app, SAGE, on the world`s second-fastest supercomputer, ASCI Q Methodologies The first to determine how fast an app could potentially run Developed a methodology to analyze artifacts that degrade app performance yet are not part of the app Doubled the performance of SAGE w/o modifying a single line of code Notions Noise and resonance Applicable to other system and other app 10/03/05 32

33 More discussions What do they mean by best observed in Table 3? The processing rate of regular 4- proc using 7680 proc (120.6) is still lower than 3-proc with only 6144 proc. The analytical model is constructed manually (Darren Kerbyson et al. SC01). It is enormously labor intensive. 10/03/05 33

34 Thanks! Any questions? The Case of the Missing Supercomputer Performance (SC 2003)

CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019

CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019 CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019 Prof. Karen L. Karavanic karavan@pdx.edu web.cecs.pdx.edu/~karavan Today s Agenda Why Study Performance? Why Study

More information

The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,9 Processors of ASCI Q Fabrizio Petrini Darren J. Kerbyson Scott Pakin Performance and Architecture Laboratory

More information

Kevin J. Barker. Scott Pakin and Darren J. Kerbyson

Kevin J. Barker. Scott Pakin and Darren J. Kerbyson Experiences in Performance Modeling: The Krak Hydrodynamics Application Kevin J. Barker Scott Pakin and Darren J. Kerbyson Performance and Architecture Laboratory (PAL) http://www.c3.lanl.gov/pal/ Computer,

More information

Assessing MPI Performance on QsNet II

Assessing MPI Performance on QsNet II Assessing MPI Performance on QsNet II Pablo E. García 1, Juan Fernández 1, Fabrizio Petrini 2, and José M. García 1 1 Departamento de Ingeniería y Tecnología de Computadores Universidad de Murcia, 371

More information

A PERFORMANCE EVALUATION OF AN ALPHA EV7 PROCESSING NODE

A PERFORMANCE EVALUATION OF AN ALPHA EV7 PROCESSING NODE A PERFORMANCE EVALUATION OF AN ALPHA EV7 PROCESSING NODE Darren J. Kerbyson Adolfy Hoisie Scott Pakin Fabrizio Petrini Harvey J. Wasserman LOS ALAMOS NATIONAL LABORATORY (LANL), CCS-3 MODELING, ALGORITHMS

More information

Performance Modeling the Earth Simulator and ASCI Q

Performance Modeling the Earth Simulator and ASCI Q Performance Modeling the Earth Simulator and ASCI Q Darren J. Kerbyson, Adolfy Hoisie, Harvey J. Wasserman Performance and Architectures Laboratory (PAL) Modeling, Algorithms and Informatics Group, CCS-3

More information

Techniques to improve the scalability of Checkpoint-Restart

Techniques to improve the scalability of Checkpoint-Restart Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

jitsim: A Simulator for Predicting Scalability of Parallel Applications in Presence of OS Jitter

jitsim: A Simulator for Predicting Scalability of Parallel Applications in Presence of OS Jitter jitsim: A Simulator for Predicting Scalability of Parallel Applications in Presence of OS Jitter Pradipta De and Vijay Mann IBM Research - India, New Delhi Abstract. Traditionally, Operating system jitter

More information

Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems

Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems National Alamos Los Laboratory Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems Fabrizio Petrini and Wu-chun Feng {fabrizio,feng}@lanl.gov Los Alamos National

More information

Characterizing Application Sensitivity to OS Interference Using Kernel-Level Noise Injection

Characterizing Application Sensitivity to OS Interference Using Kernel-Level Noise Injection Characterizing Application Sensitivity to OS Interference Using Kernel-Level Noise Injection Kurt B. Ferreira and Patrick Bridges Computer Science Department The University of New Mexico Albuquerque, NM

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

Scheduling with Implicit Information in Distributed Systems

Scheduling with Implicit Information in Distributed Systems Scheduling with Implicit Information in Distributed Systems Andrea C. Arpaci-Dusseau, David E. Culler, Alan M. Mainwaring Computer Science Division University of California, Berkeley fdusseau, culler,

More information

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine Characterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine System Noise Introduction and History CPUs are time-shared Deamons,

More information

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for oc Modeling in Full-System Simulations Michael K. Papamichael, James C. Hoe, Onur Mutlu papamix@cs.cmu.edu, jhoe@ece.cmu.edu, onur@cmu.edu

More information

In-Network Computing. Paving the Road to Exascale. June 2017

In-Network Computing. Paving the Road to Exascale. June 2017 In-Network Computing Paving the Road to Exascale June 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect -Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

More information

Load balancing for Regular Meshes on SMPs with MPI

Load balancing for Regular Meshes on SMPs with MPI Load balancing for Regular Meshes on SMPs with MPI Vivek Kale and William Gropp University of Illinois at Urbana-Champaign, IL, USA, {vivek,wgropp}@illinois.edu Abstract. Domain decomposition for regular

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Buffered Coscheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems

Buffered Coscheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems Buffered Coscheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems Fabrizio Petrini y and Wu-chun Feng yx ffabrizio, fengg@lanl.gov y Computing, Information, and Communications

More information

vsan 6.6 Performance Improvements First Published On: Last Updated On:

vsan 6.6 Performance Improvements First Published On: Last Updated On: vsan 6.6 Performance Improvements First Published On: 07-24-2017 Last Updated On: 07-28-2017 1 Table of Contents 1. Overview 1.1.Executive Summary 1.2.Introduction 2. vsan Testing Configuration and Conditions

More information

Diffusion TM 5.0 Performance Benchmarks

Diffusion TM 5.0 Performance Benchmarks Diffusion TM 5.0 Performance Benchmarks Contents Introduction 3 Benchmark Overview 3 Methodology 4 Results 5 Conclusion 7 Appendix A Environment 8 Diffusion TM 5.0 Performance Benchmarks 2 1 Introduction

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Center Extreme Scale CS Research

Center Extreme Scale CS Research Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Noise Injection Techniques to Expose Subtle and Unintended Message Races Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Cluster Network Products

Cluster Network Products Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster

More information

Dynamic load balancing in OSIRIS

Dynamic load balancing in OSIRIS Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal Maintaining parallel load balance

More information

Customer Success Story Los Alamos National Laboratory

Customer Success Story Los Alamos National Laboratory Customer Success Story Los Alamos National Laboratory Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory Case Study June 2010 Highlights First Petaflop

More information

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight

More information

Assessment of LS-DYNA Scalability Performance on Cray XD1

Assessment of LS-DYNA Scalability Performance on Cray XD1 5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123

More information

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother? Parallel Programming Concepts Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04 Parallel Background Why Bother? 1 What is Parallel Programming? Simultaneous use of multiple

More information

Leveraging Modern Interconnects for Parallel System Software

Leveraging Modern Interconnects for Parallel System Software Leveraging Modern Interconnects for Parallel System Software Thesis submitted for the degree of Doctor of Philosophy by Eitan Frachtenberg Submitted to the Senate of the Hebrew University December 2003

More information

High-resolution Measurement of Data Center Microbursts

High-resolution Measurement of Data Center Microbursts High-resolution Measurement of Data Center Microbursts Qiao Zhang (University of Washington) Vincent Liu (University of Pennsylvania) Hongyi Zeng (Facebook) Arvind Krishnamurthy (University of Washington)

More information

Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing

Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing Solving Traveling Salesman Problem Using Parallel Genetic Algorithm and Simulated Annealing Fan Yang May 18, 2010 Abstract The traveling salesman problem (TSP) is to find a tour of a given number of cities

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael Franklin, Benjamin Recht, Ion Stoica STREAMING WORKLOADS

More information

Table of contents. OpenVMS scalability with Oracle Rdb. Scalability achieved through performance tuning.

Table of contents. OpenVMS scalability with Oracle Rdb. Scalability achieved through performance tuning. OpenVMS scalability with Oracle Rdb Scalability achieved through performance tuning. Table of contents Abstract..........................................................2 From technical achievement to

More information

The Optimal CPU and Interconnect for an HPC Cluster

The Optimal CPU and Interconnect for an HPC Cluster 5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

Evaluation of Seed Selection Strategies for Vehicle to Vehicle Epidemic Information Dissemination

Evaluation of Seed Selection Strategies for Vehicle to Vehicle Epidemic Information Dissemination Evaluation of Seed Selection Strategies for Vehicle to Vehicle Epidemic Information Dissemination Richard Kershaw and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering, Viterbi School

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Using Lamport s Logical Clocks

Using Lamport s Logical Clocks Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

PRIMEHPC FX10: Advanced Software

PRIMEHPC FX10: Advanced Software PRIMEHPC FX10: Advanced Software Koh Hotta Fujitsu Limited System Software supports --- Stable/Robust & Low Overhead Execution of Large Scale Programs Operating System File System Program Development for

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

The Six Principles of BW Data Validation

The Six Principles of BW Data Validation The Problem The Six Principles of BW Data Validation Users do not trust the data in your BW system. The Cause By their nature, data warehouses store large volumes of data. For analytical purposes, the

More information

Neuro-fuzzy admission control in mobile communications systems

Neuro-fuzzy admission control in mobile communications systems University of Wollongong Thesis Collections University of Wollongong Thesis Collection University of Wollongong Year 2005 Neuro-fuzzy admission control in mobile communications systems Raad Raad University

More information

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology BY ERICH STROHMAIER COMPUTER SCIENTIST, FUTURE TECHNOLOGIES GROUP, LAWRENCE BERKELEY

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI

More information

Early Evaluation of the Cray XD1

Early Evaluation of the Cray XD1 Early Evaluation of the Cray XD1 (FPGAs not covered here) Mark R. Fahey Sadaf Alam, Thomas Dunigan, Jeffrey Vetter, Patrick Worley Oak Ridge National Laboratory Cray User Group May 16-19, 2005 Albuquerque,

More information

High Performance MPI on IBM 12x InfiniBand Architecture

High Performance MPI on IBM 12x InfiniBand Architecture High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction

More information

Determining the Number of CPUs for Query Processing

Determining the Number of CPUs for Query Processing Determining the Number of CPUs for Query Processing Fatemah Panahi Elizabeth Soechting CS747 Advanced Computer Systems Analysis Techniques The University of Wisconsin-Madison fatemeh@cs.wisc.edu, eas@cs.wisc.edu

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

An Extensible Message-Oriented Offload Model for High-Performance Applications

An Extensible Message-Oriented Offload Model for High-Performance Applications An Extensible Message-Oriented Offload Model for High-Performance Applications Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu,

More information

Hybrid programming with MPI and OpenMP On the way to exascale

Hybrid programming with MPI and OpenMP On the way to exascale Institut du Développement et des Ressources en Informatique Scientifique www.idris.fr Hybrid programming with MPI and OpenMP On the way to exascale 1 Trends of hardware evolution Main problematic : how

More information

Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono

Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono Design of Parallel Programs Algoritmi e Calcolo Parallelo Web: home.dei.polimi.it/loiacono Email: loiacono@elet.polimi.it References q The material in this set of slide is taken from two tutorials by Blaise

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

Programming for Fujitsu Supercomputers

Programming for Fujitsu Supercomputers Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming

More information

Clusters of SMP s. Sean Peisert

Clusters of SMP s. Sean Peisert Clusters of SMP s Sean Peisert What s Being Discussed Today SMP s Cluters of SMP s Programming Models/Languages Relevance to Commodity Computing Relevance to Supercomputing SMP s Symmetric Multiprocessors

More information

Blue Waters I/O Performance

Blue Waters I/O Performance Blue Waters I/O Performance Mark Swan Performance Group Cray Inc. Saint Paul, Minnesota, USA mswan@cray.com Doug Petesch Performance Group Cray Inc. Saint Paul, Minnesota, USA dpetesch@cray.com Abstract

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

SFS: Random Write Considered Harmful in Solid State Drives

SFS: Random Write Considered Harmful in Solid State Drives SFS: Random Write Considered Harmful in Solid State Drives Changwoo Min 1, 2, Kangnyeon Kim 1, Hyunjin Cho 2, Sang-Won Lee 1, Young Ik Eom 1 1 Sungkyunkwan University, Korea 2 Samsung Electronics, Korea

More information

MPI On-node and Large Processor Count Scaling Performance. October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory

MPI On-node and Large Processor Count Scaling Performance. October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory MPI On-node and Large Processor Count Scaling Performance October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory Outline Scope Presentation aimed at scientific/technical app

More information

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan PHX: Memory Speed HPC I/O with NVM Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan Node Local Persistent I/O? Node local checkpoint/ restart - Recover from transient failures ( node restart)

More information

White Paper. Why Remake Storage For Modern Data Centers

White Paper. Why Remake Storage For Modern Data Centers White Paper Why Remake Storage For Modern Data Centers Executive Summary Managing data growth and supporting business demands of provisioning storage have been the top concern of IT operations for the

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Reducing Application Runtime Variability on Jaguar XT5

Reducing Application Runtime Variability on Jaguar XT5 Reducing Application Runtime Variability on Jaguar XT5 Sarp Oral Feiyi Wang David A. Dillow Ross Miller Galen M. Shipman Don Maxwell Oak Ridge National Laboratory Leadership Computing Facility {oralhs,fwang2,dillowda,rgmiller,gshipman,maxwellde}@ornl.gov

More information

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World

More information

Alleviating Scalability Issues of Checkpointing

Alleviating Scalability Issues of Checkpointing Rolf Riesen, Kurt Ferreira, Dilma Da Silva, Pierre Lemarinier, Dorian Arnold, Patrick G. Bridges 13 November 2012 Alleviating Scalability Issues of Checkpointing Protocols Overview 2 3 Motivation: scaling

More information

Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning

Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis, MN 55455 Technical Report

More information

Composite Metrics for System Throughput in HPC

Composite Metrics for System Throughput in HPC Composite Metrics for System Throughput in HPC John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2003 Phoenix, AZ November 18, 2003 Overview The HPC Challenge Benchmark was announced last

More information

Dynamic Feedback: An Effective Technique for Adaptive Computing

Dynamic Feedback: An Effective Technique for Adaptive Computing Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science Engineering I Building University of California, Santa Barbara Santa Barbara,

More information

Non-Blocking Collectives for MPI

Non-Blocking Collectives for MPI Non-Blocking Collectives for MPI overlap at the highest level Torsten Höfler Open Systems Lab Indiana University Bloomington, IN, USA Institut für Wissenschaftliches Rechnen Technische Universität Dresden

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Building MPI for Multi-Programming Systems using Implicit Information

Building MPI for Multi-Programming Systems using Implicit Information Building MPI for Multi-Programming Systems using Implicit Information Frederick C. Wong 1, Andrea C. Arpaci-Dusseau 2, and David E. Culler 1 1 Computer Science Division, University of California, Berkeley

More information

OLAP Introduction and Overview

OLAP Introduction and Overview 1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed ASPERA HIGH-SPEED TRANSFER Moving the world s data at maximum speed ASPERA HIGH-SPEED FILE TRANSFER 80 GBIT/S OVER IP USING DPDK Performance, Code, and Architecture Charles Shiflett Developer of next-generation

More information

Application-Specific System Customization on Many-Core Platforms: The VT-ASOS Framework Position paper

Application-Specific System Customization on Many-Core Platforms: The VT-ASOS Framework Position paper Application-Specific System Customization on Many-Core Platforms: The VT-ASOS Framework Position paper Godmar Back and Dimitrios S. Nikolopoulos Center for High-End Computing Systems Department of Computer

More information

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS INFINIBAND HOST CHANNEL ADAPTERS (HCAS) WITH PCI EXPRESS ACHIEVE 2 TO 3 PERCENT LOWER LATENCY FOR SMALL MESSAGES COMPARED WITH HCAS USING 64-BIT, 133-MHZ

More information

ScalaIOTrace: Scalable I/O Tracing and Analysis

ScalaIOTrace: Scalable I/O Tracing and Analysis ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,

More information

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE) Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Introduction. Communication Systems Simulation - I. Monte Carlo method. Simulation methods

Introduction. Communication Systems Simulation - I. Monte Carlo method. Simulation methods Introduction Communication Systems Simulation - I Harri Saarnisaari Part of Simulations and Tools for Telecommunication Course First we study what simulation methods are available Use of the Monte Carlo

More information