MPI On-node and Large Processor Count Scaling Performance. October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory

Size: px
Start display at page:

Download "MPI On-node and Large Processor Count Scaling Performance. October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory"

Transcription

1 MPI On-node and Large Processor Count Scaling Performance October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory

2 Outline Scope Presentation aimed at scientific/technical app writers Why This Matters Halo Results as CPU-intensive threads approach available procs MPI-Allreduce Results as CPU-intensive threads approach available procs Operational Issues Conclusion 2

3 Why is There So Much Interest in This Anyway? Presentations from Tuel, Worley in addition to this talk As processor count climbs, scalability of infrastructure becomes more important On-node performance increasingly important There are some issues to be aware of 3

4 Findings from Halo code 4

5 Halo code Simulates the nearest neighbor exchange of a 1-2 row/column halo from a 2-D array. Due to Alan Wallcraft Common operation for a finite difference ocean model Results shown as multiple exchanges for a given tile edge length (2, 4, 8,, 1024). Up to 896 tasks on NH-2 (56x16 or 64x14) Up to 768 tasks on Silver (192x4 or 256x3) Uses MPI (timed function is MPI_Sendrecv) Code compiled as: Mpxlf -v -w -O3 -qarch=auto -qcache=auto -qfloat=hsflt 5

6 Sample Results 6

7 Halo: Three of Four Nodes MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime Machine: blue.llnl.gov Dedicated: yes Date: Aug, 2001 PSSP: PTF 9 AIX: ML8 CPU: 332Mhz 604e Node: Silver (4-way) 7

8 Halo: Four of Four Nodes MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime Machine: blue.llnl.gov Dedicated: yes Date: Aug, 2001 PSSP: PTF 9 AIX: ML8 CPU: 332Mhz 604e Node: Silver (4-way) 8

9 Halo: Fourteen of Sixteen Nodes MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime Machine: frost.llnl.gov Dedicated: yes Date: Aug, 2001 PSSP: 3.3 base AIX: ML8 CPU: 375Mhz P3 Node: NH-2 (16-way) 9

10 Halo: Sixteen of Sixteen Nodes MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime Machine: frost.llnl.gov Dedicated: yes Date: Aug, 2001 PSSP: 3.3 base AIX: ML8 CPU: 375Mhz P3 Node: NH-2 (16-way) 10

11 Findings From Halo Best Observed Timings The NHII has only a modest increase in "bandwidth" (halo length = 1024) over the WHII. The latency of the two machines is very similar, with both showing a significantly higher latency above a certain node-count. Variability Node Min Avg Max Std-dev Silver 3TPN: Silver 4TPN: , , NH-2 14TPN: NH-2 16TPN: , Fully populated nodes (up to 64 NH2 & 256 Silver) perform significantly slower on average 11

12 Findings from LLNL MPI Test code 12

13 LLNL Collective MPI Benchmark Test code developed to investigate MPI on-node and large node count scalability All results are the median of three runs C program which times a number of MPI ops within a loop. Timings by MPI_Wtime() Due to Linda Stanberry So Far, Three Effects Have Been Evaluated NH-2 Firmware 15TPN -vs- 16TPN Priority Adjustments 13

14 Effect of Firmware Fix Before and After - 15 TPN 5 MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime Machine: frost.llnl.gov Dedicated: yes Date: Oct, 2001 PSSP: 3.3 base AIX: ML8 CPU: 375Mhz P3 Node: NH-2 (16-way) default - before best - before default - after best - after Log. (best - after) Poly. (best - after) # tasks 14

15 Effect of 15TPN -vs- 16TPN Allreduce - 15 vs 16 TPN 20 MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime TPN 16 TPN Machine: white.llnl.gov Dedicated: yes Date: Oct, 2001 PSSP: 3.3 base AIX: ML8 CPU: 375Mhz P3 Node: NH-2 (16-way) # tasks 15

16 Effect of Priority Adjustments Allreduce - 15 TPN MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime priority 30 default priority Log. (priority 30) 40% 4096 tasks 10 Machine: white.llnl.gov Dedicated: yes Date: Oct, 2001 PSSP: 3.3 base AIX: ML8 CPU: 375Mhz P3 Node: NH-2 (16-way) # of tasks 16

17 Fitting Performance to a Log curve under optimal conditions Fitting to Log Curve (960 tasks) Fitting to Log Curve (4096 Tasks) actual data logarithmic trendline polynomial trendline priority 30 Log. (priority 30) #tasks -2 # of tasks 15 TPN, Priority=30, Dedicated Use 17

18 Findings from LLNL MPI_Allreduce testing Significant performance improvement with priority adjustments Choice of 15TPN or 16TPN depends on number of nodes on a 16-way SMP -- larger node counts favor 16TPN. Data appears to deviate from Log Curve-fit for high numbers of tasks. Log Curve-fit does work well for low task count. Obtaining good TPN and large node-count performance requires careful setup 18

19 Operational Issues 19

20 Operational Issues Have the firmware fix applied for NH2 nodes Discovered by Bill Tuel Not obvious IBM Has Suggested Several Set-up Techniques Increase Priority of Application Avoid MPI_Allreduce() and MPI_Barrier() Use processor binding 20

21 Set-up Technique #1 Priority Adjustment Description: Give application a favorable priority (priority=59 preempts cron jobs, priority=30 preempts all system tasks) Implemented by: Manual adjustment by sysadmin intervention Manual adjustment by setuid script Automatic adjustment based on user account (/etc/poe.priority) Pros: Significant performance gains demonstrated Cons: Manual adjustment is too burdensome Automatic adjustment is unacceptable as implemented No way to keep users from running with priority=30 and 16TPN. (Causes nodes to go comatose, can t run shutdown, must crash by power cycle.) Current /etc/poe.priority doesn t permit wildcards 21

22 Set-up Technique #2 Avoid MPI_Allreduce() Description: Avoid MPI_Allreduce() and MPI_Barrier() Implemented by: Re-writing application to be data-flow parallel, removing all barrier synchronizations Pros: Performance problems corrected by avoiding problematic functionality. Cons: Difficult to do for step-wise type simulations MPI_Allreduce() is a valid part of the MPI specification 22

23 Set-up Technique #3 Use Processor Binding Description: Bind separate tasks of the parallel application to separate processors within an SMP Implemented by: Calling a system function Pros: bindprocessor(bindthread, thread_id, cpu_id) Believed to provide some benefits through cache reuse Cons: Bill Tuel reports minimal impact Often, threads are implicitly handled by OpenMP directives. App writer doesn t always know when the system initiates CPU-intensive threads, thereby making coordination difficult. Any benefits could easily be superceded by Discouraged by IBM s own documentation (seen as tool for kernel programmers and not recommended for ordinary use): 23

24 What s Next? We want... Understandable Performance: Log-scaling, efficient use of the available hardware resources Understandable Environment: App writers should be able to easily tell why their codes do not run faster Done: NH-2 Firmware fix is available Hats_nim appears to be overly active with default settings Ongoing: LLNL is actively working with IBM to improve performance and understanding. Investigate system daemon contention, hats_nim setting, MP_Pulse setting,... MPI_Allreduce() and MPI_Barrier() measurements 24

25 and in conclusion Further Info Halo code due to Alan Wallcraft Patrick Worley s results: Message Passing Interface Forum, MPI-2: A Message Passing Interface Standard, Standards Document 2.0, University of Tennessee, Knoxville, July Acknowledgements Alan Wallcraft, Naval Research Lab Chris Chambreau, Robin Goldstone, Lawrence Livermore National Lab This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng

HPC Colony: Linux at Large Node Counts

HPC Colony: Linux at Large Node Counts UCRL-TR-233689 HPC Colony: Linux at Large Node Counts T. Jones, A. Tauferner, T. Inglett, A. Sidelnik August 14, 2007 Disclaimer This document was prepared as an account of work sponsored by an agency

More information

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Noise Injection Techniques to Expose Subtle and Unintended Message Races Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau

More information

IBM HPC Development MPI update

IBM HPC Development MPI update IBM HPC Development MPI update Chulho Kim Scicomp 2007 Enhancements in PE 4.3.0 & 4.3.1 MPI 1-sided improvements (October 2006) Selective collective enhancements (October 2006) AIX IB US enablement (July

More information

MPI - Today and Tomorrow

MPI - Today and Tomorrow MPI - Today and Tomorrow ScicomP 9 - Bologna, Italy Dick Treumann - MPI Development The material presented represents a mix of experimentation, prototyping and development. While topics discussed may appear

More information

Comparison of XT3 and XT4 Scalability

Comparison of XT3 and XT4 Scalability Comparison of XT3 and XT4 Scalability Patrick H. Worley Oak Ridge National Laboratory CUG 2007 May 7-10, 2007 Red Lion Hotel Seattle, WA Acknowledgements Research sponsored by the Climate Change Research

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

CS140 Operating Systems Midterm Review. Feb. 5 th, 2009 Derrick Isaacson

CS140 Operating Systems Midterm Review. Feb. 5 th, 2009 Derrick Isaacson CS140 Operating Systems Midterm Review Feb. 5 th, 2009 Derrick Isaacson Midterm Quiz Tues. Feb. 10 th In class (4:15-5:30 Skilling) Open book, open notes (closed laptop) Bring printouts You won t have

More information

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan & Matt Larsen (University of Oregon), Hank Childs (Lawrence Berkeley National Laboratory) 26

More information

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

Java Based Open Architecture Controller

Java Based Open Architecture Controller Preprint UCRL-JC- 137092 Java Based Open Architecture Controller G. Weinet? This article was submitted to World Automation Conference, Maui, HI, June 1 I- 16,200O U.S. Department of Energy January 13,200O

More information

Resource Management at LLNL SLURM Version 1.2

Resource Management at LLNL SLURM Version 1.2 UCRL PRES 230170 Resource Management at LLNL SLURM Version 1.2 April 2007 Morris Jette (jette1@llnl.gov) Danny Auble (auble1@llnl.gov) Chris Morrone (morrone2@llnl.gov) Lawrence Livermore National Laboratory

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

Multi Agent Navigation on GPU. Avi Bleiweiss

Multi Agent Navigation on GPU. Avi Bleiweiss Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

PTP - PLDT Parallel Language Development Tools Overview, Status & Plans

PTP - PLDT Parallel Language Development Tools Overview, Status & Plans PTP - PLDT Parallel Language Development Tools Overview, Status & Plans Beth Tibbitts tibbitts@us.ibm.com High Productivity Tools Group, IBM Research "This material is based upon work supported by the

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Snowpack: Efficient Parameter Choice for GPU Kernels via Static Analysis and Statistical Prediction ScalA 17, Denver, CO, USA, November 13, 2017

Snowpack: Efficient Parameter Choice for GPU Kernels via Static Analysis and Statistical Prediction ScalA 17, Denver, CO, USA, November 13, 2017 Snowpack: Efficient Parameter Choice for GPU Kernels via Static Analysis and Statistical Prediction ScalA 17, Denver, CO, USA, November 13, 2017 Ignacio Laguna Ranvijay Singh, Paul Wood, Ravi Gupta, Saurabh

More information

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Patrick H. Worley Thomas H. Dunigan, Jr. Oak Ridge National Laboratory 45th Cray User Group Conference May 13, 2003 Hyatt on Capital Square

More information

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help? Acceleration of HPC applications on hybrid CPU- systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga

More information

Non-blocking Collective Operations for MPI

Non-blocking Collective Operations for MPI Non-blocking Collective Operations for MPI - Towards Coordinated Optimization of Computation and Communication in Parallel Applications - Torsten Hoefler Open Systems Lab Indiana University Bloomington,

More information

CSCE 626 Experimental Evaluation.

CSCE 626 Experimental Evaluation. CSCE 626 Experimental Evaluation http://parasol.tamu.edu Introduction This lecture discusses how to properly design an experimental setup, measure and analyze the performance of parallel algorithms you

More information

B.R. de Supinski J. May

B.R. de Supinski J. May UCRL-JC-133263 - PREPRINT Benchmarking Pthreads Performance B.R. de Supinski J. May This paper was prepared for submittal to the 1999 International Conference on Parallel and Distributed Processing Techniques

More information

Memory Consistency and Multiprocessor Performance

Memory Consistency and Multiprocessor Performance Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel

More information

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System Abstract In this paper, we address the latency issue in RT- XEN virtual machines that are available in Xen 4.5. Despite the advantages of applying virtualization to systems, the default credit scheduler

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Resource allocation and utilization in the Blue Gene/L supercomputer

Resource allocation and utilization in the Blue Gene/L supercomputer Resource allocation and utilization in the Blue Gene/L supercomputer Tamar Domany, Y Aridor, O Goldshmidt, Y Kliteynik, EShmueli, U Silbershtein IBM Labs in Haifa Agenda Blue Gene/L Background Blue Gene/L

More information

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that

More information

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network

More information

Assessing the Convergence Properties of NSGA-II for Direct Crashworthiness Optimization

Assessing the Convergence Properties of NSGA-II for Direct Crashworthiness Optimization 10 th International LS-DYNA Users Conference Opitmization (1) Assessing the Convergence Properties of NSGA-II for Direct Crashworthiness Optimization Guangye Li 1, Tushar Goel 2, Nielen Stander 2 1 IBM

More information

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG Holger Brunst Center for High Performance Computing Dresden University, Germany June 1st, 2005 Overview Overview

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001 K42 Team modified October 2001 This paper discusses how K42 uses Linux-kernel components to support a wide range of hardware, a full-featured TCP/IP stack and Linux file-systems. An examination of the

More information

MPI Performance Analysis and Optimization on Tile64/Maestro

MPI Performance Analysis and Optimization on Tile64/Maestro MPI Performance Analysis and Optimization on Tile64/Maestro Mikyung Kang, Eunhui Park, Minkyoung Cho, Jinwoo Suh, Dong-In Kang, and Stephen P. Crago USC/ISI-East July 19~23, 2009 Overview Background MPI

More information

Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System

Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System June 30, 2003 Terry Jones, Shawn Dawson, Rob Neely Lawrence Livermore National Laboratory Livermore, CA,

More information

The Icosahedral Nonhydrostatic (ICON) Model

The Icosahedral Nonhydrostatic (ICON) Model The Icosahedral Nonhydrostatic (ICON) Model Scalability on Massively Parallel Computer Architectures Florian Prill, DWD + the ICON team 15th ECMWF Workshop on HPC in Meteorology October 2, 2012 ICON =

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Parallel Computing Why & How?

Parallel Computing Why & How? Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel

More information

Application and System Memory Use, Configuration, and Problems on Bassi. Richard Gerber

Application and System Memory Use, Configuration, and Problems on Bassi. Richard Gerber Application and System Memory Use, Configuration, and Problems on Bassi Richard Gerber Lawrence Berkeley National Laboratory NERSC User Services ScicomP 13, Garching, Germany, July 17, 2007 NERSC is supported

More information

Clusters of SMP s. Sean Peisert

Clusters of SMP s. Sean Peisert Clusters of SMP s Sean Peisert What s Being Discussed Today SMP s Cluters of SMP s Programming Models/Languages Relevance to Commodity Computing Relevance to Supercomputing SMP s Symmetric Multiprocessors

More information

Shadow: Real Applications, Simulated Networks. Dr. Rob Jansen U.S. Naval Research Laboratory Center for High Assurance Computer Systems

Shadow: Real Applications, Simulated Networks. Dr. Rob Jansen U.S. Naval Research Laboratory Center for High Assurance Computer Systems Shadow: Real Applications, Simulated Networks Dr. Rob Jansen Center for High Assurance Computer Systems Cyber Modeling and Simulation Technical Working Group Mark Center, Alexandria, VA October 25 th,

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

The Case of the Missing Supercomputer Performance

The Case of the Missing Supercomputer Performance The Case of the Missing Supercomputer Performance Achieving Optimal Performance on the 8192 Processors of ASCI Q Fabrizio Petrini, Darren Kerbyson, Scott Pakin (Los Alamos National Lab) Presented by Jiahua

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

Using implicit fitness functions for genetic algorithm-based agent scheduling

Using implicit fitness functions for genetic algorithm-based agent scheduling Using implicit fitness functions for genetic algorithm-based agent scheduling Sankaran Prashanth, Daniel Andresen Department of Computing and Information Sciences Kansas State University Manhattan, KS

More information

Heterogeneous-Race-Free Memory Models

Heterogeneous-Race-Free Memory Models Heterogeneous-Race-Free Memory Models Jyh-Jing (JJ) Hwang, Yiren (Max) Lu 02/28/2017 1 Outline 1. Background 2. HRF-direct 3. HRF-indirect 4. Experiments 2 Data Race Condition op1 op2 write read 3 Sequential

More information

Parallel Performance of the XL Fortran random_number Intrinsic Function on Seaborg

Parallel Performance of the XL Fortran random_number Intrinsic Function on Seaborg LBNL-XXXXX Parallel Performance of the XL Fortran random_number Intrinsic Function on Seaborg Richard A. Gerber User Services Group, NERSC Division July 2003 This work was supported by the Director, Office

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo OpenMP Amasis Brauch German University in Cairo May 4, 2010 Simple Algorithm 1 void i n c r e m e n t e r ( short a r r a y ) 2 { 3 long i ; 4 5 for ( i = 0 ; i < 1000000; i ++) 6 { 7 a r r a y [ i ]++;

More information

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor

More information

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The

More information

Introduction to MPI. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2014

Introduction to MPI. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2014 Introduction to MPI EAS 520 High Performance Scientific Computing University of Massachusetts Dartmouth Spring 2014 References This presentation is almost an exact copy of Dartmouth College's Introduction

More information

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more

More information

Experiments Using BG/Q s Hardware Transactional Memory

Experiments Using BG/Q s Hardware Transactional Memory Experiments Using BG/Q s Hardware Transactional Memory Barna L. Bihari Lawrence Livermore National Laboratory Livermore, California ScicomP Meeting, Toronto, Canada, May 14-18, 2012 Acknowledgements Technical:

More information

TECHNICAL ADDENDUM 01

TECHNICAL ADDENDUM 01 TECHNICAL ADDENDUM 01 What Does An HA Environment Look Like? An HA environment will have a Source system that the database changes will be captured on and generate local journal entries. The journal entries

More information

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Tapasya Patki 1 David Lowenthal 1 Barry Rountree 2 Martin Schulz 2 Bronis de Supinski 2 1 The University of Arizona

More information

Non-Blocking Collectives for MPI

Non-Blocking Collectives for MPI Non-Blocking Collectives for MPI overlap at the highest level Torsten Höfler Open Systems Lab Indiana University Bloomington, IN, USA Institut für Wissenschaftliches Rechnen Technische Universität Dresden

More information

Scalability issues : HPC Applications & Performance Tools

Scalability issues : HPC Applications & Performance Tools High Performance Computing Systems and Technology Group Scalability issues : HPC Applications & Performance Tools Chiranjib Sur HPC @ India Systems and Technology Lab chiranjib.sur@in.ibm.com Top 500 :

More information

The State and Needs of IO Performance Tools

The State and Needs of IO Performance Tools The State and Needs of IO Performance Tools Scalable Tools Workshop Lake Tahoe, CA August 6 12, 2017 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National

More information

Using BigSim to Estimate Application Performance

Using BigSim to Estimate Application Performance October 19, 2010 Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign Outline Overview BigSim Emulator BigSim Simulator

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

The Evaluation of GPU-Based Programming Environments for Knowledge Discovery

The Evaluation of GPU-Based Programming Environments for Knowledge Discovery The Evaluation of GPU-Based Programming Environments for Knowledge Discovery John Johnson, Randall Frank, and Sheila Vaidya Lawrence Livermore National Labs Phone: 925-424-4092 Email Addresses: {jjohnson,

More information

The IBM Blue Gene/Q: Application performance, scalability and optimisation

The IBM Blue Gene/Q: Application performance, scalability and optimisation The IBM Blue Gene/Q: Application performance, scalability and optimisation Mike Ashworth, Andrew Porter Scientific Computing Department & STFC Hartree Centre Manish Modani IBM STFC Daresbury Laboratory,

More information

Power Constrained HPC

Power Constrained HPC http://scalability.llnl.gov/ Power Constrained HPC Martin Schulz Center or Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory With many collaborators and Co-PIs, incl.: LLNL: Barry

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 10 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Chapter 6: CPU Scheduling Basic Concepts

More information

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work The Salishan Conference on High-Speed Computing April 26, 2016 Adam Moody

More information

LS-DYNA Scalability Analysis on Cray Supercomputers

LS-DYNA Scalability Analysis on Cray Supercomputers 13 th International LS-DYNA Users Conference Session: Computing Technology LS-DYNA Scalability Analysis on Cray Supercomputers Ting-Ting Zhu Cray Inc. Jason Wang LSTC Abstract For the automotive industry,

More information

A Comparative Study of High Performance Computing on the Cloud. Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez

A Comparative Study of High Performance Computing on the Cloud. Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez A Comparative Study of High Performance Computing on the Cloud Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez What is The Cloud? The cloud is just a bunch of computers connected over

More information

Using Lamport s Logical Clocks

Using Lamport s Logical Clocks Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based

More information

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008 High Scalability Resource Management with SLURM Supercomputing 2008 November 2008 Morris Jette (jette1@llnl.gov) LLNL-PRES-408498 Lawrence Livermore National Laboratory What is SLURM Simple Linux Utility

More information

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Acknowledgements: Petra Kogel Sami Saarinen Peter Towers 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Motivation Opteron and P690+ clusters MPI communications IFS Forecast Model IFS 4D-Var

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

A Global Operating System for HPC Clusters

A Global Operating System for HPC Clusters A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

Improving Per Processor Memory Use of ns-3 to Enable Large Scale Simulations

Improving Per Processor Memory Use of ns-3 to Enable Large Scale Simulations Improving Per Processor Memory Use of ns-3 to Enable Large Scale Simulations WNS3 2015, Castelldefels (Barcelona), Spain May 13, 2015 Steven Smith, David R. Jefferson Peter D. Barnes, Jr, Sergei Nikolaev

More information

Programming for Fujitsu Supercomputers

Programming for Fujitsu Supercomputers Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming

More information

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied

More information

Early Experiences with the Naval Research Laboratory XD1. Wendell Anderson Dr. Robert Rosenberg Dr. Marco Lanzagorta Dr.

Early Experiences with the Naval Research Laboratory XD1. Wendell Anderson Dr. Robert Rosenberg Dr. Marco Lanzagorta Dr. Your corporate logo here Early Experiences with the Naval Research Laboratory XD1 Wendell Anderson Dr. Robert Rosenberg Dr. Marco Lanzagorta Dr. Jeanie Osburn Who we are Naval Research Laboratory Navy

More information

<Insert Picture Here>

<Insert Picture Here> The Other HPC: Profiling Enterprise-scale Applications Marty Itzkowitz Senior Principal SW Engineer, Oracle marty.itzkowitz@oracle.com Agenda HPC Applications

More information

What SMT can do for You. John Hague, IBM Consultant Oct 06

What SMT can do for You. John Hague, IBM Consultant Oct 06 What SMT can do for ou John Hague, IBM Consultant Oct 06 100.000 European Centre for Medium Range Weather Forecasting (ECMWF): Growth in HPC performance 10.000 teraflops sustained 1.000 0.100 0.010 VPP700

More information

NUMA-Aware Shared-Memory Collective Communication for MPI

NUMA-Aware Shared-Memory Collective Communication for MPI NUMA-Aware Shared-Memory Collective Communication for MPI Shigang Li Torsten Hoefler Marc Snir Presented By: Shafayat Rahman Motivation Number of cores per node keeps increasing So it becomes important

More information

Real-time scheduling for virtual machines in SK Telecom

Real-time scheduling for virtual machines in SK Telecom Real-time scheduling for virtual machines in SK Telecom Eunkyu Byun Cloud Computing Lab., SK Telecom Sponsored by: & & Cloud by Virtualization in SKT Provide virtualized ICT infra to customers like Amazon

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model

Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model Lai Wei, Ignacio Laguna, Dong H. Ahn Matthew P. LeGendre, Gregory L. Lee This work was performed under the auspices of the

More information

Optimize HPC - Application Efficiency on Many Core Systems

Optimize HPC - Application Efficiency on Many Core Systems Meet the experts Optimize HPC - Application Efficiency on Many Core Systems 2018 Arm Limited Florent Lebeau 27 March 2018 2 2018 Arm Limited Speedup Multithreading and scalability I wrote my program to

More information

Real-Time Operating Systems Issues. Realtime Scheduling in SunOS 5.0

Real-Time Operating Systems Issues. Realtime Scheduling in SunOS 5.0 Real-Time Operating Systems Issues Example of a real-time capable OS: Solaris. S. Khanna, M. Sebree, J.Zolnowsky. Realtime Scheduling in SunOS 5.0. USENIX - Winter 92. Problems with the design of general-purpose

More information

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Kent Milfeld, Avijit Purkayastha, Chona Guiang Texas Advanced Computing Center The University of Texas Austin, Texas USA Abstract

More information

Extending scalability of the community atmosphere model

Extending scalability of the community atmosphere model Journal of Physics: Conference Series Extending scalability of the community atmosphere model To cite this article: A Mirin and P Worley 2007 J. Phys.: Conf. Ser. 78 012082 Recent citations - Evaluation

More information

Platform Choices for LS-DYNA

Platform Choices for LS-DYNA Platform Choices for LS-DYNA Manfred Willem and Lee Fisher High Performance Computing Division, HP lee.fisher@hp.com October, 2004 Public Benchmarks for LS-DYNA www.topcrunch.org administered by University

More information

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing Exercises: April 11 1 PARTITIONING IN MPI COMMUNICATION AND NOISE AS HPC BOTTLENECK LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2017 Hermann Härtig THIS LECTURE Partitioning: bulk synchronous

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Techniques to improve the scalability of Checkpoint-Restart

Techniques to improve the scalability of Checkpoint-Restart Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart

More information

June IBM Power Academy. IBM PowerVM memory virtualization. Luca Comparini STG Lab Services Europe IBM FR. June,13 th Dubai

June IBM Power Academy. IBM PowerVM memory virtualization. Luca Comparini STG Lab Services Europe IBM FR. June,13 th Dubai June 2012 @Dubai IBM Power Academy IBM PowerVM memory virtualization Luca Comparini STG Lab Services Europe IBM FR June,13 th 2012 @IBM Dubai Agenda How paging works Active Memory Sharing Active Memory

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor CS6801-MULTICORE ARCHECTURES AND PROGRAMMING UN I 1. Difference between Symmetric Memory Architecture and Distributed Memory Architecture. 2. What is Vector Instruction? 3. What are the factor to increasing

More information