MPI On-node and Large Processor Count Scaling Performance. October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory

Size: px

Start display at page:

Download "MPI On-node and Large Processor Count Scaling Performance. October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory"

Diana Day
5 years ago
Views:

1 MPI On-node and Large Processor Count Scaling Performance October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory

2 Outline Scope Presentation aimed at scientific/technical app writers Why This Matters Halo Results as CPU-intensive threads approach available procs MPI-Allreduce Results as CPU-intensive threads approach available procs Operational Issues Conclusion 2

3 Why is There So Much Interest in This Anyway? Presentations from Tuel, Worley in addition to this talk As processor count climbs, scalability of infrastructure becomes more important On-node performance increasingly important There are some issues to be aware of 3

4 Findings from Halo code 4

5 Halo code Simulates the nearest neighbor exchange of a 1-2 row/column halo from a 2-D array. Due to Alan Wallcraft Common operation for a finite difference ocean model Results shown as multiple exchanges for a given tile edge length (2, 4, 8,, 1024). Up to 896 tasks on NH-2 (56x16 or 64x14) Up to 768 tasks on Silver (192x4 or 256x3) Uses MPI (timed function is MPI_Sendrecv) Code compiled as: Mpxlf -v -w -O3 -qarch=auto -qcache=auto -qfloat=hsflt 5

6 Sample Results 6

7 Halo: Three of Four Nodes MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime Machine: blue.llnl.gov Dedicated: yes Date: Aug, 2001 PSSP: PTF 9 AIX: ML8 CPU: 332Mhz 604e Node: Silver (4-way) 7

8 Halo: Four of Four Nodes MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime Machine: blue.llnl.gov Dedicated: yes Date: Aug, 2001 PSSP: PTF 9 AIX: ML8 CPU: 332Mhz 604e Node: Silver (4-way) 8

9 Halo: Fourteen of Sixteen Nodes MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime Machine: frost.llnl.gov Dedicated: yes Date: Aug, 2001 PSSP: 3.3 base AIX: ML8 CPU: 375Mhz P3 Node: NH-2 (16-way) 9

10 Halo: Sixteen of Sixteen Nodes MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime Machine: frost.llnl.gov Dedicated: yes Date: Aug, 2001 PSSP: 3.3 base AIX: ML8 CPU: 375Mhz P3 Node: NH-2 (16-way) 10

11 Findings From Halo Best Observed Timings The NHII has only a modest increase in "bandwidth" (halo length = 1024) over the WHII. The latency of the two machines is very similar, with both showing a significantly higher latency above a certain node-count. Variability Node Min Avg Max Std-dev Silver 3TPN: Silver 4TPN: , , NH-2 14TPN: NH-2 16TPN: , Fully populated nodes (up to 64 NH2 & 256 Silver) perform significantly slower on average 11

12 Findings from LLNL MPI Test code 12

13 LLNL Collective MPI Benchmark Test code developed to investigate MPI on-node and large node count scalability All results are the median of three runs C program which times a number of MPI ops within a loop. Timings by MPI_Wtime() Due to Linda Stanberry So Far, Three Effects Have Been Evaluated NH-2 Firmware 15TPN -vs- 16TPN Priority Adjustments 13

14 Effect of Firmware Fix Before and After - 15 TPN 5 MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime Machine: frost.llnl.gov Dedicated: yes Date: Oct, 2001 PSSP: 3.3 base AIX: ML8 CPU: 375Mhz P3 Node: NH-2 (16-way) default - before best - before default - after best - after Log. (best - after) Poly. (best - after) # tasks 14

15 Effect of 15TPN -vs- 16TPN Allreduce - 15 vs 16 TPN 20 MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime TPN 16 TPN Machine: white.llnl.gov Dedicated: yes Date: Oct, 2001 PSSP: 3.3 base AIX: ML8 CPU: 375Mhz P3 Node: NH-2 (16-way) # tasks 15

16 Effect of Priority Adjustments Allreduce - 15 TPN MP_SHARED_MEMORY: YES TASK LAYOUT: CONSECUTIVE TASKS BOUND?: NO DEDICATED NODE?: YES CLOCK USED: MPI_Wtime priority 30 default priority Log. (priority 30) 40% 4096 tasks 10 Machine: white.llnl.gov Dedicated: yes Date: Oct, 2001 PSSP: 3.3 base AIX: ML8 CPU: 375Mhz P3 Node: NH-2 (16-way) # of tasks 16

17 Fitting Performance to a Log curve under optimal conditions Fitting to Log Curve (960 tasks) Fitting to Log Curve (4096 Tasks) actual data logarithmic trendline polynomial trendline priority 30 Log. (priority 30) #tasks -2 # of tasks 15 TPN, Priority=30, Dedicated Use 17

18 Findings from LLNL MPI_Allreduce testing Significant performance improvement with priority adjustments Choice of 15TPN or 16TPN depends on number of nodes on a 16-way SMP -- larger node counts favor 16TPN. Data appears to deviate from Log Curve-fit for high numbers of tasks. Log Curve-fit does work well for low task count. Obtaining good TPN and large node-count performance requires careful setup 18

19 Operational Issues 19

20 Operational Issues Have the firmware fix applied for NH2 nodes Discovered by Bill Tuel Not obvious IBM Has Suggested Several Set-up Techniques Increase Priority of Application Avoid MPI_Allreduce() and MPI_Barrier() Use processor binding 20

21 Set-up Technique #1 Priority Adjustment Description: Give application a favorable priority (priority=59 preempts cron jobs, priority=30 preempts all system tasks) Implemented by: Manual adjustment by sysadmin intervention Manual adjustment by setuid script Automatic adjustment based on user account (/etc/poe.priority) Pros: Significant performance gains demonstrated Cons: Manual adjustment is too burdensome Automatic adjustment is unacceptable as implemented No way to keep users from running with priority=30 and 16TPN. (Causes nodes to go comatose, can t run shutdown, must crash by power cycle.) Current /etc/poe.priority doesn t permit wildcards 21

22 Set-up Technique #2 Avoid MPI_Allreduce() Description: Avoid MPI_Allreduce() and MPI_Barrier() Implemented by: Re-writing application to be data-flow parallel, removing all barrier synchronizations Pros: Performance problems corrected by avoiding problematic functionality. Cons: Difficult to do for step-wise type simulations MPI_Allreduce() is a valid part of the MPI specification 22

23 Set-up Technique #3 Use Processor Binding Description: Bind separate tasks of the parallel application to separate processors within an SMP Implemented by: Calling a system function Pros: bindprocessor(bindthread, thread_id, cpu_id) Believed to provide some benefits through cache reuse Cons: Bill Tuel reports minimal impact Often, threads are implicitly handled by OpenMP directives. App writer doesn t always know when the system initiates CPU-intensive threads, thereby making coordination difficult. Any benefits could easily be superceded by Discouraged by IBM s own documentation (seen as tool for kernel programmers and not recommended for ordinary use): 23

24 What s Next? We want... Understandable Performance: Log-scaling, efficient use of the available hardware resources Understandable Environment: App writers should be able to easily tell why their codes do not run faster Done: NH-2 Firmware fix is available Hats_nim appears to be overly active with default settings Ongoing: LLNL is actively working with IBM to improve performance and understanding. Investigate system daemon contention, hats_nim setting, MP_Pulse setting,... MPI_Allreduce() and MPI_Barrier() measurements 24

25 and in conclusion Further Info Halo code due to Alan Wallcraft Patrick Worley s results: Message Passing Interface Forum, MPI-2: A Message Passing Interface Standard, Standards Document 2.0, University of Tennessee, Knoxville, July Acknowledgements Alan Wallcraft, Naval Research Lab Chris Chambreau, Robin Goldstone, Lawrence Livermore National Lab This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng

HPC Colony: Linux at Large Node Counts

UCRL-TR-233689 HPC Colony: Linux at Large Node Counts T. Jones, A. Tauferner, T. Inglett, A. Sidelnik August 14, 2007 Disclaimer This document was prepared as an account of work sponsored by an agency