The Case of the Missing Supercomputer Performance

Size: px

Start display at page:

Download "The Case of the Missing Supercomputer Performance"

Russell Harris
5 years ago
Views:

1 The Case of the Missing Supercomputer Performance Achieving Optimal Performance on the 8192 Processors of ASCI Q Fabrizio Petrini, Darren Kerbyson, Scott Pakin (Los Alamos National Lab) Presented by Jiahua He

2 Skeleton of the Story Machine: ASCI Q (Second of Top500) 2048 Alpha SMP nodes with 4 proc per node Interconnected with Quadrics QsNet network Application: SAGE compressible Eulerian hydrodynmics program 150,000 lines of Fortran MPI code Beginning: a serious but previously undetected problem Techniques: Measurement to determine real performance Analytical model to predict expected performance Microbenchmarks to identify problem source Simulator to examine what if scenarios Result: a factor of 2 improvement in app performance 10/03/05 2

3 Steps Performance expectation Use analytical model to determine the performance that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 3

4 Step 1 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 4

5 Performance Expectation Model (Darren Kerbyson et al. SC01) Validated on many large-scale systems including all ASCI systems Typical prediction error of less than 10% Terms QA: first 4096-processor segment QB: second 4096-processor segment Weal-scaling: fix per-node problem size and scale # of proc 10/03/05 5

6 Performance Expectation Model (Darren Kerbyson et al. SC01) Validated on many large-scale systems including all ASCI systems Typical prediction error of less than 10% MYSTERY #1 Terms SAGE QA: first performs 4096-processor significantly worse on ASCI Q than segment was predicted by our performance model. QB: second 4096-processor segment Weal-scaling: fix per-node problem size and scale # of proc 10/03/05 6

7 Different # of proc Is it the model accurate? n-proc: using n processors per node Only significant difference occurs when 4-proc Giving confidence to the model Limit the problem in 4-proc 3-proc outperforms 4-proc when using more than 256 nodes 2-proc outperforms 4-proc when using more than 512 nodes 10/03/05 7

8 Perf Variability Constant amount of work in each cycle constant amount of time Vary from 0.7s to 3.0s A factor of 4 in variability 10/03/05 8

9 Breakdown of Cycle Time cycle = computation + local boundary exchange + collective communication Local boundary exchanges (get, put) Plateau above 500 proc Match model prediction Collective communications (allreduce, reduction, broadcast) Increase with # of proc Constant number and payload size in allreduce operations Difference between allreduce and reduction/broadcast: the difference in frequency of occurrence 10/03/05 9

10 Observations Summary Significant difference: expected performance observed performance Only when 4-proc High variability Source of performance deficit: collective operations, especially allreduce Deduction Improve the performance of allreduce, especially when using four processors per node 10/03/05 10

11 Step 2 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 11

12 Investigating allreduce allreduce latency 4-proc: 3ms Others: less than 0.3ms Synthetic parallel benchmark Alternately computes for either 0, 1 or 5 ms then performs either an allreduce or barrier Ideal scalable system Logarithmic growth with # nodes Insensitivity to computational granularity Result: not scalable 10/03/05 12

13 Optimization Optimizing allreduce Always polling Blocking after a limited time (100us, determined empirically) Improve latency by a factor of 7 Expectation At 4096 proc, SAGE spends 51% time in allreduce 78% performance gain Measurement result Only a marginal improvement in application performance 10/03/05 13

14 Optimization Optimizing allreduce Always polling Blocking after a limited time (100us, determined empirically) Improve latency MYSTERY by a factor #2 of 7 Expectation Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce seven At 4096 proc, SAGE spends 51% time in times faster leads to a negligible performance allreduce 78% performance gain improvement. Measurement result Only a marginal improvement in application performance 10/03/05 14

15 Analyzing Noise Neither MPI nor network node Periodic system activities (noise) Need a spare proc (Fig. 3, 6) Blocking in allreduce Benchmark Synthetic 1000s computation per proc without noise Max slowdown: only 2.5% Refined benchmark 1 million 1ms iterations per proc without noise Match LANL codes pattern Similar result 10/03/05 15

3, 6) Blocking in allreduce MYSTERY #3 Benchmark Synthetic Although 1000s the noise

performance, microbenchmarks Max of per-processor slowdown: only noise 2.

16 Analyzing Noise Neither MPI nor network node Periodic system activities (noise) Need a spare proc (Fig. 3, 6) Blocking in allreduce MYSTERY #3 Benchmark Synthetic Although 1000s the noise computation hypothesis per could explain proc SAGE s without suboptimal noise performance, microbenchmarks Max of per-processor slowdown: only noise 2.5% indicate that at most 2.5% of performance is being lost to noise. Refined benchmark 1 million 1ms iterations per proc without noise Match LANL codes pattern Similar result 10/03/05 16

17 Node Aggregation Expose structure in what appears to be uncorrelated noise on a per-proc basis Important observation Regular pattern across nodes Each cluster (32 nodes) contains noisier nodes Zoom into a cluster Node 0: cluster manager Node 1: quorum node Node 31: RMS cluster monitor 10/03/05 17

Node Aggregation Expose structure in what appears to be uncorrelated noise on a per-proc basis Important observation FINDING #1 Regular pattern across nodes Analyzing noise on a per-node basis

18 Node Aggregation Expose structure in what appears to be uncorrelated noise on a per-proc basis Important observation FINDING #1 Regular pattern across nodes Analyzing noise on a per-node basis instead of a Each per-processor cluster (32 basis nodes) reveals a regular structure contains noisier nodes across nodes. Zoom into a cluster Node 0: cluster manager Node 1: quorum node Node 31: RMS cluster monitor 10/03/05 18

19 Noise Events 10/03/05 19

20 Kernel Source of Noises Distributed heartbeat generated at kernel level Lightweight: hundreds of microseconds (us) High frequency: one every 125ms RMS daemons Quadrics Resource Management System One every 30s TruCluster daemons HP cluster management software One every about 100s 10/03/05 20

21 Step 3 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 21

22 Coscheduling Application: fine-grained, bulk-synchronous A delay in a process slows down the whole app Large # proc at least one slow process per iteration Coscheduling: pay the penalty only once Developed a prototype, but no details or results 10/03/05 22

23 Discrete-event Simulator Why simulator? Time on ASCI Q is scarce Configuration changes are not always practical Event = <F, L, E, P> F: frequency of the event L: average duration E: distribution; P: placement Barriers + 1ms computations Validated for measured events (top two curves) Predict performance gain of removing noises Node 0, 1 or 31: marginal improvement (15%) Kernel noise on all nodes: dramatically improved 10/03/05 23

24 Discrete-event Simulator Why simulator? Time on ASCI Q is scarce Configuration changes are not always practical FINDING #2 Event = <F, L, E, P> On F: fine-grained frequency of applications, the event more performance is lost L: to average short but duration frequent noise on all nodes than to long but less frequent noise on just a few nodes. E: distribution; P: placement Barriers + 1ms computations Validated for measured events (top two curves) Predict performance gain of removing noises Node 0, 1 or 31: marginal improvement (15%) Kernel noise on all nodes: dramatically improved 10/03/05 24

Eliminating Noise Infeasible to remove all the noise Two TruCluster heartbeats at kernel level Require substantial kernel modifications Optimizations Removed ten daemons from all nodes

25 Eliminating Noise Infeasible to remove all the noise Two TruCluster heartbeats at kernel level Require substantial kernel modifications Optimizations Removed ten daemons from all nodes Increased RMS interval from 30s to 60s Moved several TruCluster daemons from node 1 and 2 to node 0 Microbenchmarks Barriers + Computations (0, 1 or 5ms) Improvements 2.2 to 13 times faster 10/03/05 25

26 Step 4 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 26

27 Optimized SAGE Performance Old curves (top two curves) New curves 4-proc, but w/o nodes 0 & 31 Jan-27-03: 1024-node segment (only up to 3716 proc) May-01-03: full sized ASCI Q (up to 7680 proc) May-01-03(min): minimum time over 50 cycles Results Jan and May-01-03: much improved May-01-03(min): closely match expected performance further optimizations 10/03/05 27

28 Summary Different configurations tested prior to and after noise removal Total processing rate (# usable proc) * (cells per proc) / (cycle time) Fixed 13,500 cells per proc Varied # of usable proc Best observed (???) processing rate is only 15% below model expectation 10/03/05 28

29 Summary Different configurations tested prior to and after noise removal Total processing rate (# usable proc) * (cells FINDING per proc) #3 / (cycle time) Fixed 13,500 cells per proc We Varied were # able of usable to double procsage s performance by removing noise caused by several types of dæmons, confining dæmons to the cluster manager, and removing the cluster manager and the RMS cluster monitor from each cluster s compute pool. Best observed (???) processing rate is only 15% below model expectation 10/03/05 29

30 Discussion Computational granularity of app type of noise Load balanced, coarse-grained app (e.g. LINPACH): Long noise dominate Short noise becomes coscheduled Medium-grained app (e.g. SAGE): Medium noise dominate Fine-grained app (e.g. deterministic Sn-transport): Short noise dominate The freq of long noise is low 10/03/05 30

31 Discussion Computational granularity of app type of noise Load balanced, coarse-grained app (e.g. LINPACH): FINDING #4 Long noise dominate Substantial performance loss occurs when an application Short noise becomes coscheduled resonates with system noise: high-frequency, fine-grained noise affects only fine-grained applications; low-frequency, coarse-grained SAGE): noise affects only coarse-grained applications. Medium noise dominate Medium-grained app (e.g. Fine-grained app (e.g. deterministic Sn-transport): Short noise dominate The freq of long noise is low 10/03/05 31

32 Conclusion Described a figurative journey to improve the performance of a sizable hydrodynamics app, SAGE, on the world`s second-fastest supercomputer, ASCI Q Methodologies The first to determine how fast an app could potentially run Developed a methodology to analyze artifacts that degrade app performance yet are not part of the app Doubled the performance of SAGE w/o modifying a single line of code Notions Noise and resonance Applicable to other system and other app 10/03/05 32

33 More discussions What do they mean by best observed in Table 3? The processing rate of regular 4- proc using 7680 proc (120.6) is still lower than 3-proc with only 6144 proc. The analytical model is constructed manually (Darren Kerbyson et al. SC01). It is enormously labor intensive. 10/03/05 33

34 Thanks! Any questions? The Case of the Missing Supercomputer Performance (SC 2003)

CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019

CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019 Prof. Karen L. Karavanic karavan@pdx.edu web.cecs.pdx.edu/~karavan Today s Agenda Why Study Performance? Why Study