PREDICTING COMMUNICATION PERFORMANCE

Size: px

Start display at page:

Download "PREDICTING COMMUNICATION PERFORMANCE"

Silvester Walters
6 years ago
Views:

1 PREDICTING COMMUNICATION PERFORMANCE Nikhil Jain CASC Seminar, LLNL This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA This work was funded by the Laboratory Directed Research and Development Program at LLNL under project tracking code 13-ERD

2 SUPERCOMPUTERS 2 2

3 SUPERCOMPUTERS 48 GB/s, 1-2 microsec 40 GB/s, 1-3 microsec 150 GB/s, 0.8 microsec 420 GB/s, 1-2 microsec 2 2

4 SUPERCOMPUTERS 48 GB/s, 1-2 microsec 40 GB/s, 1-3 microsec 150 GB/s, 0.8 microsec Larger Bandwidth Lower Latency Fewer hops 420 GB/s, 1-2 microsec 2 2

5 WHY STUDY NETWORK PERFORMANCE? 3 3

6 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion 3 3

7 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion High raw bandwidth does not guarantee proportionate observed performance Blue Gene vs Cray s Gemini 3 3

8 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion High raw bandwidth does not guarantee proportionate observed performance Blue Gene vs Cray s Gemini Savings are proportionate to core-count 3 3

9 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion High raw bandwidth does not guarantee proportionate observed performance Blue Gene vs Cray s Gemini Savings are proportionate to core-count Most importantly, as a graduate student, I do what I am asked to do! 3 3

10 QUANTIFYING IMPACT Execution time for different mappings of pf3d Default Map Best Map 60% Mapping via logical operations in Rubik Time per iteration (s) What about others mappings? How far are we from the best? Number of cores A. Bhatele, et al Mapping applications with collectives over sub-communicators on torus networks. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 12. IEEE Computer Society, Nov (to appear). LLNL-CONF

11 ALTERNATIVES *Abhinav Bhatele, Nikhil Jain, William D. Gropp, and Laxmikant V. Kale. 2011b. Avoiding hot-spots on two- level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 11). ACM, New York, NY, USA, 76:1 76:

12 ALTERNATIVES Theoretically: NP hard Simulations: too slow 15 days to simulate one use case* *Abhinav Bhatele, Nikhil Jain, William D. Gropp, and Laxmikant V. Kale. 2011b. Avoiding hot-spots on two- level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 11). ACM, New York, NY, USA, 76:1 76:

13 ALTERNATIVES Theoretically: NP hard Simulations: too slow 15 days to simulate one use case* Real runs: very expensive Application/allocation specific information Intrepid 4.16M 0.73M Mira 0.17M 7.67M Total 4.33M 8.40M 13 million core hours! *Abhinav Bhatele, Nikhil Jain, William D. Gropp, and Laxmikant V. Kale. 2011b. Avoiding hot-spots on two- level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 11). ACM, New York, NY, USA, 76:1 76:

14 HEURISTICS - KNOWN METRICS Time per iteration (ms) e8 1.2e9 1.6e9 2e e09 3e9 6e9 9e9 Maximum Dilation Average bytes per link Maximum bytes on a link 2D-Halo: predicting performance using a linear regression model for known metrics 6 6

15 SUPERVISED LEARNING Collect/generate data and summarize Build models: train performance prediction based on independent metrics Predict 7 7

16 SUPERVISED LEARNING Collect/generate data and summarize Build models: train performance prediction based on independent metrics Predict 7 7

17 COMMUNICATION 8 8

18 COMMUNICATION 8 8

19 COMMUNICATION 8 8

20 COMMUNICATION 8 8

21 COMMUNICATION Injection FIFO Contention 8 8

22 COMMUNICATION Injection FIFO Contention Link Contention 8 8

23 COMMUNICATION Injection FIFO Contention Receive Buffer Contention Link Contention 8 8

24 COMMUNICATION Injection FIFO Contention Receive Buffer Contention Reception FIFO Contention Link Contention 8 8

25 COMMUNICATION Injection FIFO Contention Receive Buffer Contention Reception FIFO Contention Memory Contention Link Contention Memory Contention 8 8

26 NETWORK COUNTERS OF BLUEGENE/Q 9 9

27 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module 9 9

28 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module Packets sent on links in specific directions: A, B, C, D, E deterministic, dynamic A B C 9 9

29 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module Packets sent on links in specific directions: A, B, C, D, E deterministic, dynamic A B C Packets received on a link 9 9

30 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module Packets sent on links in specific directions: A, B, C, D, E deterministic, dynamic A B C Packets received on a link Packets in buffers 9 9

31 ANALYTICAL TOOL 10 10

32 ANALYTICAL TOOL Simulate the injection mechanism Selection of memory injection FIFO Mapping of memory FIFO to network injection FIFO 10 10

33 ANALYTICAL TOOL Simulate the injection mechanism Selection of memory injection FIFO Mapping of memory FIFO to network injection FIFO Simulate routing to obtain hops/dilation 10 10

34 SUPERVISED LEARNING 11 11

35 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance

36 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance. Derive metrics from the raw data on entities, e.g. average of bytes on links

37 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance. Derive metrics from the raw data on entities, e.g. average of bytes on links. Create a database of derived metrics and performance; we have used 100 mappings

38 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance. Derive metrics from the raw data on entities, e.g. average of bytes on links. Create a database of derived metrics and performance; we have used 100 mappings. Select two-third entries as training set; includes derived metrics and performance

39 SUPERVISED LEARNING 12 12

40 SUPERVISED LEARNING The training set is used to create a model for prediction 12 12

41 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics

42 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values

43 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values. Experimented with a large number of algorithms - linear, bayesian, SVM, near-neighbors etc

44 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values. Experimented with a large number of algorithms - linear, bayesian, SVM, near-neighbors etc

45 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values. Experimented with a large number of algorithms - linear, bayesian, SVM, near-neighbors etc

46 SUPERVISED LEARNING 13 13

47 SUPERVISED LEARNING Decision trees X[1] <= X[0] <= X[0] <= X[1] <= X[0] <= X[0] <= leaf leaf leaf X[1] <= Rest of the tree 13 13

2857 X[1] <= 0.0077 X[0] <= 0.0176 X[0] <= 0.

48 SUPERVISED LEARNING Decision trees Randomized forest of trees X[1] <= X[0] <= X[0] <= X[1] <= X[0] <= X[0] <= leaf leaf leaf feature 2 X[1] <= Rest of the tree feature

49 HOW TO JUDGE A PREDICTION Rank Correlation Coefficient : (RCC): fraction of the number of pairs of task mappings whose ranks are in the same partial X } order X { } we define RCC as: in predicted and observed 8 performance list >< 1, if x i >= x j & y i >= y j Absolute Correlation Higher the better! concord ij = RCC = X 1, if x i <x j & y i <y j >: 0, otherwise X 0<=i<n 0<=j<i R 2 (y, ŷ) =1 concord ij /( n(n 1) ) 2 Pi (y i ŷ i ) 2 P i (y i ȳ)

50 RESULTS Three communication kernel Five-point 2D stencil 14-point 3D stencil Sub-communicator all-to-all Four message sizes to span MPI and routing protocols 15 15

51 KNOWN METRICS Entities Bytes on a link Dilation Derivation Methods Maximum Average Sum e09 3e9 6e9 9e9 Maximum bytes on a link 16 16

52 RESULTS KNOWN METRICS 17 17

53 RESULTS KNOWN METRICS max bytes is good, but incorrect in 10% cases 17 17

54 NEW METRICS 18 18

55 NEW METRICS Entities Buffer length (on intermediate nodes) FIFO length (packets in injection FIFO) Delay per link (packets in buffer/packets received) 18 18

56 NEW METRICS Entities Buffer length (on intermediate nodes) FIFO length (packets in injection FIFO) Delay per link (packets in buffer/packets received) Derivation methods Average Outliers (AO) Top Outliers (TO) 18 18

57 RESULTS NEW METRICS 19 19

58 HYBRID METRICS 20 20

59 HYBRID METRICS Combine multiple metrics to complement each other 20 20

60 HYBRID METRICS Combine multiple metrics to complement each other Some combinations avg bytes + max bytes + max FIFO avg bytes + max bytes + avg buffer + max FIFO avg bytes + avg buffer + avg delay AO + sum hops avg bytes TO + avg buffer TO + avg delay TO + sum hops 20 20

61 RESULTS HYBRID METRICS 21 21

62 RESULTS HYBRID METRICS hybrid metrics provide high accuracy 21 21

63 RESULTS - TREND Predicted performance (s) Sorted mappings Sorted mappings Sorted mappings 2D Halo 3D Halo Sub A2A 22 22

64 RESULTS 23 23

65 RESULTS 24 24

66 RESULTS Rank correlation coefficient RCC Combining all benchmarks avg bytes TO sum dilation AO max bytes avg bytes H1 avg buffer H3 H4 H5 H

67 RESULTS Rank correlation coefficient RCC Combining all benchmarks H1 avg bytes TO sum dilation AO max bytes avg bytes avg buffer H3 H4 H5 H6 Rank correlation coefficient RCC Predicting performance on 65,536 cores using max bytes avg bytes H1 avg buffer H3 H4 H5 H6 16,384 cores data for training avg bytes TO sum dilation A

68 RESULTS - PF3D 25 25

69 RESULTS - PF3D RCC Rank correlation coefficient R Absolute performance correlation avg bytes TO sum dilation AO max bytes avg bytes H1 avg buffer H3 H4 H5 H

70 RESULTS - PF3D 1.0 Rank correlation coefficient 31 Blue Gene/Q (16,384 cores) RCC Absolute performance correlation Execution Time (s) R sum dilation AO max bytes avg bytes H1 avg buffer avg bytes TO H3 H4 H5 H Mappings sorted by actual execution times pf3d Observed pf3d Predicted 25 25

71 SUMMARY Communication is not just about peak latency/ bandwidth Simultaneous analysis of various aspects of network is important Complex models are required for accurate prediction There are patterns waiting to be identified! 26 26

72 FUTURE WORK More applications! More metrics Weighted analysis Offline prediction of entities 27 27

73 FUTURE WORK More applications! More metrics Weighted analysis Offline prediction of entities Questions? 27 27

74 28 28

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

Automated Mapping of Regular Communication Graphs on Mesh Interconnects Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung Motivation Running a parallel application on a linear array of processors: