PREDICTING COMMUNICATION PERFORMANCE
|
|
- Silvester Walters
- 6 years ago
- Views:
Transcription
1 PREDICTING COMMUNICATION PERFORMANCE Nikhil Jain CASC Seminar, LLNL This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA This work was funded by the Laboratory Directed Research and Development Program at LLNL under project tracking code 13-ERD
2 SUPERCOMPUTERS 2 2
3 SUPERCOMPUTERS 48 GB/s, 1-2 microsec 40 GB/s, 1-3 microsec 150 GB/s, 0.8 microsec 420 GB/s, 1-2 microsec 2 2
4 SUPERCOMPUTERS 48 GB/s, 1-2 microsec 40 GB/s, 1-3 microsec 150 GB/s, 0.8 microsec Larger Bandwidth Lower Latency Fewer hops 420 GB/s, 1-2 microsec 2 2
5 WHY STUDY NETWORK PERFORMANCE? 3 3
6 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion 3 3
7 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion High raw bandwidth does not guarantee proportionate observed performance Blue Gene vs Cray s Gemini 3 3
8 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion High raw bandwidth does not guarantee proportionate observed performance Blue Gene vs Cray s Gemini Savings are proportionate to core-count 3 3
9 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion High raw bandwidth does not guarantee proportionate observed performance Blue Gene vs Cray s Gemini Savings are proportionate to core-count Most importantly, as a graduate student, I do what I am asked to do! 3 3
10 QUANTIFYING IMPACT Execution time for different mappings of pf3d Default Map Best Map 60% Mapping via logical operations in Rubik Time per iteration (s) What about others mappings? How far are we from the best? Number of cores A. Bhatele, et al Mapping applications with collectives over sub-communicators on torus networks. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 12. IEEE Computer Society, Nov (to appear). LLNL-CONF
11 ALTERNATIVES *Abhinav Bhatele, Nikhil Jain, William D. Gropp, and Laxmikant V. Kale. 2011b. Avoiding hot-spots on two- level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 11). ACM, New York, NY, USA, 76:1 76:
12 ALTERNATIVES Theoretically: NP hard Simulations: too slow 15 days to simulate one use case* *Abhinav Bhatele, Nikhil Jain, William D. Gropp, and Laxmikant V. Kale. 2011b. Avoiding hot-spots on two- level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 11). ACM, New York, NY, USA, 76:1 76:
13 ALTERNATIVES Theoretically: NP hard Simulations: too slow 15 days to simulate one use case* Real runs: very expensive Application/allocation specific information Intrepid 4.16M 0.73M Mira 0.17M 7.67M Total 4.33M 8.40M 13 million core hours! *Abhinav Bhatele, Nikhil Jain, William D. Gropp, and Laxmikant V. Kale. 2011b. Avoiding hot-spots on two- level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 11). ACM, New York, NY, USA, 76:1 76:
14 HEURISTICS - KNOWN METRICS Time per iteration (ms) e8 1.2e9 1.6e9 2e e09 3e9 6e9 9e9 Maximum Dilation Average bytes per link Maximum bytes on a link 2D-Halo: predicting performance using a linear regression model for known metrics 6 6
15 SUPERVISED LEARNING Collect/generate data and summarize Build models: train performance prediction based on independent metrics Predict 7 7
16 SUPERVISED LEARNING Collect/generate data and summarize Build models: train performance prediction based on independent metrics Predict 7 7
17 COMMUNICATION 8 8
18 COMMUNICATION 8 8
19 COMMUNICATION 8 8
20 COMMUNICATION 8 8
21 COMMUNICATION Injection FIFO Contention 8 8
22 COMMUNICATION Injection FIFO Contention Link Contention 8 8
23 COMMUNICATION Injection FIFO Contention Receive Buffer Contention Link Contention 8 8
24 COMMUNICATION Injection FIFO Contention Receive Buffer Contention Reception FIFO Contention Link Contention 8 8
25 COMMUNICATION Injection FIFO Contention Receive Buffer Contention Reception FIFO Contention Memory Contention Link Contention Memory Contention 8 8
26 NETWORK COUNTERS OF BLUEGENE/Q 9 9
27 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module 9 9
28 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module Packets sent on links in specific directions: A, B, C, D, E deterministic, dynamic A B C 9 9
29 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module Packets sent on links in specific directions: A, B, C, D, E deterministic, dynamic A B C Packets received on a link 9 9
30 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module Packets sent on links in specific directions: A, B, C, D, E deterministic, dynamic A B C Packets received on a link Packets in buffers 9 9
31 ANALYTICAL TOOL 10 10
32 ANALYTICAL TOOL Simulate the injection mechanism Selection of memory injection FIFO Mapping of memory FIFO to network injection FIFO 10 10
33 ANALYTICAL TOOL Simulate the injection mechanism Selection of memory injection FIFO Mapping of memory FIFO to network injection FIFO Simulate routing to obtain hops/dilation 10 10
34 SUPERVISED LEARNING 11 11
35 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance
36 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance. Derive metrics from the raw data on entities, e.g. average of bytes on links
37 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance. Derive metrics from the raw data on entities, e.g. average of bytes on links. Create a database of derived metrics and performance; we have used 100 mappings
38 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance. Derive metrics from the raw data on entities, e.g. average of bytes on links. Create a database of derived metrics and performance; we have used 100 mappings. Select two-third entries as training set; includes derived metrics and performance
39 SUPERVISED LEARNING 12 12
40 SUPERVISED LEARNING The training set is used to create a model for prediction 12 12
41 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics
42 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values
43 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values. Experimented with a large number of algorithms - linear, bayesian, SVM, near-neighbors etc
44 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values. Experimented with a large number of algorithms - linear, bayesian, SVM, near-neighbors etc
45 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values. Experimented with a large number of algorithms - linear, bayesian, SVM, near-neighbors etc
46 SUPERVISED LEARNING 13 13
47 SUPERVISED LEARNING Decision trees X[1] <= X[0] <= X[0] <= X[1] <= X[0] <= X[0] <= leaf leaf leaf X[1] <= Rest of the tree 13 13
48 SUPERVISED LEARNING Decision trees Randomized forest of trees X[1] <= X[0] <= X[0] <= X[1] <= X[0] <= X[0] <= leaf leaf leaf feature 2 X[1] <= Rest of the tree feature
49 HOW TO JUDGE A PREDICTION Rank Correlation Coefficient : (RCC): fraction of the number of pairs of task mappings whose ranks are in the same partial X } order X { } we define RCC as: in predicted and observed 8 performance list >< 1, if x i >= x j & y i >= y j Absolute Correlation Higher the better! concord ij = RCC = X 1, if x i <x j & y i <y j >: 0, otherwise X 0<=i<n 0<=j<i R 2 (y, ŷ) =1 concord ij /( n(n 1) ) 2 Pi (y i ŷ i ) 2 P i (y i ȳ)
50 RESULTS Three communication kernel Five-point 2D stencil 14-point 3D stencil Sub-communicator all-to-all Four message sizes to span MPI and routing protocols 15 15
51 KNOWN METRICS Entities Bytes on a link Dilation Derivation Methods Maximum Average Sum e09 3e9 6e9 9e9 Maximum bytes on a link 16 16
52 RESULTS KNOWN METRICS 17 17
53 RESULTS KNOWN METRICS max bytes is good, but incorrect in 10% cases 17 17
54 NEW METRICS 18 18
55 NEW METRICS Entities Buffer length (on intermediate nodes) FIFO length (packets in injection FIFO) Delay per link (packets in buffer/packets received) 18 18
56 NEW METRICS Entities Buffer length (on intermediate nodes) FIFO length (packets in injection FIFO) Delay per link (packets in buffer/packets received) Derivation methods Average Outliers (AO) Top Outliers (TO) 18 18
57 RESULTS NEW METRICS 19 19
58 HYBRID METRICS 20 20
59 HYBRID METRICS Combine multiple metrics to complement each other 20 20
60 HYBRID METRICS Combine multiple metrics to complement each other Some combinations avg bytes + max bytes + max FIFO avg bytes + max bytes + avg buffer + max FIFO avg bytes + avg buffer + avg delay AO + sum hops avg bytes TO + avg buffer TO + avg delay TO + sum hops 20 20
61 RESULTS HYBRID METRICS 21 21
62 RESULTS HYBRID METRICS hybrid metrics provide high accuracy 21 21
63 RESULTS - TREND Predicted performance (s) Sorted mappings Sorted mappings Sorted mappings 2D Halo 3D Halo Sub A2A 22 22
64 RESULTS 23 23
65 RESULTS 24 24
66 RESULTS Rank correlation coefficient RCC Combining all benchmarks avg bytes TO sum dilation AO max bytes avg bytes H1 avg buffer H3 H4 H5 H
67 RESULTS Rank correlation coefficient RCC Combining all benchmarks H1 avg bytes TO sum dilation AO max bytes avg bytes avg buffer H3 H4 H5 H6 Rank correlation coefficient RCC Predicting performance on 65,536 cores using max bytes avg bytes H1 avg buffer H3 H4 H5 H6 16,384 cores data for training avg bytes TO sum dilation A
68 RESULTS - PF3D 25 25
69 RESULTS - PF3D RCC Rank correlation coefficient R Absolute performance correlation avg bytes TO sum dilation AO max bytes avg bytes H1 avg buffer H3 H4 H5 H
70 RESULTS - PF3D 1.0 Rank correlation coefficient 31 Blue Gene/Q (16,384 cores) RCC Absolute performance correlation Execution Time (s) R sum dilation AO max bytes avg bytes H1 avg buffer avg bytes TO H3 H4 H5 H Mappings sorted by actual execution times pf3d Observed pf3d Predicted 25 25
71 SUMMARY Communication is not just about peak latency/ bandwidth Simultaneous analysis of various aspects of network is important Complex models are required for accurate prediction There are patterns waiting to be identified! 26 26
72 FUTURE WORK More applications! More metrics Weighted analysis Offline prediction of entities 27 27
73 FUTURE WORK More applications! More metrics Weighted analysis Offline prediction of entities Questions? 27 27
74 28 28
Automated Mapping of Regular Communication Graphs on Mesh Interconnects
Automated Mapping of Regular Communication Graphs on Mesh Interconnects Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung Motivation Running a parallel application on a linear array of processors:
More informationIS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers
IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers Abhinav S Bhatele and Laxmikant V Kale ACM Research Competition, SC 08 Outline Why should we consider topology
More informationParallel Computing Platforms
Parallel Computing Platforms Routing, Network Embedding John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 14-15 4,11 October 2018 Topics for Today
More informationReducing Communication Costs Associated with Parallel Algebraic Multigrid
Reducing Communication Costs Associated with Parallel Algebraic Multigrid Amanda Bienz, Luke Olson (Advisor) University of Illinois at Urbana-Champaign Urbana, IL 11 I. PROBLEM AND MOTIVATION Algebraic
More informationToward Runtime Power Management of Exascale Networks by On/Off Control of Links
Toward Runtime Power Management of Exascale Networks by On/Off Control of Links, Nikhil Jain, Laxmikant Kale University of Illinois at Urbana-Champaign HPPAC May 20, 2013 1 Power challenge Power is a major
More informationDr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011 A function level simulator for parallel applications on peta scale machine An
More informationA Machine-Learning Approach for Communication Prediction of Large-Scale Applications
1 A Machine-Learning Approach for Communication Prediction of Large-Scale Applications Nikela Papadopoulou, Georgios Goumas, Nectarios Koziris National Technical University of Athens School of Electrical
More informationPartitioning Low-diameter Networks to Eliminate Inter-job Interference
Partitioning Low-diameter Networks to Eliminate Inter-job Interference Nikhil Jain, Abhinav Bhatele, Xiang Ni, Todd Gamblin, Laxmikant V. Kale Center for Applied Scientific Computing, Lawrence Livermore
More informationHeuristic-based techniques for mapping irregular communication graphs to mesh topologies
Heuristic-based techniques for mapping irregular communication graphs to mesh topologies Abhinav Bhatele and Laxmikant V. Kale University of Illinois at Urbana-Champaign LLNL-PRES-495871, P. O. Box 808,
More informationOptimizing the performance of parallel applications on a 5D torus via task mapping
Optimizing the performance of parallel applications on a D torus via task mapping Abhinav Bhatele, Nikhil Jain,, Katherine E. Isaacs,, Ronak Buch, Todd Gamblin, Steven H. Langer, Laxmikant V. Kale Lawrence
More informationA CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS Abhinav Bhatele, Eric Bohm, Laxmikant V. Kale Parallel Programming Laboratory Euro-Par 2009 University of Illinois at Urbana-Champaign
More informationMaximizing Throughput on a Dragonfly Network
Maximizing Throughput on a Dragonfly Network Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J. Wright, Laxmikant V. Kale Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana,
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationPredicting connection quality in peer-to-peer real-time video streaming systems
Predicting connection quality in peer-to-peer real-time video streaming systems Alex Giladi Jeonghun Noh Information Systems Laboratory, Department of Electrical Engineering Stanford University, Stanford,
More informationAchieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation
Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College
More informationCharm++ for Productivity and Performance
Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil
More informationScalable Topology Aware Object Mapping for Large Supercomputers
Scalable Topology Aware Object Mapping for Large Supercomputers Abhinav Bhatele (bhatele@illinois.edu) Department of Computer Science, University of Illinois at Urbana-Champaign Advisor: Laxmikant V. Kale
More informationTraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks
TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks Bilge Acun, PhD Candidate Department of Computer Science University of Illinois at Urbana- Champaign *Contributors: Nikhil Jain,
More informationPreliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations
Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations Bilge Acun 1, Nikhil Jain 1, Abhinav Bhatele 2, Misbah Mubarak 3, Christopher D. Carothers 4, and Laxmikant V. Kale 1
More informationNoise Injection Techniques to Expose Subtle and Unintended Message Races
Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau
More informationFinal Report: Kaggle Soil Property Prediction Challenge
Final Report: Kaggle Soil Property Prediction Challenge Saurabh Verma (verma076@umn.edu, (612)598-1893) 1 Project Goal Low cost and rapid analysis of soil samples using infrared spectroscopy provide new
More informationOak Ridge National Laboratory Computing and Computational Sciences
Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman
More informationIntroduction to ns3. October 27, Peter D. Barnes, Jr. Staff Scientist
Introduction to ns3 October 27, 2011 Peter D. Barnes, Jr. Staff Scientist, P. O. Box 808, Livermore, CA 94551! This work performed under the auspices of the U.S. Department of Energy by under Contract
More informationCRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar
CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationAbhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center Motivation: Contention Experiments Bhatele, A., Kale, L. V. 2008 An Evaluation
More informationLecture 20: Distributed Memory Parallelism. William Gropp
Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order
More informationAcceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?
Acceleration of HPC applications on hybrid CPU- systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga
More informationThroughput Models of Interconnection Networks: the Good, the Bad, and the Ugly
Throughput Models of Interconnection Networks: the Good, the Bad, and the Ugly Peyman Faizian, Md Atiqul Mollah, Md Shafayat Rahman, Xin Yuan Department of Computer Science Florida State University Tallahassee,
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationHighly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010
Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010 1 Outline Parallel sorting background Histogram Sort overview Histogram Sort
More informationSimulating Laser-Plasma Interaction in Experiments at the National Ignition Facility on a Cray XE6
Simulating Laser-Plasma Interaction in Experiments at the National Ignition Facility on a Cray XE6 Steven Langer, Abhinav Bhatele, Todd Gamblin, Bert Still, Denise Hinkel, Mike Kumbera, Bruce Langdon,
More informationResource allocation in networks. Resource Allocation in Networks. Resource allocation
Resource allocation in networks Resource Allocation in Networks Very much like a resource allocation problem in operating systems How is it different? Resources and jobs are different Resources are buffers
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationPerformance of Multihop Communications Using Logical Topologies on Optical Torus Networks
Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,
More informationSnowpack: Efficient Parameter Choice for GPU Kernels via Static Analysis and Statistical Prediction ScalA 17, Denver, CO, USA, November 13, 2017
Snowpack: Efficient Parameter Choice for GPU Kernels via Static Analysis and Statistical Prediction ScalA 17, Denver, CO, USA, November 13, 2017 Ignacio Laguna Ranvijay Singh, Paul Wood, Ravi Gupta, Saurabh
More informationSlim Fly: A Cost Effective Low-Diameter Network Topology
TORSTEN HOEFLER, MACIEJ BESTA Slim Fly: A Cost Effective Low-Diameter Network Topology Images belong to their creator! NETWORKS, LIMITS, AND DESIGN SPACE Networks cost 25-30% of a large supercomputer Hard
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationTopology Awareness in the Tofu Interconnect Series
Topology Awareness in the Tofu Interconnect Series Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 23rd, 2016, ExaComm2016 Workshop 0 Introduction Networks
More informationLeveraging Flash in HPC Systems
Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,
More informationCS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control
CS 498 Hot Topics in High Performance Computing Networks and Fault Tolerance 9. Routing and Flow Control Intro What did we learn in the last lecture Topology metrics Including minimum diameter of directed
More informationDetecting and Using Critical Paths at Runtime in Message Driven Parallel Programs
Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs Isaac Dooley, Laxmikant V. Kale Department of Computer Science University of Illinois idooley2@illinois.edu kale@illinois.edu
More informationThe State and Needs of IO Performance Tools
The State and Needs of IO Performance Tools Scalable Tools Workshop Lake Tahoe, CA August 6 12, 2017 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National
More informationFast-Response Multipath Routing Policy for High-Speed Interconnection Networks
HPI-DC 09 Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks Diego Lugones, Daniel Franco, and Emilio Luque Leonardo Fialho Cluster 09 August 31 New Orleans, USA Outline Scope
More informationData Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling
Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus*, Murali Emani, Pei-Hung Lin, Chunhua Liao *University of Edinburgh (UK), Lawrence Livermore National Laboratory
More informationSoftware Topological Message Routing and Aggregation Techniques for Large Scale Parallel Systems
Software Topological Message Routing and Aggregation Techniques for Large Scale Parallel Systems Lukasz Wesolowski Preliminary Examination Report 1 Introduction Supercomputing networks are designed to
More informationTutorial: Application MPI Task Placement
Tutorial: Application MPI Task Placement Juan Galvez Nikhil Jain Palash Sharma PPL, University of Illinois at Urbana-Champaign Tutorial Outline Why Task Mapping on Blue Waters? When to do mapping? How
More informationA PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing
More informationRouting in a Delay Tolerant Network
Routing in a Delay Tolerant Network Vladislav Marinov Jacobs University Bremen March 31st, 2008 Vladislav Marinov Routing in a Delay Tolerant Network 1 Internet for a Remote Village Dial-up connection
More informationParallel Computer Architecture Part I. (Part 2 on March 18)
Parallel Computer Architecture Part I (Part 2 on March 18) Latency Defn: Latency is the time it takes one message to travel from source to destination. Includes various overheads. E.g. Taking the T (or
More informationSupervised Learning: K-Nearest Neighbors and Decision Trees
Supervised Learning: K-Nearest Neighbors and Decision Trees Piyush Rai CS5350/6350: Machine Learning August 25, 2011 (CS5350/6350) K-NN and DT August 25, 2011 1 / 20 Supervised Learning Given training
More informationBlue Waters I/O Performance
Blue Waters I/O Performance Mark Swan Performance Group Cray Inc. Saint Paul, Minnesota, USA mswan@cray.com Doug Petesch Performance Group Cray Inc. Saint Paul, Minnesota, USA dpetesch@cray.com Abstract
More informationSystems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications
Systems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications William Gropp University of Illinois, Urbana- Champaign Pavan Balaji, Rajeev Thakur
More informationData Driven Networks. Sachin Katti
Data Driven Networks Sachin Katti Is it possible for to Learn the control planes of networks and applications? Operators specify what they want, and the system learns how to deliver CAN WE LEARN THE CONTROL
More informationAnalysis of Sorting as a Streaming Application
1 of 10 Analysis of Sorting as a Streaming Application Greg Galloway, ggalloway@wustl.edu (A class project report written under the guidance of Prof. Raj Jain) Download Abstract Expressing concurrency
More informationpf3d Simulations of Laser-Plasma Interactions in National Ignition Facility Experiments
1 pf3d Simulations of Laser-Plasma Interactions in National Ignition Facility Experiments Steven Langer, Abhinav Bhatele, and Charles H. Still Abstract The laser-plasma interaction code, pf3d, is used
More informationUNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department
UNIVERSITY OF CASTILLA-LA MANCHA Computing Systems Department A case study on implementing virtual 5D torus networks using network components of lower dimensionality HiPINEB 2017 Francisco José Andújar
More informationPerformance Study of the MPI and MPI-CH Communication Libraries on the IBM SP
Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu
More informationPerformance and Energy Usage of Workloads on KNL and Haswell Architectures
Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research
More informationData Driven Networks
Data Driven Networks Is it possible for to Learn the control planes of networks and applications? Operators specify what they want, and the system learns how to deliver CAN WE LEARN THE CONTROL PLANE OF
More informationProceedings of the 2017 Winter Simulation Conference W. K. V. Chan, A. D Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds.
Proceedings of the 217 Winter Simulation Conference W. K. V. Chan, A. D Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds. TOWARD RELIABLE VALIDATION OF HPC NETWORK SIMULATION MODELS Misbah
More informationCombining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?
Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática
More informationSUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018
SUPERVISED LEARNING METHODS Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 2 CHOICE OF ML You cannot know which algorithm will work
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationNetwork Calculus: A Comparison
Time-Division Multiplexing vs Network Calculus: A Comparison Wolfgang Puffitsch, Rasmus Bo Sørensen, Martin Schoeberl RTNS 15, Lille, France Motivation Modern multiprocessors use networks-on-chip Congestion
More informationSelf Programming Networks
Self Programming Networks Is it possible for to Learn the control planes of networks and applications? Operators specify what they want, and the system learns how to deliver CAN WE LEARN THE CONTROL PLANE
More informationOptimization of the Hop-Byte Metric for Effective Topology Aware Mapping
Optimization of the Hop-Byte Metric for Effective Topology Aware Mapping C. D. Sudheer Department of Mathematics and Computer Science Sri Sathya Sai Institute of Higher Learning, India Email: cdsudheerkumar@sssihl.edu.in
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationOn the Role of Burst Buffers in Leadership- Class Storage Systems
On the Role of Burst Buffers in Leadership- Class Storage Systems Ning Liu, Jason Cope, Philip Carns, Christopher Carothers, Robert Ross, Gary Grider, Adam Crume, Carlos Maltzahn Contact: liun2@cs.rpi.edu,
More informationThe Impact of Optics on HPC System Interconnects
The Impact of Optics on HPC System Interconnects Mike Parker and Steve Scott Hot Interconnects 2009 Manhattan, NYC Will cost-effective optics fundamentally change the landscape of networking? Yes. Changes
More informationEvaluating Use of Data Flow Systems for Large Graph Analysis
Evaluating Use of Data Flow Systems for Large Graph Analysis Andy Yoo and Ian Kaplan, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by under
More informationSCALASCA parallel performance analyses of SPEC MPI2007 applications
Mitglied der Helmholtz-Gemeinschaft SCALASCA parallel performance analyses of SPEC MPI2007 applications 2008-05-22 Zoltán Szebenyi Jülich Supercomputing Centre, Forschungszentrum Jülich Aachen Institute
More informationBoosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search
Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid
More informationA Dendrogram. Bioinformatics (Lec 17)
A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and
More informationPower Constrained HPC
http://scalability.llnl.gov/ Power Constrained HPC Martin Schulz Center or Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory With many collaborators and Co-PIs, incl.: LLNL: Barry
More informationScalaIOTrace: Scalable I/O Tracing and Analysis
ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationUsing the Cray Gemini Performance Counters
Photos placed in horizontal position with even amount of white space between photos and header Using the Cray Gemini Performance Counters 0 1 2 3 4 5 6 7 Backplane Backplane 8 9 10 11 12 13 14 15 Backplane
More informationHistogram sort with Sampling (HSS) Vipul Harsh, Laxmikant Kale
Histogram sort with Sampling (HSS) Vipul Harsh, Laxmikant Kale Parallel sorting in the age of Exascale Charm N-body GrAvity solver Massive Cosmological N-body simulations Parallel sorting in every iteration
More information1.3 Data processing; data storage; data movement; and control.
CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical
More informationMySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona
MySQL Performance Optimization and Troubleshooting with PMM Peter Zaitsev, CEO, Percona In the Presentation Practical approach to deal with some of the common MySQL Issues 2 Assumptions You re looking
More informationMatrix multiplication on multidimensional torus networks
Matrix multiplication on multidimensional torus networks Edgar Solomonik James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-28
More informationThe Constellation Project. Andrew W. Nash 14 November 2016
The Constellation Project Andrew W. Nash 14 November 2016 The Constellation Project: Representing a High Performance File System as a Graph for Analysis The Titan supercomputer utilizes high performance
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationMapping Applications with Collectives over Sub-communicators on Torus Networks
Mapping Applications with Collectives over Sub-communicators on Torus Networks Abhinav Bhatele, Todd Gamblin, Steven H. Langer, Peer-Timo Bremer,, Erik W. Draeger, Bernd Hamann, Katherine E. Isaacs, Aaditya
More informationCombining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing
Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing Jose Flich 1,PedroLópez 1, Manuel. P. Malumbres 1, José Duato 1,andTomRokicki 2 1 Dpto.
More informationOutline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers
Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS
More informationGetting Insider Information via the New MPI Tools Information Interface
Getting Insider Information via the New MPI Tools Information Interface EuroMPI 2016 September 26, 2016 Kathryn Mohror This work was performed under the auspices of the U.S. Department of Energy by Lawrence
More informationEN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationOrder or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations
Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationIntroduction to Parallel Programming
University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Section 3. Introduction to Parallel Programming Communication Complexity Analysis of Parallel Algorithms Gergel V.P., Professor,
More informationImproving Data Movement Performance for Sparse Data Patterns on the Blue Gene/Q Supercomputer
Improving Data Movement Performance for Sparse Data Patterns on the Blue Gene/Q Supercomputer Huy Bui, Jason Leigh Electronic Visualization Laboratory (EVL) University of Illinois at Chicago Chicago, IL,
More informationWatch Out for the Bully! Job Interference Study on Dragonfly Network
Watch Out for the Bully! Job Interference Study on Dragonfly Network Xu Yang, John Jenkins, Misbah Mubarak, Robert B. Ross, Zhiling Lan Department of Computer Science, Illinois Institute of Technology,
More informationA Comprehensive Analytical Performance Model of DRAM Caches
A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur *, Mahesh Mehendale *, and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science
More informationMVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand
MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The
More informationEvaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin
Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin Lawrence Berkeley National Laboratory Energy efficiency at Exascale A design goal for future
More information