PREDICTING COMMUNICATION PERFORMANCE

Size: px
Start display at page:

Download "PREDICTING COMMUNICATION PERFORMANCE"

Transcription

1 PREDICTING COMMUNICATION PERFORMANCE Nikhil Jain CASC Seminar, LLNL This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA This work was funded by the Laboratory Directed Research and Development Program at LLNL under project tracking code 13-ERD

2 SUPERCOMPUTERS 2 2

3 SUPERCOMPUTERS 48 GB/s, 1-2 microsec 40 GB/s, 1-3 microsec 150 GB/s, 0.8 microsec 420 GB/s, 1-2 microsec 2 2

4 SUPERCOMPUTERS 48 GB/s, 1-2 microsec 40 GB/s, 1-3 microsec 150 GB/s, 0.8 microsec Larger Bandwidth Lower Latency Fewer hops 420 GB/s, 1-2 microsec 2 2

5 WHY STUDY NETWORK PERFORMANCE? 3 3

6 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion 3 3

7 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion High raw bandwidth does not guarantee proportionate observed performance Blue Gene vs Cray s Gemini 3 3

8 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion High raw bandwidth does not guarantee proportionate observed performance Blue Gene vs Cray s Gemini Savings are proportionate to core-count 3 3

9 WHY STUDY NETWORK PERFORMANCE? Peak bandwidth and latency are never realized in presence of congestion High raw bandwidth does not guarantee proportionate observed performance Blue Gene vs Cray s Gemini Savings are proportionate to core-count Most importantly, as a graduate student, I do what I am asked to do! 3 3

10 QUANTIFYING IMPACT Execution time for different mappings of pf3d Default Map Best Map 60% Mapping via logical operations in Rubik Time per iteration (s) What about others mappings? How far are we from the best? Number of cores A. Bhatele, et al Mapping applications with collectives over sub-communicators on torus networks. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 12. IEEE Computer Society, Nov (to appear). LLNL-CONF

11 ALTERNATIVES *Abhinav Bhatele, Nikhil Jain, William D. Gropp, and Laxmikant V. Kale. 2011b. Avoiding hot-spots on two- level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 11). ACM, New York, NY, USA, 76:1 76:

12 ALTERNATIVES Theoretically: NP hard Simulations: too slow 15 days to simulate one use case* *Abhinav Bhatele, Nikhil Jain, William D. Gropp, and Laxmikant V. Kale. 2011b. Avoiding hot-spots on two- level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 11). ACM, New York, NY, USA, 76:1 76:

13 ALTERNATIVES Theoretically: NP hard Simulations: too slow 15 days to simulate one use case* Real runs: very expensive Application/allocation specific information Intrepid 4.16M 0.73M Mira 0.17M 7.67M Total 4.33M 8.40M 13 million core hours! *Abhinav Bhatele, Nikhil Jain, William D. Gropp, and Laxmikant V. Kale. 2011b. Avoiding hot-spots on two- level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 11). ACM, New York, NY, USA, 76:1 76:

14 HEURISTICS - KNOWN METRICS Time per iteration (ms) e8 1.2e9 1.6e9 2e e09 3e9 6e9 9e9 Maximum Dilation Average bytes per link Maximum bytes on a link 2D-Halo: predicting performance using a linear regression model for known metrics 6 6

15 SUPERVISED LEARNING Collect/generate data and summarize Build models: train performance prediction based on independent metrics Predict 7 7

16 SUPERVISED LEARNING Collect/generate data and summarize Build models: train performance prediction based on independent metrics Predict 7 7

17 COMMUNICATION 8 8

18 COMMUNICATION 8 8

19 COMMUNICATION 8 8

20 COMMUNICATION 8 8

21 COMMUNICATION Injection FIFO Contention 8 8

22 COMMUNICATION Injection FIFO Contention Link Contention 8 8

23 COMMUNICATION Injection FIFO Contention Receive Buffer Contention Link Contention 8 8

24 COMMUNICATION Injection FIFO Contention Receive Buffer Contention Reception FIFO Contention Link Contention 8 8

25 COMMUNICATION Injection FIFO Contention Receive Buffer Contention Reception FIFO Contention Memory Contention Link Contention Memory Contention 8 8

26 NETWORK COUNTERS OF BLUEGENE/Q 9 9

27 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module 9 9

28 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module Packets sent on links in specific directions: A, B, C, D, E deterministic, dynamic A B C 9 9

29 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module Packets sent on links in specific directions: A, B, C, D, E deterministic, dynamic A B C Packets received on a link 9 9

30 NETWORK COUNTERS OF BLUEGENE/Q A PMPI based BGQ-Counter collection module Packets sent on links in specific directions: A, B, C, D, E deterministic, dynamic A B C Packets received on a link Packets in buffers 9 9

31 ANALYTICAL TOOL 10 10

32 ANALYTICAL TOOL Simulate the injection mechanism Selection of memory injection FIFO Mapping of memory FIFO to network injection FIFO 10 10

33 ANALYTICAL TOOL Simulate the injection mechanism Selection of memory injection FIFO Mapping of memory FIFO to network injection FIFO Simulate routing to obtain hops/dilation 10 10

34 SUPERVISED LEARNING 11 11

35 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance

36 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance. Derive metrics from the raw data on entities, e.g. average of bytes on links

37 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance. Derive metrics from the raw data on entities, e.g. average of bytes on links. Create a database of derived metrics and performance; we have used 100 mappings

38 SUPERVISED LEARNING Collect raw data - various entities, e.g. bytes on a link, and the observed performance. Derive metrics from the raw data on entities, e.g. average of bytes on links. Create a database of derived metrics and performance; we have used 100 mappings. Select two-third entries as training set; includes derived metrics and performance

39 SUPERVISED LEARNING 12 12

40 SUPERVISED LEARNING The training set is used to create a model for prediction 12 12

41 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics

42 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values

43 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values. Experimented with a large number of algorithms - linear, bayesian, SVM, near-neighbors etc

44 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values. Experimented with a large number of algorithms - linear, bayesian, SVM, near-neighbors etc

45 SUPERVISED LEARNING The training set is used to create a model for prediction Remaining entries from the database are used as the test set - only derived metrics. Prediction is compared with observed values. Experimented with a large number of algorithms - linear, bayesian, SVM, near-neighbors etc

46 SUPERVISED LEARNING 13 13

47 SUPERVISED LEARNING Decision trees X[1] <= X[0] <= X[0] <= X[1] <= X[0] <= X[0] <= leaf leaf leaf X[1] <= Rest of the tree 13 13

48 SUPERVISED LEARNING Decision trees Randomized forest of trees X[1] <= X[0] <= X[0] <= X[1] <= X[0] <= X[0] <= leaf leaf leaf feature 2 X[1] <= Rest of the tree feature

49 HOW TO JUDGE A PREDICTION Rank Correlation Coefficient : (RCC): fraction of the number of pairs of task mappings whose ranks are in the same partial X } order X { } we define RCC as: in predicted and observed 8 performance list >< 1, if x i >= x j & y i >= y j Absolute Correlation Higher the better! concord ij = RCC = X 1, if x i <x j & y i <y j >: 0, otherwise X 0<=i<n 0<=j<i R 2 (y, ŷ) =1 concord ij /( n(n 1) ) 2 Pi (y i ŷ i ) 2 P i (y i ȳ)

50 RESULTS Three communication kernel Five-point 2D stencil 14-point 3D stencil Sub-communicator all-to-all Four message sizes to span MPI and routing protocols 15 15

51 KNOWN METRICS Entities Bytes on a link Dilation Derivation Methods Maximum Average Sum e09 3e9 6e9 9e9 Maximum bytes on a link 16 16

52 RESULTS KNOWN METRICS 17 17

53 RESULTS KNOWN METRICS max bytes is good, but incorrect in 10% cases 17 17

54 NEW METRICS 18 18

55 NEW METRICS Entities Buffer length (on intermediate nodes) FIFO length (packets in injection FIFO) Delay per link (packets in buffer/packets received) 18 18

56 NEW METRICS Entities Buffer length (on intermediate nodes) FIFO length (packets in injection FIFO) Delay per link (packets in buffer/packets received) Derivation methods Average Outliers (AO) Top Outliers (TO) 18 18

57 RESULTS NEW METRICS 19 19

58 HYBRID METRICS 20 20

59 HYBRID METRICS Combine multiple metrics to complement each other 20 20

60 HYBRID METRICS Combine multiple metrics to complement each other Some combinations avg bytes + max bytes + max FIFO avg bytes + max bytes + avg buffer + max FIFO avg bytes + avg buffer + avg delay AO + sum hops avg bytes TO + avg buffer TO + avg delay TO + sum hops 20 20

61 RESULTS HYBRID METRICS 21 21

62 RESULTS HYBRID METRICS hybrid metrics provide high accuracy 21 21

63 RESULTS - TREND Predicted performance (s) Sorted mappings Sorted mappings Sorted mappings 2D Halo 3D Halo Sub A2A 22 22

64 RESULTS 23 23

65 RESULTS 24 24

66 RESULTS Rank correlation coefficient RCC Combining all benchmarks avg bytes TO sum dilation AO max bytes avg bytes H1 avg buffer H3 H4 H5 H

67 RESULTS Rank correlation coefficient RCC Combining all benchmarks H1 avg bytes TO sum dilation AO max bytes avg bytes avg buffer H3 H4 H5 H6 Rank correlation coefficient RCC Predicting performance on 65,536 cores using max bytes avg bytes H1 avg buffer H3 H4 H5 H6 16,384 cores data for training avg bytes TO sum dilation A

68 RESULTS - PF3D 25 25

69 RESULTS - PF3D RCC Rank correlation coefficient R Absolute performance correlation avg bytes TO sum dilation AO max bytes avg bytes H1 avg buffer H3 H4 H5 H

70 RESULTS - PF3D 1.0 Rank correlation coefficient 31 Blue Gene/Q (16,384 cores) RCC Absolute performance correlation Execution Time (s) R sum dilation AO max bytes avg bytes H1 avg buffer avg bytes TO H3 H4 H5 H Mappings sorted by actual execution times pf3d Observed pf3d Predicted 25 25

71 SUMMARY Communication is not just about peak latency/ bandwidth Simultaneous analysis of various aspects of network is important Complex models are required for accurate prediction There are patterns waiting to be identified! 26 26

72 FUTURE WORK More applications! More metrics Weighted analysis Offline prediction of entities 27 27

73 FUTURE WORK More applications! More metrics Weighted analysis Offline prediction of entities Questions? 27 27

74 28 28

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

Automated Mapping of Regular Communication Graphs on Mesh Interconnects Automated Mapping of Regular Communication Graphs on Mesh Interconnects Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung Motivation Running a parallel application on a linear array of processors:

More information

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers Abhinav S Bhatele and Laxmikant V Kale ACM Research Competition, SC 08 Outline Why should we consider topology

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Routing, Network Embedding John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 14-15 4,11 October 2018 Topics for Today

More information

Reducing Communication Costs Associated with Parallel Algebraic Multigrid

Reducing Communication Costs Associated with Parallel Algebraic Multigrid Reducing Communication Costs Associated with Parallel Algebraic Multigrid Amanda Bienz, Luke Olson (Advisor) University of Illinois at Urbana-Champaign Urbana, IL 11 I. PROBLEM AND MOTIVATION Algebraic

More information

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links Toward Runtime Power Management of Exascale Networks by On/Off Control of Links, Nikhil Jain, Laxmikant Kale University of Illinois at Urbana-Champaign HPPAC May 20, 2013 1 Power challenge Power is a major

More information

Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011 A function level simulator for parallel applications on peta scale machine An

More information

A Machine-Learning Approach for Communication Prediction of Large-Scale Applications

A Machine-Learning Approach for Communication Prediction of Large-Scale Applications 1 A Machine-Learning Approach for Communication Prediction of Large-Scale Applications Nikela Papadopoulou, Georgios Goumas, Nectarios Koziris National Technical University of Athens School of Electrical

More information

Partitioning Low-diameter Networks to Eliminate Inter-job Interference

Partitioning Low-diameter Networks to Eliminate Inter-job Interference Partitioning Low-diameter Networks to Eliminate Inter-job Interference Nikhil Jain, Abhinav Bhatele, Xiang Ni, Todd Gamblin, Laxmikant V. Kale Center for Applied Scientific Computing, Lawrence Livermore

More information

Heuristic-based techniques for mapping irregular communication graphs to mesh topologies

Heuristic-based techniques for mapping irregular communication graphs to mesh topologies Heuristic-based techniques for mapping irregular communication graphs to mesh topologies Abhinav Bhatele and Laxmikant V. Kale University of Illinois at Urbana-Champaign LLNL-PRES-495871, P. O. Box 808,

More information

Optimizing the performance of parallel applications on a 5D torus via task mapping

Optimizing the performance of parallel applications on a 5D torus via task mapping Optimizing the performance of parallel applications on a D torus via task mapping Abhinav Bhatele, Nikhil Jain,, Katherine E. Isaacs,, Ronak Buch, Todd Gamblin, Steven H. Langer, Laxmikant V. Kale Lawrence

More information

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS Abhinav Bhatele, Eric Bohm, Laxmikant V. Kale Parallel Programming Laboratory Euro-Par 2009 University of Illinois at Urbana-Champaign

More information

Maximizing Throughput on a Dragonfly Network

Maximizing Throughput on a Dragonfly Network Maximizing Throughput on a Dragonfly Network Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J. Wright, Laxmikant V. Kale Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana,

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Predicting connection quality in peer-to-peer real-time video streaming systems

Predicting connection quality in peer-to-peer real-time video streaming systems Predicting connection quality in peer-to-peer real-time video streaming systems Alex Giladi Jeonghun Noh Information Systems Laboratory, Department of Electrical Engineering Stanford University, Stanford,

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

Charm++ for Productivity and Performance

Charm++ for Productivity and Performance Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil

More information

Scalable Topology Aware Object Mapping for Large Supercomputers

Scalable Topology Aware Object Mapping for Large Supercomputers Scalable Topology Aware Object Mapping for Large Supercomputers Abhinav Bhatele (bhatele@illinois.edu) Department of Computer Science, University of Illinois at Urbana-Champaign Advisor: Laxmikant V. Kale

More information

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks Bilge Acun, PhD Candidate Department of Computer Science University of Illinois at Urbana- Champaign *Contributors: Nikhil Jain,

More information

Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations

Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations Bilge Acun 1, Nikhil Jain 1, Abhinav Bhatele 2, Misbah Mubarak 3, Christopher D. Carothers 4, and Laxmikant V. Kale 1

More information

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Noise Injection Techniques to Expose Subtle and Unintended Message Races Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau

More information

Final Report: Kaggle Soil Property Prediction Challenge

Final Report: Kaggle Soil Property Prediction Challenge Final Report: Kaggle Soil Property Prediction Challenge Saurabh Verma (verma076@umn.edu, (612)598-1893) 1 Project Goal Low cost and rapid analysis of soil samples using infrared spectroscopy provide new

More information

Oak Ridge National Laboratory Computing and Computational Sciences

Oak Ridge National Laboratory Computing and Computational Sciences Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman

More information

Introduction to ns3. October 27, Peter D. Barnes, Jr. Staff Scientist

Introduction to ns3. October 27, Peter D. Barnes, Jr. Staff Scientist Introduction to ns3 October 27, 2011 Peter D. Barnes, Jr. Staff Scientist, P. O. Box 808, Livermore, CA 94551! This work performed under the auspices of the U.S. Department of Energy by under Contract

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center Motivation: Contention Experiments Bhatele, A., Kale, L. V. 2008 An Evaluation

More information

Lecture 20: Distributed Memory Parallelism. William Gropp

Lecture 20: Distributed Memory Parallelism. William Gropp Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order

More information

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help? Acceleration of HPC applications on hybrid CPU- systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga

More information

Throughput Models of Interconnection Networks: the Good, the Bad, and the Ugly

Throughput Models of Interconnection Networks: the Good, the Bad, and the Ugly Throughput Models of Interconnection Networks: the Good, the Bad, and the Ugly Peyman Faizian, Md Atiqul Mollah, Md Shafayat Rahman, Xin Yuan Department of Computer Science Florida State University Tallahassee,

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Highly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010

Highly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010 Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010 1 Outline Parallel sorting background Histogram Sort overview Histogram Sort

More information

Simulating Laser-Plasma Interaction in Experiments at the National Ignition Facility on a Cray XE6

Simulating Laser-Plasma Interaction in Experiments at the National Ignition Facility on a Cray XE6 Simulating Laser-Plasma Interaction in Experiments at the National Ignition Facility on a Cray XE6 Steven Langer, Abhinav Bhatele, Todd Gamblin, Bert Still, Denise Hinkel, Mike Kumbera, Bruce Langdon,

More information

Resource allocation in networks. Resource Allocation in Networks. Resource allocation

Resource allocation in networks. Resource Allocation in Networks. Resource allocation Resource allocation in networks Resource Allocation in Networks Very much like a resource allocation problem in operating systems How is it different? Resources and jobs are different Resources are buffers

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Snowpack: Efficient Parameter Choice for GPU Kernels via Static Analysis and Statistical Prediction ScalA 17, Denver, CO, USA, November 13, 2017

Snowpack: Efficient Parameter Choice for GPU Kernels via Static Analysis and Statistical Prediction ScalA 17, Denver, CO, USA, November 13, 2017 Snowpack: Efficient Parameter Choice for GPU Kernels via Static Analysis and Statistical Prediction ScalA 17, Denver, CO, USA, November 13, 2017 Ignacio Laguna Ranvijay Singh, Paul Wood, Ravi Gupta, Saurabh

More information

Slim Fly: A Cost Effective Low-Diameter Network Topology

Slim Fly: A Cost Effective Low-Diameter Network Topology TORSTEN HOEFLER, MACIEJ BESTA Slim Fly: A Cost Effective Low-Diameter Network Topology Images belong to their creator! NETWORKS, LIMITS, AND DESIGN SPACE Networks cost 25-30% of a large supercomputer Hard

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Topology Awareness in the Tofu Interconnect Series

Topology Awareness in the Tofu Interconnect Series Topology Awareness in the Tofu Interconnect Series Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 23rd, 2016, ExaComm2016 Workshop 0 Introduction Networks

More information

Leveraging Flash in HPC Systems

Leveraging Flash in HPC Systems Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,

More information

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control CS 498 Hot Topics in High Performance Computing Networks and Fault Tolerance 9. Routing and Flow Control Intro What did we learn in the last lecture Topology metrics Including minimum diameter of directed

More information

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs Isaac Dooley, Laxmikant V. Kale Department of Computer Science University of Illinois idooley2@illinois.edu kale@illinois.edu

More information

The State and Needs of IO Performance Tools

The State and Needs of IO Performance Tools The State and Needs of IO Performance Tools Scalable Tools Workshop Lake Tahoe, CA August 6 12, 2017 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National

More information

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks HPI-DC 09 Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks Diego Lugones, Daniel Franco, and Emilio Luque Leonardo Fialho Cluster 09 August 31 New Orleans, USA Outline Scope

More information

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus*, Murali Emani, Pei-Hung Lin, Chunhua Liao *University of Edinburgh (UK), Lawrence Livermore National Laboratory

More information

Software Topological Message Routing and Aggregation Techniques for Large Scale Parallel Systems

Software Topological Message Routing and Aggregation Techniques for Large Scale Parallel Systems Software Topological Message Routing and Aggregation Techniques for Large Scale Parallel Systems Lukasz Wesolowski Preliminary Examination Report 1 Introduction Supercomputing networks are designed to

More information

Tutorial: Application MPI Task Placement

Tutorial: Application MPI Task Placement Tutorial: Application MPI Task Placement Juan Galvez Nikhil Jain Palash Sharma PPL, University of Illinois at Urbana-Champaign Tutorial Outline Why Task Mapping on Blue Waters? When to do mapping? How

More information

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing

More information

Routing in a Delay Tolerant Network

Routing in a Delay Tolerant Network Routing in a Delay Tolerant Network Vladislav Marinov Jacobs University Bremen March 31st, 2008 Vladislav Marinov Routing in a Delay Tolerant Network 1 Internet for a Remote Village Dial-up connection

More information

Parallel Computer Architecture Part I. (Part 2 on March 18)

Parallel Computer Architecture Part I. (Part 2 on March 18) Parallel Computer Architecture Part I (Part 2 on March 18) Latency Defn: Latency is the time it takes one message to travel from source to destination. Includes various overheads. E.g. Taking the T (or

More information

Supervised Learning: K-Nearest Neighbors and Decision Trees

Supervised Learning: K-Nearest Neighbors and Decision Trees Supervised Learning: K-Nearest Neighbors and Decision Trees Piyush Rai CS5350/6350: Machine Learning August 25, 2011 (CS5350/6350) K-NN and DT August 25, 2011 1 / 20 Supervised Learning Given training

More information

Blue Waters I/O Performance

Blue Waters I/O Performance Blue Waters I/O Performance Mark Swan Performance Group Cray Inc. Saint Paul, Minnesota, USA mswan@cray.com Doug Petesch Performance Group Cray Inc. Saint Paul, Minnesota, USA dpetesch@cray.com Abstract

More information

Systems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications

Systems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications Systems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications William Gropp University of Illinois, Urbana- Champaign Pavan Balaji, Rajeev Thakur

More information

Data Driven Networks. Sachin Katti

Data Driven Networks. Sachin Katti Data Driven Networks Sachin Katti Is it possible for to Learn the control planes of networks and applications? Operators specify what they want, and the system learns how to deliver CAN WE LEARN THE CONTROL

More information

Analysis of Sorting as a Streaming Application

Analysis of Sorting as a Streaming Application 1 of 10 Analysis of Sorting as a Streaming Application Greg Galloway, ggalloway@wustl.edu (A class project report written under the guidance of Prof. Raj Jain) Download Abstract Expressing concurrency

More information

pf3d Simulations of Laser-Plasma Interactions in National Ignition Facility Experiments

pf3d Simulations of Laser-Plasma Interactions in National Ignition Facility Experiments 1 pf3d Simulations of Laser-Plasma Interactions in National Ignition Facility Experiments Steven Langer, Abhinav Bhatele, and Charles H. Still Abstract The laser-plasma interaction code, pf3d, is used

More information

UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department

UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department UNIVERSITY OF CASTILLA-LA MANCHA Computing Systems Department A case study on implementing virtual 5D torus networks using network components of lower dimensionality HiPINEB 2017 Francisco José Andújar

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Performance and Energy Usage of Workloads on KNL and Haswell Architectures Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research

More information

Data Driven Networks

Data Driven Networks Data Driven Networks Is it possible for to Learn the control planes of networks and applications? Operators specify what they want, and the system learns how to deliver CAN WE LEARN THE CONTROL PLANE OF

More information

Proceedings of the 2017 Winter Simulation Conference W. K. V. Chan, A. D Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds.

Proceedings of the 2017 Winter Simulation Conference W. K. V. Chan, A. D Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds. Proceedings of the 217 Winter Simulation Conference W. K. V. Chan, A. D Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds. TOWARD RELIABLE VALIDATION OF HPC NETWORK SIMULATION MODELS Misbah

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática

More information

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 SUPERVISED LEARNING METHODS Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 2 CHOICE OF ML You cannot know which algorithm will work

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Network Calculus: A Comparison

Network Calculus: A Comparison Time-Division Multiplexing vs Network Calculus: A Comparison Wolfgang Puffitsch, Rasmus Bo Sørensen, Martin Schoeberl RTNS 15, Lille, France Motivation Modern multiprocessors use networks-on-chip Congestion

More information

Self Programming Networks

Self Programming Networks Self Programming Networks Is it possible for to Learn the control planes of networks and applications? Operators specify what they want, and the system learns how to deliver CAN WE LEARN THE CONTROL PLANE

More information

Optimization of the Hop-Byte Metric for Effective Topology Aware Mapping

Optimization of the Hop-Byte Metric for Effective Topology Aware Mapping Optimization of the Hop-Byte Metric for Effective Topology Aware Mapping C. D. Sudheer Department of Mathematics and Computer Science Sri Sathya Sai Institute of Higher Learning, India Email: cdsudheerkumar@sssihl.edu.in

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

On the Role of Burst Buffers in Leadership- Class Storage Systems

On the Role of Burst Buffers in Leadership- Class Storage Systems On the Role of Burst Buffers in Leadership- Class Storage Systems Ning Liu, Jason Cope, Philip Carns, Christopher Carothers, Robert Ross, Gary Grider, Adam Crume, Carlos Maltzahn Contact: liun2@cs.rpi.edu,

More information

The Impact of Optics on HPC System Interconnects

The Impact of Optics on HPC System Interconnects The Impact of Optics on HPC System Interconnects Mike Parker and Steve Scott Hot Interconnects 2009 Manhattan, NYC Will cost-effective optics fundamentally change the landscape of networking? Yes. Changes

More information

Evaluating Use of Data Flow Systems for Large Graph Analysis

Evaluating Use of Data Flow Systems for Large Graph Analysis Evaluating Use of Data Flow Systems for Large Graph Analysis Andy Yoo and Ian Kaplan, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by under

More information

SCALASCA parallel performance analyses of SPEC MPI2007 applications

SCALASCA parallel performance analyses of SPEC MPI2007 applications Mitglied der Helmholtz-Gemeinschaft SCALASCA parallel performance analyses of SPEC MPI2007 applications 2008-05-22 Zoltán Szebenyi Jülich Supercomputing Centre, Forschungszentrum Jülich Aachen Institute

More information

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

Power Constrained HPC

Power Constrained HPC http://scalability.llnl.gov/ Power Constrained HPC Martin Schulz Center or Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory With many collaborators and Co-PIs, incl.: LLNL: Barry

More information

ScalaIOTrace: Scalable I/O Tracing and Analysis

ScalaIOTrace: Scalable I/O Tracing and Analysis ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Using the Cray Gemini Performance Counters

Using the Cray Gemini Performance Counters Photos placed in horizontal position with even amount of white space between photos and header Using the Cray Gemini Performance Counters 0 1 2 3 4 5 6 7 Backplane Backplane 8 9 10 11 12 13 14 15 Backplane

More information

Histogram sort with Sampling (HSS) Vipul Harsh, Laxmikant Kale

Histogram sort with Sampling (HSS) Vipul Harsh, Laxmikant Kale Histogram sort with Sampling (HSS) Vipul Harsh, Laxmikant Kale Parallel sorting in the age of Exascale Charm N-body GrAvity solver Massive Cosmological N-body simulations Parallel sorting in every iteration

More information

1.3 Data processing; data storage; data movement; and control.

1.3 Data processing; data storage; data movement; and control. CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical

More information

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona MySQL Performance Optimization and Troubleshooting with PMM Peter Zaitsev, CEO, Percona In the Presentation Practical approach to deal with some of the common MySQL Issues 2 Assumptions You re looking

More information

Matrix multiplication on multidimensional torus networks

Matrix multiplication on multidimensional torus networks Matrix multiplication on multidimensional torus networks Edgar Solomonik James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-28

More information

The Constellation Project. Andrew W. Nash 14 November 2016

The Constellation Project. Andrew W. Nash 14 November 2016 The Constellation Project Andrew W. Nash 14 November 2016 The Constellation Project: Representing a High Performance File System as a Graph for Analysis The Titan supercomputer utilizes high performance

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

Mapping Applications with Collectives over Sub-communicators on Torus Networks

Mapping Applications with Collectives over Sub-communicators on Torus Networks Mapping Applications with Collectives over Sub-communicators on Torus Networks Abhinav Bhatele, Todd Gamblin, Steven H. Langer, Peer-Timo Bremer,, Erik W. Draeger, Bernd Hamann, Katherine E. Isaacs, Aaditya

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing Jose Flich 1,PedroLópez 1, Manuel. P. Malumbres 1, José Duato 1,andTomRokicki 2 1 Dpto.

More information

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS

More information

Getting Insider Information via the New MPI Tools Information Interface

Getting Insider Information via the New MPI Tools Information Interface Getting Insider Information via the New MPI Tools Information Interface EuroMPI 2016 September 26, 2016 Kathryn Mohror This work was performed under the auspices of the U.S. Department of Energy by Lawrence

More information

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Introduction to Parallel Programming

Introduction to Parallel Programming University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Section 3. Introduction to Parallel Programming Communication Complexity Analysis of Parallel Algorithms Gergel V.P., Professor,

More information

Improving Data Movement Performance for Sparse Data Patterns on the Blue Gene/Q Supercomputer

Improving Data Movement Performance for Sparse Data Patterns on the Blue Gene/Q Supercomputer Improving Data Movement Performance for Sparse Data Patterns on the Blue Gene/Q Supercomputer Huy Bui, Jason Leigh Electronic Visualization Laboratory (EVL) University of Illinois at Chicago Chicago, IL,

More information

Watch Out for the Bully! Job Interference Study on Dragonfly Network

Watch Out for the Bully! Job Interference Study on Dragonfly Network Watch Out for the Bully! Job Interference Study on Dragonfly Network Xu Yang, John Jenkins, Misbah Mubarak, Robert B. Ross, Zhiling Lan Department of Computer Science, Illinois Institute of Technology,

More information

A Comprehensive Analytical Performance Model of DRAM Caches

A Comprehensive Analytical Performance Model of DRAM Caches A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur *, Mahesh Mehendale *, and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science

More information

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The

More information

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin Lawrence Berkeley National Laboratory Energy efficiency at Exascale A design goal for future

More information