TORSTEN HOEFLER

Size: px
Start display at page:

Download "TORSTEN HOEFLER"

Transcription

1 TORSTEN HOEFLER Performance Modeling for Future Computing Technologies Presentation at Tsinghua University, Beijing, China Part of the 60 th anniversary celebration of the Computer Science Department (#9) spcl.inf.ethz.ch

2 Changing hardware constraints and the physics of computing 130nm 0.9 V [1] 3-bit FP ADD: 0.9 pj 3-bit FP MUL: 3. pj How to address locality challenges on standard architectures and programming? D. Unat et al.: Trends in Data Locality Abstractions for HPC Systems x3 bit from L1 (8 kib): 10 pj x3 bit from L (1 MiB): 100 pj x3 bit from DRAM: 1.3 nj IEEE Transactions on Parallel and Distributed Systems (TPDS). Vol 8, Nr. 10, IEEE, Oct nm 65nm Three Ls of modern computing: 45nm 3nm [1]: Marc Horowitz, Computing s Energy Problem (and what we can do about it), ISSC 014, plenary []: Moore: Landauer Limit Demonstrated, IEEE Spectrum 01 nm 14nm 10nm

3 Control Locality? Load-store vs. Dataflow architectures spcl.inf.ethz.ch Turing Award 1977 (Backus): "Surely there must be a less primitive way of making big changes in the store than pushing vast numbers of words back and forth through the von Neumann bottleneck." Load-store ( von Neumann ) x=a+b Static Dataflow ( non von Neumann ) y=(a+b)*(c+d) ALU Energy per instruction: 70pJ x Energy per operation: 1-3pJ y x Control Registers Cache a b Source: Mark Horowitz, ISSC 14 a+b + c+d + ld st add r1, a, b, r1, r x r Memory a b a b Memory c d y 3

4 Single Instruction Multiple Data/Threads (SIMD - Vector CPU, SIMT - GPU) ALU ALU ALU ALU ALU Control ALU ALU ALU ALU (High Performance) Computing really ALU 1 MiB: 100 pj x Registers 45nm, 0.9 V [1] Random Access SRAM: 8 kib: 10 pj 3 kib: 0 pj 45nm, 0.9 V [1] Single R/W registers: 3 bit: 0.1 pj became a data management challenge Cache + + Memory a b Memory c d y [1]: Marc Horowitz, Computing s Energy Problem (and what we can do about it), ISSC 014, plenary 4

5 Crystal Ball into the Post-Moore Future (maybe already today?) Neuromorphic Asynchronous CMOS circuits 1000x energy benefit Integrates compute (neurons) and Quantum Completely different paradigm Concept of qubits Bases on quantum mechanics (which only works in isolation) Many different ideas how to build Efficient CMOS Future architectures will force us to manage memory/communication Rest of this (synapses) talk: how do we understand Accelerators which Custom architectures Very specialized parts Network and of storage programs to Ion accelerate trap (ions trapped in fields) on which Reconfigurable device? datapaths Adapt architecture to problem Phrase your problem as Optical Dataflow + Control Flow inference! Spin-based Main challenge Even learning is hard Superconducting Programmability! Comparatively little work Majorana qubits See our SC18 tutorial Suddenly much lower energy (none proven to scale) benefits accelerated heterogeneity FPGAs or CGRAs or GPUs Have been around for a while Use transistors more efficiently Obvious answer: the slow ones! Needs new algorithms to be useful So simply observe Algorithms their performance? are limited Not so fast. Productive Parallel Programming for FPGA image source: 1stcentury.com image source: intel.com image source: intel.com 5

6 What can we learn from High Performance Computing dgemm("n", "N", 50, 50, 50, 1.0, A, 50, B, 50, 1.0, C, 50); >x ~4x 1 ~10 3 ~10 4 ~10 6 ~10 8 ~10 10 ~

7 HPC is used to solve complex problems! Image credit: Serena Donnin, Sarah Rauscher, Ivo Kabashow 7

8 Scientific Performance Engineering 1) Observe ) Model 3) Understand 4) Build 8

9 Part I: Observe Measure systems Experimental design Examine documentation Collect data Gather statistics Document process Factorial design 9

10 usec spcl.inf.ethz.ch Trivial Example: Simple ping-pong latency benchmark The latency of Piz Daint is 1.77us! ~1.ms How did you get this number? I averaged 10 6 experiments, it must be right! Why do you think so? Can I see the data? ~1.77us sample TH, Belli: Scientific Benchmarking of Parallel Computing Systems, IEEE/ACM SC15 10

11 Dealing with variation The 99.9% confidence interval is 1.765us to 1.775us Did you assume normality? Ugs, What? the Isn t data that is not always normal at all. The the nonparametric case with many 99.9% CI is much measurements? wider: 1.6us to 1.9us! Can we test for normality? TH, Belli: Scientific Benchmarking of Parallel Computing Systems, IEEE/ACM SC15 11

12 Looking at the data in detail This CI makes me nervous. Let s zoom! Clearly, the mean/median are not sufficient! Try quantile regression! S D Image credit: nersc.gov 1

13 Scientific benchmarking of parallel computing systems ACM/IEEE Supercomputing 015 (SC15) + talk online on youtube! Rule 1: When publishing parallel speedup, report if the base case is a single parallel process or best serial execution, as well as Rule the : absolute Specify the execution reason performance only reporting of the subsets base case. of standard benchmarks Rule 3: Use or the applications arithmetic mean or not only using for all summarizing system resources. costs. Use the Rule 4: Avoid summarizing ratios; summarize the costs or rates that harmonic mean for summarizing rates. the ratios base on instead. Only if these are not available use the Rule 5: Report if the measurement values are deterministic. For geometric mean for summarizing ratios. nondeterministic data, report confidence intervals of the Rule 6: Do not assume measurement. normality of collected data (e.g., based on Rule 7: Carefully investigate if measures of central tendency the number of samples) without diagnostic checking. such as Rule mean 8: Carefully or median investigate are useful if to measures report. Some of central problems, tendency such such as worst-case mean latency, median are may useful require to other report. percentiles. Rule 9: Document all varying factors and their Some levels problems, as well as the complete such as worst-case experimental latency, setup may (e.g., require software, other hardware, percentiles. techniques) Rule 10: For parallel time measurements, report all measurement, to facilitate reproducibility and provide interpretability. (optional) Rule 11: If synchronization, possible, show upper and summarization performance bounds techniques. to facilitate Rule 1: Plot as much information as needed to interpret the interpretability of the measured results. experimental results. Only connect measurements by lines if they indicate trends and the interpolation is valid. 13

14 Simplifying Measuring and Reporting: LibSciBench Simple MPI-like C/C+ interface High-resolution timers Flexible data collection Controlled by environment variables Tested up to 51k ranks Parallel timer synchronization R scripts for data analysis and visualization S. Di Girolamo, TH: 14

15 We have the (statistically sound) data, now what? t(n=100)? t(n=1510)? Matrix Multiply t(n) = a*n 3 The 99% confidence interval is within 1% of the reported median. TH, W. Gropp, M. Snir, W. Kramer: Performance Modeling for Systematic Performance Tuning, IEEE/ACM SC11 15

16 We have the (statistically sound) data, now what? t(n=100)=0.667s t(n=1510)=0.48s The 99% confidence interval is within 1% of the reported median. The adjusted R of the model fit is 0.99 TH, W. Gropp, M. Snir, W. Kramer: Performance Modeling for Systematic Performance Tuning, IEEE/ACM SC11 16

17 Part II: Model Model Burnham, Anderson: A model is a simplification or approximation of reality and hence will not reflect all of reality.... Box noted that all models are wrong, but some are useful. While a model can never be truth, a model might be ranked from very useful, to useful, to somewhat useful to, finally, essentially useless. This is generally true for all kinds of modeling. We focus on performance modeling in the following! 17

18 Performance Modeling Capability Model Performance Model Requirements Model TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC

19 Requirements modeling I: Six-step performance modeling Input parameters Communication parameters Describe application kernels Fit sequential baseline Communication / computation overlap Communication pattern 10-0% speedup [] [1] TH, W. Gropp, M. Snir and W. Kramer: Performance Modeling for Systematic Performance Tuning, SC11 [] TH and S. Gottlieb: Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MPI Datatypes, EuroMPI 10 19

20 Requirements modeling II: Automated best-fit modeling Manual kernel selection and hypothesis generation is time consuming (boring and tricky) Idea: Automatically select best (scalability) model from predefined search space number of terms n = i j e.g., number of f ( p) c p k log k ( p) processes k k = 1 (model) constant n Î i k Î I j k Î J I, J Ì n =1 I = { 0,1, } J = {0,1} p p log(p) p log(p) p log(p) [1]: A. Calotoiu, TH, M. Poke, F. Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes, IEEE/ACM SC13 0

21 Requirements modeling II: Automated best-fit modeling Manual kernel selection and hypothesis generation is time consuming (and boring) Idea: Automatically select best model from predefined space n f (p) = åc p i k log j k (p) k c1 log( p) + c k=1 c log( p) + c n = I = { 0,1, } J = {0,1} + c p + c p + c log(p) + c p log(p) + c p log(p) c c c c c c c c log( p) + c log( p) + c p + c p p + c p + c + c p p p log( p) p p p p log( p) + c p log( p) + c p log( p) [1]: A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes, IEEE/ACM SC13 p log( p) log( p) p p log( p) log( p) n Î i k Î I j k Î J I, J Ì 1

22 Requirements modeling III: Source-code analysis [1] Extra-P selects model based on best fit to the data What if the data is not sufficient or too noisy? Back to first principles The source code describes all possible executions Describing all possibilities is too expensive, focus on counting loop iterations symbolically for (j = 1; j <= n; j = j*) for (k = j; k <= n; k = k++) OperationInBody(j,k); Parallel program Loop extraction N = ( n + 1)log n n + [1]: TH, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, ACM SPAA 14 Requirements Models W = N p= 1 D = N p Number of iterations

23 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern

24 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern

25 Capability models for cache-to-cache communication write read X = Invalid read R I = 78 ns = Local read: R L = 8.6 ns Remote read R R = 35 ns S. Ramos, TH: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, ACM HPDC 13 6

26 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern

27 Part III: Understand Use models to 1. Proof optimality of real implementations Stop optimizing, step back to algorithm level. Design optimal algorithms or systems in the model Can lead to non-intuitive designs Understand Proof optimality of matrix multiplication Intuition: flop rate is the bottleneck t(n) = 76ps * n 3 Flop rate R = flop * n 3 /(76ps * n 3 ) = 7.78 Gflop/s Flop peak: GHz * 8 flops = Gflop/s Achieved ~90% of peak (IBM Power 7 Gets more complex quickly Imagine sparse matrix-vector 8

28 Design algorithms bcast in cache-to-cache model Multi-ary tree example 0 depth d = Tree depth Level size 1 k 1 = k = 3 Tree cost Reached threads S. Ramos, TH: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, ACM HPDC 13 30

29 Measured results small broadcast and reduction P=10 4.7x P=58 3.3x Intel Xeon Phi 5110P (60 cores at 105 MHz), Intel MPI v each operation timed separately, reporting maximum across processes S. Ramos, TH: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, ACM HPDC 13 31

30 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern

31 How to continue from here? DAPP Transformation System DAPP Parallel Language Data-centric, explicit requirements models User-supported, compile- and run-time memlets + operators = DCIR Performance-transparent Platforms HTM [1] RMA fompi-na [] NISA [3] [1]: M. Besta, TH: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, ACM HPDC 15 []: R. Belli, TH: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS 15 [3]: S. Di Girolamo, P. Jolivet, K. D. Underwood, TH: Exploiting Offload Enabled Network Interfaces, IEEE Micro 16 33

TORSTEN HOEFLER, ROBERTO BELLI Disclaimer(s) Scientific Benchmarking of Parallel Computing Systems This is an experience talk (published at SC 15 State of the Practice)! Twelve ways to tell the masses

More information

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL SABELA RAMOS, TORSTEN HOEFLER Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL spcl.inf.ethz.ch Microarchitectures are becoming more and more complex CPU L1 CPU L1 CPU L1 CPU

More information

Performance-Centric System Design

Performance-Centric System Design Performance-Centric System Design Torsten Hoefler Swiss Federal Institute of Technology, ETH Zürich Performance-Centric System Design - modeling, programming, and architecture - Torsten Hoefler ETH Zürich

More information

Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results

Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results TORSTEN HOEFLER, ROBERTO BELLI Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results presented at RWTH Aachen, Jan. 2019 Disclaimer(s)

More information

Performance Modeling for Systematic Performance Tuning

Performance Modeling for Systematic Performance Tuning Performance Modeling for Systematic Performance Tuning Torsten Hoefler with inputs from William Gropp, Marc Snir, Bill Kramer Invited Talk RWTH Aachen University March 30 th, Aachen, Germany All used images

More information

Towards scalable RDMA locking on a NIC

Towards scalable RDMA locking on a NIC TORSTEN HOEFLER spcl.inf.ethz.ch Towards scalable RDMA locking on a NIC with support of Patrick Schmid, Maciej Besta, Salvatore di Girolamo @ SPCL presented at HP Labs, Palo Alto, CA, USA NEED FOR EFFICIENT

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

spin: High-performance streaming Processing in the Network

spin: High-performance streaming Processing in the Network T. HOEFLER, S. DI GIROLAMO, K. TARANOV, R. E. GRANT, R. BRIGHTWELL spin: High-performance streaming Processing in the Network spcl.inf.ethz.ch The Development of High-Performance Networking Interfaces

More information

Exploiting Offload Enabled Network Interfaces

Exploiting Offload Enabled Network Interfaces spcl.inf.ethz.ch S. DI GIROLAMO, P. JOLIVET, K. D. UNDERWOOD, T. HOEFLER Exploiting Offload Enabled Network Interfaces How to We program need an abstraction! QsNet? Lossy Networks Ethernet Lossless Networks

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Productive parallel programming on FPGA using High-level Synthesis

Productive parallel programming on FPGA using High-level Synthesis J. DE FINE LICHT AND T. HOEFLER Productive parallel programming on FPGA using High-level Synthesis spcl.inf.ethz.ch Changing hardware constraints and the physics of computing 130nm 0.9 V [1] 32-bit FP

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Predicting GPU Performance from CPU Runs Using Machine Learning

Predicting GPU Performance from CPU Runs Using Machine Learning Predicting GPU Performance from CPU Runs Using Machine Learning Ioana Baldini Stephen Fink Erik Altman IBM T. J. Watson Research Center Yorktown Heights, NY USA 1 To exploit GPGPU acceleration need to

More information

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder ESE532: System-on-a-Chip Architecture Day 5: September 18, 2017 Dataflow Process Model Today Dataflow Process Model Motivation Issues Abstraction Basic Approach Dataflow variants Motivations/demands for

More information

Moore s Law. Computer architect goal Software developer assumption

Moore s Law. Computer architect goal Software developer assumption Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Trends in the Infrastructure of Computing

Trends in the Infrastructure of Computing Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much

More information

High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER Presented at ARM Research, Austin, Texas, USA, Feb. 2017 NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures TORSTEN HOEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures with support of Tobias Gysi, Tobias Grosser @ SPCL presented at Guangzhou, China,

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

EECS150 - Digital Design Lecture 09 - Parallelism

EECS150 - Digital Design Lecture 09 - Parallelism EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization

More information

Progress in automatic GPU compilation and why you want to run MPI on your GPU

Progress in automatic GPU compilation and why you want to run MPI on your GPU TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016 #pragma ivdep 2 !$ACC DATA

More information

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters

Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Torsten Hoefler With lots of help from the AUS Team and Bill Kramer at NCSA! All used images belong

More information

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Overview. CS 472 Concurrent & Parallel Programming University of Evansville Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

Design of Parallel and High-Performance Computing Fall 2017 Lecture: distributed memory

Design of Parallel and High-Performance Computing Fall 2017 Lecture: distributed memory Design of Parallel and High-Performance Computing Fall 2017 Lecture: distributed memory Motivational video: https://www.youtube.com/watch?v=pucx50fdsic Instructor: Torsten Hoefler & Markus Püschel TA:

More information

1.3 Data processing; data storage; data movement; and control.

1.3 Data processing; data storage; data movement; and control. CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

MPI in 2020: Opportunities and Challenges. William Gropp

MPI in 2020: Opportunities and Challenges. William Gropp MPI in 2020: Opportunities and Challenges William Gropp www.cs.illinois.edu/~wgropp MPI and Supercomputing The Message Passing Interface (MPI) has been amazingly successful First released in 1992, it is

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Performance of computer systems

Performance of computer systems Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Computer Architecture. R. Poss

Computer Architecture. R. Poss Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine Characterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine System Noise Introduction and History CPUs are time-shared Deamons,

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,

More information

Parallelism Marco Serafini

Parallelism Marco Serafini Parallelism Marco Serafini COMPSCI 590S Lecture 3 Announcements Reviews First paper posted on website Review due by this Wednesday 11 PM (hard deadline) Data Science Career Mixer (save the date!) November

More information

Parallel Computing Ideas

Parallel Computing Ideas Parallel Computing Ideas K. 1 1 Department of Mathematics 2018 Why When to go for speed Historically: Production code Code takes a long time to run Code runs many times Code is not end in itself 2010:

More information

Architecture without explicit locks for logic simulation on SIMD machines

Architecture without explicit locks for logic simulation on SIMD machines Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

TORSTEN HOEFLER

TORSTEN HOEFLER TORSTEN HOEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures most work performed by TOBIAS GYSI AND TOBIAS GROSSER Stencil computations (oh no,

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

CSCE 626 Experimental Evaluation.

CSCE 626 Experimental Evaluation. CSCE 626 Experimental Evaluation http://parasol.tamu.edu Introduction This lecture discusses how to properly design an experimental setup, measure and analyze the performance of parallel algorithms you

More information

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer

More information

A 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures

A 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures A 101 Guide to Heterogeneous, Accelerated, Centric Computing Architectures Allan Cantle President & Founder, Nallatech Join the Conversation #OpenPOWERSummit 2016 OpenPOWER Foundation Buzzword & Acronym

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

General introduction: GPUs and the realm of parallel architectures

General introduction: GPUs and the realm of parallel architectures General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT 7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT Draft Printed for SECO Murex S.A.S 2012 all rights reserved Murex Analytics Only global vendor of trading, risk management and processing systems focusing also

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel High Performance Computing: Blue-Gene and Road Runner Ravi Patel 1 HPC General Information 2 HPC Considerations Criterion Performance Speed Power Scalability Number of nodes Latency bottlenecks Reliability

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7 General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Big Orange Bramble. August 09, 2016

Big Orange Bramble. August 09, 2016 Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University

More information

Chap. 2 part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Chap. 2 part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1 Chap. 2 part 1 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 Provocative question (p30) How much do we need to know about the HW to write good par. prog.? Chap. gives HW background knowledge

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Non-Blocking Collectives for MPI

Non-Blocking Collectives for MPI Non-Blocking Collectives for MPI overlap at the highest level Torsten Höfler Open Systems Lab Indiana University Bloomington, IN, USA Institut für Wissenschaftliches Rechnen Technische Universität Dresden

More information

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016 Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2016 Primitive-partitioning vs. space-partitioning acceleration structures Primitive partitioning (bounding volume hierarchy): partitions node s primitives

More information

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed

More information

Microprocessor Trends and Implications for the Future

Microprocessor Trends and Implications for the Future Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from

More information

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University   Scalable Tools Workshop 7 August 2017 Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University http://hpctoolkit.org Scalable Tools Workshop 7 August 2017 HPCToolkit 1 HPCToolkit Workflow source code compile &

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

A Global Operating System for HPC Clusters

A Global Operating System for HPC Clusters A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

An Overview of Parallel Computing

An Overview of Parallel Computing An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

Efficient Parallel Programming on Xeon Phi for Exascale

Efficient Parallel Programming on Xeon Phi for Exascale Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration

More information

Allinea Unified Environment

Allinea Unified Environment Allinea Unified Environment Allinea s unified tools for debugging and profiling HPC Codes Beau Paisley Allinea Software bpaisley@allinea.com 720.583.0380 Today s Challenge Q: What is the impact of current

More information

High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch LOCKS An example structure Proc p Proc q Inuitive semantics

More information

Dataflow Supercomputers

Dataflow Supercomputers Dataflow Supercomputers Michael J. Flynn Maxeler Technologies and Stanford University Outline History Dataflow as a supercomputer technology openspl: generalizing the dataflow programming model Optimizing

More information