TORSTEN HOEFLER
|
|
- Bathsheba Short
- 5 years ago
- Views:
Transcription
1 TORSTEN HOEFLER Performance Modeling for Future Computing Technologies Presentation at Tsinghua University, Beijing, China Part of the 60 th anniversary celebration of the Computer Science Department (#9) spcl.inf.ethz.ch
2 Changing hardware constraints and the physics of computing 130nm 0.9 V [1] 3-bit FP ADD: 0.9 pj 3-bit FP MUL: 3. pj How to address locality challenges on standard architectures and programming? D. Unat et al.: Trends in Data Locality Abstractions for HPC Systems x3 bit from L1 (8 kib): 10 pj x3 bit from L (1 MiB): 100 pj x3 bit from DRAM: 1.3 nj IEEE Transactions on Parallel and Distributed Systems (TPDS). Vol 8, Nr. 10, IEEE, Oct nm 65nm Three Ls of modern computing: 45nm 3nm [1]: Marc Horowitz, Computing s Energy Problem (and what we can do about it), ISSC 014, plenary []: Moore: Landauer Limit Demonstrated, IEEE Spectrum 01 nm 14nm 10nm
3 Control Locality? Load-store vs. Dataflow architectures spcl.inf.ethz.ch Turing Award 1977 (Backus): "Surely there must be a less primitive way of making big changes in the store than pushing vast numbers of words back and forth through the von Neumann bottleneck." Load-store ( von Neumann ) x=a+b Static Dataflow ( non von Neumann ) y=(a+b)*(c+d) ALU Energy per instruction: 70pJ x Energy per operation: 1-3pJ y x Control Registers Cache a b Source: Mark Horowitz, ISSC 14 a+b + c+d + ld st add r1, a, b, r1, r x r Memory a b a b Memory c d y 3
4 Single Instruction Multiple Data/Threads (SIMD - Vector CPU, SIMT - GPU) ALU ALU ALU ALU ALU Control ALU ALU ALU ALU (High Performance) Computing really ALU 1 MiB: 100 pj x Registers 45nm, 0.9 V [1] Random Access SRAM: 8 kib: 10 pj 3 kib: 0 pj 45nm, 0.9 V [1] Single R/W registers: 3 bit: 0.1 pj became a data management challenge Cache + + Memory a b Memory c d y [1]: Marc Horowitz, Computing s Energy Problem (and what we can do about it), ISSC 014, plenary 4
5 Crystal Ball into the Post-Moore Future (maybe already today?) Neuromorphic Asynchronous CMOS circuits 1000x energy benefit Integrates compute (neurons) and Quantum Completely different paradigm Concept of qubits Bases on quantum mechanics (which only works in isolation) Many different ideas how to build Efficient CMOS Future architectures will force us to manage memory/communication Rest of this (synapses) talk: how do we understand Accelerators which Custom architectures Very specialized parts Network and of storage programs to Ion accelerate trap (ions trapped in fields) on which Reconfigurable device? datapaths Adapt architecture to problem Phrase your problem as Optical Dataflow + Control Flow inference! Spin-based Main challenge Even learning is hard Superconducting Programmability! Comparatively little work Majorana qubits See our SC18 tutorial Suddenly much lower energy (none proven to scale) benefits accelerated heterogeneity FPGAs or CGRAs or GPUs Have been around for a while Use transistors more efficiently Obvious answer: the slow ones! Needs new algorithms to be useful So simply observe Algorithms their performance? are limited Not so fast. Productive Parallel Programming for FPGA image source: 1stcentury.com image source: intel.com image source: intel.com 5
6 What can we learn from High Performance Computing dgemm("n", "N", 50, 50, 50, 1.0, A, 50, B, 50, 1.0, C, 50); >x ~4x 1 ~10 3 ~10 4 ~10 6 ~10 8 ~10 10 ~
7 HPC is used to solve complex problems! Image credit: Serena Donnin, Sarah Rauscher, Ivo Kabashow 7
8 Scientific Performance Engineering 1) Observe ) Model 3) Understand 4) Build 8
9 Part I: Observe Measure systems Experimental design Examine documentation Collect data Gather statistics Document process Factorial design 9
10 usec spcl.inf.ethz.ch Trivial Example: Simple ping-pong latency benchmark The latency of Piz Daint is 1.77us! ~1.ms How did you get this number? I averaged 10 6 experiments, it must be right! Why do you think so? Can I see the data? ~1.77us sample TH, Belli: Scientific Benchmarking of Parallel Computing Systems, IEEE/ACM SC15 10
11 Dealing with variation The 99.9% confidence interval is 1.765us to 1.775us Did you assume normality? Ugs, What? the Isn t data that is not always normal at all. The the nonparametric case with many 99.9% CI is much measurements? wider: 1.6us to 1.9us! Can we test for normality? TH, Belli: Scientific Benchmarking of Parallel Computing Systems, IEEE/ACM SC15 11
12 Looking at the data in detail This CI makes me nervous. Let s zoom! Clearly, the mean/median are not sufficient! Try quantile regression! S D Image credit: nersc.gov 1
13 Scientific benchmarking of parallel computing systems ACM/IEEE Supercomputing 015 (SC15) + talk online on youtube! Rule 1: When publishing parallel speedup, report if the base case is a single parallel process or best serial execution, as well as Rule the : absolute Specify the execution reason performance only reporting of the subsets base case. of standard benchmarks Rule 3: Use or the applications arithmetic mean or not only using for all summarizing system resources. costs. Use the Rule 4: Avoid summarizing ratios; summarize the costs or rates that harmonic mean for summarizing rates. the ratios base on instead. Only if these are not available use the Rule 5: Report if the measurement values are deterministic. For geometric mean for summarizing ratios. nondeterministic data, report confidence intervals of the Rule 6: Do not assume measurement. normality of collected data (e.g., based on Rule 7: Carefully investigate if measures of central tendency the number of samples) without diagnostic checking. such as Rule mean 8: Carefully or median investigate are useful if to measures report. Some of central problems, tendency such such as worst-case mean latency, median are may useful require to other report. percentiles. Rule 9: Document all varying factors and their Some levels problems, as well as the complete such as worst-case experimental latency, setup may (e.g., require software, other hardware, percentiles. techniques) Rule 10: For parallel time measurements, report all measurement, to facilitate reproducibility and provide interpretability. (optional) Rule 11: If synchronization, possible, show upper and summarization performance bounds techniques. to facilitate Rule 1: Plot as much information as needed to interpret the interpretability of the measured results. experimental results. Only connect measurements by lines if they indicate trends and the interpolation is valid. 13
14 Simplifying Measuring and Reporting: LibSciBench Simple MPI-like C/C+ interface High-resolution timers Flexible data collection Controlled by environment variables Tested up to 51k ranks Parallel timer synchronization R scripts for data analysis and visualization S. Di Girolamo, TH: 14
15 We have the (statistically sound) data, now what? t(n=100)? t(n=1510)? Matrix Multiply t(n) = a*n 3 The 99% confidence interval is within 1% of the reported median. TH, W. Gropp, M. Snir, W. Kramer: Performance Modeling for Systematic Performance Tuning, IEEE/ACM SC11 15
16 We have the (statistically sound) data, now what? t(n=100)=0.667s t(n=1510)=0.48s The 99% confidence interval is within 1% of the reported median. The adjusted R of the model fit is 0.99 TH, W. Gropp, M. Snir, W. Kramer: Performance Modeling for Systematic Performance Tuning, IEEE/ACM SC11 16
17 Part II: Model Model Burnham, Anderson: A model is a simplification or approximation of reality and hence will not reflect all of reality.... Box noted that all models are wrong, but some are useful. While a model can never be truth, a model might be ranked from very useful, to useful, to somewhat useful to, finally, essentially useless. This is generally true for all kinds of modeling. We focus on performance modeling in the following! 17
18 Performance Modeling Capability Model Performance Model Requirements Model TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC
19 Requirements modeling I: Six-step performance modeling Input parameters Communication parameters Describe application kernels Fit sequential baseline Communication / computation overlap Communication pattern 10-0% speedup [] [1] TH, W. Gropp, M. Snir and W. Kramer: Performance Modeling for Systematic Performance Tuning, SC11 [] TH and S. Gottlieb: Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MPI Datatypes, EuroMPI 10 19
20 Requirements modeling II: Automated best-fit modeling Manual kernel selection and hypothesis generation is time consuming (boring and tricky) Idea: Automatically select best (scalability) model from predefined search space number of terms n = i j e.g., number of f ( p) c p k log k ( p) processes k k = 1 (model) constant n Î i k Î I j k Î J I, J Ì n =1 I = { 0,1, } J = {0,1} p p log(p) p log(p) p log(p) [1]: A. Calotoiu, TH, M. Poke, F. Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes, IEEE/ACM SC13 0
21 Requirements modeling II: Automated best-fit modeling Manual kernel selection and hypothesis generation is time consuming (and boring) Idea: Automatically select best model from predefined space n f (p) = åc p i k log j k (p) k c1 log( p) + c k=1 c log( p) + c n = I = { 0,1, } J = {0,1} + c p + c p + c log(p) + c p log(p) + c p log(p) c c c c c c c c log( p) + c log( p) + c p + c p p + c p + c + c p p p log( p) p p p p log( p) + c p log( p) + c p log( p) [1]: A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes, IEEE/ACM SC13 p log( p) log( p) p p log( p) log( p) n Î i k Î I j k Î J I, J Ì 1
22 Requirements modeling III: Source-code analysis [1] Extra-P selects model based on best fit to the data What if the data is not sufficient or too noisy? Back to first principles The source code describes all possible executions Describing all possibilities is too expensive, focus on counting loop iterations symbolically for (j = 1; j <= n; j = j*) for (k = j; k <= n; k = k++) OperationInBody(j,k); Parallel program Loop extraction N = ( n + 1)log n n + [1]: TH, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, ACM SPAA 14 Requirements Models W = N p= 1 D = N p Number of iterations
23 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern
24 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern
25 Capability models for cache-to-cache communication write read X = Invalid read R I = 78 ns = Local read: R L = 8.6 ns Remote read R R = 35 ns S. Ramos, TH: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, ACM HPDC 13 6
26 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern
27 Part III: Understand Use models to 1. Proof optimality of real implementations Stop optimizing, step back to algorithm level. Design optimal algorithms or systems in the model Can lead to non-intuitive designs Understand Proof optimality of matrix multiplication Intuition: flop rate is the bottleneck t(n) = 76ps * n 3 Flop rate R = flop * n 3 /(76ps * n 3 ) = 7.78 Gflop/s Flop peak: GHz * 8 flops = Gflop/s Achieved ~90% of peak (IBM Power 7 Gets more complex quickly Imagine sparse matrix-vector 8
28 Design algorithms bcast in cache-to-cache model Multi-ary tree example 0 depth d = Tree depth Level size 1 k 1 = k = 3 Tree cost Reached threads S. Ramos, TH: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, ACM HPDC 13 30
29 Measured results small broadcast and reduction P=10 4.7x P=58 3.3x Intel Xeon Phi 5110P (60 cores at 105 MHz), Intel MPI v each operation timed separately, reporting maximum across processes S. Ramos, TH: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, ACM HPDC 13 31
30 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern
31 How to continue from here? DAPP Transformation System DAPP Parallel Language Data-centric, explicit requirements models User-supported, compile- and run-time memlets + operators = DCIR Performance-transparent Platforms HTM [1] RMA fompi-na [] NISA [3] [1]: M. Besta, TH: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, ACM HPDC 15 []: R. Belli, TH: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS 15 [3]: S. Di Girolamo, P. Jolivet, K. D. Underwood, TH: Exploiting Offload Enabled Network Interfaces, IEEE Micro 16 33
TORSTEN HOEFLER, ROBERTO BELLI Disclaimer(s) Scientific Benchmarking of Parallel Computing Systems This is an experience talk (published at SC 15 State of the Practice)! Twelve ways to tell the masses
More informationCapability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL
SABELA RAMOS, TORSTEN HOEFLER Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL spcl.inf.ethz.ch Microarchitectures are becoming more and more complex CPU L1 CPU L1 CPU L1 CPU
More informationPerformance-Centric System Design
Performance-Centric System Design Torsten Hoefler Swiss Federal Institute of Technology, ETH Zürich Performance-Centric System Design - modeling, programming, and architecture - Torsten Hoefler ETH Zürich
More informationScientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results
TORSTEN HOEFLER, ROBERTO BELLI Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results presented at RWTH Aachen, Jan. 2019 Disclaimer(s)
More informationPerformance Modeling for Systematic Performance Tuning
Performance Modeling for Systematic Performance Tuning Torsten Hoefler with inputs from William Gropp, Marc Snir, Bill Kramer Invited Talk RWTH Aachen University March 30 th, Aachen, Germany All used images
More informationTowards scalable RDMA locking on a NIC
TORSTEN HOEFLER spcl.inf.ethz.ch Towards scalable RDMA locking on a NIC with support of Patrick Schmid, Maciej Besta, Salvatore di Girolamo @ SPCL presented at HP Labs, Palo Alto, CA, USA NEED FOR EFFICIENT
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationspin: High-performance streaming Processing in the Network
T. HOEFLER, S. DI GIROLAMO, K. TARANOV, R. E. GRANT, R. BRIGHTWELL spin: High-performance streaming Processing in the Network spcl.inf.ethz.ch The Development of High-Performance Networking Interfaces
More informationExploiting Offload Enabled Network Interfaces
spcl.inf.ethz.ch S. DI GIROLAMO, P. JOLIVET, K. D. UNDERWOOD, T. HOEFLER Exploiting Offload Enabled Network Interfaces How to We program need an abstraction! QsNet? Lossy Networks Ethernet Lossless Networks
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationProductive parallel programming on FPGA using High-level Synthesis
J. DE FINE LICHT AND T. HOEFLER Productive parallel programming on FPGA using High-level Synthesis spcl.inf.ethz.ch Changing hardware constraints and the physics of computing 130nm 0.9 V [1] 32-bit FP
More informationArchitecture, Programming and Performance of MIC Phi Coprocessor
Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics
More informationPredicting GPU Performance from CPU Runs Using Machine Learning
Predicting GPU Performance from CPU Runs Using Machine Learning Ioana Baldini Stephen Fink Erik Altman IBM T. J. Watson Research Center Yorktown Heights, NY USA 1 To exploit GPGPU acceleration need to
More informationESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder
ESE532: System-on-a-Chip Architecture Day 5: September 18, 2017 Dataflow Process Model Today Dataflow Process Model Motivation Issues Abstraction Basic Approach Dataflow variants Motivations/demands for
More informationMoore s Law. Computer architect goal Software developer assumption
Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationTrends in the Infrastructure of Computing
Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much
More informationHigh-Performance Distributed RMA Locks
High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER Presented at ARM Research, Austin, Texas, USA, Feb. 2017 NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch
More informationEECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141
EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationAchieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation
Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationMODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures
TORSTEN HOEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures with support of Tobias Gysi, Tobias Grosser @ SPCL presented at Guangzhou, China,
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationHigh Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA
High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it
More informationEECS150 - Digital Design Lecture 09 - Parallelism
EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization
More informationProgress in automatic GPU compilation and why you want to run MPI on your GPU
TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016 #pragma ivdep 2 !$ACC DATA
More informationEfficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford
Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable
More informationCS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck
Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationAnalyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters
Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Torsten Hoefler With lots of help from the AUS Team and Bill Kramer at NCSA! All used images belong
More informationOverview. CS 472 Concurrent & Parallel Programming University of Evansville
Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationDesign of Parallel and High-Performance Computing Fall 2017 Lecture: distributed memory
Design of Parallel and High-Performance Computing Fall 2017 Lecture: distributed memory Motivational video: https://www.youtube.com/watch?v=pucx50fdsic Instructor: Torsten Hoefler & Markus Püschel TA:
More information1.3 Data processing; data storage; data movement; and control.
CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical
More informationOptimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor
Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationMPI in 2020: Opportunities and Challenges. William Gropp
MPI in 2020: Opportunities and Challenges William Gropp www.cs.illinois.edu/~wgropp MPI and Supercomputing The Message Passing Interface (MPI) has been amazingly successful First released in 1992, it is
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationPerformance of computer systems
Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type
More informationIntra-MIC MPI Communication using MVAPICH2: Early Experience
Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University
More informationComputer Architecture. R. Poss
Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationCharacterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine System Noise Introduction and History CPUs are time-shared Deamons,
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationPerformance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA
Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,
More informationParallelism Marco Serafini
Parallelism Marco Serafini COMPSCI 590S Lecture 3 Announcements Reviews First paper posted on website Review due by this Wednesday 11 PM (hard deadline) Data Science Career Mixer (save the date!) November
More informationParallel Computing Ideas
Parallel Computing Ideas K. 1 1 Department of Mathematics 2018 Why When to go for speed Historically: Production code Code takes a long time to run Code runs many times Code is not end in itself 2010:
More informationArchitecture without explicit locks for logic simulation on SIMD machines
Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationTORSTEN HOEFLER
TORSTEN HOEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures most work performed by TOBIAS GYSI AND TOBIAS GROSSER Stencil computations (oh no,
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationEnabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided
Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationCSCE 626 Experimental Evaluation.
CSCE 626 Experimental Evaluation http://parasol.tamu.edu Introduction This lecture discusses how to properly design an experimental setup, measure and analyze the performance of parallel algorithms you
More information10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationA 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures
A 101 Guide to Heterogeneous, Accelerated, Centric Computing Architectures Allan Cantle President & Founder, Nallatech Join the Conversation #OpenPOWERSummit 2016 OpenPOWER Foundation Buzzword & Acronym
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationGPU-centric communication for improved efficiency
GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationLatches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter
IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationGeneral introduction: GPUs and the realm of parallel architectures
General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years
More informationHPC future trends from a science perspective
HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively
More information7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT
7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT Draft Printed for SECO Murex S.A.S 2012 all rights reserved Murex Analytics Only global vendor of trading, risk management and processing systems focusing also
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 9
General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationHigh Performance Computing: Blue-Gene and Road Runner. Ravi Patel
High Performance Computing: Blue-Gene and Road Runner Ravi Patel 1 HPC General Information 2 HPC Considerations Criterion Performance Speed Power Scalability Number of nodes Latency bottlenecks Reliability
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 7
General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationBig Orange Bramble. August 09, 2016
Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University
More informationChap. 2 part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1
Chap. 2 part 1 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 Provocative question (p30) How much do we need to know about the HW to write good par. prog.? Chap. gives HW background knowledge
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationNon-Blocking Collectives for MPI
Non-Blocking Collectives for MPI overlap at the highest level Torsten Höfler Open Systems Lab Indiana University Bloomington, IN, USA Institut für Wissenschaftliches Rechnen Technische Universität Dresden
More informationRay Tracing. Computer Graphics CMU /15-662, Fall 2016
Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2016 Primitive-partitioning vs. space-partitioning acceleration structures Primitive partitioning (bounding volume hierarchy): partitions node s primitives
More informationEnabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided
Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed
More informationMicroprocessor Trends and Implications for the Future
Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from
More informationEvolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017
Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University http://hpctoolkit.org Scalable Tools Workshop 7 August 2017 HPCToolkit 1 HPCToolkit Workflow source code compile &
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationIntroduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA
Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar
More informationComputer Architecture Crash course
Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping
More informationA Global Operating System for HPC Clusters
A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationAn Overview of Parallel Computing
An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA
More informationIntroduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29
Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationEfficient Parallel Programming on Xeon Phi for Exascale
Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration
More informationAllinea Unified Environment
Allinea Unified Environment Allinea s unified tools for debugging and profiling HPC Codes Beau Paisley Allinea Software bpaisley@allinea.com 720.583.0380 Today s Challenge Q: What is the impact of current
More informationHigh-Performance Distributed RMA Locks
High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch LOCKS An example structure Proc p Proc q Inuitive semantics
More informationDataflow Supercomputers
Dataflow Supercomputers Michael J. Flynn Maxeler Technologies and Stanford University Outline History Dataflow as a supercomputer technology openspl: generalizing the dataflow programming model Optimizing
More information