Taming High Performance Computing with Compiler Technology
|
|
- Hope Cobb
- 5 years ago
- Views:
Transcription
1 Taming High Performance Computing with Compiler Technology John Mellor-Crummey Department of Computer Science Center for High Performance Software Research
2 High Performance Computing Applications Scientific inquiry ranging from elementary particles to cosmology Pollution modeling and remediation planning Storm forecasting and climate prediction Advanced vehicle design Computational chemistry and drug design Molecular nanotechnology Cryptology Nuclear weapons stewardship 2
3 High Performance Applications Algorithms Architectures Data Structures Effective parallelizations scalability Single-processor performance can differ by integer factors 3
4 Status of Highly-parallel Systems [Scalable, highly-parallel, microprocessorbased systems] remain in the research and experimental stage primarily because we lack adequate software technology, application-development tools, and, ultimately, well-developed applications. 4 Information Technology Research: Investing in our Future PITAC Report to the President, 1999
5 Challenges for Highly Parallel Computing Effective algorithms for complex problems Programming models and compilers Application development tools Operating systems for large-scale machines Design better high-performance architectures 5
6 Current Research Themes Compiler support for data parallel programming Implicitly and explicitly parallel global address space languages Technology for auto-tuning software Automatically tailor code to a microprocessor architecture Performance analysis tools Understanding application behavior on current systems Performance modeling How will applications perform at different scales and on future systems Compiler technology for scientific scripting languages R language for statistical programming 6
7 Outline Motivation Compiler technology for HPC Compiling data-parallel languages Semi-automatic synthesis of performance models Challenges for the future Other work 7
8 Compiling data-parallel languages Introduction Data parallelism Compiling HPF-like languages Rice dhpf compiler Data partitioning research Analysis and code generation Experimental results 8
9 Data Parallelism Apply the same operation to many data elements need not be synchronous need not be completely uniform Applicable to many problems in science and engineering 9
10 Data Parallel Programming Alternatives 10 Hand-coded parallelizations using library-based models complete applicability difficult to design and implement all responsibility for tuning falls to the developer Application frameworks easy to use limited applicability Single-threaded data-parallel languages much more flexible than application frameworks much simpler to use than hand-coded parallelizations compilers significantly determines performance offload details of tuning from the developer compilers are enormously complex out of luck if the compiler doesn t deliver performance
11 Data Parallel Compilation High Performance Fortran Partitioning of data drives partitioning of computation, communication, and synchronization Fortran program + data partitioning Partition computation Insert communication Manage storage Same answers as sequential program HPF Program Compilation Parallel Machine 11
12 Example HPF Program CHPF$ processors P(3,3) CHPF$ distribute A(block, block) onto P CHPF$ distribute B(block, block) onto P DO i = 2, n - 1 DO j = 2, n - 1 A(i,j) =.25 *(B(i-1,j) + B(i+1,j)+ B(i,j-1) + B(i,j+1)) P(0,0) P(2,2) 12 Processors Data for A, B (BLOCK,BLOCK)distribution
13 Compiling HPF-like Languages Partition data Select mapping of computation to processors Analyze communication requirements Partition computation by reducing loop bounds Insert communication 13
14 The Devil is in the Details Good data and computation partitionings are a must without good partitionings, parallelism suffers! Excess communication undermines scalability both frequency and volume must be right! Single processor efficiency is critical must use caches effectively node code must be amenable to optimization 14 Goal Compiler and runtime techniques that enable simple and natural programming, yet deliver the performance of hand-coded parallelizations
15 Rice dhpf Compiler Achievements parallelize sequential codes with minimal rewriting near hand-coded performance for tightly coupled codes 15 Innovations Sophisticated data partitionings Abstract set-based framework for communication analysis, code generation Sophisticated computation partitionings partial replication to reduce communication Comprehensive optimizations
16 Data Partitioning Good parallel performance requires suitable partitioning Tightly-coupled computations are problematic Line-sweep computations: e.g., ADI integration do j = 1, n do i = 2,n a(i,j) = a(i-1,j) 16 recurrences make parallelization difficult with BLOCK partitionings
17 Coarse-Grain Pipelining Compute along partitioned dimensions Partial serialization induces wavefront parallelism with block partitioning Processor 0 Processor 1 Processor 2 Processor 3 17
18 Coarse-Grain Pipelining Compute along partitioned dimensions Partial serialization induces wavefront parallelism with block partitioning Processor 0 Processor 1 Processor 2 Processor 3 18
19 Parallelizing Line Sweeps } Compilergenerated coarse-grain pipelining Hand-coded }multipartitioning 19
20 Diagonal Multipartitioning Each processor owns 1 tile between each pair of cuts along each distributed dimension Enables full parallelism for a sweep along any partitioned dimension Processor 0 Processor 1 Processor 2 Processor 3 20
21 Diagonal Multipartitioning Each processor owns 1 tile between each pair of cuts along each distributed dimension Enables full parallelism for a sweep along any partitioned dimension Processor 0 Processor 1 Processor 2 Processor 3 21
22 Generalized Multipartitioning Given an n-dimensional data domain and p processors, select which λ dimensions to partition, 2 λ n; how many cuts in each Partitioning constraints # tiles in each λ - 1 dimensional hyperplane is a multiple of p no more cuts than necessary Objective function: minimize communication volume pick the configuration of cuts to minimize total cross section 22 Mapping constraints load balance: in a hyperplane, each proc has same # tiles neighbor: in any particular direction, the neighbor of a given processor is the same IPDPS 2002 Best paper in Algorithms; JPDC 2003
23 Choosing the Best Partitioning Enumerate all elementary partitionings candidates depend on factorization of p Evaluate their communication cost Select the minimum cost partitioning worst case: p is a product of unique prime factors complexity: Number of choices for picking a pair of dimensions to partition with a number of cuts divisible by a particular prime factor d ( d 1) 2 ( 1+ o( 1 ) log p log log p Possible unique factors of p 23 very fast in practice.
24 Mapping Tiles with Modular Mappings Modular Shift Basic Tile Shape Integral # of shapes Integral # of shapes
25 Formal Compilation Framework 3 types of Sets Data Iterations Processors Layout: Reference: CompPart: 3 types of Mappings data iterations iterations processors data processors Representation 25 integer tuples with Presburger arithmetic for constraints Analysis: Use set equations to compute set(s) of interest iterations allocated to a processor communication sets Code generation: Synthesize loops from set(s), e.g. parallel (SPMD) loop nests message packing and unpacking [Adve & Mellor-Crummey, PLDI98]
26 Why Symbolic Sets? processors P(3,3) distribute A(block, block) onto P distribute B(block, block) onto P DO i = 2, n - 1 DO j = 2, n - 1 A(i, j) =.25 *( B(i-1, j) + B(i+1, j) + B(i, j-1) + B(i, j+1) ) Local section for P(x,y) (and iterations executed) { [i, j]: 20x + 2 i 20x + 19 & 30y + 2 j 30y + 29 } Non-local data accessed P(x,y) data / loop partitioning P(0,0) P(1,0) P(2,0) P(0,1) P(1,1) P(2,1) 30 P(0,2) P(1,2) P(2,2) Iterations that access non-local data
27 Integer-Set Framework: Example real A(100) distribute A(BLOCK) on P(4) do i = 1, N... = A(i-1) + A(i-2) +...! ON_HOME A(i-1) enddo symbolic N Layout := { [pid] -> [i] : 25 *pid + 1 i 25 *pid + 25 } Loop := { [i] : 1 i N } CPSubscript := { [i] [i-1] } RefSubscript := { [i] [i-2] } CompPart := (Layout o CPSubscript -1 ) Loop DataAccessed = CompPart o RefSubscript 27 NonLocal Data Accessed = DataAccessed - Layout
28 Optimizations using Integer Sets Partially replicate computation to reduce communication 66% lower message volume, 38% faster: NAS 64 procs Coalesce communication sets for multiple references 41% lower message volume, 35% faster: NAS 64 procs Split loops into local-only and off-processor loops 10% fewer Dcache misses, 9% faster: NAS procs Processor set constraints on communication sets 12% fewer Icache misses, 7% faster: NAS 64 procs 28 PACT 2002 Best student paper (with Daniel Chavarria-Miranda)
29 Experimental Evaluation NAS SP & BT benchmarks from NASA Ames use ADI to solve the Navier-Stokes equation in 3D forward & backward line sweeps on each dimension, each time step Compare four variants MPI hand-coded multipartitioning (NASA) dhpf: multipartitioned dhpf: 2D partitioning, coarse-grain pipelining PGI s pghpf: 1D partitioning with transpose Platform SGI Origin 2000: MHz procs. SGI compilers + SGI MPI 29
30 Efficiency for NAS SP (102 3 B size) similar comm. volume, more serialization > 2x multipartitioning comm. volume 30
31 Efficiency for NAS BT (102 3 B size) > 2x multipartitioning comm. volume 31 Platform: SGI Origin 2000
32 NAS BT Parallelizations Hand-coded 3D Multipartitioning Compiler-generated 3D Multipartitioning Execution Traces for NAS BT Class 'A' - 16 processors, SGI Origin
33 Observations High performance requires perfection parallelism and load-balance communication frequency communication volume scalar performance Data-parallel compiler technology can ease the programming burden yield near hand-coded performance 33
34 Data-parallel Related Work Linear equations/set-based compilation [Pugh et al; Ancourt et al; Amarasinghe & Lam] Commercial HPF compilers xlhpf, pghpf, xhpf HPF/JA 14 Teraflops on a code for the Earth Simulator Lots of research compiler efforts e.g. Polaris, CAPTOOLS 34 None support partially-replicated computation None support multipartitioning None achieve linear scaling on tightly-coupled codes
35 Outline Motivation Compiler technology for HPC Data-parallel programming systems Semi-automatic synthesis of performance models Challenges for the future Other work 35
36 Why Performance Modeling? Insight into applications barriers to scalability insight into optimizations Mapping applications to systems Grid resource selection & scheduling intelligent run-time adaptation Workload-based design of future systems 36
37 Modeling Challenges Performance depends on: architecture specific factors application characteristics input data parameters Difficult to model execution time directly Collecting data at scale is expensive 37
38 Approach Separate contribution of application characteristics Measure the application-specific factors static analysis dynamic analysis Construct scalable models Explore interactions with hardware Use binary analysis and instrumentation for language and programming model independence 38 [Marin & Mellor-Crummey SIGMETRICS 04]
39 Toolkit Design Overview Object Code Binary Instrumenter Instrumented Code Dynamic Analysis Binary Analyzer Execute Control flow graph Loop nesting structure BB instruction mix Static Analysis BB Counts Architecture neutral model Communication Volume & Frequency Post Processing Tool Scheduler Memory Reuse Distance Performance Prediction for Target Architecture Architecture Description Post Processing 39
40 Building Scalable Models Collect data from multiple runs n+1 runs to compute a model of degree n Approximation function: F(X) = c n *B n (X)+c n-1 *B n-1 (X)+ +c 0 *B 0 (X) A set of basis functions Include constraints Goal: determine coefficients Use quadratic programming 40
41 Execution Frequency Modeling Example X Count Execution Frequency Model Collected data Frequency Problem Size 41
42 Execution Frequency Modeling Example X Count Execution Frequency Model Collected data Model degree 0 Frequency Y=41416, Err=131% Problem Size 42
43 Execution Frequency Modeling Example X Count Execution Frequency Model Collected data Model degree 0 Frequency Model degree 1 Y=16776*X-42366, Err=60.4% Y=41416, Err=131% Problem Size 43
44 Execution Frequency Modeling Example X Count Execution Frequency Model Frequency Collected data Model degree 0 Model degree 1 Model degree 2 Y=482*X *X+964, Err=0% Y=16776*X-42366, Err=60.4% Y=41416, Err=131% Problem Size 44
45 Predict Schedule Latency for an Architecture Input: basic block and edge execution frequency Methodology: recover executed paths SPARC instructions generic RISC instantiate scheduler for architecture construct schedule for executed paths determine inefficiencies 45
46 Toolkit Design Overview Object Code Binary Instrumenter Instrumented Code Dynamic Analysis Binary Analyzer Execute Control flow graph Loop nesting structure BB instruction mix Static Analysis BB Counts Architecture neutral model Communication Volume & Frequency Post Processing Tool Scheduler Memory Reuse Distance Performance Prediction for Target Architecture Architecture Description Post Processing 46
47 Memory Reuse Distance MRD: # unique data blocks referenced since target block last accessed reference I 1 I 2 I 3 I 2 I 3 I 2 I 3 memory block A B A C A B B MRD I 1 : 1 cold miss I 2 : 2 cold misses, distance 2 47 I 3 : distance 0, distance 1
48 48 Memory reuse distance
49 Modeling Memory Reuse Distance More complex than execution frequency cold misses histogram of reuse distances number of bins not constant Average reuse distance is misleading 1 access with distance 10,000 3 accesses with distance 0 cache has 1024 blocks 2500 average 49
50 Modeling Memory Reuse Distance Normalized frequency 50% 30% 20% Reuse distance 50
51 51 Modeling Memory Reuse Distance
52 Predict Number of Cache Misses Instantiate model for problem size % 74% 52
53 Prediction: NAS BT 3.0 Mem Hier Utilization 300 NAS BT 3.0 Memory Hierarchy Utilization Miss count / Cell / Time step L1 measured L1 predicted L2 measured(x10) L2 predicted(x10) TLB measured(x10) TLB predicted(x10) Mesh size
54 Prediction: NAS BT 3.0 Time on SGI Origin Cycles / Cell / Time step 0 NAS BT 3.0 from SPARC to SGI Origin Measured time Scheduler latency L1 miss penalty L2 miss penalty TLB miss penalty Predicted time Measured time Scheduler Mesh latency size L1 miss penalty
55 Open Performance Modeling Issues Short term Better modeling of memory subsystem # outstanding loads to accurately predict memory latency Explore modeling of irregular applications Long term Model parallel applications Present modeling applies between synchronization points Combine with manually constructed parallel models Semi-automatically recover parallel trends Understand dynamic parallelism 55
56 Modeling Related Work Reuse distance Cache utilization [Beyls & D Hollander] Investigating optimizations [Ding et al.] Program instrumentation EEL, QPT [Ball, Larus, Schnarr] Scalable analytic models [Vernon et al; Hoisie et al.] Cross-architecture models at scale [Snavely et al.; Cascaval et al.] Simulation (trace-based and execution-driven) 56 None yield semi-automatically derived scalable models
57 HPC Compiler Challenges for the Future Programming systems for large-scale machines Abstraction and greater expressiveness are needed Potential parallelism must be readily accessible implicit parallelism or explicit element-wise parallelism Locality and latency tolerance are both critical for performance Dynamic self-scheduled parallelism will be necessary Failure will occur and must be expected and handled Support for self-tuning software for complex architectures Compiler-based tools Debugging and performance analysis of large-scale software on dynamic systems is a major open problem Insight into hardware design Understanding impact of proposed designs on whole programs 57
58 Past Work Multiprocessor synchronization 58 locks, synchronous barriers [ASPLOS89, TOCS91] reader-writer synchronization [PPOPP91] fuzzy barriers [IJPP94] Parallel debugging execution replay [JPDC90, TOC87] software instruction counter [ASPLOS89] detecting data races [WPDD93, SC91, SC90] Parallel programming environments Parascope [PIEEE 93], Dsystem [TPDT94] Parallel applications molecular dynamics [JCC92]
59 Ongoing Work Global address space parallel languages Co-array Fortran [LCPC03] Performance analysis [TJS02, LACSI01, ICS01, SIGMETRICS01] Improving node performance irregular mesh and particle codes [ICS99, IJPP00] sparse matrices [LACSI02, IJHPCA04] multigrid [ICS01] dense matrices [LACSI03] Grid computing [IJHPCA01] Library-based domain languages [JPDC01] 59
Overpartioning with the Rice dhpf Compiler
Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf
More informationAn Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications
Journal of Instruction-Level Parallelism 5 (2003) 1-29 Submitted 10/02; published 4/03 An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey
More informationJohn Mellor-Crummey Department of Computer Science Center for High Performance Software Research Rice University
Co-Array Fortran and High Performance Fortran John Mellor-Crummey Department of Computer Science Center for High Performance Software Research Rice University LACSI Symposium October 2006 The Problem Petascale
More informationCompilers and Compiler-based Tools for HPC
Compilers and Compiler-based Tools for HPC John Mellor-Crummey Department of Computer Science Rice University http://lacsi.rice.edu/review/2004/slides/compilers-tools.pdf High Performance Computing Algorithms
More informationCompilation for Heterogeneous Platforms
Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey
More informationCompiling for Advanced Architectures
Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have
More informationCompilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University
Compilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University Collaborators Raj Bandypadhyay Zoran Budimlic Arun Chauhan Daniel Chavarria-Miranda Keith Cooper Jack
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationHigh Performance Computing: Architecture, Applications, and SE Issues. Peter Strazdins
High Performance Computing: Architecture, Applications, and SE Issues Peter Strazdins Department of Computer Science, Australian National University e-mail: peter@cs.anu.edu.au May 17, 2004 COMP1800 Seminar2-1
More informationUsing Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Outline Overview of Framework Fine grain control of transformations
More informationCo-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationDistributed-memory Algorithms for Dense Matrices, Vectors, and Arrays
Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationCenter for Scalable Application Development Software: Application Engagement. Ewing Lusk (ANL) Gabriel Marin (Rice)
Center for Scalable Application Development Software: Application Engagement Ewing Lusk (ANL) Gabriel Marin (Rice) CScADS Midterm Review April 22, 2009 1 Application Engagement Workshops (2 out of 4) for
More informationCompiler-Supported Simulation of Highly Scalable Parallel Applications
Compiler-Supported Simulation of Highly Scalable Parallel Applications Vikram S. Adve 1 Rajive Bagrodia 2 Ewa Deelman 2 Thomas Phan 2 Rizos Sakellariou 3 Abstract 1 University of Illinois at Urbana-Champaign
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationProgramming as Successive Refinement. Partitioning for Performance
Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationThe Cray Rainier System: Integrated Scalar/Vector Computing
THE SUPERCOMPUTER COMPANY The Cray Rainier System: Integrated Scalar/Vector Computing Per Nyberg 11 th ECMWF Workshop on HPC in Meteorology Topics Current Product Overview Cray Technology Strengths Rainier
More informationParallel Matlab Based on Telescoping Languages and Data Parallel Compilation. Ken Kennedy Rice University
Parallel Matlab Based on Telescoping Languages and Data Parallel Compilation Ken Kennedy Rice University Collaborators Raj Bandypadhyay Zoran Budimlic Arun Chauhan Daniel Chavarria-Miranda Keith Cooper
More informationPorting a parallel rotor wake simulation to GPGPU accelerators using OpenACC
DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)
More informationAlgorithm Engineering with PRAM Algorithms
Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and
More informationParallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam
Parallel Paradigms & Programming Models Lectured by: Pham Tran Vu Prepared by: Thoai Nam Outline Parallel programming paradigms Programmability issues Parallel programming models Implicit parallelism Explicit
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationRecap. Practical Compiling for Modern Machines (Special Topics in Programming Languages)
Recap Practical Compiling for Modern Machines (Special Topics in Programming Languages) Why Compiling? Other reasons: Performance Performance Performance correctness checking language translation hardware
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationLeveraging Flash in HPC Systems
Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,
More informationOpenMP Optimization and its Translation to OpenGL
OpenMP Optimization and its Translation to OpenGL Santosh Kumar SITRC-Nashik, India Dr. V.M.Wadhai MAE-Pune, India Prasad S.Halgaonkar MITCOE-Pune, India Kiran P.Gaikwad GHRIEC-Pune, India ABSTRACT For
More informationHigh Performance Fortran. James Curry
High Performance Fortran James Curry Wikipedia! New Fortran statements, such as FORALL, and the ability to create PURE (side effect free) procedures Compiler directives for recommended distributions of
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationA Lightweight OpenMP Runtime
Alexandre Eichenberger - Kevin O Brien 6/26/ A Lightweight OpenMP Runtime -- OpenMP for Exascale Architectures -- T.J. Watson, IBM Research Goals Thread-rich computing environments are becoming more prevalent
More informationCompilers and Run-Time Systems for High-Performance Computing
Compilers and Run-Time Systems for High-Performance Computing Blurring the Distinction between Compile-Time and Run-Time Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/compilerruntime.pdf
More informationAutomatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX
Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Recap from Last Year A framework for automatic tuning of applications Fine grain control of transformations
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationStatic Data Race Detection for SPMD Programs via an Extended Polyhedral Representation
via an Extended Polyhedral Representation Habanero Extreme Scale Software Research Group Department of Computer Science Rice University 6th International Workshop on Polyhedral Compilation Techniques (IMPACT
More informationEffective Performance Measurement and Analysis of Multithreaded Applications
Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationL23: Parallel Programming Retrospective
Administrative L23: Parallel Programming Retrospective Schedule for the rest of the semester - Midterm Quiz = long homework - Return by Dec. 15 - Projects - 1 page status report due TODAY handin cs4961
More information6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT
6.189 IAP 2007 Lecture 11 Parallelizing Compilers 1 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel
More informationCompiler Synthesis of Task Graphs for Parallel Program Performance Prediction
Compiler Synthesis of Task Graphs for Parallel Program Performance Prediction Vikram Adve 1 and Rizos Sakellariou 2 1 Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana,
More informationCluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup
Topics Scalable Speedup and Data Locality Parallelizing Sequential Programs Breaking data dependencies Avoiding synchronization overheads Performance and Debugging Issues in OpenMP Achieving Cache and
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationParallelization Principles. Sathish Vadhiyar
Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs
More informationA Web-based Prophesy Automated Performance Modeling System
A Web-based Prophesy Automated Performance Modeling System Xingfu Wu, Valerie Taylor Department of Computer Science, Texas A & M University, College Station, TX 77843, USA Email: {wuxf, taylor}@cs.tamu.edu
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationIntroduction to Parallel Programming Models
Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures
More informationCompiling Affine Loop Nests for Distributed-Memory Parallel Architectures
Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Uday Bondhugula Indian Institute of Science Supercomputing 2013 Nov 16 22, 2013 Denver, Colorado 1/46 1 Introduction 2 Distributed-memory
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationFourier-Motzkin and Farkas Questions (HW10)
Automating Scheduling Logistics Final report for project due this Friday, 5/4/12 Quiz 4 due this Monday, 5/7/12 Poster session Thursday May 10 from 2-4pm Distance students need to contact me to set up
More informationA Multiprocessor Memory Processor for Efficient Sharing And Access Coordination
1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu
More informationChapel Introduction and
Lecture 24 Chapel Introduction and Overview of X10 and Fortress John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 But before that Created a simple
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationOpenMP for next generation heterogeneous clusters
OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great
More informationOn Efficient Parallelization of Line-Sweep Computations
On Efficient Parallelization of Line-Sweep Computations Alain Darte LIP, ENS-Lyon, 46, Allée d Italie, 69007 Lyon, France. Alain.Darte@ens-lyon.fr Daniel Chavarría-Miranda Robert Fowler John Mellor-Crummey
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationAdministration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall
More informationStandard promoted by main manufacturers Fortran. Structure: Directives, clauses and run time calls
OpenMP Introducción Directivas Regiones paralelas Worksharing sincronizaciones Visibilidad datos Implementación OpenMP: introduction Standard promoted by main manufacturers http://www.openmp.org, http://www.compunity.org
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationCS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel
CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999
More informationEvaluating the Performance of the Community Atmosphere Model at High Resolutions
Evaluating the Performance of the Community Atmosphere Model at High Resolutions Soumi Manna MS candidate, University of Wyoming Mentor: Dr. Ben Jamroz National Center for Atmospheric Research Boulder,
More informationMulti-Domain Pattern. I. Problem. II. Driving Forces. III. Solution
Multi-Domain Pattern I. Problem The problem represents computations characterized by an underlying system of mathematical equations, often simulating behaviors of physical objects through discrete time
More informationGeneration of High Performance Domain- Specific Languages from Component Libraries. Ken Kennedy Rice University
Generation of High Performance Domain- Specific Languages from Component Libraries Ken Kennedy Rice University Collaborators Raj Bandypadhyay Zoran Budimlic Arun Chauhan Daniel Chavarria-Miranda Keith
More informationStandard promoted by main manufacturers Fortran
OpenMP Introducción Directivas Regiones paralelas Worksharing sincronizaciones Visibilidad datos Implementación OpenMP: introduction Standard promoted by main manufacturers http://www.openmp.org Fortran
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationProgramming Models for Scientific Computing on Leadership Computing Platforms:
Programming Models for Scientific Computing on Leadership Computing Platforms: The Evolution of Coarray Fortran John Mellor-Crummey Department of Computer Science Rice University COMP 422 08 April 2008
More informationData Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun
Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation
More informationIdentifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning
Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More information10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationA Characterization of Shared Data Access Patterns in UPC Programs
IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationLanguage and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors
Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel
More informationLarge Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele
Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling
More informationCache Coherence Protocols for Chip Multiprocessors - I
Cache Coherence Protocols for Chip Multiprocessors - I John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 5 6 September 2016 Context Thus far chip multiprocessors
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationExample of a Parallel Algorithm
-1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software
More informationIntroduction to parallel computing. Seminar Organization
Introduction to parallel computing Rami Melhem Department of Computer Science 1 Seminar Organization 1) Introductory lectures (probably 4) 2) aper presentations by students (2/3 per short/long class) -
More informationPoint-to-Point Synchronisation on Shared Memory Architectures
Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationProject Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor
EE482C: Advanced Computer Organization Lecture #12 Stream Processor Architecture Stanford University Tuesday, 14 May 2002 Project Proposals Lecture #12: Tuesday, 14 May 2002 Lecturer: Students of the class
More information"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008
Description Course Summary This course provides students with the knowledge and skills to develop high-performance computing (HPC) applications for Microsoft. Students learn about the product Microsoft,
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationPhase-Based Application-Driven Power Management on the Single-chip Cloud Computer
Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig Introduction
More information