Taming High Performance Computing with Compiler Technology

Size: px
Start display at page:

Download "Taming High Performance Computing with Compiler Technology"

Transcription

1 Taming High Performance Computing with Compiler Technology John Mellor-Crummey Department of Computer Science Center for High Performance Software Research

2 High Performance Computing Applications Scientific inquiry ranging from elementary particles to cosmology Pollution modeling and remediation planning Storm forecasting and climate prediction Advanced vehicle design Computational chemistry and drug design Molecular nanotechnology Cryptology Nuclear weapons stewardship 2

3 High Performance Applications Algorithms Architectures Data Structures Effective parallelizations scalability Single-processor performance can differ by integer factors 3

4 Status of Highly-parallel Systems [Scalable, highly-parallel, microprocessorbased systems] remain in the research and experimental stage primarily because we lack adequate software technology, application-development tools, and, ultimately, well-developed applications. 4 Information Technology Research: Investing in our Future PITAC Report to the President, 1999

5 Challenges for Highly Parallel Computing Effective algorithms for complex problems Programming models and compilers Application development tools Operating systems for large-scale machines Design better high-performance architectures 5

6 Current Research Themes Compiler support for data parallel programming Implicitly and explicitly parallel global address space languages Technology for auto-tuning software Automatically tailor code to a microprocessor architecture Performance analysis tools Understanding application behavior on current systems Performance modeling How will applications perform at different scales and on future systems Compiler technology for scientific scripting languages R language for statistical programming 6

7 Outline Motivation Compiler technology for HPC Compiling data-parallel languages Semi-automatic synthesis of performance models Challenges for the future Other work 7

8 Compiling data-parallel languages Introduction Data parallelism Compiling HPF-like languages Rice dhpf compiler Data partitioning research Analysis and code generation Experimental results 8

9 Data Parallelism Apply the same operation to many data elements need not be synchronous need not be completely uniform Applicable to many problems in science and engineering 9

10 Data Parallel Programming Alternatives 10 Hand-coded parallelizations using library-based models complete applicability difficult to design and implement all responsibility for tuning falls to the developer Application frameworks easy to use limited applicability Single-threaded data-parallel languages much more flexible than application frameworks much simpler to use than hand-coded parallelizations compilers significantly determines performance offload details of tuning from the developer compilers are enormously complex out of luck if the compiler doesn t deliver performance

11 Data Parallel Compilation High Performance Fortran Partitioning of data drives partitioning of computation, communication, and synchronization Fortran program + data partitioning Partition computation Insert communication Manage storage Same answers as sequential program HPF Program Compilation Parallel Machine 11

12 Example HPF Program CHPF$ processors P(3,3) CHPF$ distribute A(block, block) onto P CHPF$ distribute B(block, block) onto P DO i = 2, n - 1 DO j = 2, n - 1 A(i,j) =.25 *(B(i-1,j) + B(i+1,j)+ B(i,j-1) + B(i,j+1)) P(0,0) P(2,2) 12 Processors Data for A, B (BLOCK,BLOCK)distribution

13 Compiling HPF-like Languages Partition data Select mapping of computation to processors Analyze communication requirements Partition computation by reducing loop bounds Insert communication 13

14 The Devil is in the Details Good data and computation partitionings are a must without good partitionings, parallelism suffers! Excess communication undermines scalability both frequency and volume must be right! Single processor efficiency is critical must use caches effectively node code must be amenable to optimization 14 Goal Compiler and runtime techniques that enable simple and natural programming, yet deliver the performance of hand-coded parallelizations

15 Rice dhpf Compiler Achievements parallelize sequential codes with minimal rewriting near hand-coded performance for tightly coupled codes 15 Innovations Sophisticated data partitionings Abstract set-based framework for communication analysis, code generation Sophisticated computation partitionings partial replication to reduce communication Comprehensive optimizations

16 Data Partitioning Good parallel performance requires suitable partitioning Tightly-coupled computations are problematic Line-sweep computations: e.g., ADI integration do j = 1, n do i = 2,n a(i,j) = a(i-1,j) 16 recurrences make parallelization difficult with BLOCK partitionings

17 Coarse-Grain Pipelining Compute along partitioned dimensions Partial serialization induces wavefront parallelism with block partitioning Processor 0 Processor 1 Processor 2 Processor 3 17

18 Coarse-Grain Pipelining Compute along partitioned dimensions Partial serialization induces wavefront parallelism with block partitioning Processor 0 Processor 1 Processor 2 Processor 3 18

19 Parallelizing Line Sweeps } Compilergenerated coarse-grain pipelining Hand-coded }multipartitioning 19

20 Diagonal Multipartitioning Each processor owns 1 tile between each pair of cuts along each distributed dimension Enables full parallelism for a sweep along any partitioned dimension Processor 0 Processor 1 Processor 2 Processor 3 20

21 Diagonal Multipartitioning Each processor owns 1 tile between each pair of cuts along each distributed dimension Enables full parallelism for a sweep along any partitioned dimension Processor 0 Processor 1 Processor 2 Processor 3 21

22 Generalized Multipartitioning Given an n-dimensional data domain and p processors, select which λ dimensions to partition, 2 λ n; how many cuts in each Partitioning constraints # tiles in each λ - 1 dimensional hyperplane is a multiple of p no more cuts than necessary Objective function: minimize communication volume pick the configuration of cuts to minimize total cross section 22 Mapping constraints load balance: in a hyperplane, each proc has same # tiles neighbor: in any particular direction, the neighbor of a given processor is the same IPDPS 2002 Best paper in Algorithms; JPDC 2003

23 Choosing the Best Partitioning Enumerate all elementary partitionings candidates depend on factorization of p Evaluate their communication cost Select the minimum cost partitioning worst case: p is a product of unique prime factors complexity: Number of choices for picking a pair of dimensions to partition with a number of cuts divisible by a particular prime factor d ( d 1) 2 ( 1+ o( 1 ) log p log log p Possible unique factors of p 23 very fast in practice.

24 Mapping Tiles with Modular Mappings Modular Shift Basic Tile Shape Integral # of shapes Integral # of shapes

25 Formal Compilation Framework 3 types of Sets Data Iterations Processors Layout: Reference: CompPart: 3 types of Mappings data iterations iterations processors data processors Representation 25 integer tuples with Presburger arithmetic for constraints Analysis: Use set equations to compute set(s) of interest iterations allocated to a processor communication sets Code generation: Synthesize loops from set(s), e.g. parallel (SPMD) loop nests message packing and unpacking [Adve & Mellor-Crummey, PLDI98]

26 Why Symbolic Sets? processors P(3,3) distribute A(block, block) onto P distribute B(block, block) onto P DO i = 2, n - 1 DO j = 2, n - 1 A(i, j) =.25 *( B(i-1, j) + B(i+1, j) + B(i, j-1) + B(i, j+1) ) Local section for P(x,y) (and iterations executed) { [i, j]: 20x + 2 i 20x + 19 & 30y + 2 j 30y + 29 } Non-local data accessed P(x,y) data / loop partitioning P(0,0) P(1,0) P(2,0) P(0,1) P(1,1) P(2,1) 30 P(0,2) P(1,2) P(2,2) Iterations that access non-local data

27 Integer-Set Framework: Example real A(100) distribute A(BLOCK) on P(4) do i = 1, N... = A(i-1) + A(i-2) +...! ON_HOME A(i-1) enddo symbolic N Layout := { [pid] -> [i] : 25 *pid + 1 i 25 *pid + 25 } Loop := { [i] : 1 i N } CPSubscript := { [i] [i-1] } RefSubscript := { [i] [i-2] } CompPart := (Layout o CPSubscript -1 ) Loop DataAccessed = CompPart o RefSubscript 27 NonLocal Data Accessed = DataAccessed - Layout

28 Optimizations using Integer Sets Partially replicate computation to reduce communication 66% lower message volume, 38% faster: NAS 64 procs Coalesce communication sets for multiple references 41% lower message volume, 35% faster: NAS 64 procs Split loops into local-only and off-processor loops 10% fewer Dcache misses, 9% faster: NAS procs Processor set constraints on communication sets 12% fewer Icache misses, 7% faster: NAS 64 procs 28 PACT 2002 Best student paper (with Daniel Chavarria-Miranda)

29 Experimental Evaluation NAS SP & BT benchmarks from NASA Ames use ADI to solve the Navier-Stokes equation in 3D forward & backward line sweeps on each dimension, each time step Compare four variants MPI hand-coded multipartitioning (NASA) dhpf: multipartitioned dhpf: 2D partitioning, coarse-grain pipelining PGI s pghpf: 1D partitioning with transpose Platform SGI Origin 2000: MHz procs. SGI compilers + SGI MPI 29

30 Efficiency for NAS SP (102 3 B size) similar comm. volume, more serialization > 2x multipartitioning comm. volume 30

31 Efficiency for NAS BT (102 3 B size) > 2x multipartitioning comm. volume 31 Platform: SGI Origin 2000

32 NAS BT Parallelizations Hand-coded 3D Multipartitioning Compiler-generated 3D Multipartitioning Execution Traces for NAS BT Class 'A' - 16 processors, SGI Origin

33 Observations High performance requires perfection parallelism and load-balance communication frequency communication volume scalar performance Data-parallel compiler technology can ease the programming burden yield near hand-coded performance 33

34 Data-parallel Related Work Linear equations/set-based compilation [Pugh et al; Ancourt et al; Amarasinghe & Lam] Commercial HPF compilers xlhpf, pghpf, xhpf HPF/JA 14 Teraflops on a code for the Earth Simulator Lots of research compiler efforts e.g. Polaris, CAPTOOLS 34 None support partially-replicated computation None support multipartitioning None achieve linear scaling on tightly-coupled codes

35 Outline Motivation Compiler technology for HPC Data-parallel programming systems Semi-automatic synthesis of performance models Challenges for the future Other work 35

36 Why Performance Modeling? Insight into applications barriers to scalability insight into optimizations Mapping applications to systems Grid resource selection & scheduling intelligent run-time adaptation Workload-based design of future systems 36

37 Modeling Challenges Performance depends on: architecture specific factors application characteristics input data parameters Difficult to model execution time directly Collecting data at scale is expensive 37

38 Approach Separate contribution of application characteristics Measure the application-specific factors static analysis dynamic analysis Construct scalable models Explore interactions with hardware Use binary analysis and instrumentation for language and programming model independence 38 [Marin & Mellor-Crummey SIGMETRICS 04]

39 Toolkit Design Overview Object Code Binary Instrumenter Instrumented Code Dynamic Analysis Binary Analyzer Execute Control flow graph Loop nesting structure BB instruction mix Static Analysis BB Counts Architecture neutral model Communication Volume & Frequency Post Processing Tool Scheduler Memory Reuse Distance Performance Prediction for Target Architecture Architecture Description Post Processing 39

40 Building Scalable Models Collect data from multiple runs n+1 runs to compute a model of degree n Approximation function: F(X) = c n *B n (X)+c n-1 *B n-1 (X)+ +c 0 *B 0 (X) A set of basis functions Include constraints Goal: determine coefficients Use quadratic programming 40

41 Execution Frequency Modeling Example X Count Execution Frequency Model Collected data Frequency Problem Size 41

42 Execution Frequency Modeling Example X Count Execution Frequency Model Collected data Model degree 0 Frequency Y=41416, Err=131% Problem Size 42

43 Execution Frequency Modeling Example X Count Execution Frequency Model Collected data Model degree 0 Frequency Model degree 1 Y=16776*X-42366, Err=60.4% Y=41416, Err=131% Problem Size 43

44 Execution Frequency Modeling Example X Count Execution Frequency Model Frequency Collected data Model degree 0 Model degree 1 Model degree 2 Y=482*X *X+964, Err=0% Y=16776*X-42366, Err=60.4% Y=41416, Err=131% Problem Size 44

45 Predict Schedule Latency for an Architecture Input: basic block and edge execution frequency Methodology: recover executed paths SPARC instructions generic RISC instantiate scheduler for architecture construct schedule for executed paths determine inefficiencies 45

46 Toolkit Design Overview Object Code Binary Instrumenter Instrumented Code Dynamic Analysis Binary Analyzer Execute Control flow graph Loop nesting structure BB instruction mix Static Analysis BB Counts Architecture neutral model Communication Volume & Frequency Post Processing Tool Scheduler Memory Reuse Distance Performance Prediction for Target Architecture Architecture Description Post Processing 46

47 Memory Reuse Distance MRD: # unique data blocks referenced since target block last accessed reference I 1 I 2 I 3 I 2 I 3 I 2 I 3 memory block A B A C A B B MRD I 1 : 1 cold miss I 2 : 2 cold misses, distance 2 47 I 3 : distance 0, distance 1

48 48 Memory reuse distance

49 Modeling Memory Reuse Distance More complex than execution frequency cold misses histogram of reuse distances number of bins not constant Average reuse distance is misleading 1 access with distance 10,000 3 accesses with distance 0 cache has 1024 blocks 2500 average 49

50 Modeling Memory Reuse Distance Normalized frequency 50% 30% 20% Reuse distance 50

51 51 Modeling Memory Reuse Distance

52 Predict Number of Cache Misses Instantiate model for problem size % 74% 52

53 Prediction: NAS BT 3.0 Mem Hier Utilization 300 NAS BT 3.0 Memory Hierarchy Utilization Miss count / Cell / Time step L1 measured L1 predicted L2 measured(x10) L2 predicted(x10) TLB measured(x10) TLB predicted(x10) Mesh size

54 Prediction: NAS BT 3.0 Time on SGI Origin Cycles / Cell / Time step 0 NAS BT 3.0 from SPARC to SGI Origin Measured time Scheduler latency L1 miss penalty L2 miss penalty TLB miss penalty Predicted time Measured time Scheduler Mesh latency size L1 miss penalty

55 Open Performance Modeling Issues Short term Better modeling of memory subsystem # outstanding loads to accurately predict memory latency Explore modeling of irregular applications Long term Model parallel applications Present modeling applies between synchronization points Combine with manually constructed parallel models Semi-automatically recover parallel trends Understand dynamic parallelism 55

56 Modeling Related Work Reuse distance Cache utilization [Beyls & D Hollander] Investigating optimizations [Ding et al.] Program instrumentation EEL, QPT [Ball, Larus, Schnarr] Scalable analytic models [Vernon et al; Hoisie et al.] Cross-architecture models at scale [Snavely et al.; Cascaval et al.] Simulation (trace-based and execution-driven) 56 None yield semi-automatically derived scalable models

57 HPC Compiler Challenges for the Future Programming systems for large-scale machines Abstraction and greater expressiveness are needed Potential parallelism must be readily accessible implicit parallelism or explicit element-wise parallelism Locality and latency tolerance are both critical for performance Dynamic self-scheduled parallelism will be necessary Failure will occur and must be expected and handled Support for self-tuning software for complex architectures Compiler-based tools Debugging and performance analysis of large-scale software on dynamic systems is a major open problem Insight into hardware design Understanding impact of proposed designs on whole programs 57

58 Past Work Multiprocessor synchronization 58 locks, synchronous barriers [ASPLOS89, TOCS91] reader-writer synchronization [PPOPP91] fuzzy barriers [IJPP94] Parallel debugging execution replay [JPDC90, TOC87] software instruction counter [ASPLOS89] detecting data races [WPDD93, SC91, SC90] Parallel programming environments Parascope [PIEEE 93], Dsystem [TPDT94] Parallel applications molecular dynamics [JCC92]

59 Ongoing Work Global address space parallel languages Co-array Fortran [LCPC03] Performance analysis [TJS02, LACSI01, ICS01, SIGMETRICS01] Improving node performance irregular mesh and particle codes [ICS99, IJPP00] sparse matrices [LACSI02, IJHPCA04] multigrid [ICS01] dense matrices [LACSI03] Grid computing [IJHPCA01] Library-based domain languages [JPDC01] 59

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf

More information

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Journal of Instruction-Level Parallelism 5 (2003) 1-29 Submitted 10/02; published 4/03 An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey

More information

John Mellor-Crummey Department of Computer Science Center for High Performance Software Research Rice University

John Mellor-Crummey Department of Computer Science Center for High Performance Software Research Rice University Co-Array Fortran and High Performance Fortran John Mellor-Crummey Department of Computer Science Center for High Performance Software Research Rice University LACSI Symposium October 2006 The Problem Petascale

More information

Compilers and Compiler-based Tools for HPC

Compilers and Compiler-based Tools for HPC Compilers and Compiler-based Tools for HPC John Mellor-Crummey Department of Computer Science Rice University http://lacsi.rice.edu/review/2004/slides/compilers-tools.pdf High Performance Computing Algorithms

More information

Compilation for Heterogeneous Platforms

Compilation for Heterogeneous Platforms Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

Compilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University

Compilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University Compilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University Collaborators Raj Bandypadhyay Zoran Budimlic Arun Chauhan Daniel Chavarria-Miranda Keith Cooper Jack

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

High Performance Computing: Architecture, Applications, and SE Issues. Peter Strazdins

High Performance Computing: Architecture, Applications, and SE Issues. Peter Strazdins High Performance Computing: Architecture, Applications, and SE Issues Peter Strazdins Department of Computer Science, Australian National University e-mail: peter@cs.anu.edu.au May 17, 2004 COMP1800 Seminar2-1

More information

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Outline Overview of Framework Fine grain control of transformations

More information

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Center for Scalable Application Development Software: Application Engagement. Ewing Lusk (ANL) Gabriel Marin (Rice)

Center for Scalable Application Development Software: Application Engagement. Ewing Lusk (ANL) Gabriel Marin (Rice) Center for Scalable Application Development Software: Application Engagement Ewing Lusk (ANL) Gabriel Marin (Rice) CScADS Midterm Review April 22, 2009 1 Application Engagement Workshops (2 out of 4) for

More information

Compiler-Supported Simulation of Highly Scalable Parallel Applications

Compiler-Supported Simulation of Highly Scalable Parallel Applications Compiler-Supported Simulation of Highly Scalable Parallel Applications Vikram S. Adve 1 Rajive Bagrodia 2 Ewa Deelman 2 Thomas Phan 2 Rizos Sakellariou 3 Abstract 1 University of Illinois at Urbana-Champaign

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Programming as Successive Refinement. Partitioning for Performance

Programming as Successive Refinement. Partitioning for Performance Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

The Cray Rainier System: Integrated Scalar/Vector Computing

The Cray Rainier System: Integrated Scalar/Vector Computing THE SUPERCOMPUTER COMPANY The Cray Rainier System: Integrated Scalar/Vector Computing Per Nyberg 11 th ECMWF Workshop on HPC in Meteorology Topics Current Product Overview Cray Technology Strengths Rainier

More information

Parallel Matlab Based on Telescoping Languages and Data Parallel Compilation. Ken Kennedy Rice University

Parallel Matlab Based on Telescoping Languages and Data Parallel Compilation. Ken Kennedy Rice University Parallel Matlab Based on Telescoping Languages and Data Parallel Compilation Ken Kennedy Rice University Collaborators Raj Bandypadhyay Zoran Budimlic Arun Chauhan Daniel Chavarria-Miranda Keith Cooper

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam Parallel Paradigms & Programming Models Lectured by: Pham Tran Vu Prepared by: Thoai Nam Outline Parallel programming paradigms Programmability issues Parallel programming models Implicit parallelism Explicit

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

Parallelising Pipelined Wavefront Computations on the GPU

Parallelising Pipelined Wavefront Computations on the GPU Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick

More information

Recap. Practical Compiling for Modern Machines (Special Topics in Programming Languages)

Recap. Practical Compiling for Modern Machines (Special Topics in Programming Languages) Recap Practical Compiling for Modern Machines (Special Topics in Programming Languages) Why Compiling? Other reasons: Performance Performance Performance correctness checking language translation hardware

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Leveraging Flash in HPC Systems

Leveraging Flash in HPC Systems Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,

More information

OpenMP Optimization and its Translation to OpenGL

OpenMP Optimization and its Translation to OpenGL OpenMP Optimization and its Translation to OpenGL Santosh Kumar SITRC-Nashik, India Dr. V.M.Wadhai MAE-Pune, India Prasad S.Halgaonkar MITCOE-Pune, India Kiran P.Gaikwad GHRIEC-Pune, India ABSTRACT For

More information

High Performance Fortran. James Curry

High Performance Fortran. James Curry High Performance Fortran James Curry Wikipedia! New Fortran statements, such as FORALL, and the ability to create PURE (side effect free) procedures Compiler directives for recommended distributions of

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

A Lightweight OpenMP Runtime

A Lightweight OpenMP Runtime Alexandre Eichenberger - Kevin O Brien 6/26/ A Lightweight OpenMP Runtime -- OpenMP for Exascale Architectures -- T.J. Watson, IBM Research Goals Thread-rich computing environments are becoming more prevalent

More information

Compilers and Run-Time Systems for High-Performance Computing

Compilers and Run-Time Systems for High-Performance Computing Compilers and Run-Time Systems for High-Performance Computing Blurring the Distinction between Compile-Time and Run-Time Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/compilerruntime.pdf

More information

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Recap from Last Year A framework for automatic tuning of applications Fine grain control of transformations

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation via an Extended Polyhedral Representation Habanero Extreme Scale Software Research Group Department of Computer Science Rice University 6th International Workshop on Polyhedral Compilation Techniques (IMPACT

More information

Effective Performance Measurement and Analysis of Multithreaded Applications

Effective Performance Measurement and Analysis of Multithreaded Applications Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

L23: Parallel Programming Retrospective

L23: Parallel Programming Retrospective Administrative L23: Parallel Programming Retrospective Schedule for the rest of the semester - Midterm Quiz = long homework - Return by Dec. 15 - Projects - 1 page status report due TODAY handin cs4961

More information

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT 6.189 IAP 2007 Lecture 11 Parallelizing Compilers 1 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel

More information

Compiler Synthesis of Task Graphs for Parallel Program Performance Prediction

Compiler Synthesis of Task Graphs for Parallel Program Performance Prediction Compiler Synthesis of Task Graphs for Parallel Program Performance Prediction Vikram Adve 1 and Rizos Sakellariou 2 1 Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana,

More information

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup Topics Scalable Speedup and Data Locality Parallelizing Sequential Programs Breaking data dependencies Avoiding synchronization overheads Performance and Debugging Issues in OpenMP Achieving Cache and

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

A Web-based Prophesy Automated Performance Modeling System

A Web-based Prophesy Automated Performance Modeling System A Web-based Prophesy Automated Performance Modeling System Xingfu Wu, Valerie Taylor Department of Computer Science, Texas A & M University, College Station, TX 77843, USA Email: {wuxf, taylor}@cs.tamu.edu

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Introduction to Parallel Programming Models

Introduction to Parallel Programming Models Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures

More information

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Uday Bondhugula Indian Institute of Science Supercomputing 2013 Nov 16 22, 2013 Denver, Colorado 1/46 1 Introduction 2 Distributed-memory

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Fourier-Motzkin and Farkas Questions (HW10)

Fourier-Motzkin and Farkas Questions (HW10) Automating Scheduling Logistics Final report for project due this Friday, 5/4/12 Quiz 4 due this Monday, 5/7/12 Poster session Thursday May 10 from 2-4pm Distance students need to contact me to set up

More information

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination 1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu

More information

Chapel Introduction and

Chapel Introduction and Lecture 24 Chapel Introduction and Overview of X10 and Fortress John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 But before that Created a simple

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

On Efficient Parallelization of Line-Sweep Computations

On Efficient Parallelization of Line-Sweep Computations On Efficient Parallelization of Line-Sweep Computations Alain Darte LIP, ENS-Lyon, 46, Allée d Italie, 69007 Lyon, France. Alain.Darte@ens-lyon.fr Daniel Chavarría-Miranda Robert Fowler John Mellor-Crummey

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall

More information

Standard promoted by main manufacturers Fortran. Structure: Directives, clauses and run time calls

Standard promoted by main manufacturers   Fortran. Structure: Directives, clauses and run time calls OpenMP Introducción Directivas Regiones paralelas Worksharing sincronizaciones Visibilidad datos Implementación OpenMP: introduction Standard promoted by main manufacturers http://www.openmp.org, http://www.compunity.org

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

Evaluating the Performance of the Community Atmosphere Model at High Resolutions

Evaluating the Performance of the Community Atmosphere Model at High Resolutions Evaluating the Performance of the Community Atmosphere Model at High Resolutions Soumi Manna MS candidate, University of Wyoming Mentor: Dr. Ben Jamroz National Center for Atmospheric Research Boulder,

More information

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution Multi-Domain Pattern I. Problem The problem represents computations characterized by an underlying system of mathematical equations, often simulating behaviors of physical objects through discrete time

More information

Generation of High Performance Domain- Specific Languages from Component Libraries. Ken Kennedy Rice University

Generation of High Performance Domain- Specific Languages from Component Libraries. Ken Kennedy Rice University Generation of High Performance Domain- Specific Languages from Component Libraries Ken Kennedy Rice University Collaborators Raj Bandypadhyay Zoran Budimlic Arun Chauhan Daniel Chavarria-Miranda Keith

More information

Standard promoted by main manufacturers Fortran

Standard promoted by main manufacturers  Fortran OpenMP Introducción Directivas Regiones paralelas Worksharing sincronizaciones Visibilidad datos Implementación OpenMP: introduction Standard promoted by main manufacturers http://www.openmp.org Fortran

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

Programming Models for Scientific Computing on Leadership Computing Platforms:

Programming Models for Scientific Computing on Leadership Computing Platforms: Programming Models for Scientific Computing on Leadership Computing Platforms: The Evolution of Coarray Fortran John Mellor-Crummey Department of Computer Science Rice University COMP 422 08 April 2008

More information

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation

More information

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer

More information

A Characterization of Shared Data Access Patterns in UPC Programs

A Characterization of Shared Data Access Patterns in UPC Programs IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel

More information

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling

More information

Cache Coherence Protocols for Chip Multiprocessors - I

Cache Coherence Protocols for Chip Multiprocessors - I Cache Coherence Protocols for Chip Multiprocessors - I John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 5 6 September 2016 Context Thus far chip multiprocessors

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Example of a Parallel Algorithm

Example of a Parallel Algorithm -1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software

More information

Introduction to parallel computing. Seminar Organization

Introduction to parallel computing. Seminar Organization Introduction to parallel computing Rami Melhem Department of Computer Science 1 Seminar Organization 1) Introductory lectures (probably 4) 2) aper presentations by students (2/3 per short/long class) -

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor EE482C: Advanced Computer Organization Lecture #12 Stream Processor Architecture Stanford University Tuesday, 14 May 2002 Project Proposals Lecture #12: Tuesday, 14 May 2002 Lecturer: Students of the class

More information

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

Charting the Course to Your Success! MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008 Description Course Summary This course provides students with the knowledge and skills to develop high-performance computing (HPC) applications for Microsoft. Students learn about the product Microsoft,

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer

Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig Introduction

More information