Parallel Computing. Hwansoo Han (SKKU)

Similar documents
Parallel Computing Parallel Programming Languages Hwansoo Han

Parallel Architecture. Hwansoo Han

The Art of Parallel Processing

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

High Performance Computing Systems

Introduction to Parallel Computing

Distributed Systems CS /640

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Convergence of Parallel Architecture

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Chap. 4 Multiprocessors and Thread-Level Parallelism

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Barbara Chapman, Gabriele Jost, Ruud van der Pas

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

COSC 6385 Computer Architecture - Multi Processor Systems

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Computing Why & How?

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

27. Parallel Programming I

Multiprocessors - Flynn s Taxonomy (1966)

CS420: Operating Systems

High Performance Computing on GPUs using NVIDIA CUDA

Programming Shared Address Space Platforms using OpenMP

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

27. Parallel Programming I

Lecture 2. Memory locality optimizations Address space organization

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 4: Multithreaded Programming

Chapter 4: Threads. Chapter 4: Threads

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

EE/CSCI 451: Parallel and Distributed Computation

27. Parallel Programming I


Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CUDA GPGPU Workshop 2012

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Online Course Evaluation. What we will do in the last week?

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Cloud Computing CS

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

PCS - Part Two: Multiprocessor Architectures

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

CellSs Making it easier to program the Cell Broadband Engine processor

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multi-core Architectures. Dr. Yingwu Zhu

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Chapter 5: Thread-Level Parallelism Part 1

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Computing Introduction

Computer parallelism Flynn s categories

Lecture 13. Shared memory: Architecture and programming

Parallel Computing. Parallel Computing. Hwansoo Han

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

An Introduction to Parallel Programming

High Performance Computing (HPC) Introduction

Lect. 2: Types of Parallelism

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Announcements. Homework 4 out today Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Computer Systems Architecture

Martin Kruliš, v

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Computer Systems Architecture

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

A brief introduction to OpenMP

Shared Memory Architecture

Parallel Computers. c R. Leduc

Parallel Programming in C with MPI and OpenMP

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Trends and Challenges in Multicore Programming

Carlo Cavazzoni, HPC department, CINECA

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

How to Write Fast Code , spring th Lecture, Mar. 31 st

Lecture 1: Introduction and Computational Thinking

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

high performance medical reconstruction using stream programming paradigms

Lecture 7: Parallel Processing

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

High Performance Computing in C and C++

Transcription:

Parallel Computing Hwansoo Han (SKKU)

Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo 2.66GHz 1000 100 10 Intel Nehalem Processor (quadcore) 1 S. Amarasinghe, MIT

CELL Processor 1 + 8 cores One 64-bit PowerPC 3.2 GHz, Dual issue, two threads, 512KB L2 cache 256 GFLOPS Eight Synergistic Processors SIMD streaming processors, 256KB local store IBM Cell Processor

Cell Applications Cell powered products Game consoles HPC servers HDTVs IBM Cell Blade Server Sony PS3 Toshiba Cell Regza TV

GPGPU Nvidia G80 G80 (GeForce 8800GTX) 128 stream processors, 518 GFLOPS (single precision FP) Double precision FP will be available on Fermi (540 cores) CUDA Extended ANSI C for program APIs Nvidia G80 Architecture

CUDA Applications Applications Ray-tracing H.264 codec High Performance Computing

Memory Architectures Centralized memory Shared memory with uniform memory access time (UMA) SMP (symmetric multiprocessors) P $ P $ network P $ Multocores M Distributed memory Shared memory with non-uniform memory access time (NUMA) CC-NUMA (HP Superdome, SGI Origin) Distributed shared machine (DSM) Message passing Separate address space per CPU P $ P $ M M M network P $

HW Cache Coherence Bus-based Snooping protocol (multicores) Broadcast data accesses to all processors Processors snoop to check if they have a copy Works well with bus Directory-based protocol (CC-NUMA) Usually implemented for distributed memory Keep track of sharing of cache lines in a directory Use point-to-point request to get directory information and shared data Scale better than snooping

Performance by Amdahl s Law Speedup estimation Improved n times for the parallel portion (P) Speedup = 1 (1 P) + P/n Speedup limit If n (improved to eliminate the portion), Speedup = 1 (1 P)

Speedup Example 80% of sequential execution time is parallelizable (P = 0.8) Ideally parallelized on quad-core (n = 4) Speedup = (1 - P) + P / n 1 1 = (1 0.8) + 0.8 / 4 = 2.5 Speedup limit ( n ) 1 Speedup = (1 - P) 1 = (1 0.8) = 5

Speedup Speedup Projection After some improvement, dominant part is not dominant any more 18 16 14 12 10 8 6 4 2 0 Parallel Portion (P) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 n X improved 100% 99% 95% 90% 80% 70% 60% 50% 40% 30%

Gustafson s Law Reevaluating Amdahl s Law which assumes fixed amount of work New assumptions for scaling-up the problem Parallel computing is needed for larger data Sequential portion is relatively fixed, as the problem size grows Speedup = Speedup approaches n, as s 0 s + n p = s + n (1- s) s + p where s + p = 1 Sufficiently large problem can be efficiently parallelized

Superlinear Speedup Original sequential performance is not optimized Thrashing cache, memory, disk More cores mean more caches Avoid thrashing by dividing the work to fit in the local cache Often results in superlinear speedup CPU L1 $ lots of cache misses fit onto two L1 caches CPU L1 $ CPU L1 $ L2 $ working set L2 $

Load Imbalance Exactly even workload? Dynamic events (cache miss, branch miss prediction, ) Algorithm itself Master/worker OpenMP schedule dynamic, guided N 4 N 4 N 4 N 4 improved

Parallelization Loops Data parallel loops Perform the same operations On different data SPMD (same program, multiple data) N = 16 fork (threads) for (i=0; i<n; i++) C[i] = A[i] + B[i] i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 i = 12 i = 13 i = 14 i = 15 join (barrier)

Parallelization Loops (cont d) Thread pool Use pool of threads without create/join Reduce thread creation/join overhead idle

Reduction Reduction analysis Associative operations Intermediate result is never used in other places within loop Parallel reduction Accumulate Partial results Combine the partial results into one A[ ] : for i = 1 to N sum = sum + A[i] tmp_sum[ ] : sum :

time SW Pipelining Can handle loops with loop-carried dependences Initiation interval Regular patterns across multiple iterations Loop A B C D A B C D A B C D A B C D A B C D

Programming Model Shared memory Thread Library POSIX thread, Java thread, OpenMP directives for compilers small number of cores: < 10 not scalable for manycores Distributed memory MPI (message passing interface) Communication API among multiple processes Proprietary APIs Cell B/E APIs, 19

Implementation on Shared Memory Thread Library Library calls Low level programming Explicit thread creation & work assignment Explicit handling of synchronization Parallelism expression Task: create/join thread Data: detailed programming Design concurrent version from the start OpenMP Compiler directives Higher abstraction Compilers convert code to use OpenMP library, which is actually implemented with thread APIs Parallelism expression Task: task/taskwait, parallel sections Data: parallel for Incremental development Start with sequential version Insert necessary directives 20

Implementation Examples Threaded functions Exploit data parallelism node A[N], B[N]; main() { for (i=0; i<nproc; i++) thread_create(par_distance); for (i=0; i<nproc; i++) thread_join(); } void par_distance() { tid = thread_id(); n = ceiling(n/nproc); s = tid * n; e = MIN((tid+1)*n, N); for (i=s; i<e; i++) for (j=0; j<n; j++) C[i][j] = distance(a[i], B[j]); } Parallel loops Exploit data parallelism node A[N], B[N]; #pragma omp parall for for (i=0; i<n; i++) for (j=0; j<n; j++) C[i][j] = distance(a[i], B[j]); 21

Implementation on Dist. Memory MPI (message passing interface) Language independent communication library Freely available implementation MPICH (Argonne Lab), Open MPI /* processor 0 */ node A[N], B[N]; /* processor 1 */ node A[N], B[N]; /* duplicate copies */ Dist_calc0() { Send (A[N/2.. N-1], B[0.. N-1]); Dist_calc1() { Recv (A[N/2.. N-1], B[0.. N-1]); for (i=0; i<n/2; i++) for (j=0; j<n; j++) C[i][j] = distance(a[i], B[j]); for (i=n/2; i<n; i++) for (j=0; j<n; j++) C[i][j] = distance(a[i], B[j]); Recv (C[N/2.. N-1][0.. N-1]); } Send (C[N/2.. N-1][0.. N-1]); } 22

Summary Parallel Architectures Multicores CC-NUMA Explicitly managed memories (Cell, GPGPU) Parallelism in Applications Amdahl s law, Gustafson s law Superlinear speedup, load balance Parallelism in loops Parallel loop, reduction, SW pipelining Programming models Pthread, OpenMP for shared memories MPI, propriatary APIs for distributed memories