Parallel Programming Models & Their Corresponding HW/SW Implementations

Size: px
Start display at page:

Download "Parallel Programming Models & Their Corresponding HW/SW Implementations"

Transcription

1 Lecture 3: Parallel Programming Models & Their Corresponding HW/SW Implementations Parallel Computer Architecture and Programming

2 Today s theme is a critical idea in this course. And today s theme is: Abstraction vs. implementation Conflating abstraction with implementation is a common cause for confusion in this course.

3 An example: Programming with ISPC

4 ISPC Intel SPMD Program Compiler (ISPC) SPMD: single *program* multiple data

5 Recall: example program from last class Compute sin(x) using Taylor expansion: sin(x) = x - x 3 /3! + x 5 /5! - x 7 /7! +... for each element of an array of N floating-point numbers void sinx(int N, int terms, float* x, float* result) for (int i=0; i<n; i++) float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = - 1; for (int j=1; j<=terms; j++) value += sign * numer / denom numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= - 1; result[i] = value;

6 sin(x) in ISPC Compute sin(x) using Taylor expansion: sin(x) = x - x 3 /3! + x 5 /5! - x 7 /7! +... #include sinx_ispc.h int N = 1024; int terms = 5; float* x = new float[n]; float* result = new float[n]; // initialize x here C++ code: main.cpp // execute ISPC code sinx(n, terms, x, result); SPMD programming abstraction: ISPC code: sinx.ispc export void sinx( uniform int N, uniform int terms, uniform float* x, uniform float* result) // assume N % programcount = 0 for (uniform int i=0; i<n; i+=programcount) int idx = i + programindex; float value = x[idx]; float numer = x[idx] * x[idx] * x[idx]; uniform int denom = 6; // 3! uniform int sign = - 1; Call to ISPC function spawns gang of ISPC program instances All instances run ISPC code in parallel Upon return, all instances have completed for (uniform int j=1; j<=terms; j++) value += sign * numer / denom numer *= x[idx] * x[idx]; denom *= (2*j+2) * (2*j+3); sign *= - 1; result[idx] = value;

7 sin(x) in ISPC Compute sin(x) using Taylor expansion: sin(x) = x - x 3 /3! + x 5 /5! - x 7 /7! +... C++ code: main.cpp #include sinx_ispc.h int N = 1024; int terms = 5; float* x = new float[n]; float* result = new float[n]; // initialize x here Sequential execution (C code) Call to sinx() Begin executing programcount instances of sinx() (ISPC code) // execute ISPC code sinx(n, terms, x, result); SPMD programming abstraction: Call to ISPC function spawns gang of ISPC program instances All instances run ISPC code in parallel Upon return, all instances have completed ISPC compiler generates SIMD implementation: Number of instances in a gang is the SIMD width of the hardware (or a small multiple of) ISPC compiler generates binary (.o) with SIMD instructions C++ code links against object file as usual sinx() returns. Completion of ISPC program instances. Resume sequential execution Sequential execution (C code)

8 sin(x) in ISPC Interleaved assignment of elements to instances #include sinx_ispc.h int N = 1024; int terms = 5; float* x = new float[n]; float* result = new float[n]; // initialize x here // execute ISPC code sinx(n, terms, x, result); ISPC Keywords: C++ code: main.cpp programcount: number of simultaneously executing instances in the gang (uniform value) programindex: id of the current instance in the gang. (a non-uniform value: varying ) uniform: A type modifier. All instances have the same value for this variable. Its use is purely an optimization. Not needed for correctness. export void sinx( uniform int N, uniform int terms, uniform float* x, uniform float* result) // assumes N % programcount = 0 for (uniform int i=0; i<n; i+=programcount) int idx = i + programindex; float value = x[idx]; float numer = x[idx] * x[idx] * x[idx]; uniform int denom = 6; // 3! uniform int sign = - 1; ISPC code: sinx.ispc for (uniform int j=1; j<=terms; j++) value += sign * numer / denom numer *= x[idx] * x[idx]; denom *= (2*j+2) * (2*j+3); sign *= - 1; result[idx] = value;

9 sin(x) in ISPC Blocked assignment of elements to instances #include sinx_ispc.h int N = 1024; int terms = 5; float* x = new float[n]; float* result = new float[n]; // initialize x here C++ code: main.cpp // execute ISPC code sinx(n, terms, x, result); ISPC code: sinx.ispc export void sinx( uniform int N, uniform int terms, uniform float* x, uniform float* result) // assume N % programcount = 0 uniform int count = N / programcount; int start = programindex * count; for (uniform int i=0; i<count; i++) int idx = start + i; float value = x[idx]; float numer = x[idx] * x[idx] * x[idx]; uniform int denom = 6; // 3! uniform int sign = - 1; for (uniform int j=1; j<=terms; j++) value += sign * numer / denom numer *= x[idx] * x[idx]; denom *= (j+3) * (j+4); sign *= - 1; result[idx] = value;

10 sin(x) in ISPC Compute sin(x) using Taylor expansion: sin(x) = x - x 3 /3! + x 5 /5! - x 7 /7! +... #include sinx_ispc.h int N = 1024; int terms = 5; float* x = new float[n]; float* result = new float[n]; // initialize x here C++ code: main.cpp // execute ISPC code sinx(n, terms, x, result); ISPC code: sinx.ispc export void sinx( uniform int N, uniform int terms, uniform float* x, uniform float* result) foreach (i = 0... N) float value = x[i]; float numer = x[i] * x[i] * x[i]; uniform int denom = 6; // 3! uniform int sign = - 1; foreach: key language construct Declares parallel loop iterations ISPC assigns iterations to program instances in gang (ISPC implementation will perform static assignment, but nothing in the abstraction prevents a dynamic assignment) for (uniform int j=1; j<=terms; j++) value += sign * numer / denom numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= - 1; result[idx] = value;

11 ISPC: abstraction vs. implementation Single program, multiple data (SPMD) programming model - This is the programming abstraction - Program is written in terms of this abstraction Single instruction, multiple data (SIMD) implementation - ISPC compiler emits vector instructions (SSE4 or AVX) - Handles mapping of conditional control flow to vector instructions Semantics of ISPC can be tricky - SPMD abstraction + uniform values (allows implementation details to peak through abstraction a bit)

12 ISPC discussion: sum reduction Compute the sum of all array elements in parallel export uniform float sumall1( uniform int N, uniform float* x) uniform float sum = 0.0f; foreach (i = 0... N) sum += x[i]; return sum; export uniform float sumall2( uniform int N, uniform float* x) uniform float sum; float partial = 0.0f; foreach (i = 0... N) partial += x[i]; // from ISPC math library sum = reduceadd(partial); return sum; sum is of type uniform float (one copy of variable for all program instances) x[i] is uniform expression (different value for each program instance) Result: compile-time type error

13 ISPC discussion: sum reduction Compute the sum of all array elements in parallel Each instance accumulates a private partial sum (no communication) Partial sums are added together using the reduceadd() cross-instance communication primitive. The result is the same for all instances (uniform) export uniform float sumall2( uniform int N, uniform float* x) uniform float sum; float partial = 0.0f; foreach (i = 0... N) partial += x[i]; ISPC code at right will execute in a manner similar to handwritten C + AVX intrinsics implementation below. * const int N = 1024; float* x = new float[n]; mm256 partial = _mm256_broadcast_ss(0.0f); // from ISPC math library sum = reduceadd(partial); return sum; // populate x for (int i=0; i<n; i+=8) partial = _mm256_add_ps(partial, _mm256_load_ps(&x[i])); float sum = 0.f; for (int i=0; i<8; i++) sum += partial[i]; * If you understand why this implementation complies with the semantics of the ISPC gang abstraction, then you ve got good command of ISPC.

14 ISPC tasks The ISPC gang abstraction is implemented by SIMD instructions on one core. So... all the code I ve shown you in the previous slides would have executed on only one of the four cores of the GHC 5205 machines. ISPC contains another abstraction: a task that is used to achieve multi-core execution. I ll let you read up about that.

15 Today Three parallel programming models - Abstractions presented to the programmer - Influence how programmers think when writing programs Three machine architectures - Abstraction presented by the hardware to low-level software - Typically reflect implementation Focus on differences in communication and cooperation

16 System layers: interface, implementation, interface,... Parallel Applications Abstractions for describing concurrent, parallel, or independent computation Abstractions for describing communication Programming model (way of thinking about things) Compiler and/or parallel runtime Language or library primitives/mechanisms OS system call API Operating system support Micro-architecture (hardware implementation) Hardware Architecture (HW/SW boundary) Blue italic text: abstraction/concept Red italic text: system interface Black text: system implementation

17 Example: expressing parallelism with pthreads) Parallel Application Abstraction for describing parallel computation: thread Programming model pthread_create() pthread library implementation System call API OS support: kernel thread management x86-64 modern multi-core CPU Blue italic text: abstraction/concept Red italic text: system interface Black text: system implementation

18 Example: expressing parallelism (ISPC) Parallel Applications Abstractions for describing parallel computation: 1. For specifying simultaneous execution (true parallelism) 2. For specifying independent work (potentially parallel) Programming model ISPC language (call ISPC function, foreach construct) ISPC compiler System call API OS support x86-64 (including AVX vector instructions) single-core of CPU Note: ISPC has the task language primitive for multi-core execution (not discussed here)

19 Three models of communication (abstractions) 1. Shared address space 2. Message passing 3. Data parallel

20 Shared address space model (abstraction) Threads communicate by reading/writing to shared variables Shared variables are like a big bulletin board - Any thread can read or write Thread 1 x Memory shared Thread 2 between threads Thread 1: Thread 2: int x = 0; x = 1; int x; while (x == 0) print x;

21 Shared address space model (abstraction) Threads communicate by: - Reading/writing to shared variables - Interprocessor communication is implicit in memory operations - Thread 1 stores to X. Later, thread 2 reads X (observes update) - Manipulating synchronization primitives - e.g., mutual exclusion using locks Natural extension of sequential programming model - In fact, all our discussions have assumed a shared address space so far Think: shared variables are like a big bulletin board - Any thread can read or write

22 Shared address space (implementation) Option 1: threads share an address space (all data is sharable) Option 2: each thread has its own virtual address space, shared portion of address spaces maps to same physical location Virtual address spaces Physical mapping Image credit: Culler, Singh, and Gupta

23 Shared address space HW implementation Any processor can directly reference any memory location Dance-hall organization Interconnect examples Processor Processor Processor Processor Processor Processor Processor Processor Bus Local Cache Local Cache Local Cache Local Cache Memory Memory Interconnect Processor Crossbar Processor Processor Processor Processor Memory I/O Processor Processor Processor Memory Memory Memory Memory Memory Memory Multi-stage network Symmetric (shared-memory) multi-processor (SMP): - Uniform memory access time: cost of accessing an uncached* memory address is the same for all processors (* caching introduces non-uniform access times, but we ll talk about that later)

24 Shared address space architectures Commodity x86 examples Memory Intel Core i7 (quad core) (network is a ring) Memory Controller Core 1 Core 2 On chip network Core 3 Core 4 AMD Phenom II (six core)

25 SUN Niagara 2 Note size of crossbar: about die area of one core Processor Processor L2 cache Memory Processor Processor Processor Crossbar Switch L2 cache L2 cache Memory Memory Processor Processor L2 cache Memory Eight cores Processor

26 Non-uniform memory access (NUMA) All processors can access any memory location, but... cost of memory access (latency or bandwidth) is different for different processors Processor Processor Processor Processor Local Cache Local Cache Local Cache Local Cache Memory Memory Memory Memory Interconnect Problem with preserving uniform access time: scalability - GOOD: costs are uniform, BAD: but memory is uniformly far away NUMA designs are more scalable - High bandwidth to local memory; BW scales with number of nodes if most accesses local - Low latency access to local memory Increased programmer effort: performance tuning - Finding, exploiting locality

27 Non-uniform memory access (NUMA) Example: latency to access location x is higher from cores 5-8 than cores 1-4 Example: modern dual-socket configuration X Memory Memory Memory Controller Core 1 Core 2 Memory Controller Core 5 Core 6 On chip network Core 3 Core 4 Core 7 Core 8 AMD Hyper-transport / Intel QuickPath

28 SGI Altix UV 1000 (PSC s Blacklight) 256 blades, 2 CPUs per blade, 8 cores per CPU = 4096 cores Single shared address space Interconnect: fat tree Fat tree Image credit: Pittsburgh Supercomputing Center

29 Shared address space summary Communication abstraction - Threads read/write shared variables - Manipulate synchronization primitives: locks, semaphors, etc. - Logical extension of uniprocessor programming - But NUMA implementation requires reasoning about locality for perf Hardware support to make implementations efficient - Any processor can load and store from any address - NUMA designs more scalable than uniform memory access - Even so, costly to scale (see cost of Blacklight)

30 Message passing model (abstraction) Threads operate within independent address spaces Threads communicate by sending/receiving messages - Explicit, point-to-point - send: specifies buffer to be transmitted, recipient, optional message tag - receive: specifies buffer to store data, sender, and (optional) message tag - Messages may be synchronous or asynchronous send(x, 2, tag) match! recv(y, 1, tag) Address Y Address X Thread 1 address space Thread 2 address space Image credit: Culler, Singh, and Gupta

31 Message passing (implementation) Popular library: MPI (message passing interface) Challenges: buffering messages (until application initiates receive), minimizing cost of memory copies Hardware need not implement system-wide loads and stores - Connect complete (often commodity) systems together - Parallel programs for clusters! Cluster of workstations (Infiniband network) IBM Blue Gene/P Supercomputer Image credit: IBM

32 Correspondence between programming models and machine types is fuzzy Common to implement message passing abstractions on machines that support a shared address space in hardware Can implement shared address space abstraction on machines that do not support it in HW (via less efficient SW solution) - Mark all pages with shared variables as invalid - Page-fault handler issues appropriate network requests Keep in mind what is the programming model (abstractions used to specific program) and what is the HW implementation

33 The data-parallel model

34 Data-parallel model Rigid computation structure ff Historically: same operation on each element of an array - Matched capabilities of 80 s SIMD supercomputers - Connection Machine (CM-1, CM-2): thousands of processors, one instruction - And also Cray supercomputer vector processors - Add(A, B, n) this was one instruction on vectors A, B of length n Matlab is another good example: A + B (A, B are vectors of same length) Today: often takes form of SPMD programming - map(function, collection) - Where function may be a complicated sequence of logic (e.g., a loop body) - Application of function to each element of collection is independent - In pure form: no communication between iterations of map - Synchronization is implicit at the end of the map

35 Data parallelism in ISPC // main C++ code: const int N = 1024; float* x = new float[n]; float* y = new float[n]; // initialize N elements of x here absolute_value(n, x, y); Think of loop body as function (from previous slide) foreach construct is a map Collection is implicitly defined by array indexing logic // ISPC code: export void absolute_value( uniform int N, uniform float* x, uniform float* y) foreach (i = 0... N) if (x[i] < 0) y[i] = - x[i]; else y[i] = x[i];

36 Data parallelism in ISPC // main C++ code: const int N = 1024; float* x = new float[n/2]; float* y = new float[n]; // initialize N/2 elements of x here absolute_repeat(n/2, x, y); Think of loop body as function foreach construct is a map Collection is implicitly defined by array indexing logic // ISPC code: export void absolute_repeat( uniform int N, uniform float* x, uniform float* y) foreach (i = 0... N) if (x[i] < 0) y[2*i] = - x[i]; else y[2*i] = x[i]; y[2*i+1] = y[2*i]; Also a valid program! Takes absolute value of elements of x, repeats them twice in output vector y

37 Data parallelism in ISPC // main C++ code: const int N = 1024; float* x = new float[n]; float* y = new float[n]; // initialize N elements of x Think of loop body as function foreach construct is a map Collection is implicitly defined by array indexing logic shift_negative(n, x, y); // ISPC code: export void shift_negative( uniform int N, uniform float* x, uniform float* y) foreach (i = 0... N) if (i > 1 && x[i] < 0) y[i- 1] = - x[i]; else y[i] = x[i]; This program is non-deterministic! Possibility for multiple iterations of the loop body to write to same memory location Data-parallel model provides no specification of order in which iterations occur primitives for fine-grained mutual exclusion/ synchronization)

38 Data parallelism the more formal way Note: this is not ISPC syntax // main program: const int N = 1024; stream<float> x(n); // define collection stream<float> y(n); // define collection // initialize N elements of x here // map absolute_value onto x, y absolute_value(x, y); Data-parallelism expressed in this functional form is sometimes referred to as the stream programing model Streams: collections of elements. Elements can be processed independently // kernel definition void absolute_value( float x, float y) if (x < 0) y = - x; else y = x; Kernels: side-effect-free functions. Operate element-wise on collections Think of kernel inputs, outputs, temporaries for each invocation as a private address space

39 Stream programming benefits // main program: const int N = 1024; stream<float> input(n); stream<float> output(n); stream<float> tmp(n); foo(input,tmp); bar(tmp, output); Functions really are side-effect free! (cannot write a non-deterministic program) Program data flow is known: Predictable data access facilitates prefetching. Inputs and outputs of each invocation are known in advance: prefetching can be input tmp output employed to hide latency. foo bar Producer-consumer locality. Can structure code so that outputs of first kernel feed immediately into second kernel. Values are never written to memory! Save bandwidth! These optimizations are responsibility of stream program compiler. Requires sophisticated compiler analysis.

40 Stream programming drawbacks // main program: const int N = 1024; stream<float> input(n/2); stream<float> tmp(n); stream<float> output(n); stream_repeat(2, input, tmp); absolute_value(tmp, output); Need library of ad-hoc operators to describe complex data flows. (see use of repeat operator at left to obtain same behavior as indexing code below) Cross fingers and hope compiler generates code intelligently Kayvon s experience: This is the achilles heel of all proper dataparallel/stream programming systems. If I just had one more operator... // ISPC code: export void absolute_value( uniform int N, uniform float* x, uniform float* y) foreach (i = 0... N) if (x[i] < 0) y[2*i] = - x[i]; else y[2*i] = x[i]; y[2*i+1] = y[2*i];

41 Gather/scatter: Two key data-parallel communication primitives Map absolute_value onto stream produced by gather: // main program: const int N = 1024; stream<float> input(n); stream<int> indices; stream<float> tmp_input(n); stream<float> output(n); Map absolute_value onto stream, scatter results: // main program: const int N = 1024; stream<float> input(n); stream<int> indices; stream<float> tmp_output(n); stream<float> output(n); stream_gather(input, indices, tmp_input); absolute_value(tmp_input, output); absolute_value(input, tmp_output); stream_scatter(tmp_output, indices, output); (ISPC equivalent) export void absolute_value( uniform float N, uniform float* input, uniform float* output, uniform int* indices) foreach (i = 0... n) float tmp = input[indices[i]]; if (tmp < 0) output[i] = - tmp; else output[i] = tmp; (ISPC equivalent) export void absolute_value( uniform float N, uniform float* input, uniform float* output, uniform int* indices) foreach (i = 0... n) if (input[i] < 0) output[indices[i]] = - input[i]; else output[indices[i]] = input[i];

42 Gather instruction: gather(r1, R0, mem_base); Gather from buffer mem_base into R1 according to indices specified by R0. Array in memory: base address = mem_base Index vector: R0 Result vector: R1 SSE, AVX instruction sets do not directly support SIMD gather/scatter (must implement as scalar loop) Gather coming with AVX 2 in 2013 Hardware supported gather/scatter does exist on GPUs. (still an expensive operation)

43 Data-parallel model summary Data-parallelism is about program structure In spirit, map a single program onto a large collection of data - Functional: side-effect free execution - No communication among invocations In practice that s how most programs work But... most popular languages do not enforce this - OpenCL, CUDA, ISPC, etc. - Choose flexibility/familiarity of imperative syntax over safety and complex compiler optimizations required for functional syntax - It s been their key to success (and the recent adoption of parallel programming) - Hear that PL folks! (functional thinking is great, but lighten up!)

44 Three parallel programming models Shared address space - Communication is unstructured, implicit in loads and stores - Natural way of programming, but can shoot yourself in the foot easily - Program might be correct, but not scale Message passing - Structured communication as messages - Often harder to get first correct program than shared address space - Structure often helpful in getting to first correct, scalable program Data parallel - Structure computation as a big map - Assumes a shared address space from which to load inputs/store results, but severely limits communication between iterations of the map (goal: preserve independent processing of iterations) - Modern embodiments encourage, but don t enforce, this structure

45 Modern trend: hybrid programming models Shared address space within a multi-core node of a cluster, message passing between nodes - Very, very common - Use convenience of shared address space where it can be implemented efficiently (within a node) Data-parallel programming models support synchronization primitives in kernels (CUDA, OpenCL) - Permits limited forms of communication CUDA/OpenCL use data-parallel model to scale to many cores, but adopt shared-address space model allowing threads running on the same core to communicate.

46 Los Alamos National Laboratory: Roadrunner Fastest computer in the world in 2008 (no longer true) 3,240 node cluster. Heterogeneous nodes. Cluster Node IBM Cell CPU IBM Cell CPU IBM Cell CPU IBM Cell CPU AMD CPU Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 16 GB Memory Core 5 Core 6 Core 5 Core 6 Core 5 Core 6 Core 5 Core 6 AMD CPU (Address Space) Core 7 Core 8 Core 7 Core 8 Core 7 Core 8 Core 7 Core 8 Core 1 Core 2 4 GB Memory 4 GB Memory 4 GB Memory 4 GB Memory (Address space) (Address space) (Address space) (Address space) Network

47 Summary Programming models provide a way to think about parallel programs. They provide abstractions that admit many possible implementations. But restrictions imposed by abstractions are designed to reflect realities of hardware communication costs - Shared address space machines - Messaging passing machines - It is desirable to keep abstraction distance low so programs have predictable performance, but want it high enough for code flexibility/portability In practice, you ll need to be able to think in a variety of ways - Modern machines provide different types of communication at different scales - Different models fit the machine best at the various scales - Optimization may require you to think about implementations, not just abstractions

Parallel Programming Abstractions

Parallel Programming Abstractions Lecture 3: Parallel Programming Abstractions (and their corresponding HW/SW implementations) Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Today s theme is a critical idea

More information

Parallel Programming Abstractions

Parallel Programming Abstractions Lecture 3: Parallel Programming Abstractions (and their corresponding HW/SW implementations) Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes N.W.A. Express yourself

More information

(and their corresponding HW/SW implementations)

(and their corresponding HW/SW implementations) Lecture 3: Parallel Programming Abstractions (and their corresponding HW/SW implementations) Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Replaying last week s tunes!

More information

Parallel Programming Models

Parallel Programming Models Lecture 3: Parallel Programming Models and their corresponding HW/SW implementations Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes N.W.A. Express yourself (Straight

More information

Parallel Programming Basics

Parallel Programming Basics Lecture 5: Parallel Programming Basics Parallel Computer Architecture and Programming Prof. Kayvon in China Prof. Kayvon in China CMU / 清华 大学, Summer 2017 Checking your understanding export void sinx(

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Under the Hood, Part 1: Implementing Message Passing

Under the Hood, Part 1: Implementing Message Passing Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Today s Theme 2 Message passing model (abstraction) Threads

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

Parallel Programming Basics

Parallel Programming Basics Lecture 4: Parallel Programming Basics Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 1 Review: 3 parallel programming models Shared address space - Communication is unstructured,

More information

Problem 1. (10 points):

Problem 1. (10 points): Parallel Computer Architecture and Programming Written Assignment 1 30 points total + 2 pts extra credit. Due Monday, July 3 at the start of class. Warm Up Problems Problem 1. (10 points): A. (5 pts) Complete

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Under the Hood, Part 1: Implementing Message Passing

Under the Hood, Part 1: Implementing Message Passing Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Today s Theme Message passing model (abstraction) Threads operate within

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming Hi! Hongyi Alex Kayvon Manish Parag One common definition A parallel computer is a collection of processing elements that cooperate

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Compute-mode GPU Programming Interfaces

Compute-mode GPU Programming Interfaces Lecture 8: Compute-mode GPU Programming Interfaces Visual Computing Systems What is a programming model? Programming models impose structure A programming model provides a set of primitives/abstractions

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. Multiprocessor Systems A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. However, Flynn s SIMD machine classification, also called an array processor,

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Aaron Lefohn, Intel / University of Washington Mike Houston, AMD / Stanford 1 What s In This Talk? Overview of parallel programming

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

CS Parallel Algorithms in Scientific Computing

CS Parallel Algorithms in Scientific Computing CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan

More information

CSE 392/CS 378: High-performance Computing - Principles and Practice

CSE 392/CS 378: High-performance Computing - Principles and Practice CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2 CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

High Performance Computing in C and C++

High Performance Computing in C and C++ High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains: The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012

More information

CMU /618: Exam 1 Practice Exercises

CMU /618: Exam 1 Practice Exercises CMU 15-418/618: Exam 1 Practice Exercises Problem 1: Miscellaneous Short Answer Questions A. Today directory-based cache coherence is widely adopted because snooping-based implementations often scale poorly.

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How Shader Cores Work From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014 Lecture 13: Memory Consistency + Course-So-Far Review Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes Beggin Madcon (So Dark the Con of Man) 15-418 students tend to

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Problem 1. (15 points):

Problem 1. (15 points): CMU 15-418/618: Parallel Computer Architecture and Programming Practice Exercise 1 A Task Queue on a Multi-Core, Multi-Threaded CPU Problem 1. (15 points): The figure below shows a simple single-core CPU

More information

Introduction to parallel computing

Introduction to parallel computing Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Lecture 1: Parallel Architecture Intro

Lecture 1: Parallel Architecture Intro Lecture 1: Parallel Architecture Intro Course organization: ~13 lectures based on textbook ~10 lectures on recent papers ~5 lectures on parallel algorithms and multi-thread programming New topics: interconnection

More information

Lect. 2: Types of Parallelism

Lect. 2: Types of Parallelism Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Intro to Multiprocessors

Intro to Multiprocessors The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Parallel and Distributed Computing

Parallel and Distributed Computing Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering

More information

Parallel Programming Models and Architecture

Parallel Programming Models and Architecture Parallel Programming Models and Architecture CS 740 September 18, 2013 Seth Goldstein Carnegie Mellon University History Historically, parallel architectures tied to programming models Divergent architectures,

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Three parallel-programming models

Three parallel-programming models Three parallel-programming models Shared-memory programming is like using a bulletin board where you can communicate with colleagues. essage-passing is like communicating via e-mail or telephone calls.

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information