Chap. 4 Part 1. CIS3090 Fall Fall 2016 CIS3090 Parallel Programming 1

Size: px

Start display at page:

Download "Chap. 4 Part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1"

Adam Marshall
6 years ago
Views:

1 Chap. 4 Part 1 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1

2 Part 2 of textbook: Parallel Abstractions How can we think about conducting computations in parallel before getting down to coding? abstraction higher-level concepts than program code, with details missing Covered by ch 4 First Steps toward par. prog. and ch 5 Scalable Algorithmic Techniques Authors parallel pseudocode for specifying par. algos without biasing toward a programming language Fall 2016 CIS*3090 Parallel Programming 2

3 Two basic ways to organize parallel computations How are we going to put all these processors to work on this problem? What can we find for them to do? Based on analyzing either the data or the process/aka task (in sense of steps to be taken) Fall 2016 CIS*3090 Parallel Programming 3

4 Essence of data parallel (DP) Apply same operations at once to many different data items Ultimate example SIMD instructions Master/workers typical DP pattern provided the workers are all doing same operations (on portions of the data set) DP scales by increasing no. of workers (each processing less data) DP wins by applying parallelism to instances of data that can be worked on simultaneously Fall 2016 CIS*3090 Parallel Programming 4

5 Essence of task parallel (TP) Tasks (processes, threads) specialized to different stages of calculation through which all data instances pass Pipeline typical TP pattern TP scales by increasing no. of stages (each performing fewer operations) TP wins on basis of increasing throughput by applying parallelism to the subtasks Fall 2016 CIS*3090 Parallel Programming 5

6 Example: Red Cross blood drive in Peter Clark Hall Problem: taking blood donations from large no. of people without being too time-consuming Data set = the donors Processors = RC personnel & volunteers 1a 1b Queue 1 1c, 1d Queue 2 2 a-e 3 Fall 2016 CIS*3090 Parallel Programming 6

7 Operations per donor 1a 1b Queue 1 1c, 1d Queue 2 2 a-e 3 1) Screening: a) Identify & check file b) Test sample for iron & blood type c) Take temperature & blood pressure d) Answer health questions Where s the TP? Where s the DP? 2) Donation: a) Lie down b) Sanitize arm c) Stick needle d) Collect blood (obeying qty & time limits) e) Remove needle 3) Recovery: a) Rest and snack Fall 2016 CIS*3090 Parallel Programming 7

8 Pseudocode for expressing parallel algorithms Authors invention Peril-L Represents additional parallel constructs on top of conventional pseudocode Conceptually targets CTA, so can distinguish local vs. non-local memory refs. Will later see how easy to translate into certain parallel programming languages Peril-L keywords & features Fall 2016 CIS*3090 Parallel Programming 8

9 forall: parallel fork/join Looks like loop! Consider as spawning N threads in lieu of one thread doing N iterations index variable has separate value in each thread for i=1..3 vs. forall (i in (1..3)) i i 1 i 2 i 3 Fall 2016 CIS*3090 Parallel Programming 9

10 Details left unspecified How to spawn/fork & join? No. of processors (P) Distribution of T threads when P<T (aka oversubscribed ) T/P threads per processor Concurrent, not all truly parallel Choosing P threads from pool of T Repeat till all T executed Fall 2016 CIS*3090 Parallel Programming 10

11 Inter-thread synchronization exclusive: denotes critical section implicit mutex barrier: where all threads check in, then all continue For this to work, all threads have to be active even if P<T Suspend a thread that s reached the barrier and run another one, continue till all arrive, then waking all Fall 2016 CIS*3090 Parallel Programming 11

12 Local vs. global variables Local if declared inside forall block Per-thread copies, not visible to other threads or outside block Global (underlined) if declared outside Indicates lambda latency! All arrays start with 0 index Fall 2016 CIS*3090 Parallel Programming 12

13 Global memory conventions Concurrent reads to same variable are OK, writes are serialized (last wins) But concurrent writes non-deterministic! If you don t like that, insert explicit sync (exclusive) Models worst case that happens with real HW Forces you to pay attention to that and deal with it explicitly at program level Fall 2016 CIS*3090 Parallel Programming 13

14 Accessing global memory 2 methods, you choose, you pay: Just reference a global variable in the pseudocode Pays lambda penalty on each access! Careful to use exclusive to ensure consistency! Localize some/all global data via explicit call to localize() pseudo-function Pays lambda penalty one time Fall 2016 CIS*3090 Parallel Programming 14

15 Localization convention (p93 code sample) int alldata[n]; Global data structure for n qty. data forall (threadid in(0..p-1)) Spawn P threads { int size=n/p; Compute size of the local allocation int locdata[size] = localize(alldata[]); } In Peril-L pseudocode, represents programmer s choice to pay lambda penalty once (=λ size) per thread for global access After that, locdata[i] is fast access What does it mean? How does it work? Fall 2016 CIS*3090 Parallel Programming 15

16 Inside localize() pseudo-func. Is a local copy actually made? Conceptually no locdata is like an alias for that thread s portion of alldata What about mismatch between locdata s size and alldata s? localize() automagically maps local array to thread s portion of global array Fall 2016 CIS*3090 Parallel Programming 16

17 Can I call localize()? This is pseudocode, not real library call! Represents mechanism used to access global data on your platform in λ time SMP: main mem L1 cache auto transfer non-sm: message from another node Reading from localized data is fast SMP: from L1 cache non-sm: from local node s memory Fall 2016 CIS*3090 Parallel Programming 17

18 Writing to localized data Because it s an alias, corresponding global data also changes (in principle) SMP: cache coherency HW auto-updates main memory (and other L1 caches) non-smp: requires sending message But localized write is fast SMP: changes only L1 cache (initially) non-smp: changes node s local memory Fall 2016 CIS*3090 Parallel Programming 18

19 Who pays lambda for writing? Convention is that reader of global data will be charged for the sync cost SMP: reflects lazy update of main mem. with relaxed consistency model and MESI protocol non-smp: only one message per reader needs to be sent Fall 2016 CIS*3090 Parallel Programming 19

20 Careful writing localized data! Updates affect corresponding global data If no intention of inter-thread communication, no problem Writes by multiple threads not interfering with each other s data Otherwise, opens up possible corruption, data races between reading/writing threads Must use some sync mechanism (shown later) Fall 2016 CIS*3090 Parallel Programming 20

21 Owner Computes style (p94) Promoted by localization convention Lets thread take ownership of portion of data set Avoid requirement for exclusive lock access to entire global data structure by partitioning data among threads Fall 2016 CIS*3090 Parallel Programming 21

22 Localize() is smart! Forces programmer to explicitly recognize and plan how to manage biggest problem of parallel computing: Memory bandwidth bottleneck Can manage at algo. level with Peril-L convention Another magical pseudo-function: mysize(global data set, my index) When data doesn t divide evenly into P chunks Fall 2016 CIS*3090 Parallel Programming 22

23 Global memory & CTA As observed before, CTA doesn t have global mem (GM) per se Conceptually, dispersed to one or more processors, in their respective local mems. To get at it, you have to make a non-local mem ref. via the relevant processor Looks like a shell game first, you have GM (pseudocode); then, you don t (CTA); then, you (may) have it again (multicores/smp)! Fall 2016 CIS*3090 Parallel Programming 23

24 Big picture: 3 layers Top layer = Peril-L pseudocode Programmer s view is having both global & local memory available Middle layer = CTA model, like a VM Doesn t have global mem, but can simulate it Level where we conduct algo. performance estimation O() complexity and lambda cost Bottom layer = physical computer global mem may (multicore/smp) or may not (cluster) be available, in latter case can be simulated by messaging Fall 2016 CIS*3090 Parallel Programming 24

25 Benefits of layered approach Allows complete disconnection of a parallel algo. from a particular HW platform, while still capturing key property of the platforms: non-local memory latency Makes any pseudocoded algo. portable among wide variety of platforms Fall 2016 CIS*3090 Parallel Programming 25

26 Summary Building on a generalized model of parallel processors, started to define a pseudocode targeted for describing parallel algos in a HW-agnostic way that still recognizes the HW issues which affect parallel performance! Fall 2016 CIS*3090 Parallel Programming 26

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1 CS4961 Parallel Programming Lecture 4: Data and Task Parallelism Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following command: