ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Introduction

Size: px

Start display at page:

Download "ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Introduction"

Daisy Donna Cross
5 years ago
Views:

1 ECE 259 / CS 221 Advanced Computer Architecture II (arallel Computer Architecture) Introduction Copyright 2010 Daniel J. Sorin Duke University Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU/EFL), Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J.. Singh (rinceton).

2 General Course Information rofessor: Daniel J. Sorin Course info All lecture notes will be posted here before class All readings and assignments are posted here Google group you re all already invited! You MUST keep up with this Google group Office hours Where: 209C Hudson Hall When: TBD 2

3 Course Objectives Learn about parallel computer architecture Learn how to read/evaluate research papers Learn how to perform research Learn how to present research 3

4 Course Guidelines Students are responsible for: Leading discussion(s) of research papers - 10% of grade articipating in class discussions - 10% of grade Midterm exam 15% of grade Final exam - 25% of grade Individual or group project - 40% of grade How to lead a paper discussion Summarize the paper Handle questions from class & ask questions of class Explain what (you think) is good about the paper Explain what (you think) is bad or lacking or confusing I ll lead the first discussion, so you ll have an example 4

5 roject The project is a semester-long assignment that should reflect the goal of being no more than a stone's throw away from a research paper. Written proposal (no more than 3 pages), due Weds March 3 Written progress report (<= 3 pages), due Mon Apr 5 Final document in conference/journal format (<= 12 pages), Apr 23 Final presentation (in class), TBD Groups of 2 or 3 are encouraged Get started early! Talk to me about project ideas. 5

6 Academic Misconduct I will not tolerate academically dishonest work. This includes cheating on the final exam and plagiarism on the project. Be careful on the project to cite prior work and to give proper credit to others' research. Ask me if you have any questions. Not knowing the rules does not make misconduct OK. 6

7 Course Topics arallel programming Machine organizations Cache-coherent shared memory machines Memory consistency models & transactional memory Interconnection networks Evaluation tools and methodology High availability systems Scalable, non-coherent machines Novel architectures (vectors, GU, dataflow, grid) Interactions with microprocessors and I/O Impact of new technology (quantum, nano) 7

8 Where to find multiprocessor research Where is M research presented? ISCA = International Symposium on Computer Architecture ASLOS = Arch. Support for rogramming Languages and OS MICRO = International Symposium on Microarchitecture (!!!) HCA = High erformance Computer Architecture SAA = Symposium on arallel Algorithms and Architecture ICS = International Conference on Supercomputing ACT = arallel Architectures and Compilation Techniques Etc. Terminology note: I m going to use multiprocessor as a catch-all that includes multicore processors, except when the difference is important For our purposes, often doesn t matter if on 1 chip or many chips 8

9 Outline for Intro to Multiprocessing Motivation & Applications rogramming Models & A Generic arallel Machine Shared Memory, Message assing, Data arallel Issues in rogramming Models Function: naming, operations, & ordering erformance: latency, bandwidth, etc. 9

10 Motivation ECE 252 / CS 220: focus on uniprocessors This course uses N processors in a computer to Increase Throughput via many jobs in parallel Improve Cost-Effectiveness (e.g., adding 3 processors may yield 4X throughput for 2X system cost) Reduce Latency for shrink-wrapped software (e.g., databases and web servers) Reduce latency through arallelization of your application (but this is hard) Avoid Melting the chip Need more performance than today s processor core? Wait for tomorrow s processor core Use more cores in parallel 10

11 Applications: Science and Engineering Examples Weather prediction Evolution of galaxies Oil reservoir simulation Automobile crash tests Drug development VLSI CAD Nuclear bomb simulation Typically model physical systems or phenomena roblems are 2D or 3D Usually require number crunching Involve true parallelism 11

12 Examples Applications: Commercial On-line transaction processing (OLT) Decision support systems (DSS) Application servers or middleware (WebSphere) Involves data movement, not much number crunching OTL has many small queries DSS has fewer but larger queries Involves throughput parallelism Inter-query parallelism for OLT Intra-query parallelism for DSS 12

13 Examples Applications: Multi-media/home Speech recognition Audio/video Data compression/decompression 3D graphics Gaming! Involves everything (crunching, data movement, true parallelism, and throughput parallelism) 13

14 Outline Motivation & Applications rogramming Models & A Generic arallel Machine Issues in rogramming Models 14

15 Sequential In Theory Time to sum n numbers? O(n) Time to sort n numbers? O(n log n) arallel Time to sum? Tree for O(log n) Time to sort? Non-trivially O(log n) What model?» RAM [Fortune, Willie STOC78]» processors in lock-step» One memory (e.g., CREW for concurrent read exclusive write) 2<->3 1<->2 3<->4 2<->3 1<->2 3<->4 15

16 But in ractice, How Do You Name a datum across processors? Communicate values? Coordinate and synchronize? Select processing node size (few-bit ALU to a C)? Select number of nodes in system? 16

17 rogramming Model rovides a communication abstraction that is a contract between hardware and software (a la ISA) rogramming model!= programming language Current rogramming Models 1) Shared Memory 2) Message assing 3) Data arallel (Shared Address Space) 4) (Dataflow) 17

18 rogramming Model How does the programmer view the system? Which is NOT the same as how the system actually behaves!! Shared memory: processors execute instructions and communicate by reading/writing a globally shared memory Message passing: processors execute instructions and communicate by explicitly sending messages Data parallel: processors do the same instructions at the same time, but on different data 18

19 Historical View Historically: system architecture and programming model were tied together M M M M M M M M M IO IO IO IO IO IO IO IO IO Join At: I/O (Network) Memory rocessor rogram With: Message assing Shared Memory Data arallel Single-Instruction Multiple-Data (SIMD) 19

20 Historical View, cont. Architecture rogramming Model Join at network program with message passing model Join at memory program with shared memory model Join at processor program with SIMD or data parallel rogramming Model Architecture But Message-passing programs on message-passing arch Shared-memory programs on shared-memory arch SIMD/data-parallel programs on SIMD/data-parallel arch Slippery slope that led to LIS machines Isn t hardware basically the same? rocessors, memory, & I/O? Convergence! Why not have generic parallel machine & program with model that fits the problem? 20

21 A Generic arallel Machine $ $ Node 0 Node 1 Mem Mem $ CA CA Interconnect Mem Mem $ CA CA Node 2 Node 3 Reminder: could be on one chip or many Separation of programming models from architectures All models require communication Node with processor(s), memory, communication assist 21

22 Today s arallel Computer Architecture Extension of core architecture to support communication and cooperation Communications architecture User Level System Level Multiprogramming Shared Memory Libraries and Compilers Communication Hardware Message assing Operating System Support Data arallel Hardware/Software Boundary rogramming Model Communication Abstraction hysical Communication Medium 22

23 Simple roblem for i = 1 to N A[i] = (A[i] B[i]) * C[i] sum = sum A[i] How do I make this parallel? 23

24 Simple roblem for i = 1 to N A[i] = (A[i] B[i]) * C[i] Split the loops sum = sum A[i]» Independent iterations for i = 1 to N A[i] = (A[i] B[i]) * C[i] for i = 1 to N sum = sum A[i] Data flow graph? 24

25 Data Flow Graph A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] C[0] C[1] C[2] C[3] * * * * 2 N-1 cycles to execute on N processors But with what assumptions? 25

26 artitioning of Data Flow Graph A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] C[0] C[1] C[2] C[3] * * * * global synch 26

27 Shared Memory Architectures store load 0 n Shared ortion of Address Space rivate ortion of Address Space Machine hysical Address Space Common hysical Addresses n rivate 2 rivate 1 rivate 0 rivate Communication, sharing, and synchronization with loads/stores on shared variables Must map virtual pages to physical page frames Consider OS support for good mapping Examples: most of the servers from Sun, IBM, Intel, Compaq/H, etc. 27

28 age Mapping in Shared Memory M $ $ Node 0 0,N-1 Mem NI Interconnect NI Mem Node 2 2N,3N-1 Mem NI Node 1 N,2N-1 NI Mem $ $ Node 3 3N,4N-1 Keep private data and frequently used shared data on same node as computation Load by rocessor 0 to address N3 goes to Node 1 28

29 Return of the Simple roblem (Shared Memory) private int i,my_start,my_end,mynode; shared float A[N], B[N], C[N], sum; for i = my_start to my_end A[i] = (A[i] B[i]) * C[i] GLOBAL_SYNCH; if (mynode == 0) for i = 1 to N sum = sum A[i] Can run this pseudocode on any machine that supports shared memory 29

30 Message assing Architectures $ Node 0 0,N-1 Mem CA $ Node 1 0,N-1 Mem CA Cannot directly access memory on another node $ CA Mem Interconnect CA Mem $ IBM S-2, Intel aragon, clusters of Cs ( beowulf ) Node 2 0,N-1 Node 3 0,N-1 30

31 Message assing rogramming Model Local rocess Address Space Local rocess Address Space match Recv y,, t address x Send x, Q, t address y rocess User-level send/receive abstraction local buffer (x,y), process (,Q), and tag (t) naming and synchronization rocess Q 31

32 The Simple roblem Again (Message assing) int i, my_start, my_end, mynode; float A[N/], B[N/], C[N/], sum; for i = 1 to N/ A[i] = (A[i] B[i]) * C[i] sum = sum A[i] if (mynode!= 0) send (sum,0); if (mynode == 0) for i = 1 to -1 recv(tmp,i) sum = sum tmp Send/Recv communicates and synchronizes processors 32

33 Separation of Architecture from Model At the lowest level, shared memory systems send messages HW is specialized to expedite read/write messages What programming model / abstraction is supported at user level? Can I have shared-memory abstraction on message passing HW? Can I have message passing abstraction on shared memory HW? Answer: YES! You can mix and match! Some 1990s research machines integrated both MIT Alewife, Wisconsin Tempest/Typhoon, Stanford FLASH 33

34 rogramming Model Data arallel Operations are performed on each element of a large (regular) data structure in a single step Arithmetic, global data transfer rocessor is logically associated with each data element SIMD architecture: Single instruction, multiple data Early architectures mirrored programming model Many bit-serial processors Today most architectures have data parallel instrs (e.g., Intel has MMX/SSE, SARC has VIS) Can support data parallel model on shared memory or message passing architecture 34

35 The Simple roblem Strikes Back Assuming we have N processors A = (A B) * C sum = global_sum (A) Language supports array assignment Special HW support for global operations CM-2 bit-serial CM-5 32-bit SARC processors Message assing and Data arallel models Special control network 35

36 Aside -- Single rogram, Multiple Data SMD rogramming Model Each processor executes the same program on different data Many Splash2 benchmarks are in SMD model for each molecule at this processor { } simulate interactions with myproc1 and myproc-1; Not connected to SIMD architecture model Not lockstep instructions Could execute different instructions on different processors» Data dependent branches cause divergent paths 36

37 Data Flow Architectures A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] C[0] C[1] C[2] C[3] * * * * Execute Data Flow Graph No control sequencing 37

38 Data Flow Architectures Explicitly represent data dependencies (dataflow graph) No artificial constraints, like sequencing instructions! Early machines had no registers or cache Instructions can fire when operands are ready Remember Tomasulo s algorithm? How do we know when operands are ready? Matching store Large associative search! Later machines moved to coarser grain (threads) Allowed registers and cache for local computation Introduced messages (with operations and operands) 38

39 Review: Separation of Model and Architecture Shared Memory Single shared address space Communicate, synchronize using load / store Can support message passing Message assing Send / Receive Communication synchronization Can support shared memory Data arallel Lock-step execution on regular data structures Often requires global operations (sum, max, min...) Can support on either shared memory or message passing Dataflow 39

40 Review: A Generic arallel Machine $ $ Node 0 Node 1 Mem Mem $ CA CA Interconnect Mem Mem $ CA CA Node 2 Node 3 Separation of programming models from architectures All models require communication Node with processor(s), memory, communication assist 40

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Introduction (Chapter 1)

CS/ECE 757: Advanced Computer Architecture II (arallel Computer Architecture) Introduction (Chapter 1) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from work by Sarita