Topic 22: Multi-Processor Parallelism

Size: px

Start display at page:

Download "Topic 22: Multi-Processor Parallelism"

Reginald King
5 years ago
Views:

1 Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute concurrently if sufficient resources exist Dependence Limited Resource Limited Units 3 2 Units 2 1 Unit Review: Where to Find Parallelism? Parallelism can be found/exists at different granularities Instruction Level Ex: add instruction executes with multiply instruction Compiler and hardware good at finding this Thread Level Ex: screen redraw function executes with recalculate in spreadsheet Programmers OK at finding this Process Level Ex: Simulation job runs on same machines as spreadsheet Users good at creating this 3

2 Thread Level Parallelism Programmer generally makes TLP explicit Compilers can extract threads in regular programs for (i = 0; i < 200; i++) for(j = 1; j < 20000; j++) val[i,j] = val[i,j-1] + 1; forall(i = 0; i < 200; i++) for(j = 1; j < 20000; j++) val[i,j] = val[i,j-1] + 1; 4 Thread Level Parallelism Synchronization Unlike in ILP, flow of data/dependences must be explicit while(ptr = ptr->next) sum += ptr->val; while(ptr = ptr->next) produce(ptr); produce(null); while(ptr = consume(ptr)) sum += ptr->val; Communication and Synchronization (order and flow) 5 Multiple Processor Organization Message Passing/Private Memory Threads communicate directly (send, receive) Scales relatively well No memory coherence problem (for the hardware at least) 6

3 Multiple Processor Organization May exist on single chip 7 Thread Level Parallelism Programmer generally makes TLP explicit Compilers can extract threads in regular programs for (i = 0; i < 200; i++) for(j = 1; j < 20000; j++) val[i,j] = val[i,j-1] + 1; forall(i = 0; i < 200; i++) for(j = 1; j < 20000; j++) val[i,j] = val[i,j-1] + 1; 8 Multiple Processor Organizations Shared Memory/Shared Bus 9

4 Multiple Processor Organizations May be on single chip! 10 Multiple Processor Organizations Shared Bus traffic Synchronization and Communication through memory The cache coherency problem 11 Multiple Processor Organizations Shared Bus 1 2 N A=5 A=7 P1: M[A] = 5 P2: M[A] = 7 P1: r1 = M[A] ;; value? PN: r1 = M[A] ;; value? 12

5 Cache Coherency Solution 1 2 N P1: M[A] = 5 P2: M[A] = 7 P1: r1 = M[A] ;; value? PN: r1 = M[A] ;; value? A=5 A=7 13 Cache Coherency Solution: Snoopy Bus 14 Snooping Protocols Variety of protocols minimize traffic for different situations Generally many states including: invalid, dirty read/write, clean/read-only Reads: just work Writes: Write-Invalidate - other caches with address invalidate line (block) - only first write generates traffic Write-Update - other caches with address update the values in the line (block) - like write through 15

6 Sample Protocol Signals From Processor 16 Sample Protocol Signals From Bus Other protocols are MESIer 17 Synchronization/Semaphores Spin lock Atomic Exclusive Access 18

7 Multiple Processor Organizations Simultaneous Multithreading ( Hyperthreading ) Multiple threads in single core Helps when single thread ILP is low Like ILP processor, but Multiple PCs, one per thread Instructions are tagged with thread ID Architectural register file per thread Threads share execution resources Cross thread synchronization and communication through memory/ cache 19 CFGs, PCs, and Cross-Iteration Deps 1. r1 = r Branch r1 < Loop-Level Parallelization: DOALL 1. r1 = r Branch r1 < r1 = 9 5. Branch r1 < Branch r1 <

Another Example 1. r1 = r1 + 1 5. Branch r2 == 10 22 Another Example 1. r1 = r1 + 1 5. Branch r2 == 10 1.

8 Another Example 1. r1 = r Branch r2 == Another Example 1. r1 = r Branch r2 == r1 = 9 5. Branch r2 == Branch r2 == Speculation 1. r1 = 9 5. Branch r2 == Branch r2 == 10 24

9 Speculation, Commit, and Recovery 1. r1 = 9 4. Send{1} r2 5. Jump 1. r2 = Receive{1} 2. Branch r2!= MEM[r1] = r2 4. r2 = Receive{2} 5. Branch r2!= MEM[r1] = r2 7. Jump 5. Jump Kill and Continue Difficult Dependences 1. r1 = Head 1. r1 = MEM[r1] 2. Branch r1 == 0 3. r2 = MEM[r1 + 4] 4. r3 = Work (r2) 5. Print ( r3 ) 6. Jump 26 DOACROSS 1. r1 = Head 1. r1 = MEM[r1] 2. Branch r1 == 0 3. r2 = MEM[r1 + 4] 4. r3 = Work (r2) 5. Print ( r3 ) 6. Jump 27

Golde era pu of com tu chitec ter ar Era of DIY:

behind CPU92 CPU92 CPU95 CPU95 CPU2000 CPU2000 CPU2006

2004 2006 2008 2010 2000 2002 2004 2006 2008 2010 2012

10 PS-DSWP 1. r1 = Head 1. r1 = MEM[r1] 2. Branch r1 == 0 3. r2 = MEM[r1 + 4] 4. r3 = Work (r2) 5. Print ( r3 ) 6. Jump 28 SPEC CINT SPECPerformance CINT Perf. (log(log. scale)scale) 10 Cores! Core Intel Xeon Unparalleled Performance n Golde era pu of com tu chitec ter ar Era of DIY: Multicore Reconfigurable GPUs Clusters re ~ 3 years behind CPU92 CPU92 CPU95 CPU95 CPU2000 CPU2000 CPU2006 CPU Year Year P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994) Automatic Speculation Automatic Pipelining Commit Parallel Resources Automatic Allocation/Scheduling

11 MULTICORE ARCHITECTURE (CIRCA 2010) Automatic Speculation Automatic Pipelining Commit Parallel Resources Automatic Allocation/Scheduling Threads Parallel Library Calls Threads Realizable parallelism Credit: Jack Dongarra

12 Compiler Advances Double Computing Power Every 18 Years! Proebsting s Law P6 SUPERSCALAR ARCHITECTURE 0 1 Spec-PS-DSWP Core 4 LD:1 LD:2 2 LD:3 3 4 LD:4 LD:5 W:1 W:2 5 W:3 C:1 W:4 C:2 C:3 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } Program Dependence Graph A C D B Control Dependence Data Dependence A1 C1 D1 A2 B2

Spec-DOALL Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } Program Dependence Graph A C D B Control Dependence Data Dependence A1 C1 D1 A2 B2 Spec-DOALL Example

13 Spec-DOALL Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } Program Dependence Graph A C D B Control Dependence Data Dependence A1 C1 D1 A2 B2 Spec-DOALL Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } Program Dependence Graph A C D B Control Dependence Data Dependence A1 C1 D1 A2 B2 A3 B3 Spec-DOALL Example A: while (true) (node) { B: node = node->next; C: res = work(node); D: write(res); } Slowdown Program Dependence Graph A 197.parser B C D Control Dependence Data Dependence A1 A2 A3 C1 D1 B2 B3 C3 D3 B2 B3 B4 C3 C4 D3 D4

14 Spec-DOACROSS Throughput: 1 iter/cycle Spec-DSWP Throughput: 1 iter/cycle B2 B3 C1 C3 C1 B2 D1 D3 D1 B3 B4 B4 C3 C4 B5 B5 C4 D3 D4 C5 B6 B6 C5 D4 B7 D5 C6 B7 C6 D5 Comparison: Spec-DOACROSS and Spec-DSWP Comm.Latency = 1: 1 iter/cycle Comm.Latency = 1: 1 iter/cycle Comm.Latency = 2: 0.5 iter/cycle Comm.Latency = 2: 1 iter/cycle C1 B2 B2 C1 Pipeline Fill time D1 B3 B3 D1 B4 C3 B4 C3 C4 B5 D3 B5 C4 D3 D4 C5 B6 B6 C5 D4 B7 D5 C6 B7 C6 D5 Performance Speedup (X) Spec-DOACROSS vs. Spec-DSWP [MICRO 2010] Geomean of 11 benchmarks on the same cluster TLS Spec-PS-DSWP (Number of Total Cores, Number of Nodes)

15 Performance relajve to Best SequenJal 128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010] Restoration of Trend Compiler Advances Double Computing Power Every 18 Years! Proebsting s Law Compiler Technology Architecture/Devices Era of DIY: Multicore Reconfigurable GPUs Clusters Compiler technology inspired class of architectures?

Topic 22: Multi-Processor Parallelism

Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute