CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

Size: px

Start display at page:

Download "CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner"

Cody Hill
5 years ago
Views:

1 CS104 Computer Organization and rogramming Lecture 20: Superscalar processors, Multiprocessors Robert Wagner

2 Faster and faster rocessors So much to do, so little time... How can we make computers that execute faster? Faster clock => more instructions/second. (technology constraints) ipelining: => faster clock Execute more than 1 instruction per cycle, (Superscalar processor) Use multiple processors and divide the computation (Multiprocessors, Clustered computing, Distributed computing) CS 104 SS 2

3 Multiple ipelines : Floating oint EX MEM M1 M2 M3 M4 M5 M6 M7 IF ID/RF WB A1 A2 A3 A4 F/INT Divide Unit Not ipelined 25 Clocks CS 104 SS 3

4 (CI < 1): Superscalar Design ipelining can get CI=1 and fast clock. Can we do better? Superscalar design: Execute multiple instructions every clock. roblems for Superscalar Design: Need multiple execution units (pipelines), Structural Hazards: Need multiple accesses to register files. Might need multiple accesses to caches Data Hazards: How to deal with data dependencies (keep program semantics)? What to do with stalled instructions? Control Hazards: What to do about conditional branches? CS 104 SS 4

5 Superscalar Design Solutions Multiple pipelines are not a problem. We already had them in regular pipeline design. Structural Hazards: Build register files with many read and write ports: Ex; 7-read and 3-write ports. Build multi-port caches. Data Hazards solutions: Issue instructions in order. Use score-board to eliminate data hazards by stalling instructions. Better Solution : Issue instruction out of order, Use register renaming to avoid data hazards, Graduate instructions in order. Control Hazards solutions: Use Branch rediction: Make sure that the branch is resolved before registers are modified. OR, Use speculative execution, roll back results if branches were predicted wrong. CS 104 SS 5

6 The Alpha Superscalar Can issue up to four instructions per clock cycle Deep pipeline: 7 stage integer, 9-stage floating point, up to 13 stages for on-chip load/store. There are two Integer and two Floating-point pipelines. In order issue. In-order execution. Use score-board to stall instructions with conflicts. Use score-board to compute all register forwarding operations. Integer Register File has 4 read ports and two write ports. Floating point Register File has 6 read ports and 3 write ports. Use Branch rediction to keep the pipe full. CS 104 SS 6

7 The Alpha Superscalar ipeline F Divider Refill Buffer N ext Index Logic Instr. Cache Instruction Buffer 0 1 Inst. Slot Logic Floating oint Register File Issue Scoreboard Logic Floating oint Add pipeline & Divide Floating oint Multiply ipeline F Store Data Integer Store C Logic Instr. TLB Integer Register File Multiply Integer ipeline 1 Integer ipeline 2 Store & Fill Data Data Cache To F Units Data TLB Level-2 Cache CS 104 SS 7

8 Alpha ipeline Stages Read Instruction Cache Buffer Instruction Slot: Steer to Execution pipeline Determine Instruction Issue, Fetch Int Registers S0 S1 S2 S3 First Integer ipeline Stage Second Integer ipeline Stage Write Integer Register File. S4 S5 S6 Integer ipeline Read floating-point Registers First floating-point pipeline stage Last floating point stage, Write register S4 S5 S6 S7 S8 Floating-point ipeline Calculate virtual address, begin data cache read End data cache read, translate to physical address S4 S5 S6 S7 S8 S9 S10 S11 S12 Memory Access ipeline Use Data, Write store to cache, Start L2 tag Access End L2 tag access Start L2 data access End L2 data access Begin data cache fill End data cache fill Use L2 data CS 104 SS 8

9 MIS R Superscalar Issues 4 instructions at a time Has 5 execution units: 2 F units, 2 Integer units and load/store unit. Out of order execution Speculative execution predicts up to four branches at a time. CS 104 SS 9

10 MIS CU F queue F Register File F Adder Align Add/N ack Instr. predecode Instr. Cache 32-KB Active List 32 entries Instr. Decode Branch Free Register List Register map table 16 entries Busy bit table Load - store queue 16 entries 64x64 5 read 3 write Int. Register File 64x64 Data Cache 32KB Address Calc. Mult. Sum/N ack Div. Sqrt TLB 64x2 entries Integer queue 16 entries 7 read 3 write Integer ALU Integer ALU CS 104 SS 10

11 MIS ipelines I-Fetch I-Decode I-Issue Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Floating oint Latency = 2 issue RF Alignment Add ack WB issue RF Multiply Sum rod. ack WB Load/Store Latency = 2 issue RF Queues Acalc D-Cache Load WB (Integer or F) Integer Latency = 1 issue RF ALU 1 WB issue RF ALU 2 WB Fetch Dec Map Write Instruction fetch and decode pipeline. Fills queues 4 istructions in parallel. I-Cache Branch-Add Up to 4 branch instructions are predicted Speculative fetching. CS 104 SS 11

12 What is arallel Computer Architecture? A arallel Computer is a collection of processing elements that cooperate to solve large problems fast how large a collection? how powerful are the elements? how does it scale up? how do they cooperate and communicate? how is data transmitted between processors? what are the primitive abstractions? how does it all translate to performance? CS 104 SS 12

13 arallel Computation: Why and Why Not? ros Cons erformance Cost-effectiveness (commodity parts) Smooth upgrade path Fault Tolerance Difficult to parallelize applications Requires automatic parallelization or parallel program development Software! AAHHHH! CS 104 SS 13

14 Applications: Science and Engineering Examples Weather prediction Evolution of galaxies Oil reservoir simulation Automobile crash tests Drug development VLSI CAD Nuclear BOMBS! Typically model physical systems or phenomena roblems are 2D or 3D Usually requires number crunching Involves true parallelism CS 104 SS 14

15 Examples Applications: Commercial Transaction processing Database Financial models Involves data movement, not much number crunching Involves throughput parallelism CS 104 SS 15

16 Examples Applications: Multi-media/home speech recognition data compression/decompression 3D graphics Will become ubiquitous Involves everything (crunching, data movement, true parallelism, and throughput parallelism) CS 104 SS 16

17 SISD SIMD MIMD Single Instruction Single Data Standard sequential machines Flynn Taxonomy Single Instruction Multiple Data Early vector computers -- CRAY 1, CDC Star On single chip today, multimedia (decompression) Special applications (graphics, image processing, cryptography) Multiple Instruction Multiple Data most of today s parallel machines CS 104 SS 17

18 Message assing Architectures Node 0 0,N-1 Mem CA Interconnect Node 1 0,N-1 Mem CA Cannot directly access memory on another node IBM S-2, Intel aragon CA Mem CA Mem Cluster of workstations Node 2 0,N-1 Node 3 0,N-1 CS 104 SS 18

19 Message assing rogramming Model Local rocess Address Space Local rocess Address Space match Recv y,, t address x Send x, Q, t address y User level send/receive abstraction rocess local buffer (x,y), process (Q,) and tag (t) naming and synchronization rocess Q CS 104 SS 19

20 Single Shared Address Space store load 0 n Shared ortion of Address Space rivate ortion of Address Space Machine hysical Address Space Common hysical Addresses n rivate 2 rivate 1 rivate Communication, sharing, and synchronization with store / load on shared variables Must map virtual pages to physical page frames Consider OS support for good mapping 0 rivate CS 104 SS 20

21 Small Scale Shared Memory Multiprocessors Cache(s) and TLB Main Memory 0 N-1 Small number of processors connected to one shared memory Memory is equidistant from all processors (UMA) Kernel can run on any processor (symmetric M) CS 104 SS 21

22 Cache Coherence roblem (Initial State) 1 2 Time BUS x Main Memory CS 104 SS 22

23 Cache Coherence roblem (Step 1) 1 2 ld r2, x Time BUS x Main Memory CS 104 SS 23

24 Cache Coherence roblem (Step 2) 1 2 ld r2, x Time ld r2, x BUS x Main Memory CS 104 SS 24

25 Cache Coherence roblem (Step 3) 1 2 ld r2, x Time ld r2, x add r1, r2, r4 st x, r1 Interconnection Network x Main Memory CS 104 SS 25

26 Snoopy Cache-Coherence rotocols Bus provides serialization point for consistency Each cache controller snoops all bus transactions relevant transactions if for a block it contains take action to ensure coherence invalidate update supply value depends on state of the block and the protocol Simultaneous Operation of Independent Controllers CS 104 SS 26

27 Large Scale Shared Memory Multiprocessors 100s to 1000s of nodes (processors) with single shared physical address space Use General urpose Interconnection Network Still have cache coherence protocol Mem Mem Mem Cntrl/NI Mem Cntrl/NI Cntrl/NI Cntrl/NI Use messages instead of bus transactions No hardware broadcast Interconnect Communication Assist CS 104 SS 27

28 Directory Based Cache Coherence Avoid broadcast request to all nodes on a miss traffic time Maintain directory of which nodes have cached copies of the block (directory controller + directory state) On a miss, send message to directory Directory determines what (if any) protocol action is required e.g., invalidation Directory waits for protocol actions to finish and then responds to the original request CS 104 SS 28

29 Today s arallel Computer Architecture Extension of traditional computer architecture to support communication and cooperation Communications architecture User Level System Level Multiprogramming Shared Memory Library and Compiler Communication Hardware Message assing Operating System Support Data arallel Hardware/Software Boundary rogramming Model Communication Abstraction hysical Communication Medium CS 104 SS 29

30 Toward a Generic arallel Machine Node 0 Node 1 Mem Mem CA CA Interconnect Mem Mem CA CA Node 2 Node 3 Separation of programming models from architectures All models require communication Node with processor(s), memory, communication assist CS 104 SS 30

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Introduction (Chapter 1)

CS/ECE 757: Advanced Computer Architecture II (arallel Computer Architecture) Introduction (Chapter 1) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from work by Sarita