Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Size: px
Start display at page:

Download "Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018"

Transcription

1 Desig of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Executio Prof. Our Mutlu ETH Zurich Sprig April 2018

2 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed Microarchitectures Pipeliig Issues i Pipeliig: Cotrol & Data Depedece Hadlig, State Maiteace ad Recovery, Out-of-Order Executio Other Executio Paradigms 2

3 Remider: Optioal Homeworks Posted olie 3 Optioal Homeworks Optioal Good for your learig =homeworks 3

4 Readigs for Today Smith ad Sohi, The Microarchitecture of Superscalar Processors, Proceedigs of the IEEE, 1995 More advaced pipeliig Iterrupt ad exceptio hadlig Out-of-order ad superscalar executio cocepts H&H Chapters 7.8 ad 7.9 Optioal: Kessler, The Alpha Microprocessor, IEEE Micro

5 Lecture Aoucemet Moday, April 30, :15-17:15 CAB G 61 Apéro after the lecture J Prof. We-Mei Hwu (Uiversity of Illiois at Urbaa-Champaig) D-INFK Distiguished Collouium Iovative Applicatios ad Techology Pivots A Perfect Storm i Computig 5

6 Geeral Suggestio Atted the Computer Sciece Distiguished Collouia Happes o some Modays 16:15-17:15, followed by a apero Great way of learig about key developmets i the field Ad, meetig leadig researchers 6

7 Out-of-Order Executio (Dyamic Istructio Schedulig)

8 Review: Dataflow Graph for Our Example MUL R3 ß R1, R2 ADD R5 ß R3, R4 ADD R7 ß R2, R6 ADD R10 ß R8, R9 MUL R11 ß R7, R10 ADD R5 ß R5, R11 8

9 Review: State of RAT ad RS i Cycle 7 9

10 Review: Correspodig Dataflow Graph 10

11 Some More Questios (Desig Choices) Whe is a reservatio statio etry deallocated? Exactly whe does a istructio broadcast its tag? Should the reservatio statios be dedicated to each fuctioal uit or global across fuctioal uits? Cetralized vs. Distributed: What are the tradeoffs? Should reservatio statios ad ROB store data values or should there be a cetralized physical register file where all data values are stored? What are the tradeoffs? May other desig choices for OoO egies 11

12 For You: A Exercise, w/ Precise Exceptios MUL R3 ß R1, R2 ADD R5 ß R3, R4 ADD R7 ß R2, R6 ADD R10 ß R8, R9 MUL R11 ß R7, R10 ADD R5 ß R5, R11 F D E R W Assume ADD (4 cycle execute), MUL (6 cycle execute) Assume oe adder ad oe multiplier How may cycles i a o-pipelied machie i a i-order-dispatch pipelied machie with reorder buffer (o forwardig ad full forwardig) i a out-of-order dispatch pipelied machie with reorder buffer (full forwardig) 12

13 Out-of-Order Executio with Precise Exceptios Idea: Use a reorder buffer to reorder istructios before committig them to architectural state A istructio updates the RAT whe it completes executio Also called froted register file A istructio updates a separate architectural register file whe it retires i.e., whe it is the oldest i the machie ad has completed executio I other words, the architectural register file is always updated i program order O a exceptio: flush pipelie, copy architectural register file ito froted register file 13

14 Out-of-Order Executio with Precise Exceptios TAG ad VALUE Broadcast Bus F D S C H E D U L E E Iteger add E E E E Iteger mul E E E E E E E E FP mul E E E E E E E E... R E O R D E R W Load/store i order out of order i order Hump 1: Reservatio statios (schedulig widow) Hump 2: Reorderig (reorder buffer, aka istructio widow or active widow) 14

15 Two Humps i a Moder Pipelie TAG ad VALUE Broadcast Bus F D S C H E D U L E E Iteger add E E E E Iteger mul E E E E E E E E FP mul E E E E E E E E... R E O R D E R W Load/store i order out of order i order Hump 1: Reservatio statios (schedulig widow) Hump 2: Reorderig (reorder buffer, aka istructio widow or active widow) Photo credit: 15

16 Moder OoO Executio w/ Precise Exceptios Most moder processors use the followig Reorder buffer to support i-order retiremet of istructios A sigle register file to store all registers Both speculative ad architectural registers INT ad FP are still separate Two register maps Future/froted register map à used for reamig Architectural register map à used for maitaiig precise state 16

17 A Example from Moder Processors Boggs et al., The Microarchitecture of the Petium 4 Processor, Itel Techology Joural,

18 Eablig OoO Executio, Revisited 1. Lik the cosumer of a value to the producer Register reamig: Associate a tag with each data value 2. Buffer istructios util they are ready Isert istructio ito reservatio statios after reamig 3. Keep track of readiess of source values of a istructio Broadcast the tag whe the value is produced Istructios compare their source tags to the broadcast tag à if match, source value becomes ready 4. Whe all source values of a istructio are ready, dispatch the istructio to fuctioal uit (FU) Wakeup ad select/schedule the istructio 18

19 Summary of OOO Executio Cocepts Register reamig elimiates false depedecies, eables likig of producer to cosumers Bufferig eables the pipelie to move for idepedet ops Tag broadcast eables commuicatio (of readiess of produced value) betwee istructios Wakeup ad select eables out-of-order dispatch 19

20 OOO Executio: Restricted Dataflow A out-of-order egie dyamically builds the dataflow graph of a piece of the program which piece? The dataflow graph is limited to the istructio widow Istructio widow: all decoded but ot yet retired istructios Ca we do it for the whole program? Why would we like to? I other words, how ca we have a large istructio widow? Ca we do it efficietly with Tomasulo s algorithm? 20

21 Recall: Dataflow Graph for Our Example MUL R3 ß R1, R2 ADD R5 ß R3, R4 ADD R7 ß R2, R6 ADD R10 ß R8, R9 MUL R11 ß R7, R10 ADD R5 ß R5, R11 21

22 Recall: State of RAT ad RS i Cycle 7 22

23 Recall: Dataflow Graph 23

24 Questios to Poder Why is OoO executio beeficial? What if all operatios take a sigle cycle? Latecy tolerace: OoO executio tolerates the latecy of multi-cycle operatios by executig idepedet operatios cocurretly What if a istructio takes 500 cycles? How large of a istructio widow do we eed to cotiue decodig? How may cycles of latecy ca OoO tolerate? What limits the latecy tolerace scalability of Tomasulo s algorithm? Active/istructio widow size: determied by both schedulig widow ad reorder buffer size 24

25 Geeral Orgaizatio of a OOO Processor Smith ad Sohi, The Microarchitecture of Superscalar Processors, Proc. IEEE, Dec

26 A Moder OoO Desig: Itel Petium 4 26 Boggs et al., The Microarchitecture of the Petium 4 Processor, Itel Techology Joural, 2001.

27 Itel Petium 4 Simplified Mutlu+, Ruahead Executio, HPCA

28 Alpha Kessler, The Alpha Microprocessor, IEEE Micro, March-April

29 MIPS R10000 Yeager, The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April

30 IBM POWER4 Tedler et al., POWER4 system microarchitecture, IBM J R&D,

31 IBM POWER4 2 cores, out-of-order executio 100-etry istructio widow i each core 8-wide istructio fetch, issue, execute Large, local+global hybrid brach predictor 1.5MB, 8-way L2 cache Aggressive stream based prefetchig 31

32 IBM POWER5 Kalla et al., IBM Power5 Chip: A Dual-Core Multithreaded Processor, IEEE Micro

33 Hadlig Out-of-Order Executio of Loads ad Stores

34 Registers versus Memory So far, we cosidered maily registers as part of state What about memory? What are the fudametal differeces betwee registers ad memory? Register depedeces kow statically memory depedeces determied dyamically Register state is small memory state is large Register state is ot visible to other threads/processors memory state is shared betwee threads/processors (i a shared memory multiprocessor) 34

35 Memory Depedece Hadlig (I) Need to obey memory depedeces i a out-of-order machie ad eed to do so while providig high performace Observatio ad Problem: Memory address is ot kow util a load/store executes Corollary 1: Reamig memory addresses is difficult Corollary 2: Determiig depedece or idepedece of loads/stores eed to be hadled after their (partial) executio Corollary 3: Whe a load/store has its address ready, there may be youger/older loads/stores with udetermied addresses i the machie 35

36 Memory Depedece Hadlig (II) Whe do you schedule a load istructio i a OOO egie? Problem: A youger load ca have its address ready before a older store s address is kow Kow as the memory disambiguatio problem or the ukow address problem Approaches Coservative: Stall the load util all previous stores have computed their addresses (or eve retired from the machie) Aggressive: Assume load is idepedet of ukow-address stores ad schedule the load right away Itelliget: Predict (with a more sophisticated predictor) if the load is depedet o the/ay ukow address store 36

37 Hadlig of Store-Load Depedeces A load s depedece status is ot kow util all previous store addresses are available. How does the OOO egie detect depedece of a load istructio o a previous store? Optio 1: Wait util all previous stores committed (o eed to check for address match) Optio 2: Keep a list of pedig stores i a store buffer ad check whether load address matches a previous store address How does the OOO egie treat the schedulig of a load istructio wrt previous stores? Optio 1: Assume load depedet o all previous stores Optio 2: Assume load idepedet of all previous stores Optio 3: Predict the depedece of a load o a outstadig store 37

38 Memory Disambiguatio (I) Optio 1: Assume load is depedet o all previous stores + No eed for recovery -- Too coservative: delays idepedet loads uecessarily Optio 2: Assume load is idepedet of all previous stores + Simple ad ca be commo case: o delay for idepedet loads -- Reuires recovery ad re-executio of load ad depedets o mispredictio Optio 3: Predict the depedece of a load o a outstadig store + More accurate. Load store depedecies persist over time -- Still reuires recovery/re-executio o mispredictio Alpha : Iitially assume load idepedet, delay loads foud to be depedet Moshovos et al., Dyamic speculatio ad sychroizatio of data depedeces, ISCA Chrysos ad Emer, Memory Depedece Predictio Usig Store Sets, ISCA

39 Memory Disambiguatio (II) Chrysos ad Emer, Memory Depedece Predictio Usig Store Sets, ISCA Predictig store-load depedecies importat for performace Simple predictors (based o past history) ca achieve most of the potetial performace 39

40 Data Forwardig Betwee Stores ad Loads We caot update memory out of program order à Need to buffer all store ad load istructios i istructio widow Eve if we kow all addresses of past stores whe we geerate the address of a load, two uestios still remai: 1. How do we check whether or ot it is depedet o a store 2. How do we forward data to the load if it is depedet o a store Moder processors use a LQ (load ueue) ad a SQ for this Ca be combied or separate betwee loads ad stores A load searches the SQ after it computes its address. Why? A store searches the LQ after it computes its address. Why? 40

41 Out-of-Order Completio of Memory Ops Whe a store istructio fiishes executio, it writes its address ad data i its reorder buffer etry Whe a later load istructio geerates its address, it: searches the reorder buffer (or the SQ) with its address accesses memory with its address receives the value from the yougest older istructio that wrote to that address (either from ROB or memory) This is a complicated search logic implemeted as a Cotet Addressable Memory Cotet is memory address (but also eed size ad age) Called store-to-load forwardig logic 41

42 Store-Load Forwardig Complexity Cotet Addressable Search (based o Load Address) Rage Search (based o Address ad Size of both the Load ad earlier Stores) Age-Based Search (for last writte values) Load data ca come from a combiatio of multiple places Oe or more stores i the Store Buffer (SQ) Memory/cache 42

43 Other Approaches to Cocurrecy (or Istructio Level Parallelism)

44 Approaches to (Istructio-Level) Cocurrecy Pipeliig Out-of-order executio Dataflow (at the ISA level) Superscalar Executio VLIW Fie-Graied Multithreadig SIMD Processig (Vector ad array processors, GPUs) Decoupled Access Execute Systolic Arrays 44

45 Review: Data Flow: Exploitig Irregular Parallelism

46 Data Flow Summary Availability of data determies order of executio A data flow ode fires whe its sources are ready Programs represeted as data flow graphs (of odes) Data Flow at the ISA level has ot bee (as) successful Data Flow implemetatios uder the hood (while preservig seuetial ISA sematics) have bee very successful Out of order executio is the prime example 46

47 Pure Data Flow Advatages/Disadvatages Advatages Very good at exploitig irregular parallelism Oly real depedecies costrai processig More parallelism ca be exposed tha vo Neuma model Disadvatages No precise state sematics Debuggig very difficult Iterrupt/exceptio hadlig is difficult (what is precise state sematics?) Too much parallelism? (Parallelism cotrol eeded) High bookkeepig overhead (tag matchig, data storage) 47

48 Approaches to (Istructio-Level) Cocurrecy Pipeliig Out-of-order executio Dataflow (at the ISA level) Superscalar Executio VLIW Fie-Graied Multithreadig SIMD Processig (Vector ad array processors, GPUs) Decoupled Access Execute Systolic Arrays 48

49 Superscalar Executio

50 Superscalar Executio Idea: Fetch, decode, execute, retire multiple istructios per cycle N-wide superscalar à N istructios per cycle Need to add the hardware resources for doig so Hardware performs the depedece checkig betwee cocurretly-fetched istructios Superscalar executio ad out-of-order executio are orthogoal cocepts Ca have all four combiatios of processors: [i-order, out-of-order] x [scalar, superscalar] 50

51 Caregie Mello I-Order Superscalar Processor Example Multiple copies of datapath: Ca issue multiple istructios at per cycle Depedecies make it tricky to issue multiple istructios at oce CLK CLK CLK CLK CLK PC A RD Istructio Memory A1 A2 A3 A4 A5 A6 WD3 WD6 Register File RD1 RD4 RD2 RD5 ALUs A1 RD1 A2 RD2 Data Memory WD1 WD2 Here: Ideal IPC = 2 51

52 Caregie Mello I-Order Superscalar Performace Example lw $t0, 40($s0) add $t1, $s1, $s2 sub $t2, $s1, $s3 ad $t3, $s3, $s4 or $t4, $s1, $s5 sw $s5, 80($s0) Ideal IPC = Time (cycles) lw $t0, 40($s0) add $t1, $s1, $s2 IM lw add RF $s0 40 $s1 $s2 + + DM $t0 $t1 RF sub $t2, $s1, $s3 ad $t3, $s3, $s4 IM $s1 sub $t2 $s3 - RF DM $s3 ad $t3 $s4 & RF or $t4, $s1, $s5 sw $s5, 80($s0) IM or sw RF $s1 $s5 $s DM $s5 $t4 RF Actual IPC = 2 (6 istructios issued i 3 cycles) 52

53 Caregie Mello Superscalar Performace with Depedecies lw $t0, 40($s0) add $t1, $t0, $s1 sub $t0, $s2, $s3 ad $t2, $s4, $t0 or $t3, $s5, $s6 sw $s7, 80($t3) Ideal IPC = Time (cycles) lw $t0, 40($s0) IM lw RF $s DM $t0 RF add $t1, $t0, $s1 sub $t0, $s2, $s3 IM add sub RF $t0 $s1 $s2 $s3 RF $t0 $s1 $s2 $s3 + - DM $t1 $t0 RF ad $t2, $s4, $t0 or $t3, $s5, $s6 Stall ad IM or IM ad or RF $s4 $t0 $s5 $s6 & DM $t2 $t3 RF sw $s7, 80($t3) IM sw RF $t $s7 DM RF Actual IPC = 1.2 (6 istructios issued i 5 cycles) 53

54 Superscalar Tradeoffs Advatages Higher IPC (istructios per cycle) Disadvatages Higher complexity for depedecy checkig Reuire checkig withi a pipelie stage Reamig becomes more complex i a OoO processor More hardware resources eeded 54

55 Desig of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Executio Prof. Our Mutlu ETH Zurich Sprig April 2018

56 We did ot cover the followig slides i lecture. These are for your preparatio for the ext lecture.

57 Approaches to (Istructio-Level) Cocurrecy Pipeliig Out-of-order executio Dataflow (at the ISA level) Superscalar Executio VLIW Fie-Graied Multithreadig SIMD Processig (Vector ad array processors, GPUs) Decoupled Access Execute Systolic Arrays 57

58 VLIW

59 VLIW Cocept Superscalar Hardware fetches multiple istructios ad checks depedecies betwee them VLIW (Very Log Istructio Word) Software (compiler) packs idepedet istructios i a larger istructio budle to be fetched ad executed cocurretly Hardware fetches ad executes the istructios i the budle cocurretly No eed for hardware depedecy checkig betwee cocurretly-fetched istructios i the VLIW model 59

60 VLIW Cocept Fisher, Very Log Istructio Word architectures ad the ELI-512, ISCA ELI: Eormously logword istructios (512 bits) 60

61 VLIW (Very Log Istructio Word) A very log istructio word cosists of multiple idepedet istructios packed together by the compiler Packed istructios ca be logically urelated (cotrast with SIMD/vector processors, which we will see soo) Idea: Compiler fids idepedet istructios ad statically schedules (i.e. packs/budles) them ito a sigle VLIW istructio Traditioal Characteristics Multiple fuctioal uits All istructios i a budle are executed i lock step Istructios i a budle statically aliged to be directly fed ito the fuctioal uits 61

62 Caregie Mello VLIW Performace Example (2-wide budles) lw $t0, 40($s0) add $t1, $s1, $s2 sub $t2, $s1, $s3 ad $t3, $s3, $s4 or $t4, $s1, $s5 sw $s5, 80($s0) Ideal IPC = Time (cycles) lw $t0, 40($s0) add $t1, $s1, $s2 IM lw add RF $s0 40 $s1 $s2 + + DM $t0 $t1 RF sub $t2, $s1, $s3 ad $t3, $s3, $s4 IM $s1 sub $t2 $s3 - RF DM $s3 ad $t3 $s4 & RF or $t4, $s1, $s5 sw $s5, 80($s0) IM or sw RF $s1 $s5 $s DM $s5 $t4 RF Actual IPC = 2 (6 istructios issued i 3 cycles) 62

63 VLIW Lock-Step Executio Lock-step (all or oe) executio: If ay operatio i a VLIW istructio stalls, all istructios stall I a truly VLIW machie, the compiler hadles all depedecy-related stalls, hardware does ot perform depedecy checkig What about variable latecy operatios? 63

64 VLIW Philosophy Philosophy similar to RISC (simple istructios ad hardware) Except multiple istructios i parallel RISC (Joh Cocke, 1970s, IBM 801 miicomputer) Compiler does the hard work to traslate high-level laguage code to simple istructios (Joh Cocke: cotrol sigals) Ad, to reorder simple istructios for high performace Hardware does little traslatio/decodig à very simple VLIW (Josh Fisher, ISCA 1983) Compiler does the hard work to fid istructio level parallelism Hardware stays as simple ad streamlied as possible Executes each istructio i a budle i lock step Simple à higher freuecy, easier to desig 64

65 Commercial VLIW Machies Multiflow TRACE, Josh Fisher (7-wide, 28-wide) Cydrome Cydra 5, Bob Rau Trasmeta Crusoe: x86 biary-traslated ito iteral VLIW TI C6000, Trimedia, STMicro (DSP & embedded processors) Most successful commercially Itel IA-64 Not fully VLIW, but based o VLIW priciples EPIC (Explicitly Parallel Istructio Computig) Istructio budles ca have depedet istructios A few bits i the istructio format specify explicitly which istructios i the budle are depedet o which other oes 65

66 VLIW Tradeoffs Advatages + No eed for dyamic schedulig hardware à simple hardware + No eed for depedecy checkig withi a VLIW istructio à simple hardware for multiple istructio issue + o reamig + No eed for istructio aligmet/distributio after fetch to differet fuctioal uits à simple hardware Disadvatages -- Compiler eeds to fid N idepedet operatios per cycle -- If it caot, iserts NOPs i a VLIW istructio -- Parallelism loss AND code size icrease -- Recompilatio reuired whe executio width (N), istructio latecies, fuctioal uits chage (Ulike superscalar processig) -- Lockstep executio causes idepedet operatios to stall -- No istructio ca progress util the logest-latecy istructio completes 66

67 VLIW Summary VLIW simplifies hardware, but reuires complex compiler techiues Solely-compiler approach of VLIW has several dowsides that reduce performace -- Too may NOPs (ot eough parallelism discovered) -- Static schedule itimately tied to microarchitecture -- Code optimized for oe geeratio performs poorly for ext -- No tolerace for variable or log-latecy operatios (lock step) ++ Most compiler optimizatios developed for VLIW employed i optimizig compilers (for superscalar compilatio) Eable code optimizatios ++ VLIW successful whe parallelism is easier to fid by the compiler (traditioally embedded markets, DSPs) 67

68 A Example Work: Superblock Hwu et al., The superblock: A effective techiue for VLIW ad superscalar compilatio. The Joural of Supercomputig, Lecture Video o Static Istructio Schedulig 68

69 Aother Example Work: IMPACT Chag et al., IMPACT: a architectural framework for multiple-istructio-issue processors. ISCA

70 Approaches to (Istructio-Level) Cocurrecy Pipeliig Out-of-order executio Dataflow (at the ISA level) Superscalar Executio VLIW Fie-Graied Multithreadig SIMD Processig (Vector ad array processors, GPUs) Decoupled Access Execute Systolic Arrays 70

71 Recall: How to Hadle Data Depedeces Ati ad output depedeces are easier to hadle write to the destiatio i oe stage ad i program order Flow depedeces are more iterestig Five fudametal ways of hadlig flow depedeces Detect ad wait util value is available i register file Detect ad forward/bypass data to depedet istructio Detect ad elimiate the depedece at the software level No eed for the hardware to detect depedece Predict the eeded value(s), execute speculatively, ad verify Do somethig else (fie-graied multithreadig) No eed to detect 71

72 Fie-Graied Multithreadig 72

73 Fie-Graied Multithreadig Idea: Hardware has multiple thread cotexts (PC+registers). Each cycle, fetch egie fetches from a differet thread. By the time the fetched brach/istructio resolves, o istructio is fetched from the same thread Brach/istructio resolutio latecy overlapped with executio of other threads istructios + No logic eeded for hadlig cotrol ad data depedeces withi a thread -- Sigle thread performace suffers -- Extra logic for keepig thread cotexts -- Does ot overlap latecy if ot eough threads to cover the whole pipelie 73

74 Fie-Graied Multithreadig (II) Idea: Switch to aother thread every cycle such that o two istructios from a thread are i the pipelie cocurretly Tolerates the cotrol ad data depedecy latecies by overlappig the latecy with useful work from other threads Improves pipelie utilizatio by takig advatage of multiple threads Thorto, Parallel Operatio i the Cotrol Data 6600, AFIPS Smith, A pipelied, shared resource MIMD computer, ICPP

75 Fie-Graied Multithreadig: History CDC 6600 s peripheral processig uit is fie-graied multithreaded Thorto, Parallel Operatio i the Cotrol Data 6600, AFIPS Processor executes a differet I/O thread every cycle A operatio from the same thread is executed every 10 cycles Deelcor HEP (Heterogeeous Elemet Processor) Smith, A pipelied, shared resource MIMD computer, ICPP threads/processor available ueue vs. uavailable (waitig) ueue for threads each thread ca have oly 1 istructio i the processor pipelie; each thread idepedet to each thread, processor looks like a o-pipelied machie system throughput vs. sigle thread performace tradeoff 75

76 Fie-Graied Multithreadig i HEP Cycle time: 100s 8 stages à 800 s to complete a istructio assumig o memory access No cotrol ad data depedecy checkig Burto Smith ( ) 76

77 Multithreaded Pipelie Example Slide credit: Joel Emer 77

78 Su Niagara Multithreaded Pipelie Kogetira et al., Niagara: A 32-Way Multithreaded Sparc Processor, IEEE Micro

79 Fie-graied Multithreadig Advatages + No eed for depedecy checkig betwee istructios (oly oe istructio i pipelie from a sigle thread) + No eed for brach predictio logic + Otherwise-bubble cycles used for executig useful istructios from differet threads + Improved system throughput, latecy tolerace, utilizatio Disadvatages - Extra hardware complexity: multiple hardware cotexts (PCs, register files, ), thread selectio logic - Reduced sigle thread performace (oe istructio fetched every N cycles from the same thread) - Resource cotetio betwee threads i caches ad memory - Some depedecy checkig logic betwee threads remais (load/store) 79

80 Moder GPUs are FGMT Machies 80

81 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) = data-parallel (SIMD) fuc. uit, cotrol shared across 8 uits = multiply-add = multiply = istructio stream decode = executio cotext storage Slide credit: Kayvo Fatahalia 81

82 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) Groups of 32 threads share istructio stream (each group is a Warp): they execute the same istructio o differet data Up to 32 warps are iterleaved i a FGMT maer Up to 1024 thread cotexts ca be stored Slide credit: Kayvo Fatahalia 82

83 NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 30 cores o the GTX 285: 30,720 threads Slide credit: Kayvo Fatahalia 83

84 Ed of Fie-Graied Multithreadig 84

85 I Memory of Burto Smith Burto Smith ( ) 85

86 I Memory of Burto Smith (II) 86

87 Wedesday Keyote (HiPEAC 2015) Resource Maagemet i PACORA Burto J. Smith Microsoft

88 Techical Fellow at Microsoft Burto Smith Past: Co-fouder, chief scietist, chairma of Tera/Cray, Deelcor, Professor at Colorado Eckert-Mauchly Award i 1991, Seymour Cray Award, US Natioal Academy of Egieerig, AAAS/ACM/IEEE Fellow ad may other hoors May wide-rage cotributios spaig architecture, system software, compilers,, icludig: Deelcor HEP, Tera MTA fie-graied sychroizatio, commuicatio, multithreadig parallel architectures, resource maagemet, itercoectio etworks Oe I would like to share: Smith, A pipelied, shared resource MIMD computer, ICPP 1978.

89 Fie-graied Multithreadig i HEP 128 processes (hardware cotexts) Cycle time: 100s 8 stages à 800 s to complete a istructio assumig o memory access No cotrol ad data depedecy checkig 89

90 Wedesday Keyote (HiPEAC 2015) Resource Maagemet i PACORA Burto J. Smith Microsoft

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 16: Out-of-Order Executio Prof. Our Mutlu ETH Zurich Sprig 2018 26 April 2018 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed

More information

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device

More information

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Out-of-Order Execution II Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15 Video

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors CS252 Sprig 2017 Graduate Computer Architecture Lecture 6: Out-of-Order Processors Lisa Wu, Krste Asaovic http://ist.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 2 WU UCB CS252 SP17 Last Time i Lecture

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 4: Pipeliig Prof. Our Mutlu ETH Zurich Sprig 28 9 April 28 Ageda for Today & Next Few Lectures Previous lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Advaced Issues Review: Pipelie Hazards Structural hazards Desig pipelie to elimiate structural hazards.

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must

More information

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 5: Pipeliig Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab1 Due toight Lab2: out later today; due 2 weeks from ow Review sessio this Friday Turig award

More information

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 17 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018

More information

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines This Uit: Damic Schedulig CSE 560 Computer Sstems Architecture Damic Schedulig Slides origiall developed b Drew Hilto (IBM) ad Milo Marti (Uiversit of Peslvaia) App App App Sstem software Mem CPU I/O Code

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017 Computer Architecture Lecture 8: SIMD Processors ad GPUs Prof. Our Mutlu ETH Zürich Fall 2017 18 October 2017 Ageda for Today & Next Few Lectures SIMD Processors GPUs Itroductio to GPU Programmig Digitaltechik

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access

More information

Design of Digital Circuits Lecture 19: Approaches to Concurrency. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 19: Approaches to Concurrency. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 19: Approaches to Concurrency Prof. Onur Mutlu ETH Zurich Spring 2017 5 May 2017 Agenda for Today & Next Few Lectures! Single-cycle Microarchitectures! Multi-cycle and

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018 Desig of Digital Circuits Lecture 20: SIMD Processors Prof. Our Mutlu ETH Zurich Sprig 2018 11 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018 2 credit uits Rigorous semiar o fudametal ad

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011 15-740/18-740 Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011 Reviews Due next Monday Mutlu et al., Runahead Execution: An Alternative

More information

Uniprocessors. HPC Prof. Robert van Engelen

Uniprocessors. HPC Prof. Robert van Engelen Uiprocessors HPC Prof. Robert va Egele Overview PART I: Uiprocessors PART II: Multiprocessors ad ad Compiler Optimizatios Parallel Programmig Models Uiprocessors Multiprocessors Processor architectures

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago CMSC 22200 Computer Architecture Lecture 2: ISA Prof. Yajig Li Departmet of Computer Sciece Uiversity of Chicago Admiistrative Stuff Lab1 out toight Due Thursday (10/18) Lab1 review sessio Tomorrow, 10/05,

More information

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings Operatig Systems: Iterals ad Desig Priciples Chapter 4 Threads Nith Editio By William Stalligs Processes ad Threads Resource Owership Process icludes a virtual address space to hold the process image The

More information

CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Multi-Core Prof. Yajig Li Uiversity of Chicago Course Evaluatio Very importat Please fill out! 2 Lab3 Brach Predictio Competitio 8 teams etered the competitio,

More information

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Course Site:   Copyright 2012, Elsevier Inc. All rights reserved. Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios

More information

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013 18-447 Computer Architecture Lecture 13: State Maintenance and Recovery Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

Instruction and Data Streams

Instruction and Data Streams Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

A collection of open-sourced RISC-V processors

A collection of open-sourced RISC-V processors Riscy Processors A collectio of ope-sourced RISC-V processors Ady Wright, Sizhuo Zhag, Thomas Bourgeat, Murali Vijayaraghava, Jamey Hicks, Arvid Computatio Structures Group, CSAIL, MIT 4 th RISC-V Workshop

More information

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control EE 459/500 HDL Based Digital Desig with Programmable Logic Lecture 13 Cotrol ad Sequecig: Hardwired ad Microprogrammed Cotrol Refereces: Chapter s 4,5 from textbook Chapter 7 of M.M. Mao ad C.R. Kime,

More information

ΕΠΛ 605 Εργαστήριο 5. Παναγιώτα Νικολάου 11/10/18. Slides from: Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin

ΕΠΛ 605 Εργαστήριο 5. Παναγιώτα Νικολάου 11/10/18. Slides from: Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin ΕΠΛ 605 Εργαστήριο 5 Παναγιώτα Νικολάου 11/10/18 Slides from: Rajagopala Desika, Doug Burger, Stephe Keckler, Todd Austi Simulators Simulatio is the process of desigig a model of a real system ad coductig

More information

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition. Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive

More information

Isn t It Time You Got Faster, Quicker?

Isn t It Time You Got Faster, Quicker? Is t It Time You Got Faster, Quicker? AltiVec Techology At-a-Glace OVERVIEW Motorola s advaced AltiVec techology is desiged to eable host processors compatible with the PowerPC istructio-set architecture

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 3: ISA ad Itroductio to Microarchitecture Prof. Yajig Li Uiversity of Chicago Lecture Outlie ISA uarch (hardware implemetatio of a ISA) Logic desig basics Sigle-cycle

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1 Switchig Hardware Sprig 208 CS 438 Staff, Uiversity of Illiois Where are we? Uderstad Differet ways to move through a etwork (forwardig) Read sigs at each switch (datagram) Follow a kow path (virtual circuit)

More information

Design of Digital Circuits Lecture 16: Dependence Handling. Prof. Onur Mutlu ETH Zurich Spring April 2017

Design of Digital Circuits Lecture 16: Dependence Handling. Prof. Onur Mutlu ETH Zurich Spring April 2017 Design of Digital Circuits Lecture 16: Dependence Handling Prof. Onur Mutlu ETH Zurich Spring 2017 27 April 2017 Agenda for Today & Next Few Lectures! Single-cycle Microarchitectures! Multi-cycle and Microprogrammed

More information

COP4020 Programming Languages. Compilers and Interpreters Prof. Robert van Engelen

COP4020 Programming Languages. Compilers and Interpreters Prof. Robert van Engelen COP4020 mig Laguages Compilers ad Iterpreters Prof. Robert va Egele Overview Commo compiler ad iterpreter cofiguratios Virtual machies Itegrated developmet eviromets Compiler phases Lexical aalysis Sytax

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Threads ad Cocurrecy i Java: Part 1 1 Cocurrecy What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

Precise Exceptions and Out-of-Order Execution. Samira Khan

Precise Exceptions and Out-of-Order Execution. Samira Khan Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 CPU-Memory Bottleeck Computer Architecture ELEC44 CPU Memory Lecture 8 Cache Dr. Hayde Kwok-Hay So Departmet of Electrical ad Electroic Egieerig Performace of high-speed computers is usually limited by

More information

Computer Architecture Lecture 15: GPUs, VLIW, DAE. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015

Computer Architecture Lecture 15: GPUs, VLIW, DAE. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015 18-447 Computer Architecture Lecture 15: GPUs, VLIW, DAE Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle

More information

Data diverse software fault tolerance techniques

Data diverse software fault tolerance techniques Data diverse software fault tolerace techiques Complemets desig diversity by compesatig for desig diversity s s limitatios Ivolves obtaiig a related set of poits i the program data space, executig the

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 1 Computers ad Programs 1 Objectives To uderstad the respective roles of hardware ad software i a computig system. To lear what computer scietists

More information

Lecture 1: Introduction and Fundamental Concepts 1

Lecture 1: Introduction and Fundamental Concepts 1 Uderstadig Performace Lecture : Fudametal Cocepts ad Performace Aalysis CENG 332 Algorithm Determies umber of operatios executed Programmig laguage, compiler, architecture Determie umber of machie istructios

More information

CMSC Computer Architecture Lecture 1: Introduction. Prof. Yanjing Li Department of Computer Science University of Chicago

CMSC Computer Architecture Lecture 1: Introduction. Prof. Yanjing Li Department of Computer Science University of Chicago CMSC 22200 Computer Architecture Lecture 1: Itroductio Prof. Yajig Li Departmet of Computer Sciece Uiversity of Chicago Lecture Outlie Meet ad greet Computer architecture: overview ad perspectives Course

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Review Istructio Set Architecture Istructio Set The repertoire of istructios of a computer Differet computers have differet istructio

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory! Why Care About the Memory Hierarchy? Memory Virtual Memory -DRAM Memory Gap (latecy) Reasos: Multi process systems (abstractio & memory protectio) Solutio: Tables (holdig per process traslatios) Fast traslatio

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Cocurrecy Threads ad Cocurrecy i Java: Part 1 What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 15-740/18-740 Computer Architecture Lecture 14: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 Reviews Due Today Chrysos and Emer, Memory Dependence Prediction Using

More information

Lecture 28: Data Link Layer

Lecture 28: Data Link Layer Automatic Repeat Request (ARQ) 2. Go ack N ARQ Although the Stop ad Wait ARQ is very simple, you ca easily show that it has very the low efficiecy. The low efficiecy comes from the fact that the trasmittig

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components Aoucemets Readig Chapter 4 (4.1-4.2) Project #4 is o the web ote policy about project #3 missig compoets Homework #1 Due 11/6/01 Chapter 6: 4, 12, 24, 37 Midterm #2 11/8/01 i class 1 Project #4 otes IPv6Iit,

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016

More information

Today s objectives. CSE401: Introduction to Compiler Construction. What is a compiler? Administrative Details. Why study compilers?

Today s objectives. CSE401: Introduction to Compiler Construction. What is a compiler? Administrative Details. Why study compilers? CSE401: Itroductio to Compiler Costructio Larry Ruzzo Sprig 2004 Today s objectives Admiistrative details Defie compilers ad why we study them Defie the high-level structure of compilers Associate specific

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Determined by ISA and compiler. Determined by CPU hardware

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Determined by ISA and compiler. Determined by CPU hardware COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 4 The Processor Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler CPI ad Cycle time Determied

More information

ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University

ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University ECE5917 SoC Architecture: MP SoC Part 1 Tae Hee Ha: tha@skku.edu Semicoductor Systems Egieerig Sugkyukwa Uiversity Outlie Overview Parallelism Data-Level Parallelism Istructio-Level Parallelism Thread-Level

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization Ed Semester Examiatio 2013-14 CSE, III Yr. (I Sem), 30002: Computer Orgaizatio Istructios: GROUP -A 1. Write the questio paper group (A, B, C, D), o frot page top of aswer book, as per what is metioed

More information

Computer Architecture Lecture 17: Latency Tolerance and Prefetching. Prof. Onur Mutlu ETH Zürich Fall November 2017

Computer Architecture Lecture 17: Latency Tolerance and Prefetching. Prof. Onur Mutlu ETH Zürich Fall November 2017 Computer Architecture Lecture 17: Latecy Tolerace ad Prefetchig Prof. Our Mutlu ETH Zürich Fall 2017 22 November 2017 Summary of Last Week s Lectures Shared Cache Maagemet Makig Cachig More Effective Heterogeeous

More information

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1 Reliable Trasmissio Sprig 2018 CS 438 Staff - Uiversity of Illiois 1 Reliable Trasmissio Hello! My computer s ame is Alice. Alice Bob Hello! Alice. Sprig 2018 CS 438 Staff - Uiversity of Illiois 2 Reliable

More information

Outline. CSCI 4730 Operating Systems. Questions. What is an Operating System? Computer System Layers. Computer System Layers

Outline. CSCI 4730 Operating Systems. Questions. What is an Operating System? Computer System Layers. Computer System Layers Outlie CSCI 4730 s! What is a s?!! System Compoet Architecture s Overview Questios What is a?! What are the major operatig system compoets?! What are basic computer system orgaizatios?! How do you commuicate

More information

Arquitectura de Computadores

Arquitectura de Computadores Arquitectura de Computadores Capítulo 2. Procesadores segmetados Based o the origial material of the book: D.A. Patterso y J.L. Heessy Computer Orgaizatio ad Desig: The Hardware/Software Iterface 4 th

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018 Desig of Digital Circuits Lecture 22: GPU Programmig Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 18 May 2018 Ageda for Today GPU as a accelerator Program structure Bulk sychroous programmig

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware A Overview Graphics System Moitor Iput devices CPU/Memory GPU Raster Graphics System Raster: A array of picture elemets Based o raster-sca TV techology The scree (ad a picture)

More information

1 Tomasulo s Algorithm

1 Tomasulo s Algorithm Design of Digital Circuits (252-0028-00L), Spring 2018 Optional HW 4: Out-of-Order Execution, Dataflow, Branch Prediction, VLIW, and Fine-Grained Multithreading uctor: Prof. Onur Mutlu TAs: Juan Gomez

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 20 Itroductio to Trasactio Processig Cocepts ad Theory Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Trasactio Describes local

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

CS2410 Computer Architecture. Flynn s Taxonomy

CS2410 Computer Architecture. Flynn s Taxonomy CS2410 Computer Architecture Dept. of Computer Sciece Uiversity of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/2410p/idex.html 1 Fly s Taxoomy SISD Sigle istructio stream Sigle data stream (SIMD)

More information

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering EE 4363 1 Uiversity of Miesota Midterm Exam #1 Prof. Matthew O'Keefe TA: Eric Seppae Departmet of Electrical ad Computer Egieerig Uiversity of Miesota Twi Cities Campus EE 4363 Itroductio to Microprocessors

More information

SCI Reflective Memory

SCI Reflective Memory Embedded SCI Solutios SCI Reflective Memory (Experimetal) Atle Vesterkjær Dolphi Itercoect Solutios AS Olaf Helsets vei 6, N-0621 Oslo, Norway Phoe: (47) 23 16 71 42 Fax: (47) 23 16 71 80 Mail: atleve@dolphiics.o

More information

Computer Architecture Lecture 15: Dataflow and SIMD. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/20/2013

Computer Architecture Lecture 15: Dataflow and SIMD. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/20/2013 18-447 Computer Architecture Lecture 15: Dataflow and SIMD Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/20/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed LC-3b,

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu UH-MEM: Utility-Based Hybrid Memory Maagemet Yag Li, Saugata Ghose, Jogmoo Choi, Ji Su, Hui Wag, Our Mutlu 1 Executive Summary DRAM faces sigificat techology scalig difficulties Emergig memory techologies

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Chapter 4 The Datapath

Chapter 4 The Datapath The Ageda Chapter 4 The Datapath Based o slides McGraw-Hill Additioal material 24/25/26 Lewis/Marti Additioal material 28 Roth Additioal material 2 Taylor Additioal material 2 Farmer Tae the elemets that

More information

ELEG 5173L Digital Signal Processing Introduction to TMS320C6713 DSK

ELEG 5173L Digital Signal Processing Introduction to TMS320C6713 DSK Departmet of Electrical Egieerig Uiversity of Arasas ELEG 5173L Digital Sigal Processig Itroductio to TMS320C6713 DSK Dr. Jigia Wu wuj@uar.edu ANALOG V.S DIGITAL 2 Aalog sigal processig ASP Aalog sigal

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

1 Enterprise Modeler

1 Enterprise Modeler 1 Eterprise Modeler Itroductio I BaaERP, a Busiess Cotrol Model ad a Eterprise Structure Model for multi-site cofiguratios are itroduced. Eterprise Structure Model Busiess Cotrol Models Busiess Fuctio

More information

15-740/ Computer Architecture

15-740/ Computer Architecture 15-740/18-740 Computer Architecture Lecture 16: Runahead and OoO Wrap-Up Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/17/2011 Review Set 9 Due this Wednesday (October 19) Wilkes, Slave Memories

More information

Portland State University ECE 587/687. Superscalar Issue Logic

Portland State University ECE 587/687. Superscalar Issue Logic Portland State University ECE 587/687 Superscalar Issue Logic Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017 Instruction Issue Logic (Sohi & Vajapeyam, 1987) After instructions are

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Computer Systems - HS

Computer Systems - HS What have we leared so far? Computer Systems High Level ENGG1203 2d Semester, 2017-18 Applicatios Sigals Systems & Cotrol Systems Computer & Embedded Systems Digital Logic Combiatioal Logic Sequetial Logic

More information

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office

More information

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition Lecture Goals Itroductio to Computig Systems: From Bits ad Gates to C ad Beyod 2 d Editio Yale N. Patt Sajay J. Patel Origial slides from Gregory Byrd, North Carolia State Uiversity Modified slides by

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Behavioral Modeling in Verilog

Behavioral Modeling in Verilog Behavioral Modelig i Verilog COE 202 Digital Logic Desig Dr. Muhamed Mudawar Kig Fahd Uiversity of Petroleum ad Mierals Presetatio Outlie Itroductio to Dataflow ad Behavioral Modelig Verilog Operators

More information