CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

Size: px

Start display at page:

Download "CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago"

Lilian Blake
6 years ago
Views:

1 CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago

2 Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device Review sessio: this Friday, 5-6:20pm, Ry 276 Extra office hours this week 2

3 Lecture Outlie Out-of-order (OOO) executio Other ways to improve ILP 3

4 Review: Out-of-Order Executio Idea Move the depedet istructios out of the way of idepedet oes (s.t. idepedet oes ca execute) Approach Moitor the source values of each istructio Whe all source values of a istructio are available, fire (i.e. dispatch) the istructio Retire each istructio i program order Beefit Latecy tolerace: Allows idepedet istructios to execute ad complete i the presece of a log latecy operatio 4

5 Review: Beefits of OOO I order F D E E E E M W F D STALL E M W F Out-of-order F D E E E E M W F D F D STALL WAIT E M D E M W F D E E E E M W F D STALL E M W E M W W F D E E E E M W F D WAIT E M W IMUL R3 ß R1, R2 ADD R3 ß R3, R1 ADD R1 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R3, R5 Assume: IMUL: 4 Ex cycles ADD: 1 Ex cycle 15 vs. 12 cycles 5

6 Review: Tomasulo s Algorithm 1. Lik the cosumer of a value to the producer Register reamig: Associate a tag with each data value Elimiates false depedecies 2. Buffer istructios util they are ready Isert istructio ito reservatio statios after reamig Reservatio statios are also referred to as issued ueues Eables pipelie to move for idepedet ops (dyamic schedulig) 3. Keep track of readiess of source values of a istructio Broadcast the tag whe the value is produced Istructios compare their source tags to the broadcast tag à if match, source value becomes ready 4. Whe all source values of a istructio are ready, dispatch the istructio to fuctioal uit (FU), which ca be out-of-order Wakeup ad select/schedule the istructio 6

7 What s Missig i Tomasulo s Algorithm? Need to preserve program istructio order Also eed to eable precise exceptio (we ll get back to this) Idea: Use a reorder buffer to reorder istructios before committig them to architectural state A istructio updates the architectural register file whe it is the oldest i the machie ad has completed executio Program order is preserved 7

8 Illustratio of a OOO Pipelie TAG ad VALUE Broadcast Bus F D S C H E D U L E E Iteger add E E E E Iteger mul E E E E E E E E FP mul E E E E E E E E... Load/store i order out of order i order R E O R D E R W Two humps Hump 1: reservatio statios (schedulig widow) Hump 2: reorder buffer (istructio widow or active widow) 8

9 Reorder Buffer (ROB) Idea: Reorder istructio before makig results visible to architectural state Whe istructio is decoded it reserves a etry i the ROB Stall if ROB is full Whe istructio completes, it writes result ito ROB etry Whe oldest istructio i ROB has completed without exceptios, its result moved to reg. file or memory ROB holds active istructios Decoded, reamed, but ot committed Waitig to be dispatched, or is beig executed 9

10 What a ROB looks like? ROB etry example Allocated? DestID DestVal Valid bit for DestVal PC Other cotrol bits Exceptio? Track readiess of dest result Oldest istructio (retire whe dest is valid; update arch. state) First available etry to hold ewly decoded istructios Allocated Allocated Allocated 10

11 Discussios Why is OoO executio beeficial? What if all operatios take sigle cycle? Latecy tolerace: OoO executio tolerates the latecy of multi-cycle operatios by executig idepedet operatios cocurretly How may cycles of latecy ca OoO tolerate? Determied by active istructio widow, or ROB size 126 i Petium 4 But, schedulig widow size also importat Bigger à higher chace of fidig more depedet istructios Tradeoffs? Power, cost, complexity, performace 11

12 Registers vs. Memory So far, we cosidered register based value commuicatio betwee istructios What about memory? What are the fudametal differeces betwee registers ad memory? Register depedecies kow statically memory depedecies determied dyamically Register state is small memory state is large Register state is ot visible to other threads/processors memory state is shared betwee threads/processors (i a shared memory multiprocessor) 12

13 Hadlig Memory Operatios i OOO Machies Just like register updates, stores should ot modify the memory util after the istructio is committed Oe idea: Keep store address/data i reorder buffer How does a load istructio fid its data? 13

14 Store Buffer Similar to reorder buffer, but used oly for store istructios Program-order list of u-committed store operatios Whe store is decoded: Allocate a store buffer etry Whe store address ad data become available: Record i store buffer etry Store address ad data are updated separately Whe the store is the oldest istructio i the pipelie: Update the memory address (i.e. cache) with data If store is flushed: clear a valid bit for the correspodig store buffer etry (i additio to clearig the ROB etry) 14

15 Illustratio of a Store Buffer V V V V V V S S S S S S Addr Addr Addr Addr Addr Addr Data Data Data Data Data Data Store Commit Path Tags Data L1 Data Cache 15

16 Memory Depedecies stur x1, [x2, 96] ldur x3, [x4, 10] Is the load depedet o the store? 16

17 Memory Depedecy Need to obey memory depedeces i a out-of-order machie While providig high performace Observatio ad Problem: Memory address is ot kow util a load/store executes Corollary 1: Reamig memory addresses is difficult Corollary 2: Determiig depedecy of loads/stores eed to be hadled after their (partial) executio Corollary 3: Whe a load/store has its address ready, there may be youger/older loads/stores with udetermied addresses i the machie 17

18 Memory Depedecy Hadlig Whe do you schedule a load istructio i a OOO machie? Problem: A youger load ca have its address ready before a older store s address is kow Kow as the memory disambiguatio problem or the ukow address problem Approaches Coservative: Stall the load util all previous stores have computed their addresses (or eve retired from the machie) Aggressive: Assume load is idepedet of ukow-address stores ad schedule the load right away Speculative: Predict if the load is depedet o the ukow address store 18

19 Memory Disambiguatio Schemes Optio 1: Assume load depedet o all previous stores + No eed for recovery -- Too coservative: delays idepedet loads uecessarily Optio 2: Assume load idepedet of all previous stores + Simple ad ca be commo case: o delay for idepedet loads -- Reuires recovery ad re-executio of load ad depedets o mispredictio Optio 3: Predict the depedecy of a load o a outstadig store + More accurate. Load store depedecies persist over time + Simple predictors (based o past history) ca achieve most of the potetial performace (Chrysos ad Emer, Memory Depedece Predictio Usig Store Sets, ISCA 1998.) -- Still reuires recovery/re-executio o mis-predictio 19

20 Data Bypass Betwee Stores ad Loads We caot update memory out of program order à Need to buffer all store ad load istructios i istructio widow Eve if we kow all addresses of past stores whe we geerate the address of a load, two uestios still remai: 1. How do we check whether or ot it is depedet o a store 2. How do we forward data to the load if it is depedet o a store Moder processors use a LQ (load ueue) ad a SQ (store ueue) Ca be combied or separate betwee loads ad stores A load searches the SQ after it computes its address. (Why?) A store searches the LQ after it computes its address. (Why?) 20

21 Load Bypass from Store Buffer Store Buffer Load Address L1 Data Cache V S V S V S V S V S V S Addr Addr Addr Addr Addr Addr Data Data Data Data Data Data Tags Data Load Data If data hits both store buffer ad cache, which to use? Store buffer If same address i store buffer twice, which to use? Yougest store older tha load 21

22 OoO: Summary Improves performace (by a lot) A lot more complex tha i-order E.g., thik about how to implemet brach predictio i OoO Whe would you choose a i-order implemetatio istead of OoO ad vice versa? 22

23 Approaches to Istructio-Level Parallelism Pipeliig Out-of-order executio Others Superscalar SIMD Processig VLIW 23

24 What s a Superscalar? Execute more tha oe istructio i a clock cycle by simultaeously dispatchig multiple istructios to differet executio uits Image source: wikipedia 24

25 SIMD Processig: Exploitig Regular (Data) Parallelism

26 Fly s Taxoomy of Computers Mike Fly, Very High-Speed Computig Systems, Proc. of IEEE, 1966 SISD: Sigle istructio operates o sigle data elemet SIMD: Sigle istructio operates o multiple data elemets MISD: Multiple istructios operate o sigle data elemet MIMD: Multiple istructios operate o multiple data elemets (multiple istructio streams) Multiprocessor Multithreaded processor 26

27 Data Parallelism Cocurrecy arises from performig the same operatios o differet pieces of data Sigle istructio multiple data (SIMD) E.g., addig two vectors SIMD exploits istructio-level parallelism Multiple operatios are cocurret: istructios happe to be the same 27

28 SIMD Processig Good at exploitig regular data-level parallelism Same operatio performed o may data elemets Improve performace, simplify desig (o itra-vector depedecies) Performace improvemet limited by vectorizability of code Scalar operatios limit vector machie performace Amdahl s Law CRAY-1, which is kow for its SIMD processig capability, was the fastest SCALAR machie at its time! May existig ISAs iclude SIMD operatios Itel MMX/SSE/AVX, PowerPC AltiVec, ARM Advaced SIMD 28

29 Vector Processors A vector is a oe-dimesioal array of umbers A Vector processor is oe whose istructios operate o vectors rather tha scalar (sigle data) values Typical executio flow Load vectors from memory Perform SIMD operatios (add, mul, etc.) Usig differet pipelied fuctioal uits Each pipelie stage operates o a differet data elemet Ca have multiple fuctioal uits that perform the same operatio, each is called a lae Store vectors back to memory 29

30 Parallelism i Vector Processors Ca overlap executio of multiple vector istructios Example machie has 32 elemets per vector register ad 8 laes Completes 24 operatios/cycle while issuig 1 vector istructio/cycle Load Uit Multiply Uit Add Uit load mul time add load mul add Istructio issue Slide credit: Krste Asaovic 30

31 Basic Reuiremets of a Vector Processor Need to load/store vectors à vector registers (cotai vectors) Need to operate o vectors of differet legths à vector legth register (VLEN) Elemets of a vector might be stored apart from each other i memory à vector stride register (VSTR) Stride: distace betwee two elemets of a vector Determiig which data elemet should ot be acted upo à VMASK register, a bit mask Supports coditioal operatios i a loop 31

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 16: Out-of-Order Executio Prof. Our Mutlu ETH Zurich Sprig 2018 26 April 2018 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed