Pipelining what Seymour Cray taught the laundry industry. One load at a time PIPELINING. How to correctly pipeline circuits

Size: px

Start display at page:

Download "Pipelining what Seymour Cray taught the laundry industry. One load at a time PIPELINING. How to correctly pipeline circuits"

Wesley Boyd
6 years ago
Views:

Pipelining what Seymour ray taught the laundry industry PIPELININ I ve got months Worth of laundry To do tonight unny, considering that he s only got one outfit ow to correctly pipeline circuits

Slides are used in DTU course 054 Digital Systems Engineering (fall 008). Due to my (Joachim Rodrigues) position at DTU, I took the freedom to use the slides in EIT5.

1 Pipelining what Seymour ray taught the laundry industry PIPELININ I ve got months Worth of laundry To do tonight unny, considering that he s only got one outfit ow to correctly pipeline circuits cknowledgement: The following slides have been provided by Prof. Ward in September 004. Reformatting of PowerPoint and addition of two more slide done September 007 by Jens Sparsø. Slides are used in DTU course 054 Digital Systems Engineering (fall 008). Due to my (Joachim Rodrigues) position at DTU, I took the freedom to use the slides in EIT5. orget EIT5 lets solve a Real Problem One load at a time INPUT: dirty laundry Device: Washer unction: ill, gitate, Spin Washer PD = 0 mins Everyone knows that the real reason that MIT students put off doing laundry so long is not because they procrastinate, are lazy, or even have better things to do. Step : OUTPUT: 6 more weeks Device: Dryer The fact is, doing one load at a time is not smart. unction: eat, Spin Dryer PD = 60 mins Total = Washer PD + Dryer PD = 90 mins 4 5

2 Doing N loads of laundry Doing N Loads the MIT way ere s how they do laundry at arvard, the combinational way. Step : MIT students pipeline the laundry process. Step : (Of course, this is just an urban legend. No one at arvard actually does laundry. The butlers all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched in time for afternoon tea) Step : Step 4: Total = N*(Washer PD + Dryer PD ) = N*90 mins That s why we wait! ctually, it s more like N* if we account for the startup transient correctly. When doing pipeline analysis, we re mostly interested in the steady state where we assume we have an infinite supply of inputs. Step : Total = N * Max(Washer PD, Dryer PD ) = N*60 mins 6 7 Some definitions Okay, back to circuits Latency: The delay from when an input is established until the output associated with that input becomes valid. 90 (arvard Laundry = mins) ( MIT Laundry = 0 mins) Throughput: The rate of which inputs or outputs are processed. (arvard Laundry = /90 outputs/min) ( MIT Laundry = /60 outputs/min) ssuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available. () () P() P() or combinational logic: latency = t PD, throughput = /t PD. We can t get the answer faster, but are we making effective use of our hardware at all times? & are idle, just holding their outputs stable while performs its computation 8 9

3 Pipelined ircuits use registers to hold s input stable! 5 5 P() Pipeline diagrams lock cycle 0 i i+ i+ i P() Now & can be working on input i+ while is performing its computation on i. We ve created a -stage pipeline: if we have a valid input during clock cycle j, P() is valid during clock j+. Suppose,, have propagation delays of 5, 0, 5 ns and we are using ideal zero-delay registers: Pipeline stages Input Reg Reg Reg i i+ ( i ) ( i ) i+ ( i+ ) ( i+ ) ( i ) i+ ( i+ ) ( i+ ) ( i+ ) ( i+ ) latency throughput unpipelined 45 /45 -stage pipelined 50 /5 worse better The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle. 0 Pipeline diagrams (alternative view) 5 0 Inputs 5 P() i i+ i+ lock cycles i i+ i+ i+ ( i ) ( i ) ( i ) ( i+ ) ( i+ ) ( i+ ) ( i+ ) ( i+ ) ( i+ ) Each row shows the processing of a particular set of input data. (In a processor the processing of an instruction. You ll see plenty) Slide added by J. Sparsø Pipeline onventions DEINITION: a K-Stage Pipeline ( K-pipeline ) is an acyclic circuit having exactly K registers on every path from an input to an output. a OMINTIONL IRUIT is thus an 0-stage pipeline. ONVENTION: Every pipeline stage, hence every K-Stage pipeline, has a register on its OUTPUT (not on its input). LWYS: The LOK common to all registers must have a period sufficient to cover propagation over combinational paths PLUS (input) register t PD PLUS (output) register t SETUP. The LTENY of a K-pipeline is K times the period of the clock common to all registers. The TROUPUT of a K-pipeline is the frequency of the clock.

4 Ill-formed pipelines pipelining methodology onsider a D job of pipelining: Y or what value of K is the following circuit a K-Pipeline? nswer: none Problem: Successive inputs get mixed: e.g., (( i+ ), Y i ). This happened because some paths from inputs to outputs had registers, and some had only! an this happen on a well-formed K pipeline? Step : Draw a line that crosses every output in the circuit, and mark the endpoints as terminal points. ontinue to draw new lines between the terminal points across various circuit connections, ensuring that every connection crosses each line in the same direction. These lines demarcate pipeline stages. dding a pipeline register at every point where a separating line crosses a connection will always generate a valid pipeline. STRTEY: ocus your attention on placing pipelining registers around the slowest circuit elements (OTTLENEKS). T = /8ns L = 4ns ns D 8 ns E ns 5 ns 4 5 Pipeline Example OSERVTIONS: -pipeline improves neither L or T. Pipelining Summary dvantages: llows us to increase throughput, by breaking up long combinational paths and (hence) increasing clock frequency Y 0-pipe: -pipe: -pipe: -pipe: LTENY TROUPUT 4 /4 4 /4 4 / 6 / T improved by breaking long combinational paths, allowing faster clock. Too many stages cost L, don t improve T. ack-to-back registers are often required to keep pipeline wellformed. Disadvantages: May increase latency... Only as good as the weakest link: slowest step constrains system throughput. Increases area. Isn t there a way around this weak link problem? This bottleneck is the only problem Which would you choose? 6 7

5 Y 4 (-pipe) Pipelined omponents 4-stage pipeline, throughput= but... but... ow can I pipeline a clothes dryer??? Pipelined systems can be hierarchical: Replacing a slow combinational component with a k-pipe version may increase clock frequency Must account for new pipeline stages in our plan Step : Step : Step 4: Step 5: ow do ces do Laundry? They work around the bottleneck. irst, they find a place with twice as many dryers as washers. Throughput = /0 loads/min Latency = 90 mins/load 8 9 ack to our bottleneck ircuit Interleaving Recall our earlier example the slowes compomnent limits clock period to 8 ns. ENE throughput limited to /8 ns. We could improve throughput by inding a pipelined version of OR interleaving multiple copies of T = /8ns L = 4ns ns D 8 ns E ns 5 ns We can simulate a pipelined version of a slow component by replicating the critical element and alternate inputs between the various copies. This is a simple -state SM that alternates between 0 and on each clock i clk Q D Q 0 0 ( i- ) 0

6 We can simulate a pipelined version of a slow component by replicating the critical element and alternate inputs between the various copies. When Q is the lower path is combinational (the latch is open), yet the output of the upper path will be enabled onto the input of the output register ready for the NET clock edge. Meanwhile, the other latch maintains the input from the last clock. ircuit Interleaving i clk Q output Mux output even 0 odd 0 odd ( i- ) It acts like a -stage pipeline -lock Martinizing In by t i, out by t i+ N-way interleaving is equivalent to N pipeline Stages... N- registers N-way interleave i0 ircuit Interleaving 0 x 0 ( i- 0 ) ) 0 x 0 Latency = clocks lock period 0: 0 presented at input, propagates thru upper latch, 0. lock period : presented at input, propagates thru lower latch,. 0 ( 0 ) propagates to register inputs. lock period : presented at input, propagates thru upper latch,. 0 ( 0 ) loaded into register, appears at output. ombining techniques nd a little parallelism We can combine interleaving and pipelining. ere, interleaves two elements with a propagation delay of 8 ns. The resulting circuit has a throughput of /, and latency of 8 ns. This can be considered as an extra pipelining stage that passes through the middle of the module. One of our separation lines must pass through this pipeline stage. y combining interleaving with pipelining we move the bottleneck from the element to the element. T = /5ns L = 5ns ns D x4ns E ns 5 ns Step : Step : Step 4: Step 5: We can combine interleaving and pipelining with parallelism. Throughput = /0 = /5 load/min Latency = 90 min 4 5

7 Summary Latency (L) = time it takes for given input to arrive at output Throughput (T) = rate at each new outputs appear or combinational circuits: L = t PD of circuit, T = /L or K-pipelines (K > 0): always have register on output(s) K registers on every path from input to output Inputs available shortly after clock i, outputs available shortly after clock (i+k) T = /t LK =/(t PD,RE + t PD of slowest pipeline stage + t SETUP ) more throughput split slowest pipeline stage(s) use replication/interleaving if no further splits possible L = K / T pipelined latency combinational latency 6

Pipelining. Quiz 2 (next week) will cover materials through Tuesday s lecture. Lab 3 is due tonight. what Seymour Cray taught the laundry industry

Pipelining. Quiz 2 (next week) will cover materials through Tuesday s lecture. Lab 3 is due tonight. what Seymour Cray taught the laundry industry Pipelining what Seymour Cray taught the laundry industry Quiz 2 (next week) will cover materials through Tuesday s lecture. Lab 3 is due tonight. L09 - Pipelining 1 Forget 6.004 lets solve a Real Problem