Prediction Router: Yet another low-latency on-chip router architecture

Size: px

Start display at page:

Download "Prediction Router: Yet another low-latency on-chip router architecture"

Cameron Sherman
5 years ago
Views:

1 Prediction Router: Yet another low-latency on-chip router architecture Hiroki Matsutani Michihiro Koibuchi Hideharu Amano Tsutomu Yoshinaga (Keio Univ., Japan) (NII, Japan) (Keio Univ., Japan) (UEC, Japan)

Router router router router router router router router router router Packet

2 Why low-latency router is needed? Tile architecture Many cores (e.g., processors & caches) On-chip interconnection network [Dally, DAC 01] Core Router router router router router router router router router router Packet switched network 16-core tile architecture On-chip router affects the performance and cost of the chip

System Topology Routing Switching Flow ctrl MIT RAW 2D mesh (32bit) XY DOR WH, no VC Credit UPMC SPIN Fat Tree (32bit) Up*/down* WH, no VC Credit QuickSilver ACM H-Tree (32bit) Up*/down* 1-flit, no

3 System Topology Routing Switching Flow ctrl MIT RAW 2D mesh (32bit) XY DOR WH, no VC Credit UPMC SPIN Fat Tree (32bit) Up*/down* WH, no VC Credit QuickSilver ACM H-Tree (32bit) Up*/down* 1-flit, no VC Credit UMass Amherst asoc Sun T1 2D mesh Crossbar (128bit) Shortestpath Pipelined CS, no VC Timeslot - - Handshake Cell BE EIB Ring (128bit) Shortestpath no Number Pipelined of hops CS, increases Credit VC TRIPS (operand) TRIPS (on-chip) Why low-latency router is 2D mesh (109bit) 2D mesh (128bit) needed? Intel SCC 2D torus (32bit) XY,YX DOR, Number of cores increases (e.g., 64-core or more?) Their communication latency is a crucial problem YX DOR 1-flit, no VC On/off YX DOR WH, 4 VCs Credit odd-even TM WH, no VC Stall/go Low-latency router architecture has been extensively studied

4 Outline: Prediction router for low-latency NoC Existing low-latency routers Speculative router Look-ahead router Bypassing router Prediction router Architecture and the prediction algorithms Hit rate analysis Evaluations Hit rate, gate count, and energy consumption Case study 1: 2-D mesh (small core size) Case study 2: 2-D mesh (large core size) Case study 3: Fat tree network

5 Wormhole router: Hardware structure 1) selecting an Input ports output channel X+ FIFO 2) arbitration for the selected output channel GRANT ARBITER Output ports X+ X- FIFO X- Y+ FIFO Y+ Y- CORE FIFO FIFO 3) sending the packet to the output channel 5x5 CROSSBAR Y- CORE Routing, arbitration, & switch traversal are performed in a pipeline manner

6 Pipeline structure: 3-cycle router Speculative router: VA/SA in parallel [Peh,HPCA 01] At least 3-cycle for traversing a router RC (Routing computation) VSA (Virtual channel & switch allocations) (Switch traversal) VA & SA are speculatively performed in parallel A packet transfer from router (a) to router C HEAD RC VSA RC VSA RC VSA DATA 1 SA SA SA DATA 2 SA SA SA DATA 3 SA SA SA To perform RC and VSA in parallel, look-ahead routing is used At least 12-cycle for transferring ELAPSED a TIME packet [CYCLE] from router (a) to router (c)

7 Look-ahead router:rc/va in parallel At least 3-cycle for traversing a router NRC (Next routing computation) VSA (Virtual channel & switch allocations) (Switch traversal) HEAD DATA 1 DATA 2 DATA 3 Routing computation for the next hop Output port of router (i+1) is selected by router C NRC VSA SA NRC VSA SA SA SA NRC VSA SA SA ELAPSED TIME [CYCLE] VSA can be performed w/o waiting for NRC SA SA SA

8 Look-ahead router:rc/va in parallel At least 2-cycle for traversing a router NRC + VSA (Next routing computation / arbitrations) (Switch traversal) HEAD DATA 1 DATA 2 DATA 3 No dependency between NRC & VSA NRC & VSA in A NRC C NRC VSA NRC VSA At least 9-cycle for transferring ELAPSED a TIME packet [CYCLE] from router (a) to router (c) [Dally s book, 2004] Typical example of 2-cycle router Packing NRC,VSA, into a single stage frequency harmed

9 Bypassing router: skip some stages Bypassing between intermediate nodes E.g., Express VCs [Kumar, ISCA 07] Virtual bypassing paths SRC 3-cycle Bypassed 3-cycle 1-cycle 3-cycle Bypassed 3-cycle 1-cycle D 3-cycle

10 Bypassing router: skip some stages Bypassing between intermediate nodes E.g., Express VCs [Kumar, ISCA 07] Virtual bypassing paths SRC 3-cycle Bypassed 3-cycle 1-cycle Pipeline bypassing utilizing the regularity of DOR E.g., Mad postman Pipeline stages on frequently used are skipped E.g., Dynamic fast path Pipeline stages on user-specific paths are skipped E.g., Preferred path E.g., DBP 3-cycle [Izu, PDP 94] [Park, HOTI 07] [Michelogiannakis, NOCS 07] [Koibuchi, NOCS 08] Bypassed 3-cycle 1-cycle D 3-cycle We propose a low-latency router based on multiple predictors

11 Outline: Prediction router for low-latency NoC Existing low-latency routers Speculative router Look-ahead router Bypassing router Prediction router Architecture and the prediction algorithms Hit rate analysis Evaluations Hit rate, gate count, and energy consumption Case study 1: 2-D mesh (small core size) Case study 2: 2-D mesh (large core size) Case study 3: Fat tree network

12 Prediction router for 1-cycle transfer [Yoshinaga,IWIA 06] [Yoshinaga,IWIA 07] Each input channel has predictors When an input channel is idle, Predict an output port to be used (RC pre-execution) Arbitration to use the predicted port(sa preexecution) RC & VSA are skipped if prediction hits 1-cycle C HEAD RC VSA RC VSA RC VSA DATA 1 DATA 2 DATA ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

13 Prediction router for 1-cycle transfer [Yoshinaga,IWIA 06] [Yoshinaga,IWIA 07] Each input channel has predictors When an input channel is idle, Predict an output port to be used (RC pre-execution) Arbitration to use the predicted port(sa preexecution) RC & VSA are skipped if prediction hits 1-cycle transfer C HEAD RC VSA RC VSA RC VSA DATA 1 DATA 2 DATA ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

14 Prediction router for 1-cycle transfer [Yoshinaga,IWIA 06] [Yoshinaga,IWIA 07] Each input channel has predictors When an input channel is idle, Predict an output port to be used (RC pre-execution) Arbitration to use the predicted port(sa preexecution) RC & VSA are skipped if prediction hits 1-cycle transfer MISS C HEAD RC VSA RC VSA DATA 1 DATA 2 DATA ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

15 Prediction router for 1-cycle transfer [Yoshinaga,IWIA 06] [Yoshinaga,IWIA 07] Each input channel has predictors When an input channel is idle, Predict an output port to be used (RC pre-execution) Arbitration to use the predicted port(sa preexecution) RC & VSA are skipped if prediction hits 1-cycle transfer MISS HIT HIT HEAD RC VSA DATA 1 DATA 2 DATA ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

Prediction router: Prediction algorithms Efficient predictor is key Prediction router [Yoshinaga,IWIA 06] [Yoshinaga,IWIA 07] 1. Random Single 2.

16 Prediction router: Prediction algorithms Efficient predictor is key Prediction router [Yoshinaga,IWIA 06] [Yoshinaga,IWIA 07] 1. Random Single 2. predictor Static isn t Straight enough (SS) for applications An output with channel different on traffic the same patterns dimension is selected (exploiting the regularity of DOR) Multiple 3. predictors Custom for each input channel User can specify which output channel is accelerated Predictors 4. Latest Port (LP) A B Previously C used output channel is selected 5. Finite Context Method (FCM) [Burtscher, TC 02] The most frequently appeared pattern of Select one n of -context them in sequence (n = 0,1,2, ) response 6. to Sampled a given network Pattern Match (SPM) [Jacquet, TIT 02] environment Pattern matching using a record table

17 Basic Correct prediction Idle state: Output port X+ is selected and reserved 1st cycle: Incoming flit is transferred to X+ without RC and VSA 1st cycle: RC is performed The prediction is correct! 2nd cycle: Next flit is transferred to X+ without RC and VSA Predictors A B C ARBITER Correct X+ FIFO X+ X- Y+ Y- CORE Crossbar is reserved 5x5 XBAR X- Y+ Y- CORE 1-cycle transfer using the reserved crossbar-port when prediction hits

18 Basic Miss prediction Idle state: Output port X+ is selected and reserved 1st cycle: Incoming flit is transferred to X+ without RC and VSA 1st cycle: RC is performed The prediction is wrong! (X- is correct) Kill signal to X+ is asserted 2nd/3rd cycle: Dead flit is removed; retransmission to the correct port Predictors A B C ARBITER KILL X+ X- Y+ Y- CORE FIFO Dead flit More energy for retransmission 5x5 XBAR X+ Correct X- Y+ Y- CORE Even with miss prediction, a flit is transferred in 3-cycle as original router

19 Outline: Prediction router for low-latency NoC Existing low-latency routers Speculative router Look-ahead router Bypassing router Prediction router Architecture and the prediction algorithms Hit rate analysis Evaluations Hit rate, gate count, and energy consumption Case study 1: 2-D mesh (small core size) Case study 2: 2-D mesh (large core size) Case study 3: Fat tree network

20 Prediction hit rate analysis Formulas to calculate the prediction hit rates on 2-D torus (Random, LP, SS, FCM, and SPM) 2-D mesh (Random, LP, SS, FCM, and SPM) Fat tree (Random and LRU) To forecast which prediction algorithm is suited for a given network environment w/o simulations Accuracy of the analytical model is confirmed through simulations Derivation of the formulas is omitted in this talk (See Section 4 of our paper for more detail)

21 Outline: Prediction router for low-latency NoC Existing low-latency routers Speculative router Look-ahead router Bypassing router Prediction router Architecture and the prediction algorithms Hit rate analysis Evaluations Hit rate, gate count, and energy consumption Case study 1: 2-D mesh (small core size) Case study 2: 2-D mesh (large core size) Case study 3: Fat tree network

22 How many cycles? miss hit hit Flit-level net simulation Evaluation items hit FIFO FIFO XBAR Design compiler(synthesis) Fujitsu 65nm library Astro (place & route) NC-Verilog (simulation) SAIF SDF Power compiler Hit rate / Comm. latency Area (gate count) Energy cons. [pj / bit] Table 1: Router & network parameters Packet length 4-flit (1-flit: 64 bit) Switching technique wormhole Channel buffer size 4-flit / VC Number of VCs Cycle / hop (miss) 1 or 2VCs 3 stage Cycle *Topology / hop and (hit) traffic 1 are stage mentioned later Table 2: Process library CMOS process 65nm Core voltage 1.20V Temperature 25C Table 3: CAD tools used Design compiler Astro

23 3 case studies of prediction router How many cycles? hit miss hit hit Flit-level net simulation Hit rate / Comm. latency Area (gate count) Energy cons. [pj / bit] 2-D mesh network Fat tree network Case study 1 & 2 FIFO FIFO XBAR Design compiler(synthesis) Fujitsu 65nm library Astro (place & route) NC-Verilog (simulation) SAIF SDF Power compiler The most popular network topology MIT s RAW [Taylor,ISCA 04] Intel s 80-core [Vangal,ISSCC 07] Dimension-order routing (XY routing) Here, we show the results of case studies 1 and 2 together Case study 3

24 Comm. latency [cycles] Case study 1: Zero-load comm.latency Original router Pred router (SS) Pred router (100% hit) Uniform random traffic on 4x4 to 16x16 meshes (*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction 35.8% reduced for 8x8 cores 48.2% reduced for 16x16 cores Simulation results (analytical model also shows the same result) More latency Network reduced size (k-ary (48% 2-mesh) for k=16) as network size increases

25 Prediction hit rate [%] Case study 2: Hit 8x8 mesh SS: go straight LP: the last one FCM: frequently used pattern Efficient for long straight comm. 7 NAS parallel benchmark programs 4 synthesized traffics

26 Prediction hit rate [%] Case study 2: Hit 8x8 mesh SS: go straight LP: the last one FCM: frequently used pattern Efficient for long straight comm. Efficient for short repeated comm. 7 NAS parallel benchmark programs 4 synthesized traffics

Prediction hit rate [%] Case study 2: Hit rate @ 8x8 mesh SS: go straight LP: the last one FCM: frequently used pattern Efficient

Existing bypassing routers use Only a static or a single bypassing policy However, effective bypassing policy depends on traffic

27 Prediction hit rate [%] Case study 2: Hit 8x8 mesh SS: go straight LP: the last one FCM: frequently used pattern Efficient for long straight comm. Efficient for short repeated comm. All arounder! Existing bypassing routers use Only a static or a single bypassing policy However, effective bypassing policy depends on traffic patterns Prediction router supports Multiple predictors which can be switched in a cycle To accelerate a wider range of applications 7 NAS parallel benchmark programs 4 synthesized traffics

28 Case study 2: Area & Energy Area (gate count) Original router Pred router (SS + LP) Pred router (SS+LP+FCM) Energy consumption Light-weight (small overhead) FCM is all-arounder, but requires counters Verilog-HDL designs Router area [kilo gates] Synthesized with 65nm library % increased, depending on type and number of predictors

Case study 2: Area & Energy Area (gate count) Original router Pred router (SS + LP) Pred router (SS+LP+FCM) Energy consumption Original router Pred router (70% hit) Pred router (100% hit) This

29 Case study 2: Area & Energy Area (gate count) Original router Pred router (SS + LP) Pred router (SS+LP+FCM) Energy consumption Original router Pred router (70% hit) Pred router (100% hit) This estimation is pessimistic. 1. More energy consumed in links Effect of router energy overhead is reduced 2. Application will be finished early More energy saved Router area [kilo gates] Flit switching energy [pj / bit] % increased, depending on type and number of predictors Miss prediction consumes power; 9.5% increased if hit rate is 70% Latency 35.8%-48.2% saved w/ reasonable area/energy overheads

30 3 case studies of prediction router How many cycles? hit miss hit hit Flit-level net simulation FIFO FIFO XBAR Design compiler(synthesis) Fujitsu 65nm library Astro (place & route) NC-Verilog (simulation) SAIF Hit rate / Comm. latency Area (gate count) Energy cons. [pj / bit] 2-D mesh network Fat tree network SDF Power compiler Case study 1 & 2 Case study 3

31 Case study 3: Fat tree network Up Down 1. LRU algorithm LRU output port is selected for upward transfer 2. LRU + LP algorithm Plus, LP for downward transfer

32 Comm. latency [cycles] Case study 3: Fat tree network Up Down Comm. Original router Pred router (LRU) Pred router (LRU + LP) 1. LRU algorithm LRU output port is selected for upward transfer 2. LRU + LP algorithm Network size (# of cores) Plus, Latency LP for 30.7% downward reduced 256-core; Small area overhead (7.8%)

33 Summary of the prediction router Prediction router for low-latency NoCs Multiple predictors, which can be switched in a cycle Architecture and six prediction algorithms Analytical model of prediction hit rates Evaluations of prediction router Case study 1 : 2-D mesh (small core size) Case study 2 : 2-D mesh (large core size) Case study 3 : Fat tree network From three case studies Area overhead: 6.4% (SS+LP) Energy overhead: 9.5% (worst) Latency reduction: up to 48% Results 1. Prediction router can be applied to various NoCs (from Case studies 1 & 2) 2. Communication latency reduced with small overheads 3. Prediction router with multiple predictors can accelerate a wider range of applications

34 Thank you for your attention It would be very helpful if you would speak slowly. Thank you in advance.

35 Prediction router: New modifications Predictors for each input channel Kill mechanism to remove dead flits Two-level arbiter Reservation higher priority Tentative reservation by the pre-execution of VSA X+ X- Y+ Y- CORE Predictors A B C FIFO KILL signals ARBITER Currently, the critical path is related to the arbiter 5x5 XBAR X+ X- Y+ Y- CORE

36 Prediction router: Predictor selection Static scheme A predictor is selected by user per application Predictors A B C Dynamic scheme A predictor is adaptively selected Predictors A B C Configuration table Application 1 Predictor B Application 2 Predictor A Application 3 Predictor C Count up if each predictor hits Predictor A 100 Predictor B 80 Predictor C 120 A predictor is selected every n cycles (e.g., n =10,000) Simple Pre-analysis is needed Flexible More energy

37 Stage delay [FO4s] Case study 1: Router critical path RC: Routing comp. VSA: Arbitration : Switch traversal can be occurred in these stages of prediction router 6.2% critical path delay increased compared with original router Original router Pred router (SS)

Prediction hit rate [%] Case study 2: Hit rate @ 8x8 mesh SS: go straight LP: the last one FCM: frequently used pattern Custom: user-specific path Efficient

38 Prediction hit rate [%] Case study 2: Hit 8x8 mesh SS: go straight LP: the last one FCM: frequently used pattern Custom: user-specific path Efficient for long straight comm. Efficient for short repeated comm. All arounder! Efficient for simple comm. 7 NAS parallel benchmark programs 4 synthesized traffics

39 Case study 4: Spidergon network Spidergon topology Ring + across links [Coppola,ISSOC 04] Hit Uniform Each router has 3-port Mesh-like 2-D layout Across first routing

40 Prediction hit rate [%] Case study 4: Spidergon network Spidergon topology Ring + across links [Coppola,ISSOC 04] Hit Uniform SS: Go straight LP: Last used one FCM: Frequently used one Hit rates of SS & FCM are almost the same Each router has 3-port Mesh-like 2-D layout Network size (# of cores) High Across hit rate first is achieved routing (80% for 64core; 94% for 256core)

41 4 case studies of prediction router How many cycles? hit miss hit hit Flit-level net simulation FIFO FIFO XBAR Design compiler(synthesis) Fujitsu 65nm library Astro (place & route) NC-Verilog (simulation) SAIF SDF Power compiler Hit rate / Comm. latency Area (gate count) Energy cons. [pj / bit] 2-D mesh network Fat tree network Spidergon network Case study 1 & 2 Case study 3 Case study 4

Lecture 22: Router Design

Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip