Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Size: px

Start display at page:

Download "Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture"

Megan George
6 years ago
Views:

1 Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on on-chip Architecture Avinash Kodi, Ashwini Sarathy * and Ahmed Louri * Department of Electrical Engineering and Computer Science, Ohio University, Athens, OH 457 * Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ kodi@ohio.edu, sarathya@ece.arizona.edu, louri@ece.arizona.edu Sponsored: National Science Foundation (NSF) grant ECCS (at the High Performance Computing Architectures and Technologies Lab, University of Arizona, Tucson) ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS 7) Dec 3-4, 27

2 Talk Outline Motivation & Introduction ideal Inter-router Dual-function Energy and Area-efficient Links for NoC architectures Link and Router Architecture Performance Evaluation Power & Area estimation for the Links & Routers Simulation results for Throughput, Latency & Overall network power Conclusions 2

3 Motivation System-on-Chip Chip (SoC) ( SoC) ) paradigm Processing Elements (Processors, DSPs, Peripheral Controllers, Memory Subsystems) UART / GPIO Processor Cores USB / Ethernet controllers (,3) (,3) (2,3) (3,3) (,2) (,2) (2,2) (3,2) NOC Router Channels or Links SRAM/Flash & Memory Controllers (,) (,) (2,) (3,) - Increasing wire delay with decreasing feature size - Scalable, modular interconnect Network-on on-chip (NoC( NoC) (,) (,) (2,) (3,) 3

4 Motivation Generic NoC Router Recent Recent NSF-sponsored NSF-sponsored workshop workshop on on On- On- Chip Chip Interconnection Interconnection Networks Networks :: + x Route Computation (RC) Input Buffers Virtual Channel (VC) Switch Allocator (SA) Crossbar Switch The The most most important important technology technology constraint constraint for for on-chip on-chip networks networks is ispower consumption. consumption. Power Power consumption consumption of of OCINs OCINsimplemented with with current current techniques techniques exceeds exceeds expected expected needs needs by by aa factor factor of of.. -x Power Break-up in the NoC Router + y -y Buffers, 46% Clock Buffer, 6% Arbiter, 3% Processing Element (PE) Crossbar, 35%. Reference : J.D.Owens, W.J.Dally, R.Ho, D.N.Jayasimha, S.W.Keckler and L.S.Peh, Research Challenges for On-Chip Interconnection Networks, IEEE Micro, vol. 27, no. 5, pp. 96 8, September-October 27. 4

ideal Inter-router router Dual-function Energy and Area- efficient Links for NoC architectures ideal ideal Methodology (circuit and and architectural techniques) --Reduce Reduce the the number of of

5 ideal Inter-router router Dual-function Energy and Area- efficient Links for NoC architectures ideal ideal Methodology (circuit and and architectural techniques) --Reduce Reduce the the number of of router router buffers buffers --To To prevent performance degradation, use use adaptive channel buffers buffers to to store store data data along along the the links links when when required -- Dynamic buffer buffer allocation within within the the router router buffers buffers + x -x Route Computation (RC) Input Buffers Virtual Channel (VC) Switch Allocator (SA) Crossbar Switch Adaptive channel buffers along the link -x Route Computation (RC) Input Buffers Virtual Channel (VC) Switch Allocator (SA) Crossbar Switch + y -y Reduced router buffer size + y -y Processing Element (PE) Generic NoC architecture Processing Element (PE) ideal architecture 5

6 Conventional Links Output Port of Router A Input Port of Router B 6

7 ideal Channel Buffer Design (/2) Output Port of Router A Input Port of Router B Control block Control block Congestion 7

8 ideal Channel Buffer Design (2/2) Control block Functions as a conventional repeater when there is no congestion. Control block Repeater tri-stated and holds the sampled value, during congestion. Control block is turned OFF. Control block is turned ON. 8

9 ideal Control Block O/P Port Router A I/P Port Router A Congestion signal CLK CLK2 CLK CLK2 CLK --Power Power efficient --Stable Stable at at varying varying frequencies 9

10 ideal : Dual-function Link Cycle Data-In Cycle 2 Data-In 3 2 Congestion Signal Cycle 3 Data-In 3 2 Congestion Signal Data-Out 3 2 Congestion Release

segment(chl-buffer) (leakage, control block) Control block Control block

11 Link - Power & Area Estimation Output Port of Router A Input Port of Router B P segment(repeater) (Dynamic, leakage, short-circuit) P segment(chl-buffer) (leakage, control block) Control block Control block Congestion P control-blk (inverters, clock, switched-cap.) CLK CLK2 Congestion CLK

ideal Router Buffer Design Input Port P VCID DEMUX Flit Flit r VC State Table Flit vc vc v MUX Static buffer allocation - Fixed number of buffers per VC - HoL blocking Flit r Credit Return VC State

12 ideal Router Buffer Design Input Port P VCID DEMUX Flit Flit r VC State Table Flit vc vc v MUX Static buffer allocation - Fixed number of buffers per VC - HoL blocking Flit r Credit Return VC State Table Congestion Control VC RP WP OP OVC CR C* Status v RP RP = read read pointer, pointer, WP WP = write write pointer, pointer, OP OP = output output port, port, OVC OVC = output output VC, VC, CR CR = credits, credits, C* C* = congestion congestion Status Status = status status of of the the VC VC (idle, (idle, waiting, waiting, RC, RC, VA, VA, SA, SA, ST) ST) 2

13 ideal Router Buffer Design Input Port P Flit Dynamic buffer allocation DEMUX Flit r Flit r+ Flit 2r MUX - Approximately (z + c)/v buffers per VC (z = router buffers, c = channel buffers, v = # of VCs) Credit Return Input Flit Tracking Flit (v-) r + Flit z Write Pointer Unified VC State Table Congestion Control Buffer Slot Availability Read Pointer Output Flit Tracking VC v RP 3 6 WP OP OVC CR Status F F F (z+c)/v N N 5 N Buffer Slot 2 Free Y N N N 3 6 N 5 N N N z N RP = read pointer, WP = write pointer, OP = output port, OVC = output VC, CR = credits, C* = congestion Status = status of the VC (idle, waiting, RC, VA, SA, ST) 3

14 ideal Router Buffer Design Example illustrating Dynamic buffer allocation in ideal Buffer Slot Free N N N NY N N YN 7 N Input Flit Tracking Write Pointer Unified VC State Table Buffer Slot Availability Read Pointer VC 2 3 Output Flit Tracking RP WP OP OVC CR Status F F F 3 F 4 3 N 4 SA ST N 2 N 4 VC N6 N N N N N N 4 Idle N N N N N SA 5 N N Incoming flit (VCID = ) Congestion Control RP = read pointer, WP = write pointer, OP = output port, OVC = output VC, CR = credits, C* = congestion Status = status of the VC (idle, waiting, RC, VA, SA, ST) 4

(SA) Crossbar Switch Crossbar Power Bitlines (Switch + Arbiter) Wordlines

15 Router - Power & Area Estimation Route Computation (RC) Virtual Channel (VC) 6T SRAM cell Buffer Power (P write + P read ) Input Buffers Switch Allocator (SA) Crossbar Switch Crossbar Power Bitlines (Switch + Arbiter) Wordlines Sense Amp Processing Element (PE) Power reduces on decreasing the buffer size 5

16 Performance Evaluation Evaluated on a cycle-accurate on-chip network simulator Simulated 8 x 8 Mesh and 8 x 8 Folded Torus topologies Synthetic benchmarks such as uniform, and non-uniform workloads (Butterfly, Complement, Perfect Shuffle, Matrix Transpose, Bit Reversal) were evaluated Parameters evaluated include throughput, latency and overall network power Considered 5 different configurations (vn V rn R cn C ) (n V = No. of VCs per input port, n R = No. of router buffers per VC, n C = number of channel buffers) Baseline = , 428, 344, 53 6

17 Power Estimation - Summary vnv rnr - cnc Buffer Power (mw) % Change Mesh % Change Folded Torus % Change Link + Control Link + Control Power (mw) Power (mw) v4-r4-c v4-r3-c v4-r2-c v4-r2-c v3-r4-c v3-r3-c v5-r2-c v5-r3-c n V = number of VCs per input port n R = number of router buffers per VC n C = number of channel buffers 7

18 Buffer Power 8x8 Mesh and Folded Torus Buffer Power (8x8 Mesh) UN - Dynamic Buffer Power (8x8 Folded Torus) UN - Dynamic.8.8 Power (watts).6.4 Power (watts) v4-r4-c v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c Configuration v4-r4-c v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c Configuration Uniformly distributed traffic traffic Nearly 4% power savings for for 5% buffer size size reduction (428), using Dynamic buffer allocation (428 (428 = 4 VCs VCs per per port, port, 2 router router buffers buffers per per VC, VC, 8 channel buffers) 8

19 Throughput 8x8 Mesh and Folded Torus Throughput (GBps) Throughput (8x8 Mesh) UN - Dynamic v4-r4-c v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c Throughput (GBps) Throughput (8x8 Folded Torus) UN - Dynamic v4-r4-c v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c Offered Load (as a fraction of network capacity) Offered Load (as a fraction of network capacity) Uniformly distributed traffic traffic Only about 5% 5% drop in in throughput for for the the case (Dynamic buffer allocation) (428 (428 = 4 VCs VCs per per port, port, 2 router router buffers buffers per per VC, VC, 8 channel buffers) 9

20 Overall Network Power 8x8 Mesh and Folded Torus Total Power (8x8 Mesh) UN - Dynamic Total Power (8x8 Folded Torus) UN - Dynamic Power (watts).5.5 Congestion Power Link Power Crossbar Power Buffer Power Power (watts) Congestion Power Link Power Crossbar Power Buffer Power v4-r4-c v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c Configuration v4-r4-c v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c Configuration Total Total power power consumed for for a network load load of of.5.5 Nearly 2% savings for for the the 428, using Dynamic buffer allocation (428 (428 = 4 VCs VCs per per port, port, 2 router router buffers buffers per per VC, VC, 8 channel buffers) 2

21 Buffer Power 8x8 Mesh all Traffic Patterns Power (watts) v4-r4-c v4-r3-c4 v4-r2-c8 Buffer Power (8x8 Mesh) at an offered load =.5 S - UN S - CO S - TO S - PS S - BR S - MT S - NE S - BU D - UN D - CO D - TO D - PS D - BR D - MT D - NE D - BU Ptt Traffic Pattern Reduction Reduction in in power power for for all all configurations, configurations, under under all all traffic traffic patterns, patterns, compared compared to to the the baseline baseline (44) (44) For For example, example, under under Complement Complement traffic traffic the the configuration configuration achieves achieves 45% 45% savings savings under under Static Static allocation allocation and and 37.5% 37.5% savings savings under under Dynamic Dynamic allocation allocation 2

22 Throughput 8x8 Mesh all Traffic Patterns Throughput (GBps) v4-r4-c v4-r3-c4 vr-r2-c8 Throughput (8x8 Mesh) at an offered load =.5 S - UN S - CO S - TO S - PS S - BR S - MT S - NE S - BU Traffic Pattern D - UN D - CO D - TO D - PS D - BR D - MT D - NE D - BU No No significant significant decrease decrease in in throughput throughput under under any any traffic traffic pattern, pattern, using using Dynamic Dynamic allocation allocation 22

23 Conclusion ideal architecture provides a Low-Power Area-efficient solution for NoCs, by reducing power consumption through circuit-level and architecture-level techniques. Simulation results show that by reducing the buffer size in half, a 4-52% savings in power is achieved, with a significant reduction in router area. There is only a marginal -5% drop in performance, under dynamic buffer allocation. Future work will involve (a) Simulation using real-application traces (b) Exploring architectural improvements such as aggressive speculation in the credit loop 23

24 Backup Slides 24

25 Area Estimation Summary with values from Synopsys Design Compiler vnv rnr - cnc Buffer Area (μm 2 ) Link Repeater Area (μm 2 ) Total Buffer + Link Area (μm 2 ) % Change v4-r4-c 8, ,439 - v4-r3-c4 63, , -2.4 v4-r2-c8 48, , v4-r2-c8 48, , v3-r4-c4 63, , v3-r3-c7 5, , v5-r2-c6 53, , v5-r3-c 73, , n V = number of VCs per input port, n R = number of router buffers per VC, n C = number of channel buffers

26 Latency 8x8 Mesh and Folded Torus Average Latency (microsec) Average Latency (8x8 Mesh) - UN - Dynamic v4-r4-c v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c Offered Load (as a fraction of network capacity) Average Latency (microsec) Average Latency (8x8 Folded Torus) UN - Dynamic v4-r4-c v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c Offered Load (as a fraction of network capacity) Uniformly distributed traffic traffic For all all cases (except 53), saturation for for a network load load of of about.3.3 in in case case of of Mesh and and about.4.4 in in case case of of Folded torus 26

27 Comparison with FC-CB CB and DAMQ Throughput (in GBps) Comparison of Saturation Throughput (8x8 Mesh) - Uniform Traffic v4-r3-c4 v4-r2-c8 FC-CB DAMQ Configuration FC-CB FC-CB shows shows similar similar performance as as the the dynamically allocated case case and and achieve nearly nearly 4% 4% increaseinin saturation throughput compared to to FC-CB FC-CB achieves nearly nearly 2.5% 2.5% improvementinin saturation throughput compared to to DAMQ DAMQ Average Latency (in microsec) Comparison of Average Latency (8x8 Mesh) - Uniform Traffic v4-r3-c4 v4-r2-c8 FC-CB DAMQ Offered Traffic (as a fraction of network capacity) 27

28 Power (Watts) Buffer Power (8x8 Mesh) - Uniform Traffic Power calculations using Synopsys Power Compiler Leakage Power Dynamic Power case case shows shows nearly nearly 4% 4% reduction in in buffer buffer power power alone alone Nearly Nearly 3% 3% decrease in in overall overall network power power for for the the case case v4-r4-c v5-r3-c v3-r4-c4 v4-r3-c4 v4-r2-c8 Configuration Power (Watts) Total Power (8x8 Mesh) - Uniform Traffic Control Link Switch Arbiter Buffer 2 v4-r4-c v5-r3-c v3-r4-c4 v4-r3-c4 v4-r2-c8 Configuration 28

Data flow Control Simulated with Synopsys VCS 5 MHz Clock Signal Data_in Congestion input Congestion at stage Congestion at stage 2 Congestion at

29 Data flow Control Simulated with Synopsys VCS 5 MHz Clock Signal Data_in Congestion input Congestion at stage Congestion at stage 2 Congestion at stage 3 Congestion at stage 4 Data_out from stage Data_out2 from stage 2 Data_out3 from stage 3 Data_out4 from stage Time (ns) 29

30 Router - Power Estimation Component Power / Area Calculation C buf (/2 x W x L x C ox ) + (W x L ov x C ox ) Ṕ dynamic a x [k(c o + C p + C buf ) + lc w ] x V DD2 x freq Ṕ leakage Ṕ short-ckt 2 x [/2 x V DD x (I off (W n + W p )k)] a x t rise x W n x k x V DD x I sc x freq Explanation C buf = additional capacitance due to three-state repeater along the links W, L = Width & Length of min. sized inverter C ox = oxide capacitance L ov = gate-drain/source overlap length a = activity factor, k = repeater sizing, l = repeater spacing C o = diffusion capacitance C p = gate capacitance C w = wire capacitance V DD = supply voltage freq = operating frequency I off = subthreshold leakage current W n (W p ) = width of the NMOS (PMOS) in the repeater t rise = rise time of the short-ckt current I sc 3

31 ideal Control Block Self-checking Double-sampling technique for the Control block Output Port of Router A Input Port of Router B MUX MUX Double-sampling the congestion input D Flip-Flop D Flip-Flop Error XOR Clock Error XOR Congestion Clock Delay Buffer Slightly more power (.2 uw v/s.6 uw) and area, but more reliable 3

32 Aggressive Speculation Throughput (in GBps) Saturation Throughput (8x8 Folded Torus) - Uniform Traffic v4-r4-c v4-r3-c4 v4-r2-c8 v3-r4-c4 v3-r3-c7 Configuration Aggressive speculation by by increasing the the number of of credits credits available to to 8 Additional credits credits are are accounted for for by by the the channel buffers buffers Saturation throughput improves by by % % for for the the case case Average Latency (in microsec) Average Latency (8x8 Folded Torus) - Uniform Traffic v4-r4-c v4-r3-c4 v4-r2-c8 v3-r4-c4 v3-r3-c7 Power (Watts) Total Power (8x8 Folded Torus) - Uniform Traffic Control Link Switch Arbiter Buffer Offered Traffic (as a fraction of network capacity) v4-r4-c v3-r4-c4 v4-r3-c4 v3-r3-c7 v4-r2-c8 Configuration 32

33 Power Estimation Summary with values from Synopsys Power Compiler vnv rnr - cnc Buffer Power (mw) Mesh Link + Control Power (mw) Total Power (Buffer + Link) (mw) % Change v4-r4-c v4-r3-c v4-r2-c v4-r2-c8 v4-r2-c v3-r4-c v3-r3-c v5-r2-c v5-r3-c n V = number of VCs per input port, n R = number of router buffers per VC, n C = number of channel buffers

TECHNOLOGY scaling is expected to continue into the deep

TECHNOLOGY scaling is expected to continue into the deep IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 9, SEPTEMBER 2008 1169 Adaptive Channel Buffers in On-Chip Interconnection Networks A Power and Performance Analysis Avinash Karanth Kodi, Member, IEEE, Ashwini