Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1

Size: px

Start display at page:

Download "Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1"

Alaina Parsons
5 years ago
Views:

1 Future of Interconnect Fabric A ontrarian View Shekhar Borkar June 13, 2010 Intel orp. 1

2 Outline Evolution of interconnect fabric On die network challenges Some simple contrarian proposals Evaluation and benefits Summary 2

Good-old Bus hip hip hip hip Good at board level,

processing power Frequency Width Energy/bit Data

3 Good-old Bus hip hip hip hip Good at board level, does not extend well Transmission lines Loss, signal integrity Need signal termination Limited frequency Width limited by pins, board area Signal processing power Frequency Width Energy/bit Data rate Power MHz bit 1 nj 1-10 GB/sec 10+ W 3

complexity and latency in each node Frequency 200-3000 MHz Width 8-32 bit Energy/bit 0.

4 Point-to-point Bus hip hip Fast signaling, long distance Between chips, boards, racks High frequency, narrow links 1D Ring, 2D Mesh and Torus to reduce latency Higher logic complexity and latency in each node Frequency MHz Width 8-32 bit Energy/bit 0.5 nj Bisection bandwidth 50+ GB/sec Power 200+ W Emergence of packet switched network 4

5 ASI Red The first TFLOP 32 X 32 X 2 Mesh Interconnect Simultaneous bidirectional signaling 50 GB/sec bisection bandwidth 5

6 21.72mm The 80-ore TFLOP hip (2006) 12.64mm I/O Area single tile 1.5mm 2.0mm 8 X 10 Mesh PLL I/O Area TAP 32 bit links 320 GB/sec bisection 5 GHz 6

7 DDR3 M DDR3 M 21.4mm DDR3 M DDR3 M 48 ore Single hip loud (2009) 26.5mm TIL E PLL TIL E JTAG VR System Interface + I/O 2 ore clusters in 6 X 4 Mesh (why not 6 x 8?) 128 bit links 256 GB/sec bisection 2 GHz 7

8 Power Breakdown 80 ore TFLOP hip 48 ore S hip lock dist. 11% Dual FPMAs 36% M & DDR % ores 70% Router + Links 28% 10-port RF 4% IMEM + DMEM 21% Routers & 2Dmesh 10% Global locking 1% Does pt-to-pt packet switched network on a chip make sense? 8

9 A Sample Multi-core System 10mm ore ache 50% 50% 65nm, 4 ores 1V, 3GHz 10mm die, 5mm each core ore Logic: 6MT, ache: 44MT Total transistors: 200M 45nm 10mm 32nm 10mm 22nm 10mm 16nm 10mm 8 ores, 1V, 3GHz 3.5mm each core 16 ores, 1V, 3GHz 2.5mm each core 32 ores, 1V, 3GHz 1.8mm each core 64 ores, 1V, 3GHz 1.3mm each core Total: 400MT Total: 800MT Total: 1.6BT Total: 3.2BT 9

10 A Sample M Network 5mm 0.4mm Packet Switched Mesh 16B=128 bit each direction 1.5u pitch 192GB/s Bisection BW Tech ore (mm) Port size (mm) 65nm nm nm nm nm Bisection BW GB/sec@3GHz 10

11 Switch Energy/bit (pj) Network Power (W) Mesh 3GHz, 1V Measured X 1.4X 1 Extrapolated u 0.18u 65nm 22nm 8nm 0 65nm 45nm 32nm 22nm 16nm 1. Network power will be too high 2. Worse if link width scales up each generation 3. ache coherency mechanism is complex 11

12 Delay (ps) Energy/bit (pj) Interconnect Delay & Energy u Pitch Length (mm)

13 Delay (ps) mw/bit Interconnect Delay & Power nm, 3GHz Router Delay Length (mm) 0 13

14 Packet Switched Interconnect STOP STOP STOP STOP 5mm 3-5 lk <1 lk 3-5 lk 5 lk 0.5mW 0.5mW 0.5mW 10mm 2-3 lk 3-5 lk 3-5 lk 6-7 lk 0.5mW 1mW 0.5mW 1. Router acts like STOP signs adds latency 2. Each hop consumes power (unnecessary) 14

15 Bus The Other Extreme Issues: Slow, < 300MHz Shared, limited scalability? Solutions: Repeaters to increase freq Wide busses for bandwidth Multiple busses for scalability Benefits: Power? Simpler cache coherency Move away from frequency, embrace parallelism 15

5u bus pitch 50ps repeater delay ore (mm) Bus Seg Delay (ps) Max Bus Freq (GHz) 65nm 5 195 2.2 45nm 3.5 99 2 32nm 2.5 51 1.8 22nm 1.8 26 1.

16 Repeated Bus (ircuit Switched) O R R R R R Arbitration: Each cycle for the next cycle Decision visible to all nodes Repeaters: Align repeater direction No driving contention R R R Assume: 10mm die, 1.5u bus pitch 50ps repeater delay ore (mm) Bus Seg Delay (ps) Max Bus Freq (GHz) 65nm nm nm nm nm Anders et al, A 2.9Tb/s 8W 64-ore ircuit-switched Network-on-hip in 45nm MOS, ESSIR 2008 Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming ircuit-switched 8 8 Mesh Network-on-hip in 45nm MOS, ISS

17 Example of a Bus Repeater 17

Other Bus Enhancements Differential, low voltage swing Twisted to reduce cross-talk Optimal repeater placement Not necessarily at the core Higher bus

18 Other Bus Enhancements Differential, low voltage swing Twisted to reduce cross-talk Optimal repeater placement Not necessarily at the core Higher bus frequency Wide bus, 1024 bit or more, transfer lots of data in one cycle Multiple busses for concurrency Employ interconnect engineering techniques 18

19 Bisection BW (GB/s) Bus BW (GB/s) Full Swing Bus Power (W) 100mV Swing Bus Power (W) Bus Power and Bandwidth X Includes bus and repeater power Full Swing 6 1X 0.1V Differential 5 1.4X nm 45nm 32nm 22nm 16nm 7 1.4X 65nm 45nm 32nm 22nm 16nm X 1.4X Mesh X 1.4X Bus nm 45nm 32nm 22nm 16nm 0 65nm 45nm 32nm 22nm 16nm 19

20 A ircuit Switched Network Routers 8x8 ircuitswitched No Packet-switched Request Plk Src 0 1 n Dest ircuit-switched Acknowledge lk ircuit-switched Data Transmission Routers lk 2mm links ircuit-switched No eliminates intra-route data storage Packet-switching used only for channel requests High bandwidth and energy efficiency (1.6 to 0.6 pj/bit) Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming ircuit-switched 8 8 Mesh Network-on-hip in 45nm MOS, ISS

21 hannel Selection & Transmission ircuit-switched hannel Selection Router Queues 2 hannel Not Ready 1 2 Requested hannel Selected Slot hannel Ready Src Ack 1 Dest Ack Queue slots store multiple channel requests Two acknowledges required to indicate ready channel More possible channels 87% throughput increase 21

22 8 x 8 NO Die Photo lock 8x8 No I/O Traffic Generation and Measurement Ports ore North South East West Process 45nm Hi-K/MG MOS, 9 Metal u Nominal Supply 1.1V Arbitration and Router Logic Supports 512b data Data Bus Width 1b (2mm link distance) Number of Transistors 2.85M Die Area 6.25mm 2 22

23 Factors Affecting Latency Packet Switched Arbitration in each node, multiple arbitration cycles ircuit Switched Single arbitration for entire transaction Multiple hops from source to destination Pipelined data flow 3-5 lock latency in each node One time latency to establish a circuit Fast clock (3 GHz) Slow clock (1 GHz) One source and destination Broadcast possible 23

24 Truth in between Bus R Bus R Bus Bus to connect over short distances Bus R 2 nd Level Bus Bus R Hierarchy of Busses Or hierarchical circuit and packet switched networks 24

25 Link Width (a.u.) Bytes/Op Hierarchical & Heterogeneous Local Regional luster Global Local. Wide Slow 25

26 But wait, what about Optical? 65nm Optical: Pre-driver Driver VSEL TIA LA DeMUX DR lk Buffer Possible today Hopeful Source: Hirotaka Tamure (Fujitsu), ISS 08 Workshop on HS Transceivers 26

27 Summary Point to point busses are not necessary for on-die networks Rings and meshes were devised for point to point busses over long distances overkill for on chip network? Router power could be prohibitive Wide busses, circuit switched networks, show promise Hierarchical, heterogeneous, tapered, circuit and packet switched schemes suitable for on-die networks Go slower, wider, simpler, and efficient 27

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp. Networks for Multi-core hips A A ontrarian View Shekhar Borkar Aug 27, 2007 Intel orp. 1 Outline Multi-core system outlook On die network challenges A simple contrarian proposal Benefits Summary 2 A Sample