Hybrid On-chip Data Networks Gilbert Hendry Keren Bergman Lightwave Research Lab Columbia University
Chip-Scale Interconnection Networks Chip multi-processors create need for high performance interconnects Performance bottleneck of on-chip networks and I/O Power dissipation constraints of the chip package > 50% of total power comes from interconnects* Intel Polaris IBM Cell AMD Opteron * N. Magen et al., Interconnect-power dissipation in a microprocessor, SLIP 2004. 2
Motivation CMPs of the future = 3D stacking Lots of data on chip Photonics offers key advantages 3
Why Photonics? Photonics changes the rules for Bandwidth, Energy, and Distance. ELECTRONICS: Buffer, receive and re-transmit at every router. Each bus lane routed independently. (P N LANES ) Off-chip BW is pin-limited and power hungry. OPTICS: Modulate/receive high bandwidth data stream once per communication event. Broadband switch routes entire multiwavelength stream. Off-chip BW = On-chip BW for nearly same power. RX RX RX TX TX TX TX RX TX RX TX RX 4
Hybrid Network Premise Optical processing difficult and limited Source, destination routing inefficient Use electronics for routing, optics for switching and transmission Hybrid Circuit-Switching 5
Hybrid Circuit-Switched Networks Step 1: Path SETUP request Electronic SETUP Msg Source core Destination Core 6
Hybrid Circuit-Switched Networks Step 2: Path ACK Electronic ACK Msg 7
Hybrid Circuit-Switched Networks Step 3: Transmit Data Photonic Switch Use Information 8
Hybrid Circuit-Switched Networks Meanwhile: Path Contention Path BLOCKED Msg (Backoff) 9
Hybrid Circuit-Switched Networks Step 4: Path TEARDOWN Electronic SETUP Msg Source core Destination Core 10
Hybrid Circuit-Switched Networks Pros: Cons: Energy-efficient end-toend transmission High bandwidth through WDM Electronic network still available for small control messages* Network-level support for secure regions Path setup latency Path setup contention (no fairness) * [G. Hendry et al. Analysis of Photonic Networks for a Chip Multiprocessor Using Scientific Applications. In NOCS, 2009] 11
Programming and Communication
Shared Memory scaling [OpenMP on large systems] often performs worse than message passing due to a combination of false sharing, coherence traffic, contention, and system issues that arise from the difference in scheduling and network interface moderation ~ Exascale Report Implicit Communication Explicit Communication 13
Partitioned Global Address Space Access Method Local Read Optical Receive Local Write Optical send Remote Read Electronic request, optical receive Remote Write Optical send Shared R/W? Implicit Communication Explicit Communication [G. Hendry et al. Circuit-Switched Memory Access in Photonic Interconnection Networks for HPEC. In Supercomputing, Nov. 2010] 14
Message Passing Complex, dynamic access patterns Relatively larger blocks of data Scientific computing Implicit Communication Explicit Communication * [G. Hendry et al. Analysis of Photonic Networks for a Chip Multiprocessor Using Scientific Applications. In NOCS, 2009] 15
Streaming Embedded / specialized systems (Graphics, Image + Signal Proc.) Execution mode of general-purpose systems (Cell Processor) 1 4 Input Data 3 Output Data 2 Persistent optical circuits Implicit Communication Explicit Communication 16
Electronic Plane
Electronic Router Ring Cntrl Request Bus Xbar Cntrl Buffer Cntrl Data Switch Routing Logic Arbiter Flow Control Xbar Allocation Credits In Xbar Cntrl Data Switch Allocation Ring Cntrl Buffer Data Path Crossbar Control Router Low frequency operation (~ 1GHz) 1 VC (typically) Small buffers (64-28) Narrow Channels (8-32) 18
Network Gateway Network IF Serialization Deserialization Receivers Drivers Electronic Crossbar Control Router To/From Control plane Core Core Core Tx/Rx Core Bidirectional Electronic Channel 5-port photonic switch Bidirectional Waveguide External Concentration To/From Data plane [P. Kumar et al. Exploring concentration and channel slicing in on-chip network router. In NOCS, 2009] 19
The Photonic Plane
Silicon Photonic Waveguide Technology 1.28 Tb/s Data Transmission Experiment (occupies small slice of available WG BW) 100 ps before injection into waveguide [Vlasov and McNab, Optics Express 12 (8) 1622 (2004)] C23 (1559 nm) C28 (1555 nm) C46 (1541 nm) C51 (1537 nm) [B. G. Lee et al., Photon. Technol. Lett. 20 (10) 767 (2008)] after 5-cm waveguide and EDFA Silicon photonic waveguides provide low-power optical interconnects in CMOS-compatible platform. Low-loss (1.7 db/cm), high-bandwidth (> 200 nm) silicon photonic waveguides can be fabricated in commercial CMOS 21 process.
Silicon Photonic Modulator and Detector Technology (CW) LASER 18 Gb/s demonstrated modulator detector Ge-on-Si Detectors: 40-GHz bandwidths 1 A/W responsivities Receivers (detectors w/ CMOS amplifiers): 1.1 pj/bit demonstrated at 10 Gb/s Scalable to < 50 fj/bit [M Lipson, Optics Express (2007)] 85 fj/bit demonstrated at 10 Gb/s Scalable to < 25 fj/bit [M Watts, Group Four Photonics (2008)] [S Koester, J. Lightw. Technol. (2007)] 22
Silicon Photonic Micro-Ring Switch Explanation Transmission (in i out i ) bar state in0 out0 no current, on-resonance in1 out1 cross state fast control of resonance wavelength via carrier injection current, off-resonance 23
Higher Order Switch Designs 24
On-Chip Topology Exploration Photonic Torus Nonblocking Photonic Torus [A. Shacham et al., Trans. on Comput., 2008] [M. Petracca et al. IEEE Micro, 2008] 25
On-Chip Topology Exploration TorusNX Square Root [J. Chan et al. JLT, May 2010] 26
Photonic Plane Characteristics Insertion Loss Noise Power 27
Insertion Loss and Optical Power Budget Optical Power Budget Nonlinear Effects WDM Factor Worst-case Insertion Loss Detector Sensitivity 28
Insertion Loss vs. Bandwidth Number of λ Topologies Network Size 29
Simulation Results Insertion Loss (db) 50 40 30 20 10 0 Torus Topology 31.2 25.6 20.6 6 6 4 4 8 8 37.0 42.8 48.6 10 10 12 12 14 14 54.5 16 16 60.3 18 18 Insertion Loss (db) 50 40 30 20 10 0 Non-Blocking Torus Topology 38.0 44.1 50.6 25.3 18.7 6 6 4 4 31.5 8 8 10 10 12 12 14 14 56.8 16 16 63.2 18 18 Topology Size (nodes) Topology Size (nodes) Insertion Loss (db) 50 40 30 20 10 TorusNX Topology 15.8 23.2 19.5 27.1 31.0 34.9 38.8 42.7 Insertion Loss (db) Insertion Loss (db) 50 40 30 20 10 Square TorusNXRoot Topology 19.5 12.2 15.8 27.1 31.0 34.9 23.2 21.5 38.8 30.6 42.7 0 6 6 4 4 8 8 10 10 12 12 14 14 16 16 18 18 0 4 4 4 4 6 6 8 8 8 8 10 10 12 12 14 14 16 16 16 16 18 18 Topology Size (nodes) Topology Size (nodes) Propagation Crossing Dropping Into a Ring 30
Simulation Results Original is based on the IL results from previous slide, Improved is based on a hypothetical improvement in crossing loss from 0.15 db to 0.05 db. Torus Topology Non-Blocking Torus Topology Number of Wavelength Channels 100 10 Number of Wavelength Channels 100 10 Optical power budget 1 0 100 200 300 Number of Access Points TorusNX Topology 1 10 20 30 Number of Access Points Square Root Topology Number of Wavelength Channels 100 10 Number of Wavelength Channels 100 10 Optical power budget 1 0 100 200 300 Number of Access Points 1 0 100 200 300 Number of Access Points 31
Photonic Plane Characteristics Insertion Loss Noise Power 32
Noise and Crosstalk Laser Noise Modulation Noise Inter-Message Crosstalk Coherent noise Crosstalk Intra-Message Crosstalk Filter Incoherent noise 33
Optical SNR Effects of Noise Network Size Number of λ Network Load 34
Simulation Results Results Results are plotted for network size of 8 8 at saturation, at the detectors. Maximum OSNR = ~45 db (due to laser noise) Minimum OSNR < 17 db (due to message-to-message crosstalk) Variations between networks due to varying likelihood of two message intersecting on network topology. Optical SNR (db) System Performance SNR measures the likelihood of error-free transmission. Lower SNR designs will require additional retransmission, resulting in lower throughput performance. 50 40 30 20 10 0 Torus Non-blocking Torus TorusNX Square Root 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Message Size (bit) The line at OSNR=16.9 db is where a bit-error-rate of 10-12 can be achieved, assuming an ideal binary receiver circuit and orthogonal signaling. 35
Photonic Plane Characteristics Insertion Loss Noise Power 36
Power Usage Transmission Laser Power Active Power Modulating Detecting Broadband Static Power Thermal tuning Tx\Rx Power Drivers TIAs 1V Electronic Control p-region n-region 0V Ohmic Heater Thermal Control 0V 1V Off-resonance profile On-resonance profile Injected Wavelengths 37
Energy Per Bit Energy per Bit (J/bit) 10-7 10-8 10-9 10-10 10-11 Torus Non-blocking Torus TorusNX Square Root 10-12 10-13 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Message Size (bit) 38
Power Breakdown Torus Topology Nonblocking Torus Topology Modulator 4% Detector 3% PSE 2% Thermal 1% Modulator 4% Detector 2% PSE 2% Thermal 1% Electronic Wire 3% Router Logic 43% Electronic Wire 2% Router Logic 45% Router Buffer 44% Router Buffer 44% 12 wavelengths @ 10 Gbps/each Power Dissipation = 4.31 W 7 wavelengths @ 10 Gbps/each Power Dissipation = 1.59 W Results based on randomly generated traffic with message sizes of 100 kbit, with network in saturation. Data was collected on 64 nodes topologies constrained to a total surface area of 2 cm 2 cm. 39
Power Breakdown Square Root Topology TorusNX Topology Modulator 14% PSE 2% Thermal 4% Modulator 17% PSE 1% Thermal 3% Detector 8% Router Logic 34% Detector 10% Router Logic 37% Electronic Wire 7% Router Buffer 31% Electronic Wire 1% Router Buffer 31% 27 wavelengths @ 10 Gbps/each Power Dissipation = 1.89 W 38 wavelengths @ 10 Gbps/each Power Dissipation = 3.22 W 40
Performance 41
Other Interesting Issues
Memory Access Processor Core Network Router Memory Access Point [G. Hendry et al. Circuit-Switched Memory Access in Photonic Interconnection Networks for HPEC. In Supercomputing, Nov. 2010] 43
Other Arbitration Means - TDM [G. Hendry et al. Silicon Nanophotonic Network-On-Chip Using TDM Arbitration. In HOTI, Aug. 2010] 44
Wavelength Granularity Original Re-design λ Scalable number of WDM channels 45 λ
Conclusion Some applications / programming models definitely well-suited to a circuit-switched photonic network Interesting tradeoffs and design space 46