PIC training: Interconnect System Design

Size: px

Start display at page:

Download "PIC training: Interconnect System Design"

Sheena Goodman
5 years ago
Views:

1 PIC training: Interconnect System Design Keren Bergman PhoenixSim Optical hardware Meisam Bahadori, Sébastien Rumley Lightwave Research Laboratory Columbia University Network Application

2 Silicon Photonics for Computing DRAM CMPs 3DI Stack Exaflop-scale high-performance computing system Silicon Photonic Interconnection Network Seamless hierarchical photonic cross-layer communication to the chip Memory Stack CMPs Photonic interconnects support inter-rack communications

HPC and Data Centers toward Exascale in a nutshell Exascale equates to 10 18 FLoating point OPerations per second (FLOPs) Reaching Exascale requires: One CPU performing 10

3 HPC and Data Centers toward Exascale in a nutshell Exascale equates to FLoating point OPerations per second (FLOPs) Reaching Exascale requires: One CPU performing 10 FLOPs per cycle, clocked at 10 8 GHz, OR 10 8 such CPUs clocked at 1Ghz Consider 1,000 CPUs placed in a drawer, that s 100K drawers With 100 drawers per rack, that s still 1000 racks

4 Supercomputing Performance Current World Top Supercomputers are Petascale #1) Tianhne-2 (China) Peak: 55 PetaFLOPs (PF) #2) Titan (US) 27 PetaFLOPs (PF) #3) Sequoia (US) 20 PetaFLOPs (PF) Need a 20x improvement factor to Exascale Average computing performance of the top 3 Supercomputers over past decade:

5 The Major Lag in Data Communications Top 10 Supercomputers computation capabilities over past 5 years: Vast increase in parallelism requires ever more communications but bandwidth is stagnated Over past 5 years: while system compute power grows by 13X Node I/O bandwidth increases by only < 2X Data-movement is too expensive! ($ and Energy)

6 6 The Real Performance in decline Since 2010 growing gap between computing operations and bandwidth Deterioration of byte/flop ratios: Communication Byte / Computation FLOP Byte/FLOP of top 10 Supercomputers

platforms Energy consumption completely dominated by costs of data movement Bandwidth taper

7 The Photonic Opportunity for Data Movement Energy efficient, low-latency, high-bandwidth data interconnectivity is the core challenge to continued scalability across computing platforms Energy consumption completely dominated by costs of data movement Bandwidth taper from chip to system forces extreme locality Reduce Energy Consumption Eliminate Bandwidth Taper 7

8 Current interconnect and memory bandwidths Memory interfaces: 100s of Gb/s to terabit/s DDR4: 200 Gb/s WideI/O 2: 500Gb/s High-Bandwidth Memory: 1Tb/s - 2Tb/s Hybrid Memory Cube: 1Tb/s - 4Tb/s Network links: 100G is the new standard in HPC Infiniband 4xEDR, Intel omnipath, Bull exascale interconnect Higher bandwidths proposed 12x25 = 300G, 12x50 = 600G (Infiniband, 2017) Router chip envelopes: several Tb/s Cray Aries: 2.2 Tb/s Upcoming intel omnipath: 4.8 Tb/s Director switch envelope: 64 Tb/s for Mellanox biggest switch Era of Multi-Tb/s!

9 Estimating bandwidth needs Bandwidth can be related to compute power through the verbosity metric byte/flops (B/F) Memory bandwidth requirement: Ideally, up to 8 B/F (for the most demanding algorithms) Can be reduced to 0.5 B/F (with HMC) cache for fast/near RAM Can be less for bulk DRAM/NVRAM memory ( B/F) Interconnect requirement: Ideally, same as bulk memory (0.1 B/F) But even 0.02 B/F would be progress Corresponding global (link) BWs at Exascale: Memory: ~500 PB/s (0.5 B/F) Interconnect: ~400PB/s (100 PB/s 0.1 B/F) multiplied by 3-4 hops!

10 supercomputing node architecture Exascale system 20k to 100k such nodes Multi-CPU die delivering 10s of TF 3D stacked near memory modules as Hybrid Memory Cube Bulk and far memory (conventional DRAM or NVRAM) Interconnect switch (opaque or transparent) Optical Network interface (O-NIC) Photonic Memory links

5B/F = 40Tb/s (split over ~6-10 individual ~5 Tb/s interfaces) Interconnect

11 Node level bandwidth requirements Assume: 10 Teraflop (TF) node (Exascale with 100K) Near memory bandwidth: 10 TF x 8bit x 0.5B/F = 40Tb/s (split over ~6-10 individual ~5 Tb/s interfaces) Interconnect bandwidth: 0.01 B/F 0.8 Tb/s 0.05 B/F 4Tb/s Bulk memory bandwidth: 0.1 B/F 8Tb/s 0.2 B/F 16 Tb/s (split over ~1-6 links)

12 Power requirements Today s largest envelope: Tianhe-2 = 17MW; RIKEN = 12MW Exascale at 100MW is maximal consideration: 10 GigaFLOP/Joule 20MW total system power envelope preferred: 50 GigaFLOP/Joule Energy efficiencies for the Green500 benchmark (June 2015)

13 [1] C.-H. Hsu, S.W. Poole, D. Maxwell, The Energy Efficiency of the Jaguar Supercomputer System components power budget Need for Gigaflop/J in the next 5 years ~30-50% of power is non-it (cooling, power delivery, etc.) [1] Power envelope 10 Gigaflop/J 50 Gigaflop/J 50 Gigaflop/J Budget per flop: 100 pj 20 pj 20 pj Interconnect Network % of power 10% 10% 10% Networking budget per flop: 10 pj 2 pj 2 pj Network verbosity 0.01 byte/flop 0.01 byte/flop 0.1 byte/flop Budget for a network byte 1 nj/byte 200 pj/byte 20 pj/byte Budget for a network bit 125 pj/bit 25 pj/bit 2.5 pj/bit Memory Memory % of power 15% 15% 15% Memory budget per flop: 15 pj 3pJ 3pJ Memory verbosity 0.5 byte/flop 0.5 byte/flop 1 byte/flop Budget for a memory byte 30 pj/byte 6pJ/byte 3pJ/byte Budget for a memory bit 3.75pJ/bit 0.75 pj/bit pj/bit

14 Energy budget per networking bit (pj) Network energy budget Gigaflop/J, 10% of the envelope 10 Gigaflop/J, 15% of the envelope 50 Gigaflop/J, 10% of the envelope 50 Gigaflop/J, 15% of the envelope Verbosity (byte/flop) Verbosities below 0.05 B/F, energy budget can be ~ 50 pj Above 0.1 B/F, total network energy: ~10pJ for 10 GF/J; ~2pJ for 50GF/J

15 Energy budget per bit (pj) Network energy requirements End-to-end data movement energy budget: 10 Gigaflop/J, 10% of the envelope 10 Gigaflop/J, 15% of the envelope 50 Gigaflop/J, 10% of the envelope 50 Gigaflop/J, 15% of the envelope 10 pjs to fjs! 100s of pj to 10s pj 1 10s of pj to single pjs Verbosity (byte/flop) 0.25 pj/bit

16 Interconnection network energy budget breakdown N+2 = 4 links Source compute node N=2 hops in the topology N+1 = 3 switches Destination compute node Budget network = (N+2) * Budget links + (N+1) * Budget switches + 2 * Budget interface Budget interface = 0 (for simplification) Budget switch: ~50 pj/bit (today s Cray Aries) ~20 pj/bit (upcoming Intel Omnipath) ~5 pj/bit (minimum for Exascale) ~1 pj/bit (target for Exascale) What s the remaining link budget? S. Rumley et al. Design Methodology for Optimizing Optical Interconnection Networks in High Performance Systems, ISC-HPC 2015.

17 Link energy budget Network portion 10% in all cases Verbosity (Byte/Flop) Energy efficiency (Gigaflop/J) Total Network Budget switch N Budget link Budget network pj/bit 50 pj/bit pj/bit pj/bit 50 pj/bit 3 5 pj/bit pj/bit 5 pj/bit pj/bit pj/bit 5 pj/bit 3 3 pj/bit pj/bit 5 pj/bit pj/bit pj/bit 5 pj/bit fj/bit pj/bit 1 pj/bit fj/bit pj/bit 1 pj/bit fj/bit N=2 requires switch radix ~ 96 N=3 switch radix ~ 48 N=2: 3 switches, 4 links N=3: 4 switches, 5 links

18 Interconnect costs Network is ~15% of total system cost $200M considered typical Exascale price $30M max for network Total interconnect bandwidth ~300 PB/s (0.1 B/F) $30M / 300 PB/s 1$/10GB/s 1.25 /Gb/s Cost reduction required: >100X for 0.1 B/F >10X for 0.01 B/F [1] M. Besta, T. Hoefler, Slim Fly: A Cost Effective Low-Diameter Network Topology, Supercomputing 2014

19 Realizing high BW low Energy Links Not Bandwidth per se: What matters is Gb/s/mW and Gb/s/$. Requires complete design space: Relationships between material/geometries, optical/electrical parameters, thermal, optical losses and energy consumption Impact of fabrication variability, limitations Applied to subsys and sys [1] Design Trade-Offs Example: Higher driver voltage increases ring modulator consumption but decreases laser consumption due to improve ER [1] R. Wu, C.-H. Chen, J.-M. Fideli, M. Fournier, R.G. Beausoleil, K.-T. Cheng, Compact modeling and system implications of microring modulators in nanophotonic interconnects, ACM SLIP 2015.

20 20 Beyond wire replacement Optics-enabled system architecture transformations: - Distance-independent, cut-through, bufferless On chip Short distance PCB Long distance PCB Optical link No conversion! 12 conversions! Conventional hop-by-hop data movement Fully flattened end-to-end data movement

21 Columbia PhoenixSim: Integrated Multi-Level Modeling and Design Environment software software Novel design environment enabling HFI across three layers: Application IO primitives Copy memory array to remote location Send, multicast, broadcast messages Thread synchronization (e.g. barrier) Network architecture and protocols Environment Link locking mechanisms (frame detection) Network topology (routing) Arbitration of shared buses, switches Si Photonic Hardware implementations Silicon photonics modulators, switches Complete toolbox of models at each layer Ensure interoperability among models Avoids manual adaptations of data between distinct software Network arch. Appl. model software Hardware SiP devices

22 Multi-layer environment Thread ID rank dest Application void work_in_parallel(int rank) { int[] array = calculate_local_array(rank); int dest = determine_next_dest(array); handshake payload trans. flow control } copy_array_remote(array, dest, address); integrity check onic 1 (rank) Switch onic 2 (dest) Routing Path arbitration Optical path setup dest onic onic 2 onic onic 1 onic Network rank Data transmission Hardware SiP Switch ns SiP WDM Demux 22

Verilog SST [9] SuperScalar Cross-layer iterative optimization Traces SST/Macro [10] Physical layer Device models [6] PhoenixSim environment Physical

Network layer IO requests Interfaces models Application characteristics [4,5] Key parameters Application layer Network performance/costs trade-offs

Bergman, "Reducing Energy per Delivered Bit in Silicon Photonic Interconnection Networks," Optical Interconnects 2014 [2] S. Rumley, el al.

, "Modeling and Evaluation of Chip-to-Chip Scale Silicon Photonic Networks," IEEE Hot Interconnects 2014 [4] S. Rumley, L. Pinals, G. Hendry, K.

, "Reuse Distance Based Circuit Replacement in Silicon Photonic Interconnection Networks for HPC," IEEE Hot Interconnects 2014. [6] D. Nikolova, R.

Bergman, "Fast Exploration of Silicon Photonic Network Designs for Exascale Systems," ASCR ModSim Workshop 2013. [8] C. Sun, et al.

23 Verilog SST [9] SuperScalar Cross-layer iterative optimization Traces SST/Macro [10] Physical layer Device models [6] PhoenixSim environment Physical layer Application needs Appl. layer DSENT [8] FDTD Circuits models [7] Optically-sound network arch. Network layer IO requests Interfaces models Application characteristics [4,5] Key parameters Application layer Network performance/costs trade-offs [1-3] Hardware validated and optimized models [7] [1] K. Wen, S. Rumley, K. Bergman, "Reducing Energy per Delivered Bit in Silicon Photonic Interconnection Networks," Optical Interconnects 2014 [2] S. Rumley, el al., "Low Latency, Rack Scale Optical Interconnection Network for Data Center Applications," ECOC 2013 [3] R. Hendry, et al., "Modeling and Evaluation of Chip-to-Chip Scale Silicon Photonic Networks," IEEE Hot Interconnects 2014 [4] S. Rumley, L. Pinals, G. Hendry, K. Bergman, "A Synthetic Task Model for HPC-Grade Optical Network Performance Evaluation," IA^ [5] K. Wen, et al., "Reuse Distance Based Circuit Replacement in Silicon Photonic Interconnection Networks for HPC," IEEE Hot Interconnects [6] D. Nikolova, R. Hendry, S. Rumley and K. Bergman, "Scalability of Silicon Photonic Microring Based Switch" ICTON'14 [7] S. Rumley, R. Hendry, K. Bergman, "Fast Exploration of Silicon Photonic Network Designs for Exascale Systems," ASCR ModSim Workshop [8] C. Sun, et al. DSENT A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling, NoCS 2012 [9] S. Hammond, et al. Towards a standard architectural simulation framework. Workshop on Modeling & Simulation of Exascale Systems & Applications, [10] C. L. Janssen, et al. A simulator for large-scale parallel architectures. International Journal of Parallel and Distributed Systems, 1(2):57-73, 2010.

24 Graphical interface Configuration of the kernel model Configuration of cross-layer parameters Configuration of networking aspects (e.g. switch arbitration) Configuration of hardware parameters and settings

25 Si Photonic physical hardware layer: current features Silicon Photonic WDM links and switch fabrics: Optical signal quality determinants (crosstalk, optical losses, etc.) Photonic network power consumption Photodetectors External laser Chip 1 Optical switch Chip 2 Other chips

26 Physical layer parameters

27 Multi-Level Modeling Environment: Interface Photodetectors External laser Chip 1 Chip 2

28 Multi-Level Modeling Environment: Interface Photodetectors External laser Chip 1 Chip 2

29 Multi-Level Modeling Environment: Interface

30 Multi-Level Modeling Environment: Interface OOK modulation Imperfect ER Couplers Jitter Waveguide on chip Modulator IL Demux - truncation Intermod. crosstalk Demux - crosstalk Demux - IL

factor Ring parameters are optimized for each architecture design and implementation: Optimization Per channel

31 Environment automated optimization: Q-factor Finding key parameters of ring resonators Size of the ring impacts resonance, and power consumption Size, internal geometry and proximity to waveguide impacts quality factor Ring parameters are optimized for each architecture design and implementation: Optimization Per channel requirements Ideal Q Global conditions Fabrication parameters and limitations Doping Size Proximity Power Signal quality

32 Transmission Optimization of Ring based demultiplexers Quality Factor (Q-factor) main ring parameter Q = l/dl Inverse of the ring bandwidth 3 db Must be optimized for each link format Example: filtering at demux subject to trade-off Truncation of the signal Crosstalk from other signals Dl l Low Q ring Wavelength High Q ring High truncation Signal spectrum Very small leakage in other channels Crosstalk due to leakage Low truncation

Modulator Penalty (db) Optimization of Ring based modulators Penalty Trade-offs: Insertion Loss vs. Extinction Ratio vs. Multiplexing Crosstalk Parameters Trade-offs: Channel spacing vs.

33 Modulator Penalty (db) Optimization of Ring based modulators Penalty Trade-offs: Insertion Loss vs. Extinction Ratio vs. Multiplexing Crosstalk Parameters Trade-offs: Channel spacing vs. Resonance shift vs. Q-factor Example: Low-Q ring vs. High-Q ring Parameters Channel Spacing = 1 nm Quality Factor Q = 6000 Q =15000 Resonance Shift (nm) Insertion Loss (db) Extinction Ratio Penalty (db) Crosstalk Penalty (db) ON-OFF Penalty (db) Total Penalty (db) Optimum Value Q = 6000 Q = Resonance Shift (nm) [Bahadori, JLT (under revision)] 33

34 Example end-to-end results Analysis of demultiplexing PP for 1 Tb/s (includes filter Q factor optimization) Optimized design for link throughput [Bahadori, Optical Interconnect 2015] [Bahadori, JLT (under revision)]

35 network layer Thread ID rank dest Application void work_in_parallel(int rank) { int[] array = calculate_local_array(rank); int dest = determine_next_dest(array); handshake payload trans. flow control } copy_array_remote(array, dest, address); integrity check onic 1 (rank) Switch onic 2 (dest) Routing Path arbitration Optical path setup dest onic onic 2 onic onic 1 onic Network rank Data transmission Hardware SiP Switch ns SiP WDM Demux 35

36 Cross-layer software integration 6 node example We assume 6 independent ranks (threads) each running a on distinct node Nodes are connected in a peer-to-peer fashion (all-to-all) Hardware layer: Point-to-point (chip-to-chip) SiP WDM links Network layer: No arbitration, no flow-control (simple design) Application layer: Test distributed algorithm: After an initialization phase, algorithm has N rounds During each round i Rank R sends a message to dest (R+i) Waits until it receives a message from (R-i) Round 1 Round 2 Does some processing Point-to-point SiP links

37 Timeline thread activity visualization Processing Time-to-solution

38 Optimized network links designs for: 0.5Tb/s, 1.0Tb/s, and 1.5Tb/s bandwidth densities 0.5 Terabit/s 20 wavelengths 25 Gb/s 2.39 pj/bit 2710 ns 1 Terabit/s 38 wavelengths Gb/s 2.64 pj/bit 1510 ns 1.5 Terabit/s 54 wavelengths Gb/s 2.9 pj/bit 1110 ns

Energy/time-to-solution Pareto fronts 1.5 Terabit/s 54 wavelengths 27.78 Gb/s 2.9 pj/bit 1110 ns 1 Terabit/s 38 wavelengths 26.

39 Energy/time-to-solution Pareto fronts 1.5 Terabit/s 54 wavelengths Gb/s 2.9 pj/bit 1110 ns 1 Terabit/s 38 wavelengths Gb/s 2.64 pj/bit 1510 ns Sub-optimal designs 0.5 Terabit/s 20 wavelengths 25 Gb/s 2.39 pj/bit 2710 ns Pareto optimal designs

40 Photonic link power bandwidth trade-off Aggregate line rate (Gb/s) Channel rate < 25 Gb/s Channel rate 25 Gb/s Energy per bit Design: 20% laser wall-plug; 0.5mW ring stabilization; 2mW detector; modulator drive voltage 1.2V; modulator capacitance 100fF System wide optimizations: realize multi-tb/s links and <pj/bit

41 Interconnect power consumption: with transparent optical switching Dimension HPC/Data Center transparent optical network 40k compute nodes (25TFs each), 0.05 B/F 10 Tb/s per node Topology: distance optimized topology, uniform traffic at max rate [1] Transceivers: Silicon photonic energy optimized WDM transceivers (Q max = 15k, ring stab: 0.5mW) Switches 2 Cases: (1) MEMs based switch: 320 ports, 0.14mW and 3Tb/s per port (46 fj/bit); assume 3 switches traversed at ~150fJ/bit; 3.5 db Power penalty [2] (2) Hybrid SOA/MZI switch fabric: ~1W/port for 64-radix; ~6dB power penalty; faster ns-scale switching [3] Transceiver launch power calculated for worse case path Laser assumed always ON, 5% Wall plug efficiency Model total consumption: (transceivers + switches) / (node injection bandwidth) [1] S. Rumley et al. ISC High Performance 2015 conference [2] Calient S320 OCS switch [3] Q. Cheng, A. Wonfor, J.L. Wei, R. V. Penty, I.H. White, Optics Letters 39(18), 2014.

PhoenixSim Exascale interconnect power consumption Case 1: MEMS based 134 nodes (25 TFs) connected to each switch (4 depicted here) 299 switches per plane (10 here) 0.

42 PhoenixSim Exascale interconnect power consumption Case 1: MEMS based 134 nodes (25 TFs) connected to each switch (4 depicted here) 299 switches per plane (10 here) 0.05 B/F 53,521 inter-switch connections per plane Up to 3 switches to be traversed (10.5dB) Aggregate line rate: 2 Tb/s (80 x 25Gb/s) 5 planes required to reach 10 Tb/s Launch power per wavelength: -2.22dBm Total consumption: 364 kw Resulting efficiency (end-to-end): 0.91 pj/bit

43 PhoenixSim component breakdown analysis Case 1: MEMS based Baseline 3dB PP (instead of 3.5dB) 3D-MEMs consumption only 10% laser WPE (instead of 5%) 960 ports (instead of 320) Mod/demod Laser Switch End-to-end energy efficiency (pj/bit)

PhoenixSim Exascale interconnect power consumption Case 2: SOA-MZI based 8 nodes connected to each switch (2 depicted here) 5000 switches (32 ports) per plane 0.

44 PhoenixSim Exascale interconnect power consumption Case 2: SOA-MZI based 8 nodes connected to each switch (2 depicted here) 5000 switches (32 ports) per plane 0.05 B/F Up to 4 switches to be traversed (19.6dB) 120,000 inter-switch connections per plane 34 channels, ~59G totalizing 2 Tb/s 5 parallel planes Launch power per wavelength: 6.27dBm Total power consumption: MW Resulting efficiency (end-to-end): 6.9 pj/bit

45 Another factor: optical circuit switching Optical circuit switching: inherently low average utilization Low utilization as the result of circuit switching: Streaming circuit data cannot be slowed when in motion

circuit switch cannot be 100% fully utilized Utilization can be high if reconfiguration << circuit ON time Poor utilization if

46 OCS why low average utilizations The optical circuit is the transmission link When a switch turns, no transmission can occur Turning the switch = breaking circuits No active circuits over a turning switch Unless the circuit is never reconfigured circuit switch cannot be 100% fully utilized Utilization can be high if reconfiguration << circuit ON time Poor utilization if reconfiguration >= circuit ON time Optical switching Unique circuit Input circuit Packet (electrical) switching Output circuit Xbar circuit

47 Packet duration shrink with increased bandwidth Packet durations will trend to ~1-10ns Packet sizes Aggregate Line rates 100B 1KB 10KB 100KB 100Gb/s 8ns 80ns 800ns 8ms 400Gb/s 2ns 20ns 200ns 2ms 1Tb/s 800ps 8ns 80ns 800ns 2.5Tb/s 320ps 3.2ns 32ns 320ns

48 Impact of optical circuit switching on utilization Link unavailability time composed of: Switch configuration (optical path) Link re-establishment (equilibrate, preamble, etc.) Resulting utilization: (worse-case) Resulting utilizations: (switch turns after every second packet) Packet duration Packet duration Link unavailability 1ns 10ns 100ns 100ns 99% 91% 50% 10ns 91% 50% 9% 1ns 50% 9% 1% Link unavailability 1ns 10ns 100ns 100ns 99% 95% 66% 10ns 95% 66% 16% 1ns 66% 16% 2% Need circuit down time no more than ~1ns!

49 What about the laser energy consumption Baseline case: 10Gb/s per wavelength Detector sensitivity: -20dBm Link optical budget including modulation: 10dB Launch power -10dBm = 0.1 mw Laser «wall plug» efficiency: 10% Laser power: 1mW Laser contribution to energy consumption: 0.1 pj/bit * assuming no additional power penalties due to WDM

50 The role of link utilization in energy consumption Assume laser ON continuously But link carries real data traffic 10% of the time Energy efficiency inversely proportional to utilization With 10% utilization, laser consumes the full 1pJ/bit budget

51 Laser energy consumption VS utilization trade-off energy efficiency (pj/bit) energy efficiency (pj/bit) 10% utilization adds 10dB Increase energy efficiency by: Improved laser efficiency Reduced launch power Better receiver sensitivity Reduced link power penalties Improved laser efficiency % 10% 100% link utilization 10 1 Need combined factor of 10X improvement to achieve 0.1pJ/bit at 10% network utilization Reduced launch power 1% 10% 100% link utilization

52 Low average utilization is desirable for performance Why is low utilization advantageous? A close to 100% utilization case. Low utilization needed to guarantee low queuing In particular, queuing synchronization messages threatens parallel efficiency S. Rumley et al. "A Synthetic Task Model for HPC-Grade Optical Network Performance Evaluation," IA^

53 Transmission efficiency (pj/b) Need for ns-scale energy proportionality Transmission efficiency (pj/b) KB Setup time = 10ns Setup time = 100ns Setup time = 1ms Setup time = 10ms Laser always on KB Number of 10 Gb/s channels Number of 10 Gb/s channels 1KB packets require at least 100ns and ~10ns dynamic data optimal proportionality

performances but highest energy cost Adding channels improve performance

54 Latency performance impact 100KB 1KB Head-to-tail latency includes both queuing and serialization times Keeping the laser ON yields the best performances but highest energy cost Adding channels improve performance (reduces serialization times) Laser setup time >100ns inflicts a substantial penalty

55 Transmission efficiency (pj/b) Transmission efficiency (pj/b) performance-energy WIN with dynamic proportionality For Tb/s 1KB packets require dynamic energy proportionality of ~10ns KB More wavelengths Setup time = 10ns Setup time = 100ns Setup time = 1ms Setup time = 10ms Laser always on Average head-to-tail latency (ns) High performance, ultra low latency AND low energy/bit with dynamic, energy proportional sources

56 summary 56 HPC scalability drives increased interconnectivity bandwidth: Aggregated compute power (needed Byte/s) Growing parallelism and distributed algorithms (B/F) System wide connectivity and data movement bandwidth key to performance and scalability Energy consumption interconnection network total budget: 0.1B/F and 50GigaFlop/J 1pJ/bit switches and 0.25pJ/bit links Laser power: At 1mW and 10% wall-plug efficiency: consumes 0.1pJ/bit with 100% utilization 10% network utilization adds 10dB, to 1pJ/bit Need combined 10X improvement to regain 0.1pJ/bit at 10% network utilization Unless the circuit is never reconfigured cannot be 100% utilized Utilization can be high if reconfiguration << circuit ON time Poor utilization if reconfiguration >= circuit ON time Packets 1ns-10ns for 1KB and ~Tbit/sec scale Circuit down time requires minimization; traffic impacts arbitration Energy proportionality is key

Silicon Photonics PDK Development

Hewlett Packard Labs Silicon Photonics PDK Development M. Ashkan Seyedi Large-Scale Integrated Photonics Hewlett Packard Labs, Palo Alto, CA ashkan.seyedi@hpe.com Outline Motivation of Silicon Photonics