Analyzing the Effectiveness of On-chip Photonic Interconnects with a Hybrid Photo-electrical Topology

Size: px

Start display at page:

Download "Analyzing the Effectiveness of On-chip Photonic Interconnects with a Hybrid Photo-electrical Topology"

Priscilla Brooks
5 years ago
Views:

1 Analyzing the Effectiveness of On-chip Photonic Interconnects with a Hybrid Photo-electrical Topology Yong-jin Kwon Department of EECS, University of California, Berkeley, CA Abstract To improve performance and energy in a future manycore processor, it is vital that the interconnect technology is optimized. While electrical interconnects are wonderful for short-distance communications, silicon photonics is a promising new interconnect technology with lower power, higher bandwidth density, and shorter latencies. However, the advantages and disadvantages of these two types of interconnect technologies are not as clear-cut as one might think. To illustrate the future of silicon photonics, we use analytical modeling to model a hybrid optoelectrical topology which best utilizes both electrical and photonic technology. The results show that without large improvements in silicon photonic technologies, even the most optimal use of on-chip photonic interconnects might not be worth the investment. 1. Introduction As the number of available transistors and relative area increase with scaling, the number of on-chip processor cores is expected to rise in the future. With this change, the on-chip interconnect network is inevitably becoming a significant component in the processor s performance and consumption of power and area. An ideal network has low latency and high bandwidth while utilizing minimal on-chip power and area. Unfortunately, current electrical on-chip networks are not able to effectively balance such key factors due to limitations in electrical link technology. A possible alternative is to make use of on-chip siliconphotonic technology to create energy and area efficient links for interconnect networks. A wavelength-division multiplexed (WDM) photonic link provides high bit density as well as dynamic energy properties that scale well with utilization. However, there are two shortcomings of photonic links that need to be addressed to ensure optimal use of photonic interconnects. 1) With low link utilization, the static power overhead of the optical components begins to dominate and drives power consumption above the electrical counterpart. In today s system where most of the processor s state is idle, reducing this static overhead cannot be avoided to efficiently implement a photonic interconnect network. 2) Previous studies in photonic interconnects have shown that photonics is only worth the investment for long global connections [1]. Short local wires are better implemented electrically for optimum power and latency. These two disadvantages of photonic interconnects will be the focus of our analysis. To analyze the potential benefits of photonic interconnects, we propose a hybrid photo-electrical topology which leverages the benefits of both electrical and photonic links. By sharing router resources between the photonic and electrical networks, we can turn off the photonic network during low utilization and save static power otherwise wasted. We must also remember that the photonic channels must be long to make them potentially viable in the on-chip domain. These requirements impose two constraints on our network: 1) an electrical network which can have an all-to-all connectivity when the optical network is off and 2) a global photonic network which can benefit the electrical network when turned on. Numerous studies [2] have shown that the concentrated mesh is an effective electrical network topology. To satisfy the two conditions given above with the least amount of power and area overhead, we propose an electrical concentrated mesh with added photonic express channels. This paper begins by identifying our target system and outlining the important technology attributes. We then analyze various topologies used in related works which converges to the motivation to this paper. We describe our proposed hybrid topology and show the results of the simulated hybrid topology. Our results show that significant advancements in on-

2 chip photonic interconnect technology is needed before they can be effectively used in future manycore processors. 2. Target System Silicon-photonic technology for on-chip communication is still in its formative stages, but with recent technology advances we project that photonics might be viable in the late 2010 s. Trends show that by this time, there will be hundreds of cores integrated on a single die. To simplify design and verification complexity, these cores and/or memory will most likely be clustered into tiles which are then replicated across the chip and interconnected with a well-structured on-chip network. The exact nature of the tiles and the inter-tile communication paradigm are still active areas of research. The tiles might be homogeneous with each tile including both some number of cores and a slice of the on-chip memory, or the tiles might be heterogeneous with a mix of compute and memory tiles. The global on-chip network might be used to implement shared memory, message passing, or both. Regardless of their exact configuration, however, all future systems will require some form of on-chip network which provides low-latency and highthroughput communication at low energy and small area. For this paper we assume a target system with tiles operating at 5 GHz on a 400mm 2 chip. While we limited our study to only 64 tiles, since the projected chip area will not increase much in the near future, our analysis is scalable to a chip with more tiles. 3. Underlying Technology This section summarizes various properties of electrical and photonic technologies projected in the late 2010 timeline. Using these analytical bases, we can draw conclusions from related works in section 4 that will ultimately form the motivation of this paper Electrical Technology For this work we use the 22nm predictive technology models [3] and interconnect projections from [4] and the ITRS. All of our inter-router channels are Figure 1: Photonic Components - two point-to-point photonic links implemented with WDM. implemented in semi-global metal layers with standard repeated wires. For medium length wires (2 3mm or approximately the width of a tile) the repeater sizing and spacing are chosen so as to minimize the energy for the target cycle-time. Longer wires are energy optimized as well as pipelined to maintain throughput. The average energy to transmit a bit transition over a distance of 2.5mm in 200 ps is roughly 160 fj, while the fixed link cost due to leakage and clocking is 20 fj per cycle. The wire pitch is only 500 nm, which means that ten thousand wires can be supported across the bisection of our target chip even with extra space for power distribution and vias. Given the abundance of on-chip wiring resources, interconnect power dissipation will likely be a more serious constraint than bisection bandwidth for most network topologies. We assume a relatively simple router micro architecture which includes input queues, round-robin arbitration, a distributed tristate crossbar, and output buffers. The routers in our multi-hop networks have similar radices, so we fix the router latency to be two cycles. For a 5x5 router with 128 b flits of uniformly random data, we estimate the energy to be 16 pj/flit. Notice that sending a 128 b flit across a 2.5mm channel consumes roughly 13 pj, which is comparable to the energy required to move this flit through a simple router. Future on-chip network designs must therefore carefully consider both channel and router energy, and to a lesser extent area Photonic Technology A simple wavelength-division multiplexed (WDM) photonic link with typical components is shown in Figure 1. A laser source generates two wave-lengths

3 dard logic process. The photonic devices can be implemented in polysilicon on top of the shallow-trench isolation in a standard bulk CMOS process [7] or in monocrystalline silicon with advanced thin BOX SOI. Although monolithic integration may require some post-processing, its manufacturing costs can be lower than 3D integration. Monolithic integration decreases the area and energy required to interface electrical and photonic devices, but it requires active area for waveguides and other photonic devices. Since the waveguide is made from the silicon substrate, the optical loss is potentially greater for monolithic integration. Table 1: Optical Loss Ranges per Component (λ1, λ2) on an optical fiber that is coupled into an onchip waveguide. The waveguide carries the light past a series of transmitters, each using a resonant ring modulator to imprint the data on the corresponding wavelength. The modulated light continues through the waveguide to the destination where each of the two receivers use a tuned resonant ring filter to drop the corresponding wavelength from the waveguide into a local photo detector. The electrical receiver senses the absorbed light once the detector converts the optical into an electrical signal. There are two proposed implementations of the photonic waveguide: 3D and monolithic integration. Because both methods are in its early stages, there are severe pros and cons associated with each proposed method. With 3D integration, a separate specialized die or layer is used for photonic devices. Devices can be implemented in monocrystalline siliconon-insulator (SOI) dies with thick layers of buried oxide (BOX) [5], or in a separate layer of silicon nitride (SiN) deposited on top of the metal stack [6]. In this separate die or layer, customized processing steps can be used to optimize device performance. However, this customized processing approach increases the number of processing steps and hence manufacturing costs. In addition, the circuits required to interface the two chips can consume significant area and power. With monolithic integration, photonic devices are designed using the existing process layers of a stan- Regardless of the integration methodology, WDM optical links share many of the components (Table 1 ) which contribute to optical loss. Optical loss is a significant property of system design because it sets the required optical laser power and correspondingly the electrical laser power. One of the benefits of photonic interconnects is that many of the losses (coupler loss, non-linearity, photodetector loss, and filter drop loss) along the critical path are independent of the network layout, size and topology. In addition to optical loss along the waveguide, ring filters and modulators have to be thermally tuned to maintain their resonance under on-die temperature variations. Due to its thermally isolating trenches and in-plane heaters, monolithic integration gives the most optimistic ring heating efficiency of about 1 µw per ring per K. Based on our analysis of various photonic technology and integration approach speculations, we make the following assumptions. With double-ring filters and a 4 THz free-spectral range, up to 64 wavelengths can be place on each wave guide each way by interleaving to alleviate filter roll-off requirements and crosstalk. A non-linearity limit of 30 mw at 1dB loss is assumed for the waveguides. The waveguides are single mode and a pitch of 4 µm minimizes the crosstalk between neighboring waveguides. The ring diameters are 10 µm. The latency of a global photonic link is assumed to be 3 cycles (1 cycle in flight and 1 cycle each for electrical to optical and optical to electrical conversion). For monolithic integration we assume a 5 µm separation between the photonic and electrical devices to maintain signal integrity, while

Table 2: Aggressive and Conservative Energy and Power Projections for Photonic Devices fj/bt = average energy per bit-time, DDE = Data-traffic dependent energy, FE = Fixed energy, TTE = Thermal

4 Table 2: Aggressive and Conservative Energy and Power Projections for Photonic Devices fj/bt = average energy per bit-time, DDE = Data-traffic dependent energy, FE = Fixed energy, TTE = Thermal tuning energy (~20K), ELP = Electrical laser power budget for 3D integration the photonic devices are designed on a separate specialized layer. Table 2 shows our assumptions for the photonic link energy and electrical laser power. 3.3 Technology Comparison The comparison between electrical and optical links can be divided into three sections: latency, bandwidth density, and power. The latency of an electrical channel is dependent on the length of the wire while the latency of an optical channel is always three cycles (one for signal propagation, one for optical to electrical conversion, and one for electrical to optical conversion). Therefore, for short distances, the latency of the electrical link will be less than a comparable optical link. However, with longer distances, the electrical link has a linear increase in latency while the optical link is constant. The bandwidth density of an optical link is much greater than an electrical link. Due to WDM, optical links have a 30x improvement in bandwidth density compared to a comparable electrical link. However, if the design is not area constraint, the benefits of bandwidth density is not as significant as it seems. Both electrical and optical networks have static and dynamic energy components. Static energy is the energy spent in a network regardless of the traffic and includes leakage and clock energy for electrical links and laser and thermal tuning energy in photonic links. Dynamic energy is the energy spent only when there is traffic in the network. Therefore, the total power dissipated in the on-chip photonic network is categorized into these two components. The first component consists of power dissipated in the photonic components, i.e., power at the laser source and the power dissipated in thermal tuning. The second part consists of electrical power dissipated in the modulator driver, receiver, and arbitration circuits. Figure 2 outlines the differences in energy for a 2.5mm and 10mm electrical and optical link. The blue bars show that the aggressive and conservative projected dynamic energy of the 10mm optical link is far less than the electrical energy for a 10mm link. However, for a 2.5mm link, the aggressive projection costs around ½ the dynamic energy of the 2.5 mm electrical link while the conservative projection takes about 1.5 times. The static component of the optical link is potentially much greater than the electrical counterpart since there is the electrical laser power not shown in the graph due to its topology dependent nature. There are two very important takeaways to remember. 1) Under light load, the photonic network might be power hungry due to its static power overhead. 2) Regardless of the load, short links are better implemented using the electrical over the optical technology. Using this insight, we will analyze some related works and their topologies in the next two sections. 4. Network Topologies Figure 3 illustrates four topologies that we will be discussing in the paper: global crossbars, two dimensional meshes, concentrated meshes, and Clos networks. Table 1 shows some key parameters for these topologies assuming an MTBw system. Figure 2: Electrical versus Optical Link Comparison

Figure 3: Logical View of 64 Tile Network Topologies (a) 2D 8x8 mesh, (b) concentrated mesh (cmesh) with 4x concentration, (c) 8-ary, 3- stage Clos network with eight middle routers, (d) 64x64

5 Figure 3: Logical View of 64 Tile Network Topologies (a) 2D 8x8 mesh, (b) concentrated mesh (cmesh) with 4x concentration, (c) 8-ary, 3- stage Clos network with eight middle routers, (d) 64x64 distributed tristate global crossbar. In all four figures: squares = tiles, dots = routers, triangles = tristate buffers. In (a) and (b) inter-dot lines = two opposite direction channels. In (c) and (d) inter-dot lines = uni-directional channels. For system with few tiles, a simple global crossbar is one of the most efficient network topologies and presents a simple performance model o software [8]. Such crossbars are strictly non-blocking; as long as an output is not oversubscribed every input can send messages to its desired output without contention. Small crossbars can have very low latency and high throughput, but are difficult to scale to tens or hundreds of tiles. Figure 3d illustrates a 64x64 crossbar network implemented with distributed tristate buses. Although such a network provides strictly non-blocking connectivity, it also requires a large number of global buses across long distances. These buses are hard to layout and require global arbitration. Global arbitration can add significant latency and also needs to be pipelined. These global control and data wires result in significant power consumption even for communication between neighboring tiles. Thus global crossbar is an unlikely choice for future manycore on-chip networks, despite its benefits. Figure 3a shows a two-dimensional mesh network that is highly popular in systems with more tiles due to their simplicity in terms of design, wire routing, and decentralized flow control [9] [10]. Unfortunately, the mesh topology introduces high hop counts which results in long latencies and significant router and channel energy consumption. Higher dimensional mesh networks reduce the network diameter, but also require long channels when mapped to a 2D substrate. Higher radix routers are also needed which results in higher power and area. Instead of adding network dimensions, adding concentration can help reduce hop count [2]. Figure 3b illustrates a two-dimensional mesh with a concentration factor of four. One of the disadvantages of the CMesh topology is the wider channels compared to the mesh topology. To improve channel utilization for shorter messages, the CMesh can be divided into multiple parallel CMesh networks with narrower channels. The CMesh topology should achieve similar throughput as a standard mesh with half the latency at the cost of longer channels and higher-radix routers. CMesh topologies still require careful application mappings for good performance. Lastly, Figure 3c illustrates an 8-ary 4-stage Clos topology which reduces the hop count but requires longer point-to-point channels. The Clos network is an interesting intermediate point between the highradix, low-diameter crossbars topology and the lowradix, high-diameter mesh topology [11]. Clos networks use many small routers and extensive path diversity. Unfortunately, Clos networks still require global point-to-point channels and these global channels can be difficult to layout and have significant energy costs. 5. Related Works In the research community, photonic interconnect topology proposals include all four topologies from the previous section. Starting with the high radix low diameter topology Vantrease et al. have proposed a

global 64 64 photonic crossbar requiring about a million rings [12].

perspective, and 3D integration is expensive due to the power cost of thermal tuning. On the other side of the spectrum, [13] and [14] proposed a low radix high diameter 2D mesh topology.

6 global photonic crossbar requiring about a million rings [12]. In [1], they show that a crossbar is not a scalable photonic topology since the large number of rings required for photonic crossbar tations makes monolithic integration impractical from an area perspective, and 3D integration is expensive due to the power cost of thermal tuning. On the other side of the spectrum, [13] and [14] proposed a low radix high diameter 2D mesh topology. The problem of such topology was outlined in previous sections. Since low radix high diameter topologies depend on shorter wires for communication, it is more sensible to use an electrical link rather than an optical one. In [1], the clos network is explored for the on-chip photonic technology. This paper states a similar basis that either extremes of the topology spectrum are not ideal for silicon photonics. Figure 4 shows the latency versus offered bandwidth comparison between the mesh, CMesh and clos topology. Figure 5 shows the corresponding power for each topology. From these graphs it is evident that 1) it is extremely difficult to beat the performance and energy of an electrical mesh for local communication and 2) a fully photonic interconnect is not energy efficient in low utilization due to static power overhead. 6. Proposed Design Using the insight from the previous section, we can understand what is needed to design an optimized photonic topology. The first insight is that it is nearly impossible to beat short electrical wires in terms of performance and energy. Therefore in our topology, we propose an underlying electrical mesh for local communication purposes. The second insight is that with low utilization, photonic interconnects burn more power than the electrical counterpart. To leverage the most out of photonics, we propose a topology with long photonic channels which can be turned off during low utilization. The two insight therefore leads to a concentrated mesh with photonic express channels. Two logical views of such is shown in Figure 6 and these topologies (identified by express1 and express2) will be used for the rest of the Figure 4: Latency vs. Offered Bandwidth The traffic patterns shown (UR, P2D, P8C, P8D are all explained in section 7 of this paper. Figure 5: Power Dissipation vs. Offered Bandwidth - 3.3W laser power not included for the pclos-a (aggressive) topology.

Figure 6: The logical and physical view of the two proposed networks Express1 (on the right) and Express2 (on the left). The physical view is shown below. paper.

A CMesh with photonic express channels is an effective optical topology because the short channels are implemented using energy and performance efficient electrical technology while the long channels

7 Figure 6: The logical and physical view of the two proposed networks Express1 (on the right) and Express2 (on the left). The physical view is shown below. paper. Both topologies can be physically implemented as shown in Figure 6 as well. A CMesh with photonic express channels is an effective optical topology because the short channels are implemented using energy and performance efficient electrical technology while the long channels are implemented using photonics. The underlying CMesh has all-to-all connectivity and the routers are shared between the optical and electrical links. This means that under low utilization, the laser and thermal heaters can be turned off to save power. 7. Simulation Platform We use a detailed cycle-accurate microarchitectural simulator to study the performance and power of various electrical and photonic networks for a 64- tile system with 512 b messages. Our model includes pipeline latencies, router contention, flow control, and serialization overheads. Warm-up, measure, and drain phases of several thousand cycles and infinite source queues were used to accurately determine the latency at a given injection rate. Various events (such as channel utilization, queue accesses, and arbitration) were counted during simulation and then multiplied by energy values derived from first-order gate level models. We use synthetic traffic patterns based on a partitioned application model. Each traffic pattern has some number of logical partitions, and tiles randomly communicate only with other tiles that are in the same partition. These logical partitions are then mapped to physical tiles in either a co-located fashion (tiles within a partition are physically grouped together) or in a distributed fashion (tiles in a partition are distributed across the chip). We believe these partitioned traffic patterns capture the varying locality presented in manycore programs. Although we studied various partition sizes and mappings, we focus on the following four representative patterns in this paper. A single global partition is identical to the standard uniform random traffic pattern (UR). The P8C pattern has eight partitions each with eight tiles optimally co-located together. The P8D pattern stripes these partitions across the chip. The P2D pattern has 32 partitions each with two tiles, and these two tiles are mapped to diagonally opposite quadrants of the chip. 8. Simulation Results Using this simulation platform, we implement a CMesh with 256 b channels as a baseline for our analysis. We then compare the simulated CMesh to all electrical versions of express1 and express2 with 256 b express channels. We then compare the electrical implementations of express1 and express2 with the hybrid opto-electrical express1 and express2 topology Electrical Topologies An all-electrical version of the CMesh, express1, and express2 are analyzed using our simulation platform. Figure 7a shows the latency versus offered bandwidth for the CMesh topology and as expected, the P8C traffic pattern exhibits greater performance properties since it only involves local communication. Conversely, the P2D traffic pattern only requires global communication which results in the performance degradation shown in the results. Figure 7b shows the latency versus offered bandwidth for the express1 topology. The P8C and P2D traffic patterns do not show any improvements in performance because these traffic patterns constraints all of the messages to go through the local channels. There are some improvements in the UR and P8D traffic patterns compared to the CMesh. In general, the express1 topology does not gain much performance for the area

8 Figure 7: Latency vs. Offered Bandwidth for our target system. spent for the express channels. Figure 7c shows the latency versus offered bandwidth for the express2 topology. For every source and destination pair, the express2 topology essentially reduces the hop count to a maximum of 2 hops. Therefore, the variation in saturation bandwidth is attributed to the different ratio of 1 and 2 hop routes. Notice that performance is greatly increased throughout various traffic patterns. While this is partially due to the increase bandwidth from adding express channels, as long as the power budget is met, an increase in area is not a significant factor due to abundant die area. Figure 8 shows the power dissipation versus offered bandwidth for the three electrical topologies. We compute the power dissipation from the simulation event data extracted for each of the points in Figure 7. The plots show that the power number shows a slight but insignificant increase (with the same offered bandwidth) as we add more express channels. This is an expected result since electrical power depends mostly on dynamic factors and (with the same offered bandwidth) the power should be similar. However, if the designer wants improved offered bandwidth, the power will inevitably increase. Figure 8: Power Dissipation versus Offered-Bandwidth for our target systems implemented all electrically Even if the designer is not aiming for added performance, the designer should add express channels to place the operating point of the network further away from the knee of the latency versus offered bandwidth curve since variance in latency increases dramatically as the network approaches the knee. The express2 topology will minimize latency variation since the knee is furthest away for the same offered bandwidth. This can be seen in Figure 7 where the latency distribution of express2 is significantly less with different traffic patterns compared to the CMesh Hybrid Topologies In this section we compare the simulation results from the hybrid express1 and express2 topology to the all-electric express1 and express2 topology. Since the latency versus offered bandwidth behavior does not change much between hybrid and electrical implementations (except slight reduction in latency in the hybrid case), we omitted those results in this paper.

9 Figure 9 shows the power dissipation versus offered bandwidth for the hybrid express1 and express2 topologies for the aggressive and conservative photonic technology projections. The results show that with aggressive technological advances in photonics, we have desirable qualities. Although the Express1 topology does not show any improvements in power (because the global links are not very often used), the express2 has a noticeable decrease in power when implemented with photonics. An interesting result is that with photonics links only used for long channels, the static power overhead is not much greater than the electrical counterpart. Turning off the photonic channels during 0 5 kb/s offered bandwidth could provide a small save in power (about 6 W). It is unclear whether this reduction in power is worth the additional hardware needed to toggle between all electrical and hybrid states. Nevertheless, if photonics device engineers can meet the aggressive technology improvements, these results show large potential for on-chip photonics. However, if the devices are only able to meet the conservative analysis, it is clear that photonics is not worth the investment. Because the static power is so high, the all-electrical version wins out in terms of power for all relevant offered bandwidths. If the technology improves to somewhere between the aggressive and conservative projections, it will be sensible to turn off the photonic channels during low utilizations since the power savings for the express2 channel could be up to 15 W. Toggling on the photonic channels will show improvements in power as well. 9. Conclusion In this paper, we evaluated the potential of silicon-photonic technology by designing a hybrid optoelectrical network optimized for both electrical and photonic technologies. An electric CMesh with photonic express channels uses photonics optimally because only global connections are implemented in photonics. Also, the photonic network could be turned off to reduce the static power overhead during low utilization. This proposed network also utilizes Figure 9: The Aggressive and Conservative Power analysis for our target systems implemented with photonic express channels

10 electrical wires optimally since local communication is always done using the underlying electrical CMesh. We analyzed related works to show that they do not use the electrical and photonic technology in an optimal fashion. Ultimately we conclude that even for a highly optimized network, the future of silicon photonics is vitally dependent on the advancement of the photonics technology. Depending on the improvements in power, silicon photonics could be a worthy investment in the future. Acknowledgements Thanks to Ajay Joshi for showing me how to compute the power numbers from the events collected from the simulator. References [1] Joshi, Ajay, et al. "Silicon-Photonic Clos Networks for Global On-Chip Communication." International Symposium on Networks-on-Chip, Optical Networking, 2007: 6(1):63 73, [7] Orcutt, J. et al. "Demonstation of an electronic photonic integrated circuit in a commercial scaled bulk CMOS process." Conf. on Lasers and Electro-Optics, [8] Nawathe, U. et al. "An 8-core 64-thread 64b power efficient SPARC SoC." Int l Solid-State Circuits Conf., [9] Bell, S. et al. "TILE64 processor: A 64-core SoC with mesh interconnect." Int l Solid-State Circuits Conf., [10] Vangal, S. et al. "80-tile 1.28 TFlops network-onchip in 65 nm CMOS." Int l Solid-State Circuits Conf., [11] Clos, C. "A study of non-blocking switching networks." Bell System Technical Journal, 1953: 32: [12] Vantrease, D. et al. "Corona: System implications of emerging nanophotonic technology." Int l Symp. on Computer Architecture, [2] Dally, J., and W. J. Balfour. "Design Tradeoffs for tiled CMP on-chip networks." Int'l Conf. on Supercomputing, [3] Zhao, W., and Y. Cao. "New generation of predictive technology model for sub-45 nm early design exploration." Trans. on Electron Devices, 2006: 53(11): ,. [4] Kim, B., and V. Stojanovic. "Characterization of equalized and repeated interconnects for NoC applications." IEEE Design and Test of Computers,, 2008: 25(5): [5] Gunn, C. "CMOS photonics for high-speed interconnects." IEEE Micro, 2006: 26(2): [6] Barwicz, T. et al. "Silicon Photonics for Compact, Energy Efficient Interconnects." Journal of

Silicon-Photonic Clos Networks for Global On-Chip Communication

Appears in the Proceedings of the 3rd International Symposium on Networks-on-Chip (NOCS-3), May 9 Silicon-Photonic Clos Networks for Global On-Chip Communication Ajay Joshi *, Christopher Batten *, Yong-Jin