Low-Power Low-Latency Optical Network Architecture for Memory Access Communication

Size: px

Start display at page:

Download "Low-Power Low-Latency Optical Network Architecture for Memory Access Communication"

Ashley Atkinson
5 years ago
Views:

1 Wang et al. VOL. 8, NO. 10/OCTOBER 2016/J. OPT. COMMUN. NETW. 757 Low-Power Low-Latency Optical Network Architecture for Memory Access Communication Yue Wang, Huaxi Gu, Kang Wang, Yintang Yang, and Kun Wang Abstract The interconnection network plays a vital role in improving the performance of modern computing systems. Traditional electronic interconnect is subject to latency, power consumption, and bandwidth problems. A low-power low-latency optical network architecture is proposed in this paper to interconnect cores and memory. The proposed architecture is made up of optical subnetworks using seven wavelengths. The optical subnetwork is constructed by some switching blocks, which are able to provide the memory access communication from all cores to ranks at the same time. Compared with traditional electronic bus-based core-to-memory architecture, the simulation results based on the PARSEC benchmark show that the average latency decreases by 54.05%, and the average power consumption decreases by 86.25%. Due to the enhancement of parallel access, the total runtime of applications decreases by 66.43%. Index Terms Memory access communication; Network on chip; Optical interconnect; Power consumption. I. INTRODUCTION N owadays, the performance of computing systems is developing rapidly with the aid of technology innovations. The storage system is an integral part of computing systems. In the traditional memory system, multiple DRAM devices are interconnected together to form a single memory system that is managed by a single memory controller [1]. Moreover, only one rank can communicate with one core through the electronic bus at the same time. As the number of cores are scaling, more cores will attempt to communicate with memory at the same time. The communication latency and bandwidth will become the bottleneck of core-to-memory communication. Thus, the traditional electronic bus-based interconnection architecture faces the challenge of a high-performance computing system and the memory wall problem [2]. Power consumption is another crucial problem for core-to-memory communication. Manuscript received April 13, 2016; revised August 26, 2016; accepted August 26, 2016; published September 21, 2016 (Doc. ID ). Y. Wang, H. Gu ( hxgu@xidian.edu.cn), and K. Wang are with the State Key Laboratory of Integrated Service Networks, Xidian University, Xi an, China. Y. Yang is with the Institute of Microelectronics, Xidian University, Xi an, China. K. Wang is with the School of Computer Science, Xidian University, Xi an, China. As the storage capacity of the memory increases, more electronic buses will be added to the system with higher power consumption and access latency. A traditional electrical bus-based interconnection network cannot fulfill the power consumption and latency requirements in a many-core system. Optical network-on-chip (ONoC) is a promising interconnection solution, which is popularly used to solve problems faced by electronic on-chip communications [3]. Over the last few years, many research groups [4 7] have explored the design and innovation of network-on-chip, especially the architectures, topologies, routing algorithms, etc. The proposals related to ONoC are mainly focused on the communication between different cores, such as [8], [9], and [10]. The advantages of these proposals are 1) high bandwidth due to the use of wavelength division multiplexing technology, 2) low power consumption thanks to the data bit-rate independent of distance in optical interconnect, and 3) high path diversity compared with traditional bus interconnect. The architecture Flyover [11] uses a cluster structure with time-division multiplex (TDM) wavelength division multiplex (WDM) technology, which improves network scalability and flexibility. A wavelength-reused hierarchical architecture WRH-ONoC meets the demands of high-throughput data communication between connected cores [12]. Besides, the innovation of topologies, e.g., DMesh topology [13] and MSB topology [14], also achieves higher performance of the network and quality of communication. The performance of modern computing systems is determined by the characteristics of the interconnection network, which are used to provide communication links between on-chip cores and off-chip memory [15]. To improve the memory access performance of future CMPs, some researchers have introduced the emerging technology such as optical interconnect and network design into offchip memory access [16,17]. The discrepancy between on- and off-chip communication bandwidth is one of the problems widely noted. It continues to widen due to the power and spatial constraints of electronic off-chip signaling [16]. The design of a photonic network-on-chip with integrated DRAM I/O interfaces, along with the performance comparison between similar electronic solutions and the authors design, is obviously presented. SAMNoC [17] is an optical bus-based on-chip network. The bus-based /16/ Journal 2016 Optical Society of America

2 758 J. OPT. COMMUN. NETW./VOL. 8, NO. 10/OCTOBER 2016 Wang et al. structure may have less parallelism and more MRs to implement the functions. Considering all the problems above, in this paper, a new optical network is specially designed for the memory access communication between cores and memory systems. We focus on improving the performance of the power consumption, memory access latency, and parallelism for memory transactions. Compared with other ONoCs, the new optical network made up of switching blocks uses fewer microring resonators (MRs), which are fully utilized. By using different wavelengths and subnetworks, all cores are able to send memory transactions simultaneously. The optical subnetworks not only save the power consumption during the communication but also reduce the transmission latency. To the best of our knowledge, the off-chip core-to-memory interconnect presented in this paper takes advantage of the three above-mentioned key advantages of ONoC. Only seven wavelengths are used in our design to realize the interference-free memory access communication between different cores. As for the optical insertion loss [18], we use fewer MRs to reduce the MR coupling loss and MR through loss. These features are achieved thanks to the use of optical interconnections and the network design. In other words, these benefits are not observable in architectures that only employ optical interconnect [19]. II. OPTICAL INTERCONNECTED ARCHITECTURE A. Overview of the Architecture An optical interconnect is designed for an N-core system to provide a novel method of memory access communication. Generally, the proposed optical interconnect is a hierarchical architecture, which is built by N/4 subnetworks. For each subnetwork, two switching blocks are responsible for two-directional memory access communications. The 3D stacked optical interconnect consists of one electrical layer and two optical layers. The electrical layer contains N cores that are divided into N/4 groups. The corresponding subnetworks are placed on the two optical layers. One optical layer contains N/4 switching blocks, which are responsible for the communication from cores to ranks. The other optical layer contains N/4 switching blocks that are responsible for the communication responses from ranks to cores. Take the 64-core four-rank system as an example, Fig. 1(a) shows the cross-sectional view of our target system. The electrical layer contains 16 groups of cores, and each group has four cores. The laser layer mainly consists of the laser sources, many modulators, and detectors. The number of laser sources is related to the number of cores and ranks. The optical layer 1 shown in Fig. 1(a) consists of 16 optical subnetworks that make the communication from cores to ranks possible. The communication from memory to core relies on the optical subnetworks on optical layer 2. Both optical layer 1 and 2 are linked with the cores or ranks on an electrical layer by the means of throughsilicon vias (TSVs). Layout details for optical layer 1, optical layer 2, and the electrical layer are shown in Fig. 1(b). Each optical layer consists of 16 new optical subnetworks that are connected to the TSV access points. On the electrical layer, two data lines are used between a core and a TSV access point because of the two-way memory access communications. Heat sink Optical layer 2 (optical memory-to-core networks) Optical layer 1 (optical core-to-memory networks) OE/EO layer (O/E and E/O transceivers) Electrical layer (cores and ranks) TSV access points TSVs Optical layer 2 Optical layer 1 Electrical layer (a) (b) Fig. 1. Layout of the proposed optical interconnect for 64-core four-rank system. (a) Cross-sectional view of the proposed architecture. (b) Layout details for optical layer 1, optical layer 2, and electrical layer.

3 Wang et al. VOL. 8, NO. 10/OCTOBER 2016/J. OPT. COMMUN. NETW. 759 B. Optical Subnetwork The critical component of optical interconnect between cores and ranks is the new optical subnetwork. The N-core system has N 4 optical subnetworks that are constructed by some switching units, sending/receiving supply units, and power units. 1) Switching Unit: The switching unit is the key point of the new optical subnetwork. It is used to transmit an optical signal to a certain rank. Each optical subnetwork has two switching blocks that consist of eight waveguides and 12 MRs. The two switching blocks are the same; each has four waveguides and six different MRs. A switching block has four input ports and four output ports connected to four cores and four ranks, respectively. The six MRs in each block have different resonance wavelengths, distinguished by the number of MRs. In total, seven wavelengths are used in the optical interconnect architecture for all core-tomemory communications. Table I shows that four input ports of a switching block apply different wavelengths to four output ports. 2) Sending/Receiving Unit: Each core and rank are equipped with a sending/receiving unit. The main devices of the sending/receiving unit are modulators/demodulators, MRs, and a control subunit. Modulators and demodulators are arranged on both sides of cores and ranks responsible for data modulation and demodulation. Each core has four MRs to couple four different wavelengths, which are controlled by a control subunit to communicate with different ranks of the memory system. Each rank also has four MRs to couple four different wavelengths, which are controlled by the control subunit to communicate with different cores in one group. On the side of the memory system, each rank also has several broadband microring resonators (BMRs). These BMRs are used for the communication from ranks to cores in different groups. The control subunit and BMRs are controlled by ranks, based on special memory response transaction. On the side of cores, the control subunit and MRs are controlled by cores, based on core-to-memory transactions. 3) Power Supply Unit: Any optical signal transmitted in the network needs the optical power supply unit. Thus, the laser source becomes the main part of the optical power supply unit. Considering the power required by communication from core to memory, which means the losses can still reach the receiver sensitivity after being cut down, the off-chip laser sources provide enough power for a 4N-core 4M-rank system to afford the one-way communication from ranks back to the cores. A wavelength-selective TABLE I MAPPING INFORMATION OF WAVELENGTHS Input Port Corresponding Wavelength Input1 λ 1 λ 3 λ 5 λ 7 Input2 λ 2 λ 4 λ 5 λ 7 Input3 λ 1 λ 2 λ 6 λ 7 Input4 λ 3 λ 4 λ 6 λ 7 on-chip optical broadcasting system proposed by Su et al. [20] is also used and contained in the power supply unit beside the laser sources. The wavelength-selective on-chip optical broadcasting system is used to distribute laser power to eight wavelength channels, which are connected to eight groups of cores. Eight partial drop filter designs consist of asymmetric coupling to the bus and drop waveguides. C. Case Study A 64-core four-rank system is taken as an example to show the details of the proposed optical interconnection for core-to-memory communications. Figure 2 Illustrates the optical subnetwork between one group of cores and four ranks. In the power supply unit, two laser sources are used for the core-to-memory one-way communication. Eight waveguide channels of a wavelength-selective on-chip optical broadcasting system connect eight groups of cores and the laser sources can be verified to provide enough power for the core-to-memory one-way communication of 32 cores. One laser source is used for the memory-to-core one-way communication. In the sending/receiving unit, each core has a control subunit controlling four MRs. The modulators/demodulators are shown in Fig. 2(a). The key component of the switching block is the parallel switching elements shown in Fig. 2(c), which are fabricated by two waveguides and one MR. Figure 2(a) shows the two similar blocks of the new optical network using 12 parallel switching elements. In our new optical subnetwork, the six MRs in one switching block have different resonance frequencies. The optical signal with wavelength λ i will be coupled by MRi. When MRi is powered on and the wavelength of the light signal is λ i, the incident light signal will be coupled into MRi and transmitted to the given output port. On the other hand, the optical signal with wavelength λ j (j i) will propagate from the input port to the output port without being coupled into MRi [Fig. 2(c)]. The excellent feature of the new subnetwork is the full usage of the MRs and fewer number of MRs. As [21] indicated, the MR could be used from both sides at one time. Hence, both the optical paths in a switching block, which are from input port 1 to output port 1 and from input port 2 to output port 2, use MR5 to couple optical signals. It does not affect overall system performance. As is shown in Fig. 2(a), the new optical network, which uses a fewer number of MRs, provides reliable communications from four cores to four ranks. In the switching unit, according to Table I, core 1 is connected to input port 1 using wavelengths λ 1, λ 3, λ 5, and λ 7 to communicate with the four ranks, respectively. Core 2 is connected to the input port 2 using wavelengths λ 2, λ 4, λ 5, and λ 7. Core 3 is connected to the input port 3 using wavelengths λ 1, λ 2, λ 6, and λ 7. Core 4 is connected to the input port 4 using wavelengths λ 3, λ 4, λ 6, and λ 7. The new optical network does not have a MR with resonance wavelength λ 7, so the light signal with wavelength λ 7 is used to transmit data directly with no coupling needed to change the waveguide. There are 16 optical paths from four inputs of the optical network

4 760 J. OPT. COMMUN. NETW./VOL. 8, NO. 10/OCTOBER 2016 Wang et al. Fig. 2. Optical subnetwork for 64-core four-rank system-on-chip. (a) Overall architecture of four cores in the same group communicating with four ranks. (b) Connection details of four ranks in the new core-to-memory interconnection. (c) Operation principle of the parallel switching element. to four outputs of the optical network. The different statuses presented in Table II use different MRs to couple different wavelengths. The details of the 16 statuses are shown in Table II. The cores with the same number in different groups use the same wavelength, while the cores with different numbers in one group use different wavelengths. The connection between optical networks and ranks is also shown in Fig. 2(a). Rank 0 is connected to output port 2 of the switching block. Rank 1 is connected to output port 1. Rank 2 is connected to output port 4, and rank 3 is connected to output port 3. Figure 2(b) shows that the four ranks also use the different wavelengths to communicate with different cores. According to Table I, rank 0, which is connected to input port 1, requires four wavelengths λ 1, λ 3, λ 5, and λ 7 to respond to any of the cores. Rank 1 TABLE II ANALYSIS OF THE 16 STATUSES Status Optical Path Number of MRs Wavelength 1 Input1 Output1 5 λ 5 2 Input1 Output2 λ 7 3 Input1 Output3 3 λ 3 4 Input1 Output4 1 λ 1 5 Input2 Output1 λ 7 6 Input2 Output2 5 λ 5 7 Input2 Output3 4 λ 4 8 Input2 Output4 2 λ 2 9 Input3 Output1 2 λ 2 10 Input3 Output2 1 λ 1 11 Input3 Output3 6 λ 6 12 Input3 Output4 λ 7 13 Input4 Output1 4 λ 4 14 Input4 Output2 3 λ 3 15 Input4 Output3 λ 7 16 Input4 Output4 6 λ 6 requires four wavelengths λ 2, λ 4, λ 5, and λ 7. Rank 2 requires four wavelengths λ 1, λ 2, λ 6, and λ 7. Rank 3 requires four wavelengths λ 3, λ 4, λ 6, and λ 7. The statuses of optical paths from ranks to cores are the same as those from cores to ranks. III. COMMUNICATION METHOD The two-way communication between cores and memory is designed for the new interconnect architecture. When cores need to send transactions to the memory, the control units will switch on the certain MRi to couple the optical signal with wavelength λ i. The modulators convert the electronic signal to the optical signal. Before the optical signal is injected into the optical network, the signal will be modulated by the modulator and carry the messages. After the optical signal transmitting through the optical network, the detectors on the side of ranks will convert the optical signal to the electronic signal. The electronic signal will be sent to the certain memory module to process the transaction. Each rank is receiving many transactions from different cores at the same time, so it will record which network this transaction is from (i.e., optical waveguide) and process the transactions in chronological order. Figure 2(a) shows the two-way communication architecture of four cores in the same group communicating with four ranks as an example. When core 1 wants to send a memory transaction to rank 3, the control unit will switch on MR3 to couple the optical signal with wavelength λ 3. Then, the modulator will modulate the optical signal, i.e., it will convert the electronic signal to the optical signal. After transmitting the optical signal through the optical network, the detector on the side of rank 3 will convert the optical signal to the electronic signal. Finally, the

5 Wang et al. VOL. 8, NO. 10/OCTOBER 2016/J. OPT. COMMUN. NETW. 761 electronic signal will be sent to a certain memory module to process the transaction. As an advantage of a new optical subnetwork, the cores can send transactions to four ranks without any interference based on different wavelengths and networks. If rank 0 needs to send the response signal to core 4 of group 3 [Fig. 2(b)], it will power on MR3 to couple the wavelength λ 3 and modulate the light signal. Then, the rank should power on the third microring resonator according to the previous records of waveguide, in order to send the light signal to the subnetwork connected to group 3. The optical signal will be modulated and sent to the optical network. MR3 will couple the optical signal with wavelength λ 3.At the end of the response communication, the detector will convert the optical signal to the electronic signal that is sent to core 4. IV. SIMULATION AND ANALYSIS In this section, we compare the new optical network interconnection architecture for a 64-core four-rank system with the traditional electronic bus-based interconnection architecture. DRAMSim2 [22] is used to simulate the transaction transmission process between cores and ranks and analyze the performance improvement of the proposed network. We use the real traffic derived from Netrace [23], which is based on the PARSEC benchmark [24] in the DRAMSim2 to achieve reliable results. Three applications from the PARSEC benchmark are chosen to test the performance of the new optical interconnect architecture, and each application extracts 100,000 core-to-memory traces from Netrace in chronological order. Then, performance results such as electronic power consumption, the number of returned transactions, and average latency are shown and analyzed. A. Power Consumption Analysis The total power consumption of the system with optical network interconnection architecture can be calculated by the following equation: P optical-total P laser P tuning P modulation P detector P memory ; (1) where P laser is the power required by all lasers. P tuning is the power consumed by tuning MRs. All the modulators and detectors in the optical interconnection will generate the modulation power consumption P modulation and detection power consumption P detector. P memory denotes the total power consumed by the memory system, including the stacked memory modules, memory controllers, etc. Considering the differences between core-to-memory communication and memory-to-core communication, we calculate the power required by three lasers separately. When each core s power consumption is 0.4 W, the laser source achieves a peak efficiency of 8.2% and provides a laser output power of 23 mw per wavelength [25]. To gain the real output power needed in the system, Eq. (2) is developed as follows: P out S L total 10 lg n; (2) where P out is the output power of the laser. S is the detector sensitivity, and n is the number of wavelengths. L total is the total power loss of all the pairs of optical paths between the cores and ranks. The main sources of L total are the laser power and the circuits of modulator, detector, and tuning. The computing formulas of total optical power loss and total loss per wavelength are shown in Eq. (3): L total-per-wavelengh L insertion L modulation L detection L distribution ; (3) where L modulation and L detection are the optical power loss during the modulation and detection. L distribution is the total distribution loss in the optical path. When the core communicates with the rank, the laser power is first distributed to eight groups and then distributed to four cores [Fig. 2(a)]. When the rank returns transactions to the core, the laser power is only distributed once to four ranks. L insertion is the optical power insertion loss, which contains the MR drop loss, the MR through loss, the waveguide crossing loss, and the waveguide bend loss [Eq. (4)]: L insertion X l drop X l through X l crossing X l bend : (4) In Eq. (4), l drop is the power loss from MR coupling. l through is the power loss from passing through an MR. l crossing is the power loss experienced at the crossing point of two waveguides, and l bend is the power loss from a bending waveguide. The parameters and values are listed in Table III. In the architecture with the new optical interconnect, seven wavelengths are used in the core-to-memory communication. The optical loss of the case that one core using one wavelength to send the transaction to one rank is easily calculated. The aggregate excess loss of the parallel drop in the device is only 1.1 db (L distribution ) with an optical power variation of only 0.11 db between drops [20]. The optical insertion loss L insertion of all 16 optical paths in the network TABLE III PARAMETERS OF OPTICAL DEVICES Parameter Value MR drop port 1.5 db [26] MR through port 0.01 db [26] Waveguide crossing 0.05 db [26] Waveguide bend db/90 [27] Modulator through port 1.2 db [27] Detector drop port 0.5 db [27] Detector through port 0.05 db [27] Detector sensitivity 22 dbm [28] Laser efficiency 8.2% [25]

762 J. OPT. COMMUN. NETW./VOL. 8, NO. 10/OCTOBER 2016 Wang et al. Fig. 3. Power consumption of three applications under new optical interconnected architecture. Fig. 4.

6 762 J. OPT. COMMUN. NETW./VOL. 8, NO. 10/OCTOBER 2016 Wang et al. Fig. 3. Power consumption of three applications under new optical interconnected architecture. Fig. 4. Power consumption comparison results between traditional electronic bus architecture and the new optical network architecture. whole process that a core using one wavelength communicates with a rank is mw. In the optical interconnect architecture, one laser is used for the eight groups (32 cores). The wavelength λ 7 may be used by 32 cores at the same time while other wavelengths may be used by as much as 16 cores at the same time. The total power consumption of two lasers responsible for the cores is P laser mw. According to the structure in Fig. 1(c) and the communication method, the power consumption of the laser responsible for the ranks is calculated as shown in Fig. 3. The tuning, modulation, and detection power consumptions are calculated based on the total average bandwidth of each application. The total average bandwidths of applications 1, 2, and 3 are 5.365, 6.564, and GB/s, respectively. The maximum value of tuning power is about 250 fj/bit from Manipatruni et al. [29]. The maximum total value of modulation power and detection power is 2 pj/bit [30 33,34]. The power consumption details of total-perwavelength are shown in Fig. 3. In the proposed structure, memory modules are placed vertically using 3D integration technology [35]. The total power consumption of 3D stacked memory is 500 fj/bit based on [34]. Figure 4 shows the comparison results of power consumptions between two architectures. The bar graph shows each value of power consumptions in Eq. (1) while the line chart depicted by the right y axis shows the total power consumption under two architectures. Compared with the traditional electronic bus-based interconnection architecture, the 64-core four-rank system using our new optical network interconnection architecture decreases the total power consumption significantly. The total power consumption of application 1 decreases by 88.33%. The total power consumptions of applications 1, 2, and 3 decrease by 88.33%, 87.15%, and 83.26%, respectively. is analyzed. The max value of 1.75 db generated in the optical path from input port 3 to output port 3 is assumed. From Eq. (3), L total-per-wavelength is 5.8 db. P total-per-wavelength is 16.2 db. In other words, the power consumed by the B. Latency and Parallel Access Analysis Latency is an important indicator of measuring the performance of a communication network. In this section, the (a) (b) (c) Fig. 5. Comparison of latency between two architectures. (a) Latency comparison of traces in application 1. (b) Latency comparison of traces in application 2. (c) Latency comparison of traces in application 3.

7 Wang et al. VOL. 8, NO. 10/OCTOBER 2016/J. OPT. COMMUN. NETW. 763 achieves lower power consumption, lower latency, and better parallel memory access. ACKNOWLEDGMENT This work was supported by the National Natural Science Foundation of China (NSFC), grant nos , , and ; the Fundamental Research Funds for the Central Universities, grant no. JB150318; and the 111 Project, grant no. B REFERENCES Fig. 6. Average latency and the maximum latency comparison of three applications. average latencies and the maximum latencies in the three applications are simulated and shown in Figs. 5 and 6. The x axis in Fig. 5 expresses the running cycle, while the y axis stands for the number of traces that can be accomplished in the corresponding cycle interval. Compared with electronic interconnection, the new optical network interconnection significantly reduces the latency. Figure 5 shows that large numbers of traces under optical network architecture can be finished with low latency. The average latencies and the maximum latencies in the three applications are compared and shown in Fig. 6. It can be observed that the results of the new optical interconnect are better than that of electronic interconnect. In application 1, the average latency of electronic interconnect is ns, while the average latency of optical interconnect is ns. The average latency is reduced by 58.09%. Similarly, the average latency in application 2 is reduced by 52.90%, and the average latency in application 3 is reduced by 51.16%. From this simulation, it can be observed that the maximum latency is also reduced. The maximum latencies in the three applications are reduced by 48.45%, 46.60%, and 48.66%, respectively. By using the optical network interconnect, the parallel access characteristic also is significantly improved. The high parallelism enables an obvious latency reduction and allows the application to be finished in a short time. V. CONCLUSION This work presents a low-power low-latency optical network for core-to-memory interconnect. The new optical subnetwork is designed to improve the communication performance between cores and memory systems. A 64-core four-rank system based on multiple optical subnetworks is shown as an example. Seven wavelengths are used in communication ensuring no interference. The simulation results of power consumption and latency show that the optical network interconnection architecture exhibits better performance in benchmark applications, which [1] B. Jacob, S. W. Ng, and D. Wang, DRAM memory system organization, in Memory System: Cache, DRAM, Disk, 1st ed. Burlington, VT, 2008, chap. 10, sec. 2, pp [2] Y. Xie, Future memory and interconnect technologies, in Design, Automation and Test in Europe Conf. and Exhibition (DATE), Grenoble, France, 2013, pp [3] I. O Connor, Optical solutions for system-level interconnect, in Int. Workshop on System Level Interconnect Prediction, Paris, France, 2004, pp [4] S. Pasricha and N. Dutt, Trends in emerging on-chip interconnect technologies, IPSJ Trans. Syst. LSI Design Method., vol. 1, pp. 2 17, [5] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, A 5-GHz mesh interconnect for a teraops processor, IEEE Micro, vol. 27, pp , Sept [6] Z. Li, C. Zhu, L. Shang, R. Dick, and Y. Sun, Transactionaware network-on-chip resource reservation, IEEE Comput. Archit. Lett., vol. 7, pp , July [7] Y. Xie, W. Xu, W. Zhao, Y. Huang, T. Song, and M. Guo, Performance optimization and evaluation for torus-based optical networks-on-chip, J. Lightwave Technol., vol. 33, pp , Sept [8] Z. Chen, H. Gu, Y. Yang, and D. Fan, A hierarchical optical network-on-chip using central-controlled subnet and wavelength assignment, J. Lightwave Technol., vol. 32, pp , Mar [9] J. Chan and K. Bergman, Photonic interconnection network architectures using wavelength-selective spatial routing for chip-scale communications, J. Opt. Commun. Netw., vol. 4, pp , Mar [10] G. Hendry, E. Robinson, V. Gleyzer, J. Chan, L. Carloni, N. Bliss, and K. Bergman, Circuit-switched memory access in photonic interconnection networks for high-performance embedded computing, in Int. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2010, pp [11] B. Zhang, H. Gu, Y. Yang, K. Chen, and Q. Hao, Flyover architecture for cluster and TDM-based optical network-onchip, IEEE Photon. Technol. Lett., vol. 26, pp , Dec [12] F. Liu, H. Zhang, Y. Chen, Z. Huang, and H. Gu, WRH-ONoC: A wavelength-reused hierarchical architecture for optical network on chips, in IEEE Conf. on Computer Communications (INFOCOM), Apr. 2015, pp [13] Z. Chen, H. Gu, Y. Yang, L. Bai, and H. Li, A power efficient and compact optical interconnect for network-on-chip, IEEE Comput. Archit. Lett., vol. 13, pp. 5 8, Jan [14] I. Datta, D. Datta, and P. Pratim Pande, Design methodology for optical interconnect topologies in NoCs with BER and

8 764 J. OPT. COMMUN. NETW./VOL. 8, NO. 10/OCTOBER 2016 Wang et al. transmit power constraints, J. Lightwave Technol., vol. 32, pp , Jan [15] M. Petracca, B. G. Lee, K. Bergman, and L. P. Carloni, Photonic NoCs: System-level design exploration, IEEE Micro, vol. 29, pp , [16] X. Wu, J. Xu, Y. Ye, X. Wang, M. Nikdast, Z. Wang, and Z. Wang, An inter/intra-chip optical network for manycore processors, IEEE Trans. Very Large Scale Integr. Syst., vol. 23, pp , Apr [17] W. Fu, M. Yuan, T. Chen, L. Liu, and M. Wu, SAMNoC: A novel optical network-on-chip for energy-efficient memory access, in IEEE 8th Int. Symp. on Embedded Multicore/ Manycore SoCs (MCSoc), Sept. 2014, pp [18] L. Zhang, Y. Man, X. Tan, M. Yang, T. Hu, J. Yang, and Y. Jiang, On reducing insertion loss in wavelength-routed optical network-on-chip architecture, J. Opt. Commun. Netw., vol. 6, pp , Oct [19] K. Wang, H. Gu, Y. Yang, and K. Wang, Optical interconnection network for parallel access to multi-rank memory in future computing systems, Opt. Express, vol. 23, pp , [20] Z. Su, E. Timurdogan, J. Sun, M. Moresco, G. Leake, D. Coolbaugh, and M. R. Watts, An on-chip partial drop wavelength selective broadcast network, in Conf. on Lasers and Electro-Optics (CLEO), June 2014, paper SF2O.4. [21] H. Jia, L. Yang, Y. C. Zhao, and Q. S. Chen, Reconfigurable non-blocking four-port optical router based on microring resonators, in Proc. IEEE Group IV Photonics (GFP) Conf., Aug. 2015, pp [22] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, DRAMSim2: A cycle accurate memory system simulator, IEEE Comput. Archit. Lett., vol. 10, pp , Jan [23] J. Hestness and S. W. Keckler, Netrace: Dependency-tracking traces for efficient network-on-chip experimentation, Department of Computer Science, The University of Texas at Austin, Tech. Rep., [24] C. Bienia and K. Li, Fidelity and scaling of the PARSEC benchmark inputs, in IEEE Int. Symp. on Workload Characterization (IISWC), Dec. 2010, pp [25] C. Chen, T. S. Zhang, P. Contu, J. Klamkin, A. K. Coskun, and A. Joshi, Sharing and placement of on-chip laser sources in silicon-photonic NoCs, in IEEE/ACM Int. Symp. on Networks-on-Chip (NoCS), Sept. 2014, pp [26] C. Batten, A. Joshi, J. Orcutt, A. Khilo, B. Moss, C. Holzwarth, and F. Kartner, Building manycore processor-to-dram networks with monolithic silicon photonics, in IEEE Symp. on High Performance Interconnects, 2008, pp [27] A. Biberman, K. Preston, G. Hendry, N. Sherwood-Droz, J. Chan, J. S. Levy, and K. Bergman, Photonic network-on-chip architectures using multilayer deposited silicon materials for high-performance chip multiprocessors, ACM J. Emerging Technol. Comput. Syst., vol. 7, article no. 7, [28] R. Hendry, D. Nikolova, S. Rumley, and K. Bergman, Modeling and evaluation of chip-to-chip scale silicon photonic networks, in IEEE 22nd Annu. Symp. on High- Performance Interconnects (HOTI), Mountain View, CA, 2014, pp [29] S. Manipatruni, M. Lipson, and I. A. Young, Device scaling considerations for nanophotonic CMOS global interconnects, IEEE J. Sel. Top. Quantum Electron., vol. 19, , Mar [30] Y. Audzevich, P. M. Watts, A. West, A. Mujumdar, S. W. Moore, and A. W. Moore, Power optimized transceivers for future switched networks, IEEE Trans. Very Large Scale Integr., vol. 22, pp , Oct [31] G. Li, X. Zheng, J. Lexau, Y. Luo, H. Thacker, P. Dong, S. Liao, D. Feng, D. Zheng, R. Shafiiha, M. Asghari, J. Yao, J. Shi, P. Amberg, N. Pinckney, K. Raj, R. Ho, J. Cunningham, and A. V. Krishnamoorthy, Ultralow-power high-performance Si photonic transmitter, in Optical Fiber Communication Conf. and the Nat. Fiber Optic Engineers Conf. (OFC/NFOEC), Mar. 2010, paper OMI2. [32] X. Zheng, F. Liu, D. Patil, H. Thacker, Y. Luo, T. Pinguet, and K. Raj, A sub-picojoule-per-bit CMOS photonic receiver for densely integrated systems, Opt. Express, vol. 18, pp , [33] E. E. Hussein, S. Safwat, M. Ghoneima, and Y. Ismail, A new signaling technique for a low power on-chip SerDes transceivers, in Int. Conf. on Energy Aware Computing (ICEAC), Dec. 2010, pp [34] K. Bergman, G. Hendry, P. Hargrove, J. Shalf, B. Jacob, K. S. Hemmert, A. Rodrigues, and D. Resnick, Let there be light! The future of memory systems is photonics and 3D stacking, in ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, 2011, pp [35] P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carson, W. Dally, and K. Hill, Exascale computing study: Technology challenges in achieving exascale systems, Department of Computer Science and Engineering, University of Notre Dame, Tech. Rep. TR , 2008.

NETWORKS-ON-CHIP (NoC) plays an important role in

NETWORKS-ON-CHIP (NoC) plays an important role in 3736 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 30, NO. 23, DECEMBER 1, 2012 A Universal Method for Constructing N-Port Nonblocking Optical Router for Photonic Networks-On-Chip Rui Min, Ruiqiang Ji, Qiaoshan