Architectures. A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree.

Size: px

Start display at page:

Download "Architectures. A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree."

Johnathan Collins
6 years ago
Views:

1 Dynamic Bandwidth and Laser Scaling for CPU-GPU Heterogenous Network-on-Chip Architectures A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Scott E. Van Winkle August Scott E. Van Winkle. All Rights Reserved.

2 2 This thesis titled Dynamic Bandwidth and Laser Scaling for CPU-GPU Heterogenous Network-on-Chip Architectures by SCOTT E. VAN WINKLE has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by Avinash Kodi Associate Professor of Electrical Engineering and Computer Science Dennis Irwin Dean, Russ College of Engineering and Technology

3 3 ABSTRACT VAN WINKLE, SCOTT E., M.S., August 2017, Electrical Engineering Dynamic Bandwidth and Laser Scaling for CPU-GPU Heterogenous Network-on-Chip Architectures (88 pp.) Director of Thesis: Avinash Kodi As the relentless quest for higher throughput and lower energy cost continues in heterogenous multicores, there is a strong demand for energy-efficient and high-performance Network-on-Chip (NoC) architectures. Heterogenous architectures that can simultaneously utilize both the serialized nature of the CPU as well as the thread level parallelism of the GPU are gaining traction in the industry. A critical issue with heterogenous architectures is finding an optimal way to utilize the shared resources such as the last level cache (LLC) and NoC without hindering the performance of either the CPU or the GPU core. Photonic interconnects are a disruptive technology solution that have the potential to increase the bandwidth, reduce latency, and improve energy-efficiency over traditional metallic interconnects. In this thesis, we propose a CPU-GPU heterogenous architecture called SHARP (Shared Heterogenous Architecture with Reconfigurable Photonic Network-on-Chip) that combines CPU and GPU cores around the same router. SHARP architecture is designed as a Single-Writer Multiple-Reader (SWMR) crossbar with reservation-assist to connect CPU/GPU cores. The architecture consists of 32 CPU cores and 64 GPU computational units. As network traffic exhibits temporal and spatial fluctuations due to application behavior, SHARP can dynamically reallocate bandwidth and thereby adapt to application demands. In this thesis, we propose to dynamically reallocate bandwidth and reduce power consumption by evaluating buffer utilization. While buffer utilization is a reactive technique that deals with fluctuations in application demands, we also propose a proactive technique wherein we use machine learning (ML) to optimize the bandwidth and power

4 4 consumption. In ML, instead of predicting the buffer utilization, we predict the number of packets that will be generated by the heterogenous cluster. Simulation results where evaluated using PARSEC 2.1 and SPLASH2 benchmark suits for the CPU and OpenCL SDK benchmark suite for the GPU. Our simulation results demonstrate a 34% performance (throughput) improvement over a baseline electrical CMESH while consuming 25% less energy per bit when dynamically reallocating bandwidth. Further simulation results have also shown 6.9% to 14.9% performance improvement over other flavors of the proposed SHARP architecture without dynamic bandwidth allocation. When dynamically scaling laser power, the laser power scaling without ML demonstrates an 8.2% throughput loss with a 60.5% laser power improvement of a high laser power baseline when the reservation window is set to 25 cycles. With a reservation window size of 2000 cycles, the laser power scaling with ML has a negligible throughput loss when compared to the high laser power baseline with a 42% laser power savings. With a reservation window size of 500 cycles, the laser power scaling with ML demonstrates a 14.6% throughput loss with a 65.5% power savings.

5 5 I dedicate my thesis to my wife, parents, friends, and family who have supported and encouraged me through this long endeavor.

6 6 ACKNOWLEDGMENTS This work is partially supported by the National Science Foundation (NSF) grants CCF (CAREER), CCF , CCF , and CCF

7 7 TABLE OF CONTENTS Page Abstract Dedication Acknowledgments List of Tables List of Figures List of Acronyms Introduction Why Multicores and Heterogenous Architectures Network Topologies Why Silicon Photonics Bandwidth and Power Scaling Reasons for Machine Learning Contributions of this Work Organization of Thesis Proposed SHARP Architecture Bandwidth Reconfiguration and Power Scaling Techniques Prior Work Proposed Architecture (SHARP) Dynamic Bandwidth Scaling Variations of the SHARP Architecture Dynamic Power Scaling Machine Learning for Bandwidth Scaling Regression Model Derivation Machine Learning for Power Scaling Regression Model Derivation Performance Evaluation SHARP Simulation Methodology Machine Learning Simulation Methodology Feature Engineering for Dynamic Bandwidth Scaling Training and Testing for Dynamic Bandwidth Scaling Feature Engineering for Dynamic Power Scaling

8 3.2.4 Training and Testing for Dynamic Power Scaling Results for Dynamic Bandwidth Scaling Synthetic Traffic Throughput and Latency Results Real Traffic Throughput and Latency Results Area and Power Results for Real Traffic Machine Learning Throughput and Error Results Results on Power Scaling Throughput and Power Results Conclusions and Future Work References

9 9 LIST OF TABLES Table Page 2.1 Dynamic Bandwidth Allocation Algorithm Dynamic Bandwidth Allocation Algorithm with Power Scaling Laser Power for the Five Laser Power States Benchmark Information Architecture Specifications Loss and Power Values for Optical Energy Calculations Area for SHARP-Dyn [LAS + 09, LHE + 13, Hor14a] Dynamic Bandwidth Feature List Dynamic Laser Scaling Feature List Loss and Power Values for Optical Energy Calculations

10 10 LIST OF FIGURES Figure Page 1.1 Depiction of Moore s law and Dennard scaling [Moo98] Comparison of how CPUs and GPUs handle data Communication path between CPU and GPU cores Examples of different network topologies Silicon photonic communication utilizing micro ring resonators to send and receive data Bandwidth density comparison between Electrical Interconnects (EI) and silicon photonics (Silicon OI) [HCC + 06] core Corona architecture [VSM + 08] compute unit GPU with two photonic crossbars connecting the L1 and L2 caches [ZAU + 15] Cluster consists of 2 CPU cores with private L1 data and instruction cache, 4 GPU CUs with private L1 cache, and L2 caches for CPU and GPU all centered around a single router. There are 16 clusters and all clusters are connected to a L3 router via optical waveguides D die layout consisting of three layers; core and cache layer, router and optical transceiver layer, optical layer containing the data and reservation waveguides Electrical to optical and optical to electrical router microarchitecture with reservation waveguide interface: IB = Input Buffer, OB = Output Buffer, O/E = Optical to Electrical conversion, and E/O = Electrical to Optical conversion (a) Router Pipeline. (b) Reservation packet. (c) Communication path for CPU- GPU R-SWMR link SHARP-CoSeg chip layout. The CPU and GPU cores and their resources are separated between two halves of the chip CPU-GPU packet breakdown for each traffic trace Synthetic Traffic Latency (a) Average time a CPU packet spends in the network for 64 wavelengths (b) Average time a GPU packet spends in the network for 64 wavelengths. Shuff=Perfect Shuffle, HTSP=Random Traffic with a Hot spot, and UniRnd=Uniform Random Synthetic traffic throughput: Shuff=Perfect Shuffle, HTSP=Random Traffic with a Hot spot, and UniRnd=Uniform Random Real traffic latency vs. Workload Injection Rate (a) Average time a CPU packet spends in the network for 64 wavelengths (b) Average time a GPU packet spends in the network for 64 wavelengths Real Traffic Latency: (a) Average time a CPU packet spends in the network for 64 wavelengths. (b) Average time a GPU packet spends in the network for 64 wavelengths

11 3.6 Real traffic throughput results for 64 wavelengths The mean throughput results with an increase in injection rate for 64 wavelengths Real traffic percent throughput increase for 64 wavelengths when comparing SHARP-Dyn to SHARP-BanSp, SHARP-CoSeg, and SHARP-FCFS Real traffic percent throughput increase for 64, 32, and 16 wavelengths when comparing SHARP-Dyn to SHARP-BanSp, SHARP-CoSeg, and SHARP-FCFS Energy per bit based off the average throughput of each network Energy per bit based off the average throughput of each network Machine Learning Dynamic Bandwidth Throughput Comparison Comparison of the normalized RMSE for the training, testing, and validation data for the machine learning dynamic bandwidth scaling Testing data error for each benchmark pair for the machine learning dynamic bandwidth Average NRMSE for dynamic power and machine learning Average laser power consumption per router for each benchmark Average network throughput for dynamic laser scaling Wavelength breakdown for the dynamic laser scaling with 500 cycle reservation window Wavelength breakdown for the dynamic laser scaling with 25 cycle reservation window Wavelength breakdown for the dynamic laser scaling with 2000 cycle reservation window Wavelength breakdown for the machine learning laser power scaling without 8 wavelength state and a 500 cycle reservation window Wavelength breakdown for the machine learning laser power scaling with 500 cycle reservation window Wavelength breakdown for the machine learning laser power scaling with 25 cycle reservation window Wavelength breakdown for the machine learning laser power scaling with 2000 cycle reservation window

12 12 LIST OF ACRONYMS BAC - Bandwidth Allocation Controller BW S - Buffer Write Source BW D - Buffer Write Destination CMESH - Concentrated MESH DCT - Discrete Cosine Transforms Dwrt - One-dimensional Haar Wavelet Transform DVFS - Dynamic voltage and Frequency Scaling DWDM - Dense Wavelength Division Multiplexing E/O - Electrical to Optical Conversion EDP - Energy Delay Product FA - Fluid Animate fmm - Fast Multipole Method HTSP - Random Traffic with a Hot Spot IB - Input Buffer MIMD - Multiple Instruction Multiple Data ML - Machine Learning MRRs - Micro-Ring Resonators MWSR - Multiple Write Single Read NRMSE - Normalized Root-Mean-Square Error NoCs - Network-on-Chips O/E - Optical to Electrical Conversion OB - Output Buffer OL - Optical Link Traversal QRS - Quasi Random Sequence RB - Reservation Broadcast

13 13 RC - Route Computation R-SWMR - Reservation assisted Single Write Multiple Read Rad - Radiosity Reduc - Reduction RMSE - Root-Mean-Square Error RW - Reservation Window SA - Switch Allocation SHARP - Shared Heterogenous Architecture with Reconfigurable Photonic Network-on-Chip SHARP-BanSp - SHARP Bandwidth Split SHARP-CoSeg - SHARP Core Segregation SHARP-Dyn - SHARP with Dynamic Bandwidth Allocation SHARP-FCFS - SHARP first-come-first-serve SIMD - Single Instruction Multiple Data Shuff - Perfect Shuffle SWMR - Single Write Multiple Reader UniRnd - Uniform Random WL - Wavelength x264 - x264

14 14 1 INTRODUCTION 1.1 Why Multicores and Heterogenous Architectures For the past several decades, Moore s Law and Dennard Scaling have defined the design and development of microprocessors. Moore observed transistor scaling as follows:, The transistor count in integrated circuits doubled approximately every two years[moo98]. This was observed in 1965 by Moore and held true for the next 50 years as shown in Figure 1.1. Increases in transistor density have opened an infinite number of possibilities for computer chip design. Robert Dennard observed that as the transistor size continues to decrease, the total power dissipation per transistor should decrease[opa15]. Technology scaling down would reduce the necessary voltage across the transistor and enable a higher clock frequency. In 2006, we saw the end of the Dennard scaling. Transistor sizes continued to shrink, but Dennard failed to consider the leakage current of the transistor, which is growing steadily with technology scaling. This led to a power wall ending around a 4 GHz clock frequency with static power dominating the total power consumption of integrated circuits. In order to maintain the performance postulated by Moore s law, microprocessor manufacturers started integrating multiple cores onto a single chip[moo98]. Industry has been forced to fabricate multiple cores on a single chip to achieve higher performance via parallelism rather than higher clock frequency. Therefore, multicore design paradigm is the only way to circumvent the limitations of Dennards scaling while satisfying Moores Law. Furthermore, many current chips have both general purpose CPU cores integrated with specialized GPU cores. Due to thread level parallelism of the GPU, it has been shown that the throughput of a GPU can outperform a CPU on similar tasks [GH11]. Figure 1.2b is a Multiple Instruction Multiple Data (MIMD) configuration typically used in a CPU core to handle data and instructions. Figure 1.2a is a Single

15 15 Figure 1.1: Depiction of Moore s law and Dennard scaling [Moo98]. Instruction Multiple Data (SIMD) configuration typically used in a GPU core to handle instructions and data. A SIMD style processor is best at handling large vectors that require the same operation on each data point. Additionally, SIMD processors typically have a smaller instruction set than a MIMD processor. This causes the SIMD processor to have a smaller footprint than the MIMD[Chr98]. For example, a SIMD processor would be ideal for modifying pixels while processing images. Since images are comprised of an array of pixel colors, increasing the brightness of an entire image would require modifying each pixel in the array. A SIMD processor would require one instruction to modify the image and could change each pixel in parallel. The SIMD s smaller core size allows for increased core counts; therefore, along with the its parallel nature, the SIMD processor will outperform the MIMD in image processing. As CPU and GPUs offer significant

16 Instructions Instructions 16 Data Data GPU CORE CPU CORE CPU CORE GPU CORE CPU CORE CPU CORE GPU CORE CPU CORE CPU CORE (a) SIMD (GPU) (b) MIMD (CPU) Figure 1.2: Comparison of how CPUs and GPUs handle data. variability in performance, an application programmer can have different options on how to effectively execute an application. Research has demonstrated that when the CPU needs to communicate off chip, GPUs tend to show increased latency for data transfer [GH11]. However, placing CPUs and GPUs on an integrated platform and sharing the common network resources can decrease the data transfer latency for both cores. Figure 1.3 compares communication paths between the CPU and GPU cores when the GPU is both off-chip and on-chip. Maximizing throughput and energy efficiency for the network, cache, and I/O must be considered while implementing these heterogenous architectures. Research into CPU-GPU heterogenous architecture has grown extensively in the past several years in an effort to take advantage of the strengths of both processor types. Different commercial chips have demonstrated the power of CPU-GPU heterogenous architectures and are released every year including: Intel s Broadwell [Intb] and Skylake [Inta], NVIDIA s

CPU and GPU cores can share multiple resources within the chip s architecture including the network bandwidth, last level cache, memory

17 17 (a) Off Chip Communication (b) On Chip Communication Figure 1.3: Communication path between CPU and GPU cores. Tegra X1 [NVI], and AMD s Carrizo [AMD]. CPU and GPU cores can share multiple resources within the chip s architecture including the network bandwidth, last level cache, memory controllers, and main memory. Several critical challenges such as power optimization, resource allocation, resource management, and the interaction between these resources still remain. Addressing each of these challenges will help in improving overall system performance of heterogenous multicores

18 Network Topologies As silicon-based processors have become more complex and consume increased power due to static components, the data path between the heterogenous cores and memory will become the lifeline to obtaining higher performance. Therefore, the design and implementation of the on-chip communication fabric becomes extremely critical. Currently, Network-on-Chips (NoCs) have become the defacto standard for interconnecting multiple heterogenous cores. Previous research has shown, for 8 to 16 cores, the NoC can consume on an average between 15% to 30% of the total chip power and for some applications consuming up to 50% of the chip power [FAA08]. The plethora of components in the NoC router can consume a substantial amount of the NoC s static and dynamic power. The main components of the router are the buffers where packets are stored, crossbar where the packets are switched, and links where the packets are routed between the routers. Given this information, increasing the core count beyond 16 cores can have a problematic effect on the total amount of energy consumed by the NoC. There are several processors currently on the market with tens to hundreds of cores. Recent research by the University of California, Davis has created a 1000 core processor called KiloCore [BSP + 16]. Novel solutions to increase the efficiency of the NoC architecture will play an important role in the total power consumption of the chip. Determining the optimal power-area-latency of the NoC will depend heavily on the network topology. A bus network is used on multicores with 4-8 processing cores. Figure 1.4a represents a bus network. The advantage of using a bus for small core counts is its simplicity and minimal amount of hardware requirement. Due to the parasitic capacitance and arbitration complexity, the bus network topology has severe limitations as the processor count continues to increase. At higher core counts, latency becomes a fundamental limitation since the packet has to wait for longer arbitration cycle. Figure 1.4b represents a ring network topology that is comprised of a node connected to two other

19 19 (a) Bus (b) Ring (c) Crossbar (d) Mesh (e) Torus Figure 1.4: Examples of different network topologies. nodes in clockwise and counterclockwise directions. The ring network is not hindered by the large scale arbitration issues caused by a bus network. The two connections minimize the arbitration complexity. Ring topologies do not scale well with larger core counts due to the number of hops needed to reach the other side of the network. Figure 1.4c shows a crossbar network topology which is setup such that any node can reach the destination node by traversing a single switch. Every node is one hop away from each other and there is contention in the network only if multiple source nodes need to reach a destination. However, the number of switch points grows as O(N 2 ) where N is the number of nodes connected to the network. Figure 1.4d represents a mesh topology which is conceptually similar to the ring topology without limiting the number of connections to two nodes. Figure 1.4d represents a 2-dimensional mesh where each node has two to four connections to adjacent nodes. The mesh can be implemented in higher dimensions, but this complicates the routing mechanisms and hardware implementation. Figure 1.4e is a torus

20 Figure 1.5: Silicon photonic communication utilizing micro ring resonators to send and receive data. network topology where each node has exactly four connections.

20 20 Figure 1.5: Silicon photonic communication utilizing micro ring resonators to send and receive data. network topology where each node has exactly four connections. The key advantage of the mesh and torus topologies over the ring topology is the decreased hop count. Although the mesh and torus topologies cannot compete with the latency of a crossbar topology, the footprint of the mesh and torus are significantly smaller than the crossbar[mb06]. 1.3 Why Silicon Photonics Due to high energy demands of the NoC, new emerging technologies are needed to decrease the interconnect energy costs while maintaining the performance trends. Photonic interconnects for NoC architectures are a developing technology that can support this endeavor. Photonics outperform electrical connections on speed and energy efficiency when used on a much larger scale. Large scale computer networks, such as high speed data center communication and the internet, already utilize dense wavelength division multiplexing (DWDM) to increase the throughput of the network. This allows for multiple

21 21 Figure 1.6: Bandwidth density comparison between Electrical Interconnects (EI) and silicon photonics (Silicon OI) [HCC + 06]. channels on a single waveguide using separate wavelengths providing a substantial gain in bandwidth and energy efficiency for photonic interconnects. Photonic technology has the potential to meet bandwidth demands of future integrated chips while decreasing the energy costs over its electrical counterpart. Figure 1.5 represents a photonic interconnect between a source and a destination. In this example, an external laser is coupled to a waveguide. The source s ring resonator modulates the data by coupling the light. The light is then coupled into the destination germanium photodetector and turned into a voltage. Research such as [TWG + 12, Sor06] have shown the potential for photonic devices and has successfully demonstrated modulation rates in excess of 40 Gb/s. Other research has

22 22 demonstrated a 7.9 fj/bit energy consumption with a 2.5 µm radius micro-ring resonator [MPCL10, XZL + 12]. Furthermore, photonics has a lower latency of ps/mm while the electrical interconnect has a latency of ps/mm[hcc + 06] making it suitable for on-chip communication. Photonic interconnects built for silicon chips do not come without drawbacks or complications within the fabrication processes. To ensure successful communication with minimal crosstalk, waveguides need to be placed µm apart[hcc + 06]. As seen in Figure 1.6, for a single wavelength this places silicon photonics at a clear disadvantage when comparing bandwidth density. This issue can be remedied with DWDM. Crosstalk becomes an issue with ring resonators due temperature affecting the refractive index of the device, which has the potential to increase the bit error rate of the communication network. High performance processors can have a wide range of temperature fluctuations causing complications for the photonic hardware. Techniques have been developed to thermally isolate ring resonators which can decrease the thermal fluctuations. An additional solution to the thermal issue is to provide each ring resonator with an individual temperature control mechanism[glm + 11]. This thermal tuning mechanism adds an additional energy cost to correctly operate the photonic NoC. 1.4 Bandwidth and Power Scaling Conserving power and increasing performance has many benefits within the digital electronics paradigm. Whether conserving battery life on an individual cellular phone or saving research costs by decreasing computational time and energy consumed by modern supercomputers, dynamic scaling techniques will continue to play a pivotal role in future multicores. Research done by [CTD + 16] has proposed a novel topology partitioning method that can save the Tianhe-2 supercomputer half a million dollars per year in energy costs which clearly demonstrate the amount of power and fiscal burden these computers

23 23 can have on the researchers and the companies that maintain them. Given the amount of power needed to operate these computers, the optimization of every component within the system will pay for itself. There are many ways to improve the performance of a NoC architecture. One such way is to identify an underutilized resource within the network and develop a method to extract its potential such as dynamically allocating the unused resource to another part of the network. Within heterogenous architectures, dynamically allocating resources between two different core types can change the behavior of the NoC. Dynamically managing resources can minimize the energy consumption while maximizing the utilization of the network resources. Another approach to managing unused resources is to develop a method to decrease or even eliminate power consumption. Dynamic voltage and frequency scaling (DVFS) is used to decrease energy consumption by decreasing the frequency and voltage at which the transistor gates can operate. The transistor s power behavior can be described by the following equation[lsh10]: P = C f V 2 + P static (1.1) where P is the total power consumed by the transistor, C is the gate s capacitance, f is the clock frequency, V 2 is the gate voltage, and P static is the static power consumed by the transistor. As demonstrated by the above equation, decreasing the frequency will reduce the power consumed by the transistor. When the frequency is decreased, the voltage needed to operate the transistor can be reduced, thus dramatically decreasing the power due to the V 2 term. With DVFS, voltage and frequency levels can be changed to cater to the applications demand. This can conserve network power during low utilization periods. Using a technique called power gating, voltage levels are reduced to zero by removing the supply voltage from a component in order to eliminate energy consumption [AMP + 15]. However, power gating has a latency component that should be accounted for when the

24 24 component needs to be reactivated. When considering these methods, there is a tradeoff between power consumption and performance. DVFS and power gating are mostly ad-hoc techniques that decide on the usage or load to tune the voltage or frequency in response to application behavior. However, most data-driven methods have the potential to outperform an ad-hoc technique. 1.5 Reasons for Machine Learning Machine learning is a form of pattern recognition that utilizes computer algorithms to identify regularities in sets of data. Once the identification processes has occurred, these regularities can be used to assign labels to each set of data. There are three forms of machine learning: supervised, unsupervised, and reinforcement learning. Supervised learning utilizes a set of data that has a determined outcome called a label. Unsupervised learning classifies data that does not have a label. Reinforcement learning, like supervised learning, does not have a label. The model is derived through a series of rewards and penalties based on event outcomes. The focus of this thesis will be on supervised learning which can be divided into two categories called classification and regression. Classification is used when the label for the data has a finite set of possible outcomes. Regression is used when the possible outcomes for a given set of data is continuous. [Bis06] Machine Learning has the potential to provide an additional advantage over dynamic allocation mechanisms. Used in tandem with dynamic allocation mechanism, machine learning can aid in performance improvements within a NoC architecture. Machine learning is being utilized throughout the scientific and engineering communities. Extensive use of machine learning has generated hardware specifically designed for this task. Google has introduced the Tensor Processing Unit which is said to have a 10x increase in efficiency over non-machine learning specific hardware[arm]. Within the NoC

25 25 community, machine learning can be used to create a proactive technique for predicting the outcome of the network s behavior in contrast to reacting to what has already occurred. Reactive techniques can be used to predict what may happen in the future. Proactively predicting what will occur based on past patterns has the potential to dramatically improve dynamic scaling techniques discussed in the previous section. 1.6 Contributions of this Work In this thesis, we focus on improving the performance of CPU-GPU heterogenous 3D-NoC architectures with the utilization of silicon photonics as the interconnection network. The proposed SHARP architecture utilizes a novel checkerboard pattern to place CPU and GPU cores at each router in order to share network bandwidth between the two core types. This proposed approach is engineered such that a dynamic bandwidth allocation mechanism can be executed locally within the router without requiring global coordination across multiple routers. This fine-grain bandwidth reconfiguration is achieved by considering a sliding window of buffer utilization for each core type, thereby balancing the network bandwidth with application demands. While dynamic bandwidth allocation is implemented to improve performance, we safeguard CPU application from being overrun by GPU application by providing priority to CPU traffic. We also propose 3D integration to minimize the distance electrical signals must travel in order to communicate between different optical and electrical layers. 3D integration also reduces the potential for waveguide crossings reducing the laser power for driving the photonic link. We further explore machine learning methods to improve the performance of SHARP architecture. We propose to use a polynomial regression model in order to predict buffer link utilization. We can further extrapolate optimal buffer threshold levels to increase

26 26 throughput. These buffer threshold levels will be used to dictate the amount of bandwidth that will be allocated to each core type within the reservation window. We also explored techniques to dynamically allocate laser power at run time with on-chip lasers. We explore two techniques to achieve this goal. The first is a reactive buffer-based prediction scheme that utilizes buffer occupancy to determine the amount of laser power used for a reservation window. This prediction scheme will trade throughput for power savings. The second utilizes machine learning, specifically linear ridge regression, as a proactive technique improving the power consumption for a heterogenous CPU-GPU photonic NoC architecture. The machine learning method utilizes linear ridge regression in order to predict the number of packets that will be injected into each router. This information will be used to determine how many wavelengths will be needed within the next reservation window. All simulation where executed with 32 CPU cores and 64 GPU computational units. The CPU benchmarks where chosen from the PARSEC 2.1 benchmark suite [BL09] and the SPLASH2 benchmark suite [WOT + 95]. The GPU benchmarks are chosen from OpenCL SDK benchmarks provided by AMD through [UJM + 12]. Our simulation results demonstrate a 34% performance improvement over a baseline electrical CMESH while consuming 25% less energy per bit when dynamically reallocating bandwidth. SHARP has shown 6.9% to 14.9% performance improvement over other photonic architecture without dynamic bandwidth allocation. When dynamically scaling laser power, the reactive buffer based dynamic prediction method demonstrates an 8.2% throughput loss with a 60.5% laser power improvement of a high laser power baseline when the reservation window is set to 25 cycles. With a reservation window size of 2000 cycles, the ML laser power scaling method has a negligible throughput loss when compared to the high laser power baseline with a 42% laser power savings. With a reservation window size

27 27 of 500 cycles, the machine learning method demonstrates a 14.6% throughput loss with a 65.5% power savings. 1.7 Organization of Thesis The thesis is organized as follows:. Chapter 2 discusses the proposed architecture beginning with prior work that this research builds upon in subsequent sections. Following prior work, details of the SHARP architecture are discussed. The latter half of Chapter 2 focuses on dynamic and machine learning methods used to further improve performance and power-efficiency of the SHARP architecture. Chapter 3 discusses the simulation methodology for the SHARP architecture. We further describe the competing architectures and expand upon the dynamic allocation and machine learning methods used within the simulations. To conclude Chapter 3 the throughput, latency, and power results of the simulations are discussed. Chapter 4 concludes the thesis with concluding remarks and possible future work.

28 2 PROPOSED SHARP ARCHITECTURE BANDWIDTH RECONFIGURATION AND POWER SCALING TECHNIQUES This chapter focuses on the architecture specification for SHARP, the dynamic bandwidth and power allocation, and the machine learning techniques. The chapter is divided up into six sections. The first section focuses on the prior work related to this theses and the mechanism in which they are built upon. Sections two and three describes the SHARP architectures and the dynamic bandwidth allocation mechanism. Section four focuses on the dynamic power scaling integrated into the SHARP architecture. Sections five and six discuss the machine learning techniques use to further improve the dynamic scaling methods discussed in sections three and four. 2.1 Prior Work There has been extensive research investigating architectures to implement photonic interconnects for communication between multiple cores. Previous research, such as Corona [VSM + 08] and the hybrid GPU design by [ZAU + 15] along with many others [KM10, PKK + 09, MKL12b] have shown that both CPUs and GPUs with large core counts can take advantage of the power efficiency, lower latencies, and higher bandwidth yielded by photonic NoCs. Corona [VSM + 08], as seen in Figure 2.1 is a 256-core CPU with a total of 1024 threads that uses a Multiple Write Single Read (MWSR) photonic crossbar interconnect. The MWSR photonic crossbar connects multiple source routers to a single destination router. This requires an arbitration mechanism to prevent multiple routers from communicating on the photonic connection at one time. The purpose of the Corona architecture is to show the advantages of connecting the cores, cache, and main memory with a nanophotonic interconnect. Each waveguide can support 64 wavelengths using Dense Wave Division Multiplexing (DWDM) and the ability to write on one wavelength is 28

29 29 Figure 2.1: 256 core Corona architecture [VSM + 08]. done by a token arbitration scheme [PKK + 09, VSM + 08]. The results show that Corona has a 2 to 6 times performance improvement on application with heavy memory accesses over equivalent electrical architectures. Further, energy costs for the application was significantly decreased over the electrical interconnect. Firefly is a nanophotonic architecture that focuses on developing a novel implementation of a Single Write Multiple Reader (SWMR) crossbar. A SWMR photonic crossbar connects a single source router to multiple destination routers. In Firefly, the power efficiency of a nanophotonic interconnect is improved by implementing a Reservation assisted Single Write Multiple Read (R-SWMR) crossbar network. The advantages the R-SWMR link has is the decrease in energy costs due to the reservation packet. The purpose of the reservation packet is to inform the rest of the network which router is receiving the next data packet. This has two key effects on the network

Two, the R-SWMR decreases the laser energy cost on the data waveguide by having only one router listening on the channel to receive the packet.

30 30 Figure 2.2: 128 compute unit GPU with two photonic crossbars connecting the L1 and L2 caches [ZAU + 15]. communication. One, the on-chip network no longer needs a complex token arbitration mechanism associated with MWSR. Two, the R-SWMR decreases the laser energy cost on the data waveguide by having only one router listening on the channel to receive the packet. Firefly shows a 57% performance improvement over an electrical based concentrated mesh network and a 54% performance improvement over a nanophotonic-based crossbar network. Furthermore, Firefly demonstrates an increase in energy efficiency over the two networks previously mentioned. The work done by [ZAU + 15] was 128 compute unit (CU) GPUs connected with two photonic crossbar implementations. The architecture layout can be seen in Figure 2.2. The first crossbar communicates between L1 and L2 with a MWSR communication scheme. Similar to Corona, this uses token passing for an arbitration scheme. Having a token arbitration scheme from L1 to L2 has minimal effect on the performance of the CU due to the GPU programming model of: no other CU should be stalled awaiting completion of a write-back transfer[zau + 15]. The second crossbar implementation uses a SWMR

31 31 communication scheme to pass data from L2 to L1. There is a destination address in the header of each packet to indicate which node is receiving the packet. Each node reads the packet header to determine if they are supposed to receive the packet. Then, the nodes that should not be reading the packet will stop listening. This is implemented to minimize the power loss by having multiple listeners on the line [ZAU + 15]. The data being transferred from L2 to L1 is more latency sensitive when compared to L1 to L2 network. Having a destination in the header packet with the SWMR connection eliminates any latency overhead for token arbitration. This architecture takes advantage of the many-to-few characteristic discussed in [BKA10]. The results of this study show that the photonic GPU outperforms a GPU with an electrical crossbar by an average of 43% performance improvement while averaging a 3% power improvement [ZAU + 15]. Dynamically managing resources can minimize the energy consumption and maximize the utilization of network resources. 3D-NoC proposed by [MKL12a] utilizes a 3D multilayered design to implement dynamic bandwidth allocation. This design uses link utilization and buffer utilization to route traffic to different layers of the chip by dynamically reconfiguring the NoC to maximize throughput. Dynamic reconfiguration of the network not only optimizes throughput, but also helps to tolerate faults. 3D-NoC demonstrates a 10% to 25% performance improvement for a 64-core network. Furthermore, 3D-NoC s simulation results have shown a 25% performance improvement while demonstrating a 6% to 36% energy savings. Techniques used for dynamic bandwidth allocation in homogenous architectures can easily be adapted to heterogenous architectures. A feedback directed virtual channel partitioning mechanism which allocates different numbers of virtual channels to each core type has been proposed in [LLKY13]. This mechanism allocates at least one virtual channel to the CPU and thereby prevents memory-intensive GPU traffic from starving the CPU traffic of network resources. Furthermore, there has been several proposed ways to manage the cache, main memory,

32 32 and data transfer mechanisms in heterogenous architectures [LLKY13, JJP + 12, KLJK14]. There has been some research in dynamically allocating bandwidth between CPU and GPU core types using photonic interconnects done by [SMJ + 14]. Work done by [SMJ + 14] has demonstrated the benefits of dynamic bandwidth allocation for a heterogenous photonic NoC. The dynamic allocation is done using a R-SWMR link similar to [PKK + 09] and the bandwidth is allocated every time there is a change in task on a core. The bandwidth request from the core is stored in a table at the router in which the core is connected. After a task change, the bandwidth is allocated using a token arbitration mechanism implemented on a separate controller waveguide. Dynamically adapting to traffic demands can be beneficial to power consumption. When implemented properly, this can have a minor impact on throughput and latency. In PROBE [ZK13], an adaptive photonic bandwidth scaling technique was proposed which utilizes a table-based prediction scheme. PROBE is a 64-core architecture that implements a binary tree based waveguide system with adaptive channels to allocate bandwidth on demand. With the use of three bandwidth modes, PROBE utilizes a history based prediction method to determine the bandwidth setting based on traffic fluctuations. This work demonstrates a 60% laser power saving with a maximum of 11% throughput loss. Applying power-gating technique to a photonic network has shown potential to increase energy performance for NoC architectures by reducing the static laser power. Previous research such has SLaC [DH16] has demonstrated the use of on-chip InP lasers that are directly integrated into silicon to conserve energy by power gating the laser. As the laser is directly integrated, energy-efficiency can be greatly improved due to a 2 ns laser turn-on time. SLaC utilizes a flattened butterfly topology in order to provide multiple paths to the destination router. In low traffic situations, several of the integrated lasers can be turned off. Some lasers are left on in order to maintain communication paths in order to avoid a laser turn on delay. The increased number of paths from source to destination

33 33 afforded by the flattened butterfly enables SLaC to avoid most laser turn-on delays. The routing mechanism utilized within SLaC avoids the links that have been turned off and turns on additional laser during higher traffic periods. SLaC has demonstrated 43%-57% in energy saving while exhibiting a 2% loss in performance on real world applications. The use of machine learning is beginning to gain traction in the NoC community. Machine learning has been utilized to in electrical NoCs to optimize last level cache, Dynamic Voltage and Frequency Scaling(DVFS), and fault detection[jps16, GP16, DBKL16]. Machine learning has the potential to optimize NoC systems beyond what is capable with reactive based techniques. Research into machine learning for NoC architecture is currently in its infancy and has potential to further improve electrical and photonic NoC performance. SHARP build upon many of the concepts proposed by above work. Given the advantage of R-SWMR, in our proposed SHARP design, we alter the R-SWMR link for heterogenous CPU-GPU architecture that also will implement dynamic bandwidth reconfiguration. We modify the reservation packet to aid in the dynamic allocation of bandwidth between the two core type. Corona and work done by [ZAU + 15] demonstrate advantages photonics can have for CPU and GPU architectures. Although, heterogenous architectures have been explored in past works [ZKDI11, USM + 14, FLL + 15], the use of a photonic NoCs with a CPU-GPU heterogenous architecture can be further expanded upon in future work. The latency and throughput benefits provided by photonics can be optimized such that the resources can be shared effectively between the CPU and GPU cores. The PROBE architecture demonstrates the benefits of dynamically allocating photonic bandwidth to decrease power consumption. The research proposed in this thesis expands upon the dynamic bandwidth scaling and supplements an on-ship laser proposed by SLaC. This enables the dynamic laser power scaling in addiction to the bandwidth scaling to further increase the flexility of the network.

34 34 Cluster CPU-GPU cache, core, and router layout L3 Router Shared L3 Cache Router Figure 2.3: Cluster consists of 2 CPU cores with private L1 data and instruction cache, 4 GPU CUs with private L1 cache, and L2 caches for CPU and GPU all centered around a single router. There are 16 clusters and all clusters are connected to a L3 router via optical waveguides. 2.2 Proposed Architecture (SHARP) SHARP Layout: Figure 2.3 represents our proposed SHARP architecture which consists of 32 CPU cores and 64 GPU cores. In SHARP, a cluster is made up of two CPU cores, four GPU computational unit (CUs), one router and corresponding L1/L2 caches. We propose a checkerboard pattern such that each router is directly connected to four CPU and two GPU cores. In this design, CPUs and GPUs contend locally at the router. Under high traffic scenarios for one core type, contention is more manageable since both cores contend at each router. If the two cores were segregated such that all CPUs were in one half and the GPUs in the other half, then there will be traffic imbalance as one half

35 35 would see high traffic whereas the other half will see low traffic. In our checkerboard design, this scenario is avoided as contention is balanced under all traffic conditions. Each CPU core has its own private L1 instruction and data caches and each GPU CU has its own private L1 cache. Within each cluster, there is a shared CPU L2 cache and a shared GPU L2 cache. The router at each cluster connects to the shared L3 cache. When dealing with multiple caches spread throughout the architecture, cache coherence becomes pivotal to a multiprocessor applications. A piece of data used by multiple cores could reside in multiple caches. A cache coherence protocol ensures the data received is the most recent iteration. The cache coherence protocol used with SHARP is NMOESI [UJM + 12]. All 16 routers are organized in a 4 4 grid and the shared L3 cache is connected using the optical crossbar employing a SWMR approach. The purpose of the L3 cache is to decrease the amount of time needed to communicate between the CPU and GPU sides of the chip. The L3 cache is split evenly between the CPU and GPU cores. In order for the CPU to communicate with the GPU, the necessary data needs to be copied from the CPU bank of the L3 cache to the GPU bank. For the GPU to communicate with CPU, the necessary data needs to be copied from GPU bank to CPU bank. In order to scale up the design to larger core counts, more optical layers could be added to communicate to different layers of the chip similar to 3D-NoC architecture [MKL12b]. We propose a 3D layout that consists of three layers as shown in Figure 2.4. The first layer consists of the CPU cores, GPU CUs, and the caches. The second layer consists of the optical interface layer connecting the routers to the waveguides. The last layer consists of the optical interconnects divided into data and reservation waveguides. The proposed 3D layout can speedup the communication by increasing the throughput and reducing the wire lengths with the use of Through-Silicon Vias (TSVs) to drive the optical circuitry. Additionally, the 3D architecture reduces the waveguide crossings which in turn reduces the laser energy needed for data communication.

RESERVATION WAVGUIDE E/O O/E Route Computation IB Buffer Occupancy Calculator Bandwidth Allocation Controller FROM CORES demux IB E/O TO OPTICAL LAYER Switch Allocation Crossbar Switch OB

36 36 Laser Inputs Optical Layer Router and Optical Interface Layer TSVs Cores, CUs, and Cache Layer Heat Sink Figure 2.4: 3D die layout consisting of three layers; core and cache layer, router and optical transceiver layer, optical layer containing the data and reservation waveguides. RESERVATION WAVGUIDE E/O O/E Route Computation IB Buffer Occupancy Calculator Bandwidth Allocation Controller FROM CORES demux IB E/O TO OPTICAL LAYER Switch Allocation Crossbar Switch OB TO CORES mux OB O/E FROM OPTICAL LAYER 1 Figure 2.5: Electrical to optical and optical to electrical router microarchitecture with reservation waveguide interface: IB = Input Buffer, OB = Output Buffer, O/E = Optical to Electrical conversion, and E/O = Electrical to Optical conversion.

37 37 Inter-Router Communication: We chose an optical link with reservation assist (R-SWMR) for the purpose of implementing inter-core communication. Under R-SWMR, the transmitting router uses the reservation waveguide to broadcast the signal to the remaining routers connected on the optical link informing of the intended destination. Then, only the intended destination listens on the channel while the transmitter sends the data. Figure 2.5 shows the router architecture for SHARP. Figure 2.6a represents the router pipeline used within the router architecture. When a packet is generated from either the CPU or the GPU core, it is placed in an input buffer and its route is computed (RC). Next, the reservation broadcast (RB) is converted into an optical format (E/O) and coupled to the reservation waveguide. The packet then requests the crossbar in the switch allocation (SA) and traverses the crossbar (BW S ). The E/O conversion occurs by using the electrical drivers to modulate the ring resonators coupling the signal to the optical link. This is when the optical link traversal portion of the router pipeline occurs (OL). At the destination, the reverse process takes place, where the optical signal is filtered and converted into electrical format using a combination of photodetector, TIA and voltage amplifiers (O/E). The packet is written into the buffer (BW D ) where it can be transferred to its intended destination routed via the switch allocation (SA). Figure 2.6b is a routing example for SHARP. After the RC occurs for each core type the RB is coupled to the reservation waveguide. This process is represented by the yellow line highlighting the reservation waveguide in Figure 2.6b. This process notifies which router is supposed to receive the CPU and GPU packets. Each router reads the reservation packet to determine if they need to be listening to the data line in the subsequent cycles. As shown in Figure 2.6b, routers R 1 and R 3 ignore the reservation packet because they are not the destination for either the CPU or GPU data coming from R 0. Correspondingly, the R 1 and R 3 Micro-Ring Resonators (MRRs) are switched off in order to save energy. R 2 (receiving the CPU data) and R N-1 (receiving the GPU data) will tune the MRRs per the

38 38 RC RB BW S E/O OL O/E BW D SA RC RB BW S E/O OL O/E BW D SA RC RB BW S E/O OL O/E BW D SA (a) Reservation Waveguide 3 Wavelengths R 0 R 1 R 2 R 3 R N-1 Data Waveguide Reservation Data CPU Data GPU Data (b) 64 Wavelengths CPU Destination GPU Destination Address Address CPU Packet Size GPU Packet Size Dynamic Allocation log(n)-bits log(n)-bits log(s CPU )-bits log(s GPU )-bits log(d)-bits L3 Destination Bit log(n L3 )-bits (c) Figure 2.6: (a) Router Pipeline. (b) Reservation packet. (c) Communication path for CPU- GPU R-SWMR link. dynamic allocation bits set by the sending router in the reservation packet. This will give R 2 and R N-1 the ability to receive the corresponding data from the sending router. 2.3 Dynamic Bandwidth Scaling As the GPU has the tendency to flood the network [MV15], care must be taken while designing the dynamic bandwidth allocation algorithm to prevent the GPU from starving the CPU of the network resources. Our proposed dynamic bandwidth allocation algorithm was designed with the following goals: The algorithm should work with minimal hardware additions.

39 The algorithm should operate locally within the confines of each router and thereby avoid complex global management. 39 The algorithm should prevent the GPU from blocking CPU traffic within the router architecture. The algorithm should allow simultaneous transmission of CPU and GPU packets regardless of the packets destination. We propose a dynamic variant of the SHARP architecture which we have labeled SHARP-Dyn. A local management of resources was chosen for SHARP-Dyn to mitigate the overhead (eg. write contention) that is often associated with global bandwidth management techniques [PKK + 09]. Figure 2.5 shows the minimal hardware addition to implement the R-SWMR link. When a packet is generated from either the L2 caches or from one of the cores, it is placed in an input buffer. Credit counters track the number of packets occupying the input buffers at each router. The buffer occupancy calculator sums the values of the credit counters for the CPU and GPU packets at the router, and sends the information to the bandwidth allocation controller (BAC). The BAC determines the amount of bandwidth to assign to each core type by using the number of buffer slots occupied. Next, the BAC will generate a reservation packet as seen in Figure 2.6c [PKK + 09]. Figure 2.6c is the reservation packet used to implement dynamic bandwidth allocation mechanism in SHARP-Dyn. The number of bits in the reservation packet can be calculated with ResPacket size =log( 2 N S CPU S GPU D N L3 ) where N is the number of non-l3 routers in the network, S CPU is the number of different CPU packet types (i.e. request and response) that can be sent through the network, S GPU is the number of different GPU packet types that can be sent through the network, D is the number of different dynamic allocation possibilities for the data being sent (D=5 for the proposed algorithm), and N L3 is the number of L3 routers in the network. The number of

40 40 Table 2.1: Dynamic Bandwidth Allocation Algorithm Step 0: For each individual routers R 0 through R N-1 complete steps 1 through 7 Step 1: Calculate the buffer occupancy β Ocup for each input buffer β Ocup-0 through β Ocup-(j-1) in router R ω Step 2: β Ocup for routers β Ocup-0 through β Ocup-(j-1) are sent to Buffer Occupancy Calculator Step 3: Calculate β CPU using β Ocup-0 through β Ocup-(k-1) Step 4: Calculate β GPU using β Ocup-k through β Ocup-(j-1) Step 5: Determine the amount of bandwidth to be allocated to the CPU and GPU core types: If β GPU = 0 and β CPU > 0 GPU Bandwidth = 0% Bandwidth CPU Bandwidth = 100% Bandwidth Else if β CPU = 0 and β GPU > 0 GPU Bandwidth = 100% Bandwidth CPU Bandwidth = 0% Bandwidth Else if β GPU < β GPU-UpperBound GPU Bandwidth = 25% Bandwidth CPU Bandwidth = 75% Bandwidth Else if β CPU < β CPU-UpperBound GPU Bandwidth = 75% Bandwidth CPU Bandwidth = 25% Bandwidth Else GPU Bandwidth = 50% Bandwidth CPU Bandwidth = 50% Bandwidth Step 6: Send reservation packet via reservationassisted SWMR link Step 7: Transmit Data Using specified wavelengths on the first come first serve basis wavelengths needed for the reservation waveguide can be determined using the ResPacket size, optical data rate, network frequency, and the number of routers. The proposed algorithm in Table 2.1 is executed by every router, R N during each cycle. In the first step, the algorithm calculates the CPU and GPU buffer occupancies as

41 follows: β ocup(cpu) = β ocup(gpu) = j 1 i=0 Bu f i a i j k 1 i=0 Bu f i a i k where j is the total number of CPU buffers, k is the total number of GPU buffers and a i is 1 if the buffer slot is occupied. 41 (2.1) (2.2) The proposed algorithm in Table 2.1 is executed by every router during every cycle in the network. In the first step of the algorithm the router calculates the buffer occupancy. This occupancy calculation is done for each input buffer β Ocup-i that is connected to a core or L2 cache(0 i (Input Buffers per Router 1)) using the following formula: β Ocup-i = β current β total (2.3) To determine the amount of bandwidth to assign to each core type β CPU and β GPU must be calculated by summing the buffer occupancy value β Ocup-i for each core type at each router. In Table 2.1 buffer k represents the first GPU buffer and j represents the total number of input buffers within a router that are associated with the CPU and GPU cores. The last CPU buffer can be represented by (k-1). β CPU-UpperBound and β GPU-UpperBound are used to drive the decision processes within the algorithm. Utilizing a brute force method, the optimal β CPU-UpperBound and β GPU-UpperBound where determined experimentally on a separate set of benchmarks than the ones used in the results section of this paper. The optimal β CPU-UpperBound was determined to be 16% of the total CPU input buffer space while the optimal β GPU-UpperBound was determined to be 6% of the total GPU input buffer space. Using the calculated values for β CPU and β GPU in comparison with β CPU-UpperBound and β GPU-UpperBound, the bandwidth is assigned to each core type and the reservation packet is created. Due to the temporal sensitivity of the CPU, precedence is given over the GPU by considering it first for the 75% bandwidth allocation within Step 5 of Table 2.1. After

42 42 sending and receiving the reservation packet, the corresponding routers will tune the MRRs to receive the data packets Variations of the SHARP Architecture In this subsection, we describe a few variations of the SHARP architecture that will be compared against the proposed SHARP-Dyn architecture. The first architecture is called SHARP Core Segregation (SHARP-CoSeg), as seen in Figure 2.7, which uses a photonic crossbar and segregates the CPU and GPU cores to two halves of the chip. By segregating CPU and GPU cores, we avoid all interaction between the two core types (L3 cache router and main memory interactions are excluded). The second architecture is called SHARP first-come-first-serve (SHARP-FCFS) and has a similar layout as SHARP-Dyn (see Figure 2.3). The key difference between SHARP-Dyn and SHARP-FCFS is that SHARP-FCFS has no dynamic bandwidth allocation. Furthermore, in SHARP-FCFS, the packets are sent on a first-come first-serve basis using 100% of the link bandwidth regardless of which core injects the packet. The third architecture is called SHARP Bandwidth Split (SHARP-BanSp) and has a similar layout as SHARP-Dyn (see Figure 2.3). SHARP-BanSp divides the link bandwidth evenly between the CPU and GPU cores at each router regardless of traffic flow or buffer utilization. All four architectures utilize R-SWMR links[pkk + 09]. The reservation packets in SHARP-CoSeg and SHARP-FCFS are similar to Figure 2.6c. This packet consists of a single destination address, single packet size, and L3 destination bit which reduces the reservation packet by almost half. For the SHARP-FCFS architecture, the CPU and GPU data will not be sent on the same cycle, eliminating the need for two destination addresses and packet sizes in the reservation packet. For the SHARP-CoSeg architecture the two core types do not reside at the same router, eliminating the need for two destination addresses and packet sizes in the reservation packet.

4 Dynamic Power Scaling The dynamic power scaling scheme described in this section is a reactive technique that utilizes buffer occupancy within a reservation window to determine the ensuing

43 43 GPU Cluster CPU Cluster Figure 2.7: SHARP-CoSeg chip layout. The CPU and GPU cores and their resources are separated between two halves of the chip. 2.4 Dynamic Power Scaling The dynamic power scaling scheme described in this section is a reactive technique that utilizes buffer occupancy within a reservation window to determine the ensuing reservation window s laser power. This concept is built on top of the SHARP architecture and works in tandem with the algorithm in Table 2.1. The additional steps added to the algorithm can be seen in Table 2.2. Steps 0 through 8 are executed at a fine-grain granularity on the per cycle basis. Step 8 checks to see if the network has reached the end of a reservation window. If Step 8 evaluates to true, the coarse-grain power scaling mechanism executes Steps 9 and 10. Step 9 sums up the running buffer occupancy total for all the buffers across the reservation window. The power scaling has four threshold which creates five laser power states. The power values for each laser power state can be

44 44 Table 2.2: Dynamic Bandwidth Allocation Algorithm with Power Scaling Step 0: For each individual routers R 0 through R N-1 complete steps 1 through 9 Step 1: Calculate the buffer occupancy β Ocup for each input buffer β Ocup-0 through β Ocup-(j-1) in router R ω Step 2: β Ocup for routers β Ocup-0 through β Ocup-(j-1) are sent to Buffer Occupancy Calculator Step 3: Calculate β CPU using β Ocup-0 through β Ocup-(k-1) Step 4: Calculate β GPU using β Ocup-k through β Ocup-(j-1) Step 5: Determine the amount of bandwidth to be allocated to the CPU and GPU core types: If β GPU = 0 and β CPU > 0 GPU Bandwidth = 0% Bandwidth CPU Bandwidth = 100% Bandwidth Else if β CPU = 0 and β GPU > 0 GPU Bandwidth = 100% Bandwidth CPU Bandwidth = 0% Bandwidth Else if β GPU < β GPU-UpperBound GPU Bandwidth = 25% Bandwidth CPU Bandwidth = 75% Bandwidth Else if β CPU < β CPU-UpperBound GPU Bandwidth = 75% Bandwidth CPU Bandwidth = 25% Bandwidth Else GPU Bandwidth = 50% Bandwidth CPU Bandwidth = 50% Bandwidth Step 6: Send reservation packet via reservationassisted SWMR link Step 7: Transmit Data Using specified wavelengths on the first come first serve basis Step 8: If CurrentCycle mod RW = 0 Proceed to Steps 9 and 10 Else Return to Step 0 Step 9: For each reservation window RW, sum the total buffer occupancy β Total for each cycle Step 10: At the end of RW, determine the number of wavelengths WL for the outgoing waveguide at Router R ω : If β Total > Threshold upper WL= 64 Wavelengths Else If β Total > Threshold mid-upper WL= 48 Wavelengths Else If β Total > Threshold mid-lower WL= 32 Wavelengths Else If β Total > Threshold lower WL= 16 Wavelengths Else WL= 8 Wavelengths

45 45 Table 2.3: Laser Power for the Five Laser Power States Wavelengths per Waveguide Power (Watts) seen in Table 2.3. These laser power states consist of 64, 48, 32, 16, and 8 wavelengths as seen in Step 10 in Table 2.2. Minimal hardware is needed in order to implement the dynamic power scaling. An on-chip laser was utilized for this thesis. We assume a 2 ns turn on delay[dh16] for all laser scaling applications. The thresholds used for this section were chosen to balance throughput loss and power saving and can be changed to favor throughput or power. The main contribution to the laser power is the laser power loss calculated using the values from Table 2.3. The laser power used increases alomost linearly with the number of wavelengths. 2.5 Machine Learning for Bandwidth Scaling In this section, we will discuss the machine learning techniques utilized along with the proposed SHARP architecture. Additionally, we will discuss how the machine learning algorithm works to improve the design described within the previous sections. Machine learning can be employed in a variety of different ways. The base machine learning algorithm utilized throughout this thesis is ridge regression. The reason we chose regression over other methods is due to continues nature of the data collected from the network topology. From an implementation standpoint, we felt the additions and multiplications associated with the predictions for regression would be less complicated to implement in hardware. The error function for the ridge regression method is as follows

46 [Bis06]: Ẽ(w) = N {w T φ(x n ) t n } 2 + λ 2 w 2 (2.4) n=1 where w 2 = w T w = w w w2 N. The λ is the regularization coefficient. This error function and the derivations needed for this paper will be discussed in the later sections. The architectures that exploit this machine learning model will be compared to the SHARP-Dyn architecture running at 64 wavelengths per waveguide. Both machine learning architectures are built on top of the SHARP-Dyn architecture. The dynamic bandwidth allocation mechanism discussed in the previous sections are present and further expanded upon with the machine learning architectures. The machine learning model described in this section was designed to further improve the performance of the algorithm in Table 2.1. The model is used to predict the optimal β CPU-UpperBound and β GPU-UpperBound that will maximize the link utilization at each router. The model will be trained and validated to predict link utilization. During testing, derived equations from the trained regression model will be used to predict the β UpperBound values Regression Model Derivation The training processes used to predict the β UpperBound values is a polynomial ridge regression algorithm. In the following formulas, the β CPU-UpperBound is represented by up 1 and β GPU-UpperBound is represented by up 2. The purpose of this algorithm is to generate the up 1 and up 2 values such that the link utilization ˆ lu is maximized. The initial formula used to evaluate the link utilization for the next reservation window is as follows 1 : ˆ lu(x up 1, up 2 ) = w T φ(x) + up 1 u T φ(x) + α 1 2 up2 1 + up 2v T φ(x) + α 2 2 up2 2 (2.5) 1 Formula derivation within this subsection where contributed by Dr. Razvan Bunescu, Associate Professor of Electrical Engineering and Computer Science at Ohio University.

47 47 where α 1 and α 2 are hyper parameters used to aid in the prediction of the up 1 and up 2 values. w, u, and v are weight vectors generated by the regression model. ˆ lu(x up 1, up 2 ) can be rewritten as follows: ˆ lu(x up 1, up 2 ) = ω T ψ(x) (2.6) ω = [α 1, α 2, w u v] (2.7) [ 1 ψ(x) = 2 up2 1, 1 ] 2 up2 2, φ(x) up 1φ(x) up 2 φ(x) (2.8) The next step is to solve for ω such that the error between ˆ lu(x up 1, up 2 ) and the label lu n is minimized. This can be represented by the following equation: ω = argmin ω 1 2 N n=1 ( ˆ lu(x n, up 1, up 2 ) lu n ) 2 + λ 2 ω 2 (2.9) In order to maximize the link utilization ˆ lu the following formula needs to be evaluated: up ˆ 1, up ˆ 2 = argmax lu(x up ˆ 1, up 2 ) (2.10) up 1,up 2 In order to maximize the ˆ lu the gradient can be taken with respect to up 1 and up 2 and evaluated to zero. The resulting equation can be solved for up 1 and up 2 to produce the following: ( δlu ˆ = 0 => α 1 up 1 + u T φ(x) = 0 => up ˆ δup 1 = u ) T φ(x) (2.11) 1 α 1 δlu ˆ = 0 => α 2 up 2 + v T φ(x) = 0 => up ˆ δup 2 = 2 ( v α 2 ) T φ(x) (2.12) The solutions for up 1 and up 2 above can be used to predict the β UpperBound values that will produce the maximum link utilization. 2.6 Machine Learning for Power Scaling The purpose of machine learning within the context of this section is to provide a proactive method for determining the amount of laser power needed at each router within

48 48 a specified reservation window. This model will be used to predict the number of packets that will be injected into each router. This processes with take the place of the dynamic power scaling method discussed in Step 8 through 10 in Table 2.2. Once the number of packets are determined, one of five wavelength states will be selected for each router. This machine learning technique will be used is place on the Dynamic power scaling model discussed in the previous section Regression Model Derivation A linear regression model was utilized to predict the number of packets injected into the network for a given reservation window. The following formula can be used to calculate the error for a set of predicted values [Bis06]. Ẽ(w) = 1 2 N {w T φ(x n ) t n } 2 + λ 2 w 2 (2.13) n=1 where w 2 = w T w = w w w2 N which represents the weight vector calculated for the regression model. The λ is the regularization coefficient. The t n represents the label for the given set of features and w T φ(x n ) represents the prediction for the regression model. The weight vector w can be represented by the following formula [Bis06]: w = argmin Ẽ(w) (2.14) w In order to solve the above equation, the gradient of Ẽ(w) must be taken and set to zero. The solution for w is as follows [Bis06]: w = (λi + Φ T Φ) 1 Φ T t (2.15) This function is used to calculate the weights for each feature in the vector Φ. Furthermore, the weight vector is a presentation of the weight combinations that minimize the root-mean-square error function in equation 2.13.

49 49 3 PERFORMANCE EVALUATION 3.1 SHARP Simulation Methodology In this section, we evaluate the performance of four variations (SHARP-Dyn, SHARP-FCFS, SHARP-CoSeg, SHARP-BanSp) of our proposed SHARP architecture. We also compare SHARP and its variations against CMESH architecture. With each architecture, we run a multitude of CPU and GPU simulations. Each simulation combines one CPU and one GPU benchmark to make a benchmark pair. Each benchmark pair will be evaluated by varying both the number of wavelengths per waveguide as well as the injection rate to gauge the performance of the networks. Lastly, we evaluate and compare the area and energy consumption and compute the Energy Delay Product (EDP) per packet of each network. As the variation in area requirement for SHARP-Dyn and the other architectures only change within the Router and Optical Interface Layer, we have omitted this plot, but present the data in Table 3.4. Table 3.1: Benchmark Information Core Type Abbreviation Benchmark Name CPU FA Fluid Animate fmm Fast Multipole Method Rad Radiosity x264 x264 GPU DCT Discrete Cosine Transforms Dwrt One-dimensional Haar Wavelet Transform QRS Quasi Random Sequence Reduc Reduction All real network simulation traces were collected using Multi2Sim [UJM + 12]. Multi2Sim is a cycle-accurate simulation framework that captures every network transaction and its effect on the CPU and GPU pipelines, cache interactions, and router pipelines. Each network trace was collected by running a single benchmark on Multi2Sim

50 50 Table 3.2: Architecture Specifications CPU GPU Cores 32 Computation Units 64 Threads per Core 4 Frequency (GHz) 2 Frequency (GHz) 4 L1 Cache Size (kb) 64 L1 Instruction Cache Size (kb) 32 L2 Cache Size (kb) 512 L1 Data Cache Size (kb) 64 L2 Cache Size (kb) 256 Shared Components Network Frequency (GHz) 2 L3 Cache Size (MB) 8 Main Memory Size (GB) 16 for a total of 4 CPU benchmarks and 4 GPU benchmarks. The CPU benchmarks where chosen from the PARSEC 2.1 benchmark suite [BL09] and the SPLASH2 benchmark suite [WOT + 95]. The 4 CPU benchmarks were chosen for their variation in parallelization, size of problem set, and data usage [BL09]. The GPU benchmarks are chosen from OpenCL SDK benchmarks provided by AMD through [UJM + 12]. The goal of the simulations is to see the impact on network throughput and latency for various combinations of CPU and GPU workloads. Each line in the trace is made up of a source, destination, packet type (request or response), and a network cycle number corresponding to when the packet was sent into the network. One CPU and one GPU trace is combined using the network cycle number given to each packet by Multi2Sim. The simulation specifications can be seen in Table 3.2. The traces where combined with the intent of mimicking high workloads for server applications and demonstrating the outcome when there is heavy interaction between the two core types. Abbreviations for each benchmark can be seen in Table 3.1. Figure 3.1 is the packet percentage between each core type for the real traffic benchmark pairs. Theses benchmark pairs where chosen for the wide variation of packet densities with respect to

51 51 Table 3.3: Loss and Power Values for Optical Energy Calculations Component Value Unit Modulator Insertion 1 db Waveguide 1.0 db/cm Coupler 1 db Splitter 0.2 db Filter Through 1.00e-3 db Filter Drop 1.5 db Photodetector 0.1 db Receiver Sensitivity -15 dbm Ring Heating 26 µw/ring Ring Modulating 500 µw/ring the two core types. Nine synthetic traffic benchmark pairs where used in this paper to demonstrate how each architecture performs under controlled traffic patterns. The CPU s synthetic traffic timing was constant throughout each benchmark while the GPU s synthetic traffic timing was varied to mimic the bursty nature of the GPU core type. Each benchmark pair was run through a cycle-accurate network simulator to compare the throughput and latency of all architectures. Using traffic traces with a cycle accurate network simulator allowed us to isolate the network interaction and gauge the impact of the two traffic types on the network. Each simulation was run with 64, 32, and 16 wavelengths with each wavelength operating at a 16 Gbps data rate. The aggressive 16 Gbps data rate was chosen to enable the network to send one packet per cycle at the 2 GHz network clock frequency. To further gauge the interaction between the two core types, the simulations were run with increased injection rates to determine how each architecture responds under higher load. The injection rate variation is based on the trace packet s injection time. Therefore, we inject packets at a higher rate and this translates into 2, 4, 8, 16, and 32 times when compared to the trace collection. We refer to this as the Workload Injection Rate. The router energy calculations were acquired through DSENT 0.91 [SCK + 12]. The optical power calculations were attained using the values

52 Packet Type Breakdown % 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% % CPU Packets % GPU Packets Figure 3.1: CPU-GPU packet breakdown for each traffic trace. from Table 3.7[MKL12b, ZAU + 15]. The laser power results, along with the ring heating power [MKL12b], were used to calculate the energy per bit for the photonic interconnects. The energy calculation for SHARP-Dyn s dynamic allocation mechanism was done using 8-bit adds at 0.03 pj per add [Hor14b]. The energy per bit results for SHARP-Dyn s dynamic allocation mechanism is not displayed because the mean energy consumed for this process accounts for 0.824% of the total energy per bit for the 64-wavelength simulations. The values in Figure 3.10 show the average energy per bit to operate the network connections and the routers. The EDP per packet represent the product of the average energy consumed by each packet with the time each packet spends in the network. 3.2 Machine Learning Simulation Methodology Machine learning algorithms can improve the performance and power consumption of the proposed SHARP architecture. In order to fully utilize machine learning, there should be a tradeoff between delay and energy consumption. In this section, we discuss the delay and energy needed to achieve the dynamic laser scaling in addition to the time

53 53 and energy needed to fully utilize the machine learning within this architecture. The results in this section compare both dynamic laser scaling methods to the baseline architecture using a constant 64 wavelengths throughout the simulation. All three architectures utilize the dynamic bandwidth allocation mechanism described in Table 2.1. In this section, we evaluate the performance of the prediction models using a Normalized Root-Mean-Square Error (NRMSE) where 1 is a perfect fit and is a bad fit. We further demonstrate the difference in the prediction methods by displaying the amount of simulation time each method spends in the wavelength states. Lastly, we compare the throughput and power performance with the same architecture using a constant 64 wavelengths. The optical component specification can be seen in Table 3.7 [ACP11, AFB + 09]. The number of features dictate the number of adds and multiplies needed to implement the machine learning aspect of this architecture. In order to compute the number of packets injected into a router there will need to be 28 multiplies and 27 adds. Assuming 16 bit numbers, the total energy need is 43 pj [Hor14a] at a computation time of 5 nanoseconds. The computation time was estimated using Synopsys Design Compiler. The router area needed for the machine learning computation can be seen in Table 3.4. The off-chip comb laser turn-on delay can reach 1 µs [HB14]. The wavelength for the comb laser is temperature dependant. To obtain the desired wavelength the comb laser must have time to reach the correct temperature. The on-chip Fabry-Perot laser turn-on delay is 2 ns [EKK12]. The Fabry-Perot laser turn-on time is dependant on the silicon doping levels. This is what decrease the on-chip lasers turn-on time. The 2 ns turn-on time is the reason why the Fabry-Perot laser is utilize in this thesis.

54 54 Table 3.4: Area for SHARP-Dyn [LAS + 09, LHE + 13, Hor14a] Photonic and Electronic Component Area Cluster 25 mm 2 L2 Cache per Cluster 2.1 mm 2 Optical Components (MRRs and Waveguides) 24.4 mm 2 Waveguide Width [SCK + 12] 5.28 µm MRR Diameter [VSM + 08] 3.3 µm L3 Cache 8.5 mm 2 Router mm 2 Dynamic Allocation mm 2 Machine Learning mm 2 [Hor14a] Feature Engineering for Dynamic Bandwidth Scaling The main constraint for feature selection revolved around the hardware limitations. The β UpperBound values generated by the regression model occur at every router in the network. The features were selected by using the information already present at each router. Any additional information from anywhere outside of an individual router would require additional communication hardware and latency to accesses that feature. Table 3.5 is a list of features used in the dynamic bandwidth scaling model. The number of features were kept to a minimum in order to decrease the amount of time needed to calculate the β UpperBound values at run time. The L3 router feature (feature 1) is a binary feature used to distinguish between the 16 cluster routers and the L3 cache router. The Core input buffer utilization features (features 2 and 4) represent the buffers connected to the cores at any given router while the Other router input buffer utilization features(features 3 and 5) represent the input buffers connected to links from other routers. Features 6 through 13 could have been core type specific but this would have increase the number of features by eight and further increased the delay and energy to predict the β UpperBound values. There should be enough information in the buffer utilization features to negate splitting features 6 through 13 into core type specific features. Features 6 and 7 keep track of the packets

55 55 Table 3.5: Dynamic Bandwidth Feature List 1. L3 router 2. CPU Core Input Buffer Utilization 3. Other Router CPU Input Buffer Utilization 4. GPU Core Input Buffer Utilization 5. Other Router GPU Input Buffer Utilization 6. Outgoing Link Utilization 7. Number of Packets Sent to a Core 8. Incoming Packets from Other Routers 9. Incoming Packets from the Cores 10. Request Sent 11. Request Received 12. Responses Sent 13. Responses Received 14. Algorithm State Algorithm State Algorithm State Algorithm State Algorithm State β CPU-UpperBound 20. β GPU-UpperBound that stay within the cluster versus the packets that get sent out into the network. Features 8 and 9 sums up the number of packet generated at the given router and the number of packet received from other routers within the network. Features 10 through 13 are used to sum up the total number of request and response packets that move through the router. Features 14 through 18 represent the bandwidth states from Step 5 in Table 2.1. These features sum up the number of cycles the router is in each state within the reservation window. Features 19 and 20 represent the β UpperBound for the given reservation window Training and Testing for Dynamic Bandwidth Scaling There are two simulators needed to implement the data collection for this process. The feature data was collected from a modified network simulator running real network traffic. The network traffic was acquired from Multi2Sim [UJM + 12] full system simulator.

56 56 Each traffic file consisted of one CPU benchmark ran simultaneously with one GPU benchmark. Features from 8 CPU and 8 GPU benchmarks where collect for the training and validation process while 4 CPU and 4 GPU benchmarks were used to test the machine learning model. The CPU benchmarks used for training, validation, and testing are selected from Parsec Benchmark Suite [BL09] and the SPLASH-2 benchmark suites. The GPU benchmarks were selected from the OpenCL SDK provided by AMD. The traffic files were imported into the network simulator to gauge the interaction between the CPU and GPU traffic. This is done for the purpose of emulating an on-chip network without having to simulate the entire processor. The network simulator was used to collect feature data at each router. The feature set is then given a label and the feature counters are set back to zero. The reservation window for the data collection was fixed to 500 network cycles. The feature collection for each router is offset by 10 network cycles to prevent all the routers from changing the β UpperBound values at the same time. The initial feature data was collected using randomly generated β UpperBound values. This was done to mimic the changing β UpperBound values of the regression model and to avoid influencing the machine learning process by a predefined pattern. Once an initial regression model is generated there was a second feature collection processes. This process was done with the β UpperBound values being generated by the first regression model. The data collection processes were designed to best mimic the testing environment for the final results Feature Engineering for Dynamic Power Scaling Feature selection for a regression model can be equal or even more important than the algorithm used to predict the outcome of an event. If the features do not correlate to the label than the algorithm chosen to create the model will not be able to predict the outcome. Therefore, a great deal of thought needs to be placed in the features selected for a machine learning model.

57 57 Table 3.6: Dynamic Laser Scaling Feature List 1. L3 router 2. CPU Core Input Buffer Utilization 3. Other Router CPU Input Buffer Utilization 2. GPU Core Input Buffer Utilization 5. Other Router GPU Input Buffer Utilization 6. Outgoing Link Utilization 7. Number of Packets Sent to a Core 8. Incoming Packets from Other Routers 9. Incoming Packets from the Cores 10. Request Sent 11. Request Received 12. Responses Sent 13. Responses Received 14. Request CPU L1 instruction 15. Request CPU L1 data 16. Request CPU L2 up 17. Request CPU L2 down 18. Request GPU L1 19. Request GPU L2 up 20. Request GPU L2 down 21. Response CPU L1 instruction 22. Response CPU L1 data 23. Response CPU L2 up 24. Response CPU L2 down 25. Response GPU L1 26. Response GPU L2 up 27. Response GPU L2 down The features selected for the dynamic laser scaling were selected for the purpose of representing the packets that have been injected into the network. Features 1 through 13 are recycled from the dynamic bandwidth allocation section. Features 14 through 27 are used to track the packet movement throughout the network. Each feature has a label of request or response. A request packet is requesting data. A response packet has data. Additionally, the feature is labeled with the core type and the cache in which it is associated. The L2 cache features are labeled with an up or down corresponding the packet either going to up to an L1 cache or down to an L3 cache.

58 Training and Testing for Dynamic Power Scaling Each set of features must have a label. This is done to train the regression model. Predicting the number of packets that are being injected into the router was used for the label in the machine learning power scaling scheme. This label was chosen over other metric (i.e., buffer utilization, router utilization, or link utilization) to minimize the effect the wavelength state has on the outcome of the prediction. The buffer, link, and router utilizations vary significantly based on the number of wavelengths assigned to each waveguide. Packets received from other routers are going to dictate the injection of packets at the local router. Thus, minimizing the affect the number of wavelengths has on the local router s injection of packets. If the model was trying to predict the buffer utilization, setting the laser power state to 8 wavelengths would cause the buffer to fill at an increased rate due to the small number of packets leaving the router. In contrast, if the laser power state was set to 64 wavelengths the buffers would have a greater chance of being empty. When predicting the number of packets that will be injected, the core will try to inject a packet regardless of the laser power state. In order to create an accurate machine learning model three data sets must be created: training, validation, and testing. This process requires data to be gathers from a variety of different benchmark. A total of 12 CPU benchmarks where selected for this process. The CPU benchmarks where acquired from the PARSEC 2.1 [BL09] and SPLASH2 [WOT + 95] benchmark suites. There were 12 GPU benchmarks selected from the OpenCL SDK benchmark suite. Each set of data needs to have its own benchmarks. The training data was created with 6 CPU benchmarks and 6 GPU benchmarks. The benchmarks where combined to create 36 benchmark pairs. The validation data was created using 2 CPU and 2 GPU benchmarks. The validation data is used to tune the λ regularization coefficient from Equation The CPU and GPU benchmarks used for the validation data were combined to make 4 benchmark pairs. Lastly, the testing data was created using 4 CPU

59 59 and 4 GPU benchmarks. These benchmarks where combined to make 16 benchmark pairs. The CPU and GPU packet breakdown can be seen in Figure 3.1. The training and validation benchmark pairs are used to gather the feature data described in Table 3.6. There are two dynamic laser scaling schemes that are being compared to a baseline heterogenous architecture set to a static 64 wavelengths throughout the simulation. The first dynamic laser scaling scheme is a reactive mechanism based on buffer utilization. This takes the buffer utilization of the previous reservation window and uses it to determine how many wavelengths to power on for the next reservation window. The second dynamic laser scaling mechanism utilizes machine learning to predict how many packets are going to be injected into the network at each router. There are five different wavelength states considered for the dynamic laser scaling: 64, 48, 32, 16, and 8. An aggressive 16 Gbps data rate per wavelength [Mil09] was chosen to achieve the network frequency from Table 3.2. The reactive buffer based prediction method utilizes all five wavelength states. The machine learning based prediction method utilizes 4 of the 5 wavelength states. Initially, the 8 wavelength state was omitted during the training and validation of the machine learning model due to the significant decrease in the prediction accuracy. The 8 wavelength state was reintroduced after the model was computed to help save in power consumption. The thresholds that change the wavelength states can change the performance of the network. The buffer based prediction method threshold was chosen to balance the throughput and power of the simulation. These numbers can be varied to increase power savings or performance. The thresholds for the machine learning prediction method was chosen based on the number of packets that can exit the router during any given reservation window. If the predicted number of packets that are injected into the router exceeds the amount of data a state can send out of the data waveguide than the router increases the number of wavelengths used for that reservation window. This can be

60 60 described the following formula: PredictPkt PktS z WL state DataRate WL (3.1) where PredictPkt is the number of packet the machine learning model predicts, PktS z is WL the size of the packets sent, is the number of wavelengths in the comparing state (in state our case 16, 32, 48, and 64), and DataRate WL is the data chosen for a single wavelength. An estimate for the power needed for the machine learning calculation is 95.2 Microwatts. This calculation is based on a 500 cycle reservation window and the energy estimates from [Hor14a]. 3.3 Results for Dynamic Bandwidth Scaling The simulation results are broken up into four categories: Throughput, Latency, Energy per Bit, Energy-Delay per Packet (EDP). The throughput is shown in bytes per cycle in order to have a fair comparison of the network throughput without the data being skewed by the different packet sizes. Latency represents the amount of time the packet spends within the network. The energy per bit demonstrates the amount of energy used by the network with reference to the throughput of the network. The EDP per packet was used to compare the average energy spent per packet with the average time each packet spends within the network. The four optical architecture will be compared to a baseline concentrated MESH (CMESH) design. The CPU and GPU core types will be segregated to separate halves of the chip with the interaction between the two core type residing at the L3 cache. The channel width for the CMESH was set at 256 bits to balance the throughput and energy consumption when compared to the photonic architectures. Due to the multi-hop nature and the increased latency per packet of the CMESH architecture, the latency numbers for the CMESH were not included in the latency plots. This was done to make the plot easier to read.

61 Average Packet Latency (ns) Average Packet Latency (ns) CPU Shuff GPU Shuff CPU Shuff GPU HTSP CPU Shuff GPU UniRnd CPU HTSP GPU Shuff CPU HTSP GPU HTSP CPU HTSP GPU UniRnd CPU UniRnd GPU Shuff CPU UniRnd GPU HTSP CPU UniRnd GPU UniRnd SHARP-Dyn SHARP-BanSp SHARP-CoSeg SHARP-FCFS (a) Mean CPU Shuff GPU Shuff CPU Shuff GPU HTSP CPU Shuff GPU UniRnd CPU HTSP GPU Shuff CPU HTSP GPU HTSP CPU HTSP GPU UniRnd CPU UniRnd GPU Shuff CPU UniRnd GPU HTSP CPU UniRnd GPU UniRnd SHARP-Dyn SHARP-BanSp SHARP-CoSeg SHARP-FCFS (b) Mean Figure 3.2: Synthetic Traffic Latency (a) Average time a CPU packet spends in the network for 64 wavelengths (b) Average time a GPU packet spends in the network for 64 wavelengths. Shuff=Perfect Shuffle, HTSP=Random Traffic with a Hot spot, and UniRnd=Uniform Random

62 Bytes per Cycle CPU Shuff GPU Shuff CPU Shuff GPU HTSP CPU Shuff GPU UniRnd CPU HTSP GPU Shuff CPU HTSP GPU HTSP CPU HTSP GPU UniRnd CPU UniRnd GPU Shuff CPU UniRnd GPU HTSP CPU UniRnd GPU UniRnd SHARP-Dyn SHARP-BanSp SHARP-CoSeg SHARP-FCFS CMESH Mean Figure 3.3: Synthetic traffic throughput: Shuff=Perfect Shuffle, HTSP=Random Traffic with a Hot spot, and UniRnd=Uniform Random Synthetic Traffic Throughput and Latency Results Figures 3.3 and 3.2 represent the throughput and latency for synthetically generated network traffic. Three traffic patterns chosen are Perfect Shuffle(Shuff), Uniform Random (UniRnd), and Random Traffic with a Hot spot(htsp). Synthetic traffic demonstrates the network and dynamic allocation s performance under controlled traffic patterns. The CPU injection rate is set to force the network into saturation; while the GPU injection time varies randomly within the simulation. The GPU synthetic traffic is designed to have periods of time with a high injection rate. In contrast, periods of time will occur when the GPU cores will not inject any packets. This demonstrates how SHARP-Dyn s dynamic allocation mechanism handles the bursty nature of the GPU traffic while the network is at saturation from the CPU traffic. SHARP-Dyn demonstrates a 22.3% throughput increase over CMESH on average for all synthetic benchmark pairs with a maximum increase of 104%. The mean throughput improvement when comparing SHARP-Dyn to SHARP-CoSeg is 80.9% while SHARP-Dyn demonstrates a 14.9% improvement over

63 Average Packet Latency (ns) Average Packet Latency (ns) % Workload Injection Rate SHARP-Dyn SHARP-BanSp SHARP-CoSeg SHARP-FCFS (a) % Workload Injection Rate SHARP-Dyn SHARP-BanSp SHARP-CoSeg SHARP-FCFS (b) Figure 3.4: Real traffic latency vs. Workload Injection Rate (a) Average time a CPU packet spends in the network for 64 wavelengths (b) Average time a GPU packet spends in the network for 64 wavelengths SHARP-FCFS. Additionally, SHARP-FCFS exhibits a 57.4% throughput increase over SHARP-CoSeg. This demonstrates the potential benefit of clustering the two core types over alienating them to separate halves of the chip. SHARP-Dyn s mean throughput improvement over SHARP-BanSp is 0.19%. The minimal throughput improvement is due

64 Average Packet Latency (ns) Average Packet Latency (ns) SHARP-Dyn SHARP-BanSp SHARP-CoSeg SHARP-FCFS (a) SHARP-Dyn SHARP-BanSp SHARP-CoSeg SHARP-FCFS (b) Figure 3.5: Real Traffic Latency: (a) Average time a CPU packet spends in the network for 64 wavelengths. (b) Average time a GPU packet spends in the network for 64 wavelengths. to the network running at saturation, which in turn creates high network contention between the two core types. When there is high network contention and increased buffer occupancy between the CPU and GPU, SHARP-Dyn more often resides within the fifth

65 Bytes per Cycle SHARP-Dyn SHARP-BanSp SHARP-CoSeg SHARP-FCFS CMESH Figure 3.6: Real traffic throughput results for 64 wavelengths. bandwidth allocation state from Step 5 in Table 2.1. When comparing the Latencies in Figure 3.2, SHARP-Dyn has a 34.2% CPU latency decrease over SHARP-CoSeg. The SHARP-FCFS s CPU latency improvement over SHARP-CoSeg demonstrates that the checkerboard pattern spreads the network load across all the routers and decreases the amount of time a packet spends within the network. The high injection rate of the synthetic traffic causes the SHARP-CoSeg architecture to increase the contention at the CPU routers which increases the delay of the CPU packets. Couple the checkerboard layout with SHARP-Dyn s dynamic allocation mechanism, SHARP-Dyn demonstrates an 11.0% CPU latency improvement over SHARP-FCFS Real Traffic Throughput and Latency Results Figure 3.6 shows the throughput for various architectures under different workloads using 64 wavelengths per waveguide. The work load distribution per core type can be seen in Figure 3.1. The means in Figures 3.6 and 3.5 are an average for all 16 real traffic simulations. It can be seen in Figure 3.6 that SHARP-Dyn outperforms the other three

66 66 architectures on average. SHARP-Dyn outperforms CMESH by 34% on average. SHARP-Dyn s mean improvement in throughput over SHARP-BanSp, SHARP-CoSeg, and SHARP-FCFS architectures are 6.9%, 14.9%, and 11.2% respectively. Figure 3.5 (a) represents the CPU network latency for all four photonic architectures. SHARP-Dyn s average CPU latency improvement is 76.6% and GPU latency improvement is 70.4% over CMESH. When comparing SHARP-Dyn s to the three other photonic architecture, SHARP-Dyn s has a maximum CPU latency improvement of 58.4% for the CPU FA GPU DCT real traffic benchmark pair. The GPU latency for SHARP-CoSeg demonstrates an improvement over SHARP-Dyn but this is to the detriment of the latency sensitive CPU cores as see in Figure 3.5. Figure 3.4 represents the CPU and GPU average packet latency as the injection rate increases. As the contention increases in the network, the SHARP-Dyn architecture demonstrates a precedence given to the CPU core type. SHARP-BanSp performs similarly to SHARP-Dyn due to the high contention in the network, but when the network is at lower contention the performance gap from SHARP-Dyn to SHARP-BanSp increases. Figure 3.7 is the mean throughput for each architecture at 64 wavelengths as the injection rate is increased. SHARP-Dyn demonstrates a 7.6% throughput increase over SHARP-BanSp when the Workload Injection Rate is at 32 times the initial injection rate. The minimum average throughput gain, at 5.9%, that SHARP-Dyn demonstrates over SHARP-BanSp occurs when the Workload Injection Rate is increased by 4 times. The increased contention has an adverse effect on SHARP-CoSeg and SHARP-FCFS. With a Workload Injection Rate of 16, SHARP-Dyn has an increase of throughput over SHARP-CoSeg and SHARP-FCFS of 16.9% and 19.5% respectively. The increase in injection rate has a minimal effect on the CMESH architecture with an increase of less than 1% from the Workload Injection Rate of 1 to 32. When the network is at high contention with the Workload Injection Rate at 32, SHARP-Dyn demonstrates a 45.6%

67 THROUGHPUT INCREASE 6.9% 14.9% 11.2% 39% 7.2% 15.0% 13.2% 5.9% 14.3% 13.6% 7.3% 18.2% 15.9% 7.5% 19.5% 16.9% 7.6% 19.0% 17.1% 42% 43% 45% 45% 46% Bytes per Cycle % Increase Workload Injection Rate SHARP-Dyn SHARP-BanSp SHARP-CoSeg SHARP-FCFS CMESH Figure 3.7: The mean throughput results with an increase in injection rate for 64 wavelengths. SHARP-Dyn over SHARP-BanSp SHARP-Dyn over SHARP-CoSeg SHARP-Dyn over SHARP-FCFS SHARP-Dyn over CMESH WORKLOAD INJECTION RATE Figure 3.8: Real traffic percent throughput increase for 64 wavelengths when comparing SHARP-Dyn to SHARP-BanSp, SHARP-CoSeg, and SHARP-FCFS. throughout increase over CMESH. Figure 3.8 represents the percent increase in throughput for all 16 real traffic benchmark pairs when the bandwidth is set to 64, 32, and

68 THROUGHPUT INCREASE 6.9% 14.9% 11.2% 3.3% 3.1% 39% 18.9% 25.1% 17.7% 24.8% 95% 95% 68 SHARP-Dyn over SHARP-BanSp SHARP-Dyn over SHARP-CoSeg SHARP-Dyn over SHARP-FCFS SHARP-Dyn over CMESH WAVELENGTHS PER WAVEGUIDE Figure 3.9: Real traffic percent throughput increase for 64, 32, and 16 wavelengths when comparing SHARP-Dyn to SHARP-BanSp, SHARP-CoSeg, and SHARP-FCFS. 16 wavelengths compared against an increase in the Workload Injection Rate. As the contention increases, the SHARP-Dyn architecture spends more time in the 50/50 bandwidth state from Table 2.1. As demonstrated in Figure 3.8, the network contention increases due to the bandwidth decrease; SHARP-Dyn s throughput improvement decreases when compared to SHARP-BanSp. This demonstrates the benefit of the dynamic bandwidth allocation at lower contention and SHARP-Dyn s behavior when the network is at higher contention. Figure 3.9 represents the percent throughput increase for SHARP-Dyn over the other four architectures. As expected, the gap between SHARP-Dyn and SHARP-BanSp decreases as the contention in the network increases. The percent throughput increase from 64 to 32 wavelengths demonstrates SHARP-Dyn s ability to handle the network load Area and Power Results for Real Traffic Table 3.4 shows the area overhead for the proposed SHARP-Dyn architecture which was calculated using McPAT and GPUWatch [LAS + 09, LHE + 13]. The router and link

69 69 Table 3.7: Loss and Power Values for Optical Energy Calculations Component Value Unit Modulator Insertion 1 db Waveguide 1.0 db/cm Coupler 1 db Splitter 0.2 db Filter Through 1.00e-3 db Filter Drop 1.5 db Photodetector 0.1 db Receiver Sensitivity -15 dbm Ring Heating 26 µw/ring Ring Modulating 500 µw/ring energy per bit calculations were determined using DSENT 0.91 [SCK + 12] for all five architectures. The optical power calculations were attained using 1dB for modulator insertion, 1dB/cm waveguide loss, 1dB for coupling loss, 0.1dB for photo detection, 500µW/ring for ring modulation, and 26µW/ring for ring heating as seen in Table 3.7. The laser power results, along with the ring heating power [MKL12b, NFA11], were used to calculate the energy per bit for the photonic interconnect. The values in Figure 3.10 show the average energy per bit expended to operate the network connections and the routers. Each photonic architecture s energy per bit cost is shown with bandwidth constraints of 16, 32, and 64 wavelengths per waveguide. The CMESH bandwidth constraints are relative to the number of wavelengths starting with 256 bit electrical connections. The energy per bit decrease from CMESH to SHARP-Dyn is 25% while the EDP per packet has decreased 80%. Additionally, the average energy per bit in Figure 3.10 shows a 6.4% decrease from the SHARP-Dyn to the SHARP-CoSeg network for 64 wavelengths while demonstrating a 9.1% increase in EDP per packet. The energy per bit decrease is due to the decrease in hardware needed for SHARP-CoSeg. The segregated core types communicate to half of the chip which uses less laser power then the other photonic architectures. This result demonstrates that the decrease in energy spent on a

70 Average Energy Per bit (pj) Wavelengths 32 Wavelengths 16 Wavelengths Data WG Res WG Router Ring Heating Ring Modulation Electrical Link Figure 3.10: Energy per bit based off the average throughput of each network. packet doesn t necessarily decrease the amount of time a packet spends within the network. Furthermore, Figure 3.10 shows a 6.5% energy per bit decrease from SHARP-BanSp to SHARP-Dyn and a 9.7% energy per bit decrease from SHARP-FCFS to SHARP-Dyn for 64 wavelengths. When the bandwidth is constrained from 64 to 32 wavelengths, SHARP-Dyn shows a 19.7% and 3.2% energy per bit decrease when compared to SHARP-BanSp and SHARP-FCFS. SHARP-Dyn demonstrates a 40.7% and 34.4% decrease in energy per bit when compared to CMESH, respectively for 32 and 16 wavelengths. Additionally, SHARP-Dyn demonstrates a 91.9% and 88.8% decrease in EDP per packet when compared to CMESH, respectively for 32 and 16 wavelengths Machine Learning Throughput and Error Results The goal for the machine learning used with the dynamic bandwidth scaling was to predict the optimal threshold values that would produce the highest link utilization. In this section, we look at the throughput and error results for the bandwidth scaling machine learning. We chose to compare the machine learning data to SHARP-dyn because the machine learning utilized within this section is layered on top of the SHARP-dyn architecture. The results shown in this section did not come out favorable for the machine

71 Bytes per Cycle EDP per Packet (10-9 )20 64 Wavelengths Wavelengths Wavelengths SHARP-Dyn SHARP-CoSeg SHARP-BanSp SHARP-FCFS Figure 3.11: Energy per bit based off the average throughput of each network SHARP-Dyn ML Bandwidth Allocation Figure 3.12: Machine Learning Dynamic Bandwidth Throughput Comparison. learning architecture. When comparing the machine learning dynamic allocation method to SHARP-dyn in Figure 3.12 there is an 8.5% throughput loss. We believe this performance loss is cause by the 0.46 NRMS error seen in Figure Based on the error results the machine learning model does not have any problem predicting the outcome of the training and validation data. The issue arises when the model generated by the training

72 Normalized RMSE Error Training Validation Testing Figure 3.13: Comparison of the normalized RMSE for the training, testing, and validation data for the machine learning dynamic bandwidth scaling. data is integrated back into the simulation. It can be seen in Figure 3.14 that the error for all the benchmark pairs are less than Results on Power Scaling In this section, we compare the dynamic laser power scaling (Dyn) method to the machine learning (ML) model. We have broken the evaluation into three different reservation window (RW) sizes: 25, 500, and 2000 cycles. The training of the machine learning algorithm was done with only the 64, 48, 32, and 16 wavelengths states for all reservation window sizes. The accuracy of the predictions where much lower when the 8 wavelength state was included in the training and validation processes for any of the data seen in this paper. A comparison between the testing data with and without the 8

73 Normalized RMSE ML Testing Data Figure 3.14: Testing data error for each benchmark pair for the machine learning dynamic bandwidth. wavelength state is done for a reservation window of 500. The 8 wavelength state is used when the prediction is almost zero. This is used as an almost power gating state to aid in the conservation of energy. The machine learning with the 25 and 2000 reservation windows have the 8 wavelength already included. The dynamic lase scaling method maintained the same threshold values through all reservation window sizes. When developing any dynamic power scheme for a NoC architecture the focus is on how much energy consumption can be decreased while minimizing the impact on the throughput of the network. Both prediction methods are being compared to an architecture with a static laser state set at 64 wavelengths. Setting the laser state at a static 64 wavelengths is going to guarantee the best performance while consuming the greatest amount of energy. The area calculations can be seen in Table 3.4. The additions and multiplications for the machine learning where estimated using the numbers from [Hor14a].

74 Normalized RMSE #N/A #N/A #N/A #N/A #N/A #N/A 0 Training Validation Testing ML RW 500 No 8 ML RW 500 ML RW 2000 Dyn RW 500 Dyn RW 2000 Dyn RW 25 Figure 3.15: Average NRMSE for dynamic power and machine learning Throughput and Power Results We chose to compare prediction error using NRMSE in order to have a fair comparison of the two prediction methods. The error for the dynamic laser scaling method was comparing the difference between the current reservation window and the previous. Figure 3.15 represents the NRMSE values for both the dynamic and machine learning models. The ML RW 25 was not included in Figure 3.15 due to the NRMSE value was less than zero. Figures 3.18 through 3.24 represents the percent of the simulation time each configuration resides in each dynamic laser scaling states. These figures were included to have as a reference when discussing the laser power and throughput. The power comparison can be seen in Figure 3.16 and the throughput comparison is in Figure When looking at the throughput, the best performing configuration was the ML RW 2000 with a 0.3% throughput loss compared the 64 WL baseline. The prediction error for

75 75 ML RW 2000 did not influence the throughput of the network. ML RW 2000 had a 99.8% accuracy when predicting the 64 wavelength state. The error had a significant effect on the power consumption of ML RW When comparing the laser power consumption, ML RW 2000 performed the worst with 42% improvement over the 64 WL baseline. The reason for this can be seen in Figure ML RW 2000 spends just under 30% of the simulation in the 64 WL state. The ML RW 25 performed the best when comparing the laser power consumption in Figure The reason for this can be seen in Figure ML RW 25 spends most of the simulation in the 8 wavelength state and demonstrates a 79% power improvement over the 64 WL baseline. As seen in Figure 3.17, the throughput for RW ML 25 suffers because of the prediction error. The error in the prediction caused the ML RW 25 to predict the 8 wavelength state when it should have been in a state with higher bandwidth. Additionally, in Figure 3.23 RW ML 25 spends 75% of the simulation time in the 8 WL state. In comparison, in Figure 3.19 the Dyn RW 25 spends 48% of the simulation time in the 8 WL state. The change in reservation window size did not have much of an effect on the throughput of the dynamic laser scaling with a variation of 1.3% throughput loss over the 64 WL baseline. As the NRMSE increases the laser power consumption decreases. Dyn RW 500 had the lowest NRMSE and a 46% power savings over the 64 WL baseline. Dyn RW 25 had the highest NRMSE and demonstrated a 60.5% power saving over the 64 WL baseline. The ML RW 500 configurations perform the same when comparing throughput. When comparing the ML RW 500 with and without the 8 WL state, there is almost no throughput loss. When the 8 WL state is included, ML RW 500 demonstrates a 65.5% power savings over the 64 WL baseline compared to the ML RW No 8 s 60.7% power savings. This demonstrates that the 8 WL state can improve the power savings when the prediction accuracy is high enough. The best performing of the dynamic laser scaling method is the Dyn RW 25 with an 8.2% throughput loss and a 60.4% power saving when compared to the 64 WL baseline. If the application needs to

76 Power Watts ML RW 500 No 8 ML RW 500 ML RW 2000 ML RW 25 Dyn RW 500 Dyn RW 2000 Dyn RW WL Figure 3.16: Average laser power consumption per router for each benchmark. maintain the throughput of the network, ML RW 2000 demonstrates a negligible throughput loss with a power savings of 42% when compared to the 64 WL baseline.

77 Wavelength State Bytes per Cycle ML RW 500 No 8 ML RW 500 ML RW 2000 ML RW 25 Dyn RW 500 Dyn RW 2000 Dyn RW WL Figure 3.17: Average network throughput for dynamic laser scaling. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Figure 3.18: Wavelength breakdown for the dynamic laser scaling with 500 cycle reservation window.

78 Wavelength State Wavelength State % 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Figure 3.19: Wavelength breakdown for the dynamic laser scaling with 25 cycle reservation window. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Figure 3.20: Wavelength breakdown for the dynamic laser scaling with 2000 cycle reservation window.

79 Wavelength State Wavelength State % 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Figure 3.21: Wavelength breakdown for the machine learning laser power scaling without 8 wavelength state and a 500 cycle reservation window. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Figure 3.22: Wavelength breakdown for the machine learning laser power scaling with 500 cycle reservation window.

80 Wavelength State Wavelength State % 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Figure 3.23: Wavelength breakdown for the machine learning laser power scaling with 25 cycle reservation window. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Figure 3.24: Wavelength breakdown for the machine learning laser power scaling with 2000 cycle reservation window.

81 81 4 CONCLUSIONS AND FUTURE WORK In this thesis, we have demonstrated the advantage of utilizing photonic interconnect for a heterogenous architecture while implementing dynamic bandwidth allocation using R-SWMR approach. SHARP-Dyn and SHARP-FCFS demonstrate the advantages of clustering two core types around one router. On synthetic traffic, SHARP-FCFS exhibits a 57.4% increase in throughput over SHARP-CoSeg. Meanwhile, SHARP-Dyn demonstrates an 80.9% throughput increase on synthetic traffic. The high injection rate of the synthetic traffic creates contention for the resources at each router. The results on the synthetic benchmark pairs show that clustering two core types at one router can spread the traffic across the network alleviating the contention at each router. SHARP-Dyn increases the throughput on real traffic while decreasing the energy per bit cost over SHARP-FCFS, SHARP-BanSp, SHARP-CoSeg, and CMESH for varying link bandwidths. The real traffic simulation results produced a 6.9%-14.9% increased throughput for SHARP-Dyn over the other three photonic architectures at 64 wavelengths. Additionally, SHARP-Dyn demonstrates a 34% performance improvement over the baseline CMESH while decreasing the energy per bit by 25%. SHARP-Dyn has shown a 6.4% and 9.7% energy per bit improvement over SHARP-BanSp and SHARP-FCFS, respectively while the bandwidth is set to 64 wavelengths. The dynamic power scaling portion of this work has shown promise. We have proposed several different methods to predict when to scale the number of wavelengths per waveguide. The Dyn RW 2000 configuration when looking at the dynamic laser scaling method demonstrated the best performance improvement over 64 WL baseline with a 60.5% laser power decrease and an 8.2% throughput loss. When considering the machine learning, the needs of the application should be considered. ML RW 2000 has a negligible throughput loss when compared to the 64 WL baseline with a 42% laser power

82 82 savings. When considering the power savings, ML RW 500 demonstrates a 14.6% throughput loss with a 65.5% power savings. The machine learning technique used for the dynamic bandwidth scaling did not perform as well as expected. The expectation was to improve upon the performance of the SHARP-dyn architecture. This felt short of the expectations. The issue could be cause by the NRMSE drop from the validation data to the training data. The future work for this machine learning algorithm apply the mathematics to a different application. This could be applied to anything with two thresholds that could optimize a network component. When considering machine learning for real time systems, the most difficult part is in generating a perfect set of training examples. I believe the results of in this paper can be significantly improved if a set of training data could be generated that behaved exactly the way the testing simulation is expected to behave. The training and validation data collected for this work was done in a two stage processes. The first stage the data was collected using random wavelength states. In the second stage of the training process a machine learning model was generated from the random data and used to create a second set of data for the final training. The issue arises because the final training data set is not a representation of the ideal way the simulation should have to perform. When considering the machine learning in future work, a better data representation would be preferred for each benchmark pair. When considering the machine learning for bandwidth scaling, the ideal data set would be to find the beta values that would produce the maximum link utilization. When considering the machine learning for laser power scaling, the ideal data set would have the highest possible throughput with the lowest laser power consumption. The only way to do this is to design a simulation that could iteratively select the ideal parameters for each reservation window.

83 83 REFERENCES [ACP11] Konstantinos Aisopos, Chia-Hsin Owen Chen, and Li-Shiuan Peh. Enabling system-level modeling of variation-induced faults in networks-on-chips. In Proceedings of the 48th Design Automation Conference, DAC 11, pages , New York, NY, USA, ACM. URL: doi: / [AFB + 09] J. Ahn, M. Fiorentino, R. G. Beausoleil, N. Binkert, A. Davis, D. Fattal, N. P. Jouppi, M. McLaren, C. M. Santori, R. S. Schreiber, S. M. Spillane, D. Vantrease, and Q. Xu. Devices and architectures for photonic chip-scale integration. Applied Physics A, 95(4): , URL: doi: /s [AMD] AMD. Carrizo. URL: [AMP + 15] M. Arora, S. Manne, I. Paul, N. Jayasena, and D. M. Tullsen. Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on cpu-gpu integrated systems. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pages , Feb doi: /hpca [Arm] Lucian Armasu. Google s big chip unveil for machine learning: Tensor processing unit with 10x better efficiency. URL: com/news/google-tensor-processing-unit-machine-learning,31834.html [Bis06] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, [BKA10] Ali Bakhoda, John Kim, and Tor M. Aamodt. On-chip network design considerations for compute accelerators. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 10, pages , [BL09] C. Bienia and K. Li. PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors. In Proc. of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June [BSP + 16] B. Bohnenstiehl, A. Stillmaker, J. Pimentel, T. Andreas, B. Liu, A. Tran, E. Adeagbo, and B. Baas. Kilocore: A 32 nm 1000-processor array. In IEEE HotChips Symposium on High-Performance Chips, August 2016.

84 [Chr98] Gary E. Christensen. Mimd vs. simd parallel processing: A case study in 3d medical image registration. Parallel Computing, 24: , [CTD + 16] J. Chen, Y. Tang, Y. Dong, J. Xue, Z. Wang, and W. Zhou. Reducing static energy in supercomputer interconnection networks using topology-aware partitioning. IEEE Transactions on Computers, 65(8): , Aug doi: /tc [DBKL16] D. DiTomaso, T. Boraten, A. Kodi, and A. Louri. Dynamic error mitigation in nocs using intelligent prediction techniques. In th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1 12, Oct doi: /micro [DH16] Y. Demir and N. Hardavellas. Slac: Stage laser control for a flattened butterfly network. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages , March doi: /hpca [EKK12] K. Patel E. Kotelnikov, A. Katsnelson and I. Kudryashov. Highpower single-mode ingaasp/inp laser diodes for pulsed operation. Proceedings of SPIE, :1 6, [FAA08] Antonio Flores, Juan L. Aragón, and Manuel E. Acacio. An energy consumption characterization of on-chip interconnection networks for tiled cmp architectures. J. Supercomput., 45(3): , September doi: /s [FLL + 15] Juan Fang, Zhen-Yu Leng, Si-Tong Liu, Zhi-Cheng Yao, and Xiu-Feng Sui. Exploring heterogeneous noc design space in heterogeneous gpu-cpu architectures. Journal of Computer Science and Technology, 30(1):74 83, [GH11] C. Gregg and K. Hazelwood. Where is the data? why you cannot debate cpu vs. gpu performance without the answer. In Performance Analysis of Systems and Software (ISPASS), 2011 IEEE International Symposium on, pages , April doi: /ispass [GLM + 11] M. Georgas, J. Leu, B. Moss, C. Sun, and V. Stojanovi. Addressing link-level design tradeoffs for integrated photonic interconnects. In 2011 IEEE Custom Integrated Circuits Conference (CICC), pages 1 8, Sept doi: /cicc [GP16] J. Guo and M. Potkonjak. Coarse-grained learning-based dynamic voltage frequency scaling for video decoding. In th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pages 84 91, Sept doi: /patmos

85 [HB14] M. J. R. Heck and J. E. Bowers. Energy efficient and energy proportional optical interconnects for multi-core processors: Driving the need for on-chip sources. IEEE Journal of Selected Topics in Quantum Electronics, 20(4): , July doi: /jstqe [HCC + 06] M. Haurylau, G. Chen, H. Chen, J. Zhang, N. A. Nelson, D. H. Albonesi, E. G. Friedman, and P. M. Fauchet. On-chip optical interconnect roadmap: Challenges and critical directions. IEEE Journal of Selected Topics in Quantum Electronics, 12(6): , Nov doi: /jstqe [Hor14a] M. Horowitz. 1.1 computing s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10 14, Feb doi: /isscc [Hor14b] M. Horowitz. 1.1 computing s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10 14, Feb doi: /isscc [Inta] Intel. 6th generation intel core i7 processors (formerly skylake). URL: html?wapkw=skylake [Intb] Intel. The next generation of computing has arrived: Performance to power amazing experiences. URL: mobile-5th-gen-core-app-power-guidelines-addendum.pdf [JJP + 12] Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August. Dynamically managed data for cpu-gpu architectures. CGO 12, pages , New York, NY, USA, ACM. [JPS16] R. Jain, P. R. Panda, and S. Subramoney. Machine learned machines: Adaptive co-optimization of caches, cores, and on-chip network. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE), pages , March [KLJK14] Y. Kim, J. Lee, J. E. Jo, and J. Kim. Gpudmm: A high-performance and memory-oblivious gpu architecture using dynamic memory management. pages , Feb [KM10] Nevin Kirman and José F. Martínez. A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing. SIGARCH Comput. Archit. News, 38(1):15 28, March

86 [LAS + 09] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages , New York, NY, USA, ACM. [LHE + 13] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. Gpuwattch: Enabling energy optimizations in gpgpus. SIGARCH Comput. Archit. News, 41(3): , June [LLKY13] Jaekyu Lee, Si Li, Hyesoon Kim, and Sudhakar Yalamanchili. Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures. ACM Trans. Des. Autom. Electron. Syst., 18(4):48:1 48:28, October [LSH10] Etienne Le Sueur and Gernot Heiser. Dynamic voltage and frequency scaling: The laws of diminishing returns. In Proceedings of the 2010 International Conference on Power Aware Computing and Systems, HotPower 10, pages 1 8, Berkeley, CA, USA, USENIX Association. URL: [MB06] Giovanni De Michele and Luca Benini. Networks on Chips: Technology and Tools. Systems on Silicon. Morgan Kaufmann Publishers, [Mil09] D. A. B. Miller. Device requirements for optical interconnects to silicon chips. Proceedings of the IEEE, 97(7): , July doi: /jproc [MKL12a] R. Morris, A. K. Kodi, and A. Louri. 3d-noc: Reconfigurable 3d photonic on-chip interconnect for multicores. In Computer Design (ICCD), 2012 IEEE 30th International Conference on, pages , Sept [MKL12b] R. Morris, A.K. Kodi, and A. Louri. Dynamic reconfiguration of 3d photonic networks-on-chip for maximizing performance and improving fault tolerance. In Microarchitecture (MICRO), th Annual IEEE/ACM International Symposium on, pages , [Moo98] G. E. Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, 86(1):82 85, Jan doi: /jproc [MPCL10] Sasikanth Manipatruni, Kyle Preston, Long Chen, and Michal Lipson. Ultra-low voltage, ultra-small mode volume silicon microring modulator. Optics express, 18 17: ,

87 87 [MV15] Sparsh Mittal and Jeffrey S. Vetter. A survey of cpu-gpu heterogeneous computing techniques. ACM Comput. Surv., 47(4):69:1 69:35, July [NFA11] C. Nitta, M. Farrens, and V. Akella. Addressing system-level trimming issues in on-chip nanophotonic networks. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pages , Feb doi: /hpca [NVI] NVIDIA. Tegra x1. URL: [OPA15] L. Shulenburger A. Landahl J. Moussa O. Parekh, J. Wendt and J. Aidun. Benchmarking adiabatic quantum optimization for complex network analysis. Technical Report SAND , Sandia National Laboratories, [PKK + 09] Yan Pan, Prabhat Kumar, John Kim, Gokhan Memik, Yu Zhang, and Alok Choudhary. Firefly: Illuminating future network-on-chip with nanophotonics. SIGARCH Comput. Archit. News, 37(3): , June [SCK + 12] Chen Sun, C.-H.O. Chen, G. Kurian, Lan Wei, J. Miller, A. Agarwal, Li-Shiuan Peh, and V. Stojanovic. Dsent - a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on, pages , [SMJ + 14] A. Shah, N. Mansoor, B. Johnstone, A. Ganguly, and S. L. Alarcon. Heterogeneous photonic network-on-chip with dynamic bandwidth allocation. In th IEEE International System-on-Chip Conference (SOCC), pages , Sept [Sor06] R. Soref. The past, present, and future of silicon photonics. IEEE Journal of Selected Topics in Quantum Electronics, 12(6): , Nov doi: /jstqe [TWG + 12] H. Tian, G. Winzer, A. Gajda, K. Petermann, B. Tillack, and L. Zimmermann. Fabrication of low-loss soi nano-waveguides including beol processes for nonlinear applications. Journal of the European Optical Society - Rapid publications, 7(0), [UJM + 12] Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. Multi2sim: A simulation framework for cpu-gpu computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT 12, pages , [USM + 14] Rafael Ubal, Dana Schaa, Perhaad Mistry, Xiang Gong, Yash Ukidave, Zhongliang Chen, Gunar Schirner, and David Kaeli. Exploring the

88 88 heterogeneous design space for both performance and reliability. In Proceedings of the 51st Annual Design Automation Conference, DAC 14, pages 181:1 181:6, [VSM + 08] D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N.P. Jouppi, M. Fiorentino, A. Davis, N. Binkert, R.G. Beausoleil, and J.H. Ahn. Corona: System implications of emerging nanophotonic technology. In Computer Architecture, ISCA th International Symposium on, pages , June doi: /isca [WOT + 95] S.C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proc. of the 22nd International Symposium on Computer Architecture, June [XZL + 12] L. Xu, W. Zhang, Q. Li, J. Chan, H. L. R. Lira, M. Lipson, and K. Bergman. 40-gb/s dpsk data transmission through a silicon microring switch. IEEE Photonics Technology Letters, 24(6): , March doi: /lpt [ZAU + 15] Amir Kavyan Kavyan Ziabari, Jose L. Abellán, Rafael Ubal, Chao Chen, Ajay Joshi, and David Kaeli. Leveraging silicon-photonic noc for designing scalable gpus. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 15, pages , [ZK13] L. Zhou and A. K. Kodi. Probe: Prediction-based optical bandwidth scaling for energy-efficient nocs. In 2013 Seventh IEEE/ACM International Symposium on Networks-on-Chip (NoCS), pages 1 8, April doi: /nocs [ZKDI11] Hui Zhao, Mahmut Kandemir, Wei Ding, and Mary Jane Irwin. Exploring heterogeneous noc design space. In Proceedings of the International Conference on Computer-Aided Design, ICCAD 11, pages , 2011.

89 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Thesis and Dissertation Services!

Extending the Power-Efficiency and Performance of Photonic Interconnects for Heterogeneous Multicores with Machine Learning

Extending the Power-Efficiency and Performance of Photonic Interconnects for Heterogeneous Multicores with Machine Learning Scott Van Winkle, Avinash Kodi, Razvan Bunescu School of Electrical Engineering