Flow-controlled and SDN-enabled optical switching system for high-capacity and low-latency flat data center networks Miao, W.

Size: px

Start display at page:

Download "Flow-controlled and SDN-enabled optical switching system for high-capacity and low-latency flat data center networks Miao, W."

Terence Hodges
5 years ago
Views:

1 Flow-controlled and SDN-enabled optical switching system for high-capacity and low-latency flat data center networks Miao, W. Published: 25/01/2017 Document Version Publisher s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication: A submitted manuscript is the author's version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. The final author version and the galley proof are versions of the publication after peer review. The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain You may freely distribute the URL identifying the publication in the public portal? Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 01. Sep. 2018

2 Flow-controlled and SDN-enabled Optical Switching System for High-capacity and Lowlatency Flat Data Center Networks PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de rector magnificus, prof.dr.ir. F.P.T. Baaijens, voor een commissie aangewezen door het College voor Promoties in het openbaar te verdedigen op woensdag 25 januari 2017 om uur door Wang Miao geboren te Taiyuan, China

3 Dit proefschrift is goedgekeurd door de promotoren en de samenstelling van de promotiecommissie is als volgt: voorzitter: 1e promotor: copromotor: leden: adviseur(s): prof.dr.ir. A.B. Smolders prof.ir. A.M.J. Koonen dr. N. Calabretta prof.dr. D. Simeonidou (University of Bristol) prof.dr. S. Spadaro (Universitat Politècnica de Catalunya) prof.dr. J. Bauwelinck (Gent University) prof.dr. J. J. Lukkien dr. O. Raz Het onderzoek of ontwerp dat in dit proefschrift wordt beschreven is uitgevoerd in overeenstemming met de TU/e Gedragscode Wetenschapsbeoefening.

4 This Ph.D. thesis has been approved by a committee with the following members: voorzitter: 1e promotor: copromotor: leden: adviseur(s): prof.dr.ir. A.B. Smolders prof.ir. A.M.J. Koonen dr. N. Calabretta prof.dr. D. Simeonidou (University of Bristol) prof.dr. S. Spadaro (Universitat Politècnica de Catalunya) prof.dr. J. Bauwelinck (Gent University) prof.dr. J. J. Lukkien dr. O. Raz A catalogue record is available from the Eindhoven University of Technology Library. Title: Flow-controlled and SDN-enabled Optical Switching System for High-capacity and Low-latency Flat Data Center Networks Author: Wang Miao Eindhoven University of Technology, 2017 ISBN: NUR 959 Keywords: Data center network / Optical switching / Software defined networking / Wavelength division multiplexing Copyright 2017 by Wang Miao All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the prior written consent of the author.

6 Scientists investigate that which already is; Engineers create that which has never been. by Albert Einstein

8 Abstract Driven by the cloud computing, internet of things, and emerging big-data applications, more stringent requirements in terms of high bandwidth, low latency and large interconnectivity are imposed on the communications within the data centers. Traditional intra- data center network based on electronic switches is organized in a hierarchical topology, which is affected by the communication bottleneck and poor power efficiency. Architectural and technological innovations are desired, in order to enable the scalable growth both in the number of connected endpoints and exchanged traffic volume. Transparent optical switching has been considered as an attractive technology in this respect, providing data-rate and data-format agnostic operation, and meantime eliminating the power consuming transceivers as well as formatdependent interfaces. The research work described in this thesis first focuses on the design, implementation and assessment of a high-capacity and scalable fast optical switch node. These activities are carried out in the framework of EU FP7 project LIGHTNESS, targeting for the demonstration of a high-performance and programmable optical data center network. The fast optical switch is a key element in the data plane which is utilized to effectively cope with the burst-like data center traffic. The parallel processing of the RF tone label, the distributed control resulted from the modular structure, and the nanoseconds operation of the semiconductor optical amplifier (SOA) gate allow for nanoseconds reconfiguration time of the switch to support statistically multiplexed traffic. An optical flow control mechanism employing bidirectional transmission of the in-band label and flow control signal has been proposed and investigated. For the packet contended at the optical switch, fast retransmission of the electrically stored copy is realized, mitigating the issue of lack of the optical buffer in a closed environment like data center network. Moreover, to facilitate the flexible provisioning and configuration of the network enabled from the software-defined networking (SDN)-based control scheme, a dedicated SDNenabled control interface has been developed for the fast optical switch node. A control agent is used to bridge the SDN control plane and the optical switch controller, which handles the extended OpenFlow protocol and the proprietary interface from the two entities, respectively. Programmable functionalities improving the network performance therefore can be supported, by updating the

9 ii Abstract look-up table and monitoring the statistic kept in the optical switch controller. The flow-controlled and SDN-enabled optical switch node is prototyped and the experimental assessments have shown dynamic switching operation of highcapacity packetized data with low-latency delivery and potential scalability to large number of ports. Considering the practical implementation incorporating with the burst-mode receiver, the required preamble length has been experimentally investigated. The 4 4 prototype has also been successfully employed in the final testbed and demonstration of the LIGHTNESS project as the optical packet switch node. The combination of the innovative data plane devices and the unified SDN control plane has demonstrated a SDN-enabled and fully programmable optical data center network. The fast transparent switching capability of the optical switch node is then exploited in the research activities beyond LIGHTNESS project. A novel data center network architecture OPSquare based on two parallel switching networks is proposed and investigated. The fast flow-controlled optical switch nodes are used as the switching elements, which allow for flexible switching capability in both wavelength and time domains. Benefiting from the scalability enabled by the architecture and the transmitter wavelength assignment for the grouped topof-the-rack (ToR) switches, large interconnectivity can be achieved by utilizing moderate port-count optical switches with low broadcasting ratios. The switching performance is evaluated by simulations with realistic data center traffic as well as experiments using the prototyped switches. The potential of providing Petabit/s capacity and low-latency switching capabilities is studied with higher-order modulation formats and waveband signals along with optical switch port-count scaling. To enhance the efficiency and agility of the data center interconnect networks, the analyses of an optical label switched add-drop node utilizing fast SOA gates has been presented. Fast power equalization of the channels is implemented by monitoring the in-band optical label and adjusting the bias current of the SOA. Packet-based add/drop operation for high-capacity and statistically multiplexed data traffic is allowed, significantly improving the flexibility and resources utilization of the network. Scalability in terms of capacity and number of crossed nodes are also experimentally assessed. Exploiting the capability of photonic integration featured by the modular optical switch node, a photonic integrated circuit (PIC) has been fabricated and assessed. More than a hundred of optical components are integrated in the same chip, resulting in reduced footprint as well as power consumption which are necessities in data center networks. The dynamic switching operation of high data-rate and multi-level modulated traffic is experimentally investigated,

10 Abstract iii indicating the potential scalability to higher data-rate and larger number of portcounts. Overall, the results achieved in this thesis have demonstrated the highcapacity, low-latency and programmable switching capabilities of the fast optical switch node under investigation. In combination with the port-count scalability enabled by the SOA-based modular structure as well as capability of photonic integration, it serves as promising candidate for next-generation data center networking solutions.

12 Contents Abstract... i Chapter 1 Introduction Data centers Data center network Optics in data centers Data center traffic Requirements for data center networks Challenges for current data center network Optical data center network Optical switching technologies Optical data center network: state-of-the-art Technical challenges Scope of this thesis Groundwork: LIGHTNESS project Novel contributions to the field Outline of the thesis Chapter 2 SOA-based fast optical switch with modular architecture Switching operation Modular architecture Synchronous operation Contention resolution Optical in-band RF tone labeling Labeling technique Label processor... 23

13 vi Contents Label generator FPGA-based switch controller F optical switch Broadcast-and-select structure SOA gate Summary Chapter 3 Low latency and efficient optical flow control Optical flow control Experimental demonstration Experiment Set-up Dynamic operation Payload and flow control signal performance Variable link length Summary Chapter 4 Software defined networking enabled control interface Fast optical switch based VN SDN control plane Fast optical switch based VN SDN-enabled control interface Control architecture OpenFlow extensions Experimental evaluation Virtual network reconfiguration Priority Assignment Statistics Report and Load Balancing Summary Chapter fast optical switch prototype and performance assessments fast optical switching system Traffic generation SDN-enabled control... 66

14 Contents vii Dynamic switching Packet loss and latency Scalability Prototype implementation Label processor SOA/DML driver Implemented prototype and power consumption Performance assessment with burst-mode receiver Case I minimum preamble length Case II - asynchronous packets Case III clock data recovery LIGHTNESS testbed and final demonstration Testbed set-up VDC management and applications OPS/OCS operation LIGHTNESS final demo Summary Chapter 6 OPSquare DCN based on flow-controlled fast optical switches OPSquare DCN architecture ToR switch Fast optical switch node Numerical investigation Simulation set-up Traffic distribution Buffer dimensioning Interconnectivity scalability Comparison of latency and power consumption Experimental investigation Scalability investigation optical switch for 4096-ToR interconnectivity

15 viii Contents Multi-level modulation formats and waveband switching Summary Chapter 7 Optical label switched add-drop node for data center interconnect Optical label switched add-drop node Flexibility and efficiency required in DCI networks Structure and operation Assessment with 50 Gb/s data traffic Experiment set-up Dynamic operation Statistic investigation Assessment with high-capacity waveband traffic Experiment set-up Dynamic operation Scalability investigation Summary Chapter 8 Photonic integrated fast optical switch Fast optical switch PIC Gb/s, 20 Gb/s and 40Gb/s NRZ-OOK traffic Experimental set-up Dynamic switching BER performance High-data and multi-level modulated traffic Experimental set-up Assessment results Summary Chapter 9 Summary and outlook Summary Outlook Appendix A

16 Contents ix A.1 Label processor A.2 Label generator Bibliography Acronyms List of Publications Acknowledgments Curriculum Vitae

18 Chapter 1 Introduction 1.1 Data centers Data centers (DCs) play an important role in modern IT infrastructures. A data center is a home to computational power, storage, and applications necessary to support the services of individuals and organizations (such as enterprises, banks, academic institutions, etc.). Data centers process billions of Internet activities every day. The rise of cloud computing, internet of things, and emerging bigdata applications has significantly increased the traffic between and within the data centers [1]. As indicated in the Figure 1.1, annul global data center IP traffic is projected to reach 10.4 Zettabytes by the end of year 2019, which represents a three-fold increase from year 2014 [2]. Most (three-quarters) of the consumer and business traffic flowing in the data centers resides within the data center. Therefore, as the beating hearts of the companies they serve, data centers are phenomenally growing in size and complexity [3, 4]. Large data centers with hundreds of thousands of servers are being deployed by renowned ICT organizations, like IBM, Microsoft, Amazon, Facebook, and Google [5, 6]. Figure 1.1: Data center traffic growth and the destination distributions.

19 2 Data centers Besides the growth in the size, the data centers are also witnessing increasing demands for more powerful computational performance mainly driven by the data-intensive applications as well as high-density virtualization [7, 8]. Here virtualization refers to the act of decoupling the logical requirement of the computation from the actual physical infrastructure with faithful abstraction (resource pooling), which allows for efficient resource sharing over the same physical infrastructure. In consequence, high-performance and energyefficient multi-core processors are developed aggressively to provide higher processing capability [9]. Foreseeing the preservation of Moore s law through chip-level parallelism, the multi-core product is expected to scale unabated in computing systems [10, 11]. However, the overall performance of a computing system such as a data center is not only determined by its computing capacity but is also inextricably dependent on the capabilities and performance of the interconnection network infrastructure. As indicated by the Amdahl s balanced system law, the interconnects and switching elements should guarantee a balanced I/O bandwidth performance among the underlying compute nodes [12] Data center network Data centers consist of a multitude of servers as computing nodes and storage subsystems interconnected with the appropriate networking hardware and accompanied by highly-engineered power and cooling subsystems [13]. The data center network (DCN) is responsible to support the large amounts of workload exchanged among the parallel server machines. The traditional DCN uses a multi-tier tree-based architecture, as depicted in Figure 1.2. Tens of servers are housed in individual racks and interconnected by the top-of-the-rack (ToR) switches via copper or optical links. Racks are grouped together into clusters. The inter-rack communication is handled by layers of electronic switches. Normally each ToR switch is connected to more than one set of aggregation-level switches in the same cluster to provide redundancy and path diversity. The aggregation switches are further connected with each other through the core switches via optical transceivers. This architecture has two main advantages. First, it can be easily scaled to provide a large connectivity in a cost-effective way. The other one is the fault tolerance enabled by the extensive path diversity which enables the graceful degradation of the network performance even under failure. Ideally, the networking fabric including the switches connecting every server to every other server in a data center should provide a full bisection bandwidth (i.e., the available bandwidth between two bisected sections equals to the aggregation bandwidth of the servers in one of the sections) [13]. In this case, the

Introduction 3 oversubscription defined as the ratio of the worst-case achievable bandwidth among the servers to the total bisection bandwidth is 1:1, indicating high server utilization and

20 Introduction 3 oversubscription defined as the ratio of the worst-case achievable bandwidth among the servers to the total bisection bandwidth is 1:1, indicating high server utilization and computation efficiency. However, due to the super linear costs associated with scaling the bandwidth and port density of conventional electronic switches, such a design would be prohibitively expensive for a current large-scale DCN. In practice, data centers tend to enforce a certain degree of oversubscription with a factor from 4 to 10 [14, 15]. For instance, the ToR switch featuring with 48 ports commonly assigns 40 ports for interconnecting servers and 8 for connecting aggregation switches, leading to an oversubscription of 5:1 at ToR level. There is more bandwidth available for intra-rack communication than inter-rack communication and similar trend would be found at higher switching layers. Figure 1.2: Traditional tree-like data center network. Currently the DCN has evolved into folded-clos and leaf-spine topologies, which have been schematically shown in Figure 1.3. The folded-clos architecture which is also known as fat-tree, uses small and identical commodity switches. Although same aggregation bandwidth can be achieved at each layer, the folded-clos suffers from daunting cabling complexity when scaling the network interconnectivity, in which case tremendous amount of optical transceivers is required between the switching layers. In the leaf-spine architecture, a series of leaf switches accessing the servers is fully meshed to a series of spine switches. The leaf switches are no more than one hop away from one another, minimizing the latency and the likelihood of bottlenecks. However, with the increase of the number of connected nodes and operating speed, huge capacity is required for the spine switch which forces the deployment of oversubscription and an extra layer of super-spine switches [16].

4 Data centers Figure 1.3: Folded-clos and leaf-spine data center network architecture. 1.1.2 Optics in data centers Fiber optics has already played a critical role, mostly as interconnect medium, in delivering on the potential of the data center network.

21 4 Data centers Figure 1.3: Folded-clos and leaf-spine data center network architecture Optics in data centers Fiber optics has already played a critical role, mostly as interconnect medium, in delivering on the potential of the data center network. At 10 Gb/s speed and beyond, passive and active copper cables are impractical to be used beyond a few meters of reach due to their bulky size, frequency-dependent loss and high power consumption of the transceivers [17]. With the ever-increasing demand for higher speed, the optical interconnects will eventually replace the traditional copper-based solutions even for the links between the servers and ToR switches. The rapid development of vertical-cavity surface-emitting laser (VCSEL)-based optical interconnects and silicon photonics (SP) based optical interconnects have accelerated this transition. VCSEL-based multi-mode fiber (MMF) 10 Gb/s short reach optical interconnects are commonly used due to VCSEL s circular output beam (easy coupling into optical fiber), low current threshold (energy-efficient and low-cost driver) as well as easy mass fabrication and wafer level testability (costefficiency). Especially the 850 nm VCSELs together with the optimized multimode fibers (MMFs) have dominated the short reach (up to hundreds of meters) optical interconnects in data center applications [18, 19]. However, modal dispersion ultimately limits the reach-bandwidth product of the VCSELs based active optical links [20]. Scaling the data rate (> 25 Gb/s) and the reach (> 300 m) to accommodate larger data center network, high-power distributed feedback (DFB) lasers together with single-mode fiber (SMF) need to be used [21]. The narrower spectrum in comparison with their VCSEL-based counterparts also allows for future-proof scalability by exploiting wavelength division multiplexing (WDM) [22]. In the past decade, significant advances have been made in SP-based WDM interconnects to further scale the bandwidth density and address the energy efficiency and cost of optical transceivers [23, 24].

22 Introduction 5 Although not the material of choice for semiconductor lasers due to its indirect bandgap, silicon has good thermal conductivity, transparency at the traditional telecom wavelengths and most importantly, allows the leveraging of the highly-developed silicon CMOS fabrication process for electronic-photonic convergence [17]. Photonic integration on silicon combining active devices (e.g., lasers and modulators) and passive building blocks (e.g., wavelength multiplexers and grating couplers) is extensively investigated as the emerging solution for cost-effective and scalable WDM interconnects in data center networks [25, 26]. Despite the increased power consumption and fabrication complexity, the high capacity achieved within a compact integrated device has pushed forward the commercialization and deployment of SP-based WDM interconnects [27, 28]. For instance, SP-based 100 Gigabit Ethernet (GbE) solutions (e.g., QSFP28 and CFP4) containing four 25 Gb/s coarse WDM channels are considered aggressively for data centers [29]. With the maturing of the technology, higher capacity with increasing bit-rate per channel and larger number of WDM channels is expected in coming years, providing a total capacity per link reaching Terabit/s while maintaining the same form factor, power consumption and cost [30] Data center traffic Data centers are revolutionizing the way in which communication networks are built, triggered by the ever-increasing bandwidth demands in cloud computing. The appropriate design, scale, and even technology of a data center network depend intimately on the traffic demands of the services it hosts. Therefore, before putting a great deal of effort into identifying effective topologies for data center networks, a firm and clear understanding of the data center traffic is necessary. There are several studies of data center traffic features conducted via simulations [31], on testbeds [32] and on production data centers like Microsoft [33, 34] and Facebook [35]. Although there are differences among the various types of applications and different hosting environments, some interesting findings including representative commonalities have been noted as follows. - Traffic flow locality: A traffic flow is specified as an established link between two servers. The traffic flow locality describes whether the traffic generated by the servers in a rack is directed to the servers located in the same rack (intra-rack traffic) or to other racks (inter-rack traffic). The locality of a rack s traffic is environment-dependent and related to its constituent servers (e.g., Web server, or cache). In general, the inter-rack traffic fluctuates from 10 to 80% of the total traffic [36]. Recent studies have shown the continuous increase of the inter-rack traffic with a clear majority

23 6 Data centers of traffic being intra-cluster (> 50%) but not intra-rack (< 15%) [35]. Instead of being rack-local as reported in the literatures [37], the traffic flows are more cluster-local in which case high-speed networks are required between the racks while lower-cost commodity switches can be used inside the rack. - Utilization: Utilization here is the percentage of the data throughput relative to the maximum capacity of the link. The overall access link (i.e., links between servers and ToR switches) utilization is quite low, with 99% of the links typically lightly (10%) loaded [35]. However, the probability of having traffic output from the servers (i.e., load) can still vary considerably across the time period, resulting in temporary high (> 25%) utilization rate. The network experiences congestion drops in such condition mainly caused by the inherent burstiness of flows and intentionally oversubscribed network [38]. This can be mitigated by upgrading the interconnect links with higher data rate capacity and spreading the traffic load with an evenly distributed utilization across the network. Utilization also rises at higher levels of aggregation, with busy links reaching up to 50% utilization [35, 39]. Higher bandwidth interconnects in combination with high-capacity switching elements are required especially for inter-rack and inter-cluster communications. - Traffic flow size and packet size: Most flows in data centers are short and in majority send smaller than 10 KB and last less than 10 seconds. Larger flows lasting for more than minutes of time are more likely to be seen in cache servers. However, regardless of the size, most flows are active (i.e., actually transmit packets) only during distinct millisecond-scale intervals with large intervening gaps (tend to be internally bursty). Regarding the packet size, Hadoop traffic (MapReduce job) shows a bimodal distribution around the 1500 bytes accounting for the maximum transmission unit (MTU) length for Ethernet frame and bytes as control packets (e.g., TCP ACKs) [14, 35]. Packets for the other services have a wider distribution, but the median size is less than 200 bytes with only 5-10% of the packets fully utilizing the MTU. - Concurrent flows: The number of concurrent traffic flows per server is also of paramount importance for the design of the network topology. The servers in data centers have 10 s to 100 s concurrent flows on average [35, 40]. Considering the small fraction of intra-rack traffic, almost all flows will traverse an up-link at the ToR switch as inter-rack communication. Therefore, the number of interconnections supported by the aggregation switching network should be large enough to accommodate the number of concurrent flows.

24 Introduction 7 As can be seen from these empirical results, high performance switching networks are necessary to effectively handle the increasing inter-rack traffic and the burstiness of the traffic flows. In addition, fine switching granularity (e.g., deploying statistical multiplexing) and low switching latency are preferable to improve the efficiency in connection with the bursty flows and variable packet sizes. The large number of concurrent flows makes large-interconnectivity a necessity for each server, in which case the circuit-based approaches may be challenging to employ Requirements for data center networks The cloud computing paradigm imposes a set of stringent requirements on the data center networks. A few key requirements are briefly summarized here. Capacity: An increasing fraction of data centers is migrating to warehouse scales. Although substantial traffic will continue to cross between users and data centers, the vast majority of the data communication is taking place within the data center [2]. For instance, a data center hosting with over 100,000 server each capable of 10 Gb/s bandwidth, would require an internal Petabits/s capacity to support full-bandwidth communication among the servers. Latency: Packet latency in a data center can be defined as the time it takes for a packet to traverse the network from the sender to the receiver node (end-to-end latency) which includes both the propagation latency and switch latency. When it comes to the closed environment like data centers, the latency is dominated by the switch latency mainly contributed by the buffering, routing algorithm, and arbitration. Low latency is a critical performance requirement especially for mission-critical and latencysensitive applications where microseconds matter. These range from financial networking to cloud networking applications that demand the lowest possible latency. Scalability: The network architecture should enable scaling to large number of nodes to address future capacity needs in a cost-efficient manner. Extension of an existing network in terms of both node count and bandwidth in an incremental fashion is preferable, i.e., without having to replace a disproportionate amount of the installed hardware. Flexibility: Data centers are expected to adopt technologies that allow them to flexibly manage the service delivery and quickly adapt to changing needs. The virtualization can enhance such automated provisioning, with the resources (such as computing, storage and network) pooled and dynamically optimized by the control plane through software configuration. To this end,

25 8 Data centers the interconnecting network consisting of switching elements needs to be fully virtualized to enable the creation of the virtual networks and the following management (e.g., security, load balancing). In addition, open standards, open protocols and open source development are more and more involved to facilitate and speed up the deployment, operation as well as management of the virtual networks. Power/cost-efficiency: A data center represents a significant investment in which the data center network occupies a significant portion [41]. Besides the costs for hardware and software installation, running a large scale data center is mainly a power consumption matter. Power-efficiency is a key target for reducing the energy-related costs and scaling the bandwidth by improving the power-density performance. In this sense, significant efforts have been made towards employment of optical technology and virtualization leading to enhancements in power and cost efficiency [42, 43] Challenges for current data center network Facing the scaling number of server nodes with rapid upgrade in I/O bandwidth [44], the abovementioned requirements would be quite challenging for current data center network, in terms of both switching node and network architecture. Electronic switch It is difficult for the electronic switch to satisfy the future bandwidth need. The increasing parallelism in microprocessors has enabled continued advancements in computational density [45]. A server blade with 4 multi-processor cards each accommodating 4 quad-core 2.5 GHz processors can provide a processing power of 160 GHz which in turn, demands an I/O bandwidth higher than 100 Gb/s to achieve a balanced computing performance. The ToR switch grouping 40 server blades would then have a total aggregation bandwidth of multi- Terabit/s. With the help of the optical interconnects, each aggregation switch node needs to handle tens of Tb/s aggregation traffic. Despite the continuous efforts from merchant silicon providers towards the development of applicationspecific integrated circuits (ASICs), the implementation of high-bandwidth electronic switch node is limited by the switch ASIC I/O bandwidth (to roughly 8 Tb/s) due to the scaling issues of the ball grid array (BGA) package [46]. Higher bandwidth is achievable by stacking several ASICs in a multi-tier structure, but at the expense of larger latency, higher cost, and power consumption. Another limiting factor is the power consumption. As electronic switch has to store and transmit each bit of information, it dissipates energy with each bit

26 Introduction 9 transition, resulting in power consumption at least proportional to the bit-rate of the information it carries. The power consumed by electronic switches is prohibitively high that a single switch located in the higher network tiers can reach up to tens of kilowatts power consumption when also considering the dedicated cooling system [47]. With the scaling in port-count and capacity, the value is still growing continuously, greatly deteriorating the power-efficiency and cost-efficiency performance. Hierarchical network topology Interconnecting thousands of ToRs, each with multi-tb/s aggregated traffic would put an enormous pressure on the multi-tier tree-like topology employed by the current data center networks. Due to the limiting performance in terms of bandwidth and port density of conventional electronic switches, network is commonly arranged with oversubscription [13]. Consequently, data-intensive computations become bottlenecked especially for the communication between servers residing in different racks/clusters. The multiple layers of switches also bring large latency when a packet traverses the aggregation and core switches to reach its destination, mainly caused by the queueing delay of buffer related processing. Besides the power-hungry electronic switching fabrics, the electrical-tooptical (E/O) and optical-to-electrical (O/E) conversions between switching layers actually dissipate a large portion of the total consumed power in the data center network [48]. In addition, fueled by the demand for higher data-rate, it becomes beneficial for transceivers employing WDM and multi-level modulation schemes instead of the simple on-off keying (OOK) modulation. Therefore, dedicated parallel optics and electronic circuits for processing the format-dependent signals need to be further included as front-end of the electronic switches, which contribute unnecessarily to a higher power consumption and higher cost in the hierarchical network. Therefore, to effectively address the bandwidth, latency, scalability and power requirements imposed by the next-generation data center networks, innovations in the switching technology and network architecting are of paramount significance. 1.2 Optical data center network With the prevalence of the high-capacity optical interconnects, optically switched data center networks have been proposed as a solution to overcome the potential scaling issues of the electronic switch and traditional tree-like topology [49, 50]. The switching network handles the data traffic in optical

27 10 Optical data center network domain thus avoiding the power-consuming O/E and E/O translations at the transceivers. It also eliminates the dedicated interface for modulation-dependent process, achieving better efficiency and less complexity. Moreover, benefiting from the optical transparency, the switching operation (including the power consumption) is independent of the bit-rate of the information. Scalability to higher bandwidth and employment of WDM technology can be seamlessly supported, enabling superior power-per-unit bandwidth performance Optical switching technologies Various optical switching techniques have been investigated for data center applications. Optical circuit switching (OCS) In OCS-based solution, the connectivity path between the source and destination is established using a two-way reservation, which in general is a time-consuming procedure [51]. One or more wavelengths can be allocated to a connection, thus the bandwidth granularity is at the wavelength level. However, due to the slow configuration (in the order of millisecond), OCS has poor bandwidth utilization and inability to support applications with fast changing traffic demands. OCS would behave more efficiently when supporting long-lived smooth data flows, for which quality of service (QoS) guarantees are ensured. The switching fabrics for OCS are mostly based on 3D micro-electronic-mechanical systems (MEMS) [52] and piezoelectric beam steering [53]. Optical burst switching (OBS) Bursts of traffic are transmitted through the optical network by setting up a connection and reserving end-to-end resources in the OBS system [54]. Prior to data transmission, a burst control header (BCH) is created and sent towards the destination. The BCH is processed electronically at the OBS routers and it informs each node of the arrival of the data burst and drives the allocation of an optical connection path. OBS enables sub-wavelength granularity by reserving the bandwidth only for the duration of the actual data transfer. Although it is more efficient than OCS in terms of bandwidth utilization, the long reservation time still could not satisfy the burstness feature and latency requirements in data center networks. Optical packet switching (OPS) OPS technology makes it possible to achieve sub-wavelength bandwidth granularity and low end-to-end latency exploiting statistical multiplexing of bursty flows. In OPS networks, one or more electrical data packets with similar attributes are aggregated in an optical packet and attached with an

28 Introduction 11 optical label indicating the destination. The optical packet switch processes the label and forwards the optical packet to the right output port. Advance reservation for the connection is not needed and the bandwidth can be utilized in the most flexible way. These features make OPS a suitable candidate for data center applications which requires transmission of small data sets in an on-demand manner [55, 56]. Practical realization of OPS relies heavily on the implementation of optical labeling technique and contention resolution. The switching fabrics are normally built based on arrayed waveguide grating router (AWGR) [57] or broadcast-and-select (B&S) switching [58]. It is worth noting that the finest granularity also allows the optical packet switches being employed in OCS and OBS applications Optical data center network: state-of-the-art Several solutions using a large number of small commodity electronic switches have been proposed such as DCell [59], BCube [60], VL2 [15], Camcube [61] and Jellyfish [62], trying to overcome the subscription-induced communication bandwidth bottleneck of traditional data center networks. However, the scaling limitations of the electronic switches in terms of bandwidth and power consumption still exist. The redundant interconnectivity also results in overhead in forwarding and routing strategy and wiring complexity as the scale of the network increases. Therefore, optical switching technologies providing high bandwidth and power efficiency have been adopted in the recent researches for optical data center networks [63, 64]. According to the switching technologies used, the optical data center networks can be classified into different categories. Slow optical switch This category includes the proposals using the optical switches with submillisecond/milliseconds switching speed [65, 66]. The most commonly used optical switch in this catalog is the MEMS switch. In OSA [67], the authors proposed an all-optical switching architecture, where the ToR switches are connected to a central MEMS switch, through optical MUX/DEMUX and switching components. Like the prior work Proteus [68], communication in between any two ToR switches is established by stitching multiple optical hops. 3D MEMS switch with 320 ports are commercially available and higher radix is expected in the future [52]. A key challenge here is the slow reconfiguration time including both hardware switching time ( ms) and controlling overhead (100 ms-1s) [69]. Another option for the switching element is the wavelength selective switch (WSS). Mordia [69], Wavecube [70], RODA [71] and OPMDC [72] are all based on WSS switches utilizing either a ring topology

29 12 Optical data center network or a multi-dimensional cube structure, improving the fault-tolerance of the network. Although relatively fast reconfiguration time (sub-microseconds) can be achieved at small port-counts, it is not suitable for supporting bursty traffic flows in data center scenarios. The limited port-count also requires cascading and stacking of the WSSs, resulting in performance degradation in terms of flexibility, latency and power losses. Fast optical switch This category consists of the solutions based on optical switches with fast (nanoseconds) switching speed, in support of the packet-level operations. There have been several studies on AWGR along with tunable lasers (TLs) or tunable wavelength converters (TWCs), including IRIS [73], DOS [74] and the following work LIONS [75]. Petabit [76] proposed a three-stage Clos network and Hi-LION [77, 78] proposed a mesh-like network using both local and global AWGRs. The interconnection scale and performance is largely dependent on the capability of the costly TL and TWC. The wavelength-related operation also limits the application of WDM technology to further scale the capacity, as emerged in data center optical interconnects solutions. The other schemes mainly use semiconductor optical amplifier (SOA) as switching gates [79, 80]. SOA can provide fast nanosecond switching. Data Vortex [81] is entirely composed of 2 2 switching nodes arranged as concentric cylinders. As the number of nodes increases, the packets have to traverse several nodes before reaching the destination, causing increased and non-deterministic latency, as well as deteriorated signal quality. The OSMOSIS [82] utilizes WDM and B&S stages to enable high-capacity and low-latency forwarding of the synchronously arrived fixed-length optical packets. The scalability of OSMOSIS is mainly limited by the hardware complexity. In [83], a modular switching architecture with SOA-based B&S has been proposed and investigated for the data center network. Distributed control and advanced labeling technique has allowed for nanoseconds reconfiguration time regardless of the port-count. Despite the gain offered by the SOA, practical implementation of extremely large port-count (> 256) is limited by the splitting losses [84]. Hybrid electronic/optical switches Various hybrid electronic/optical interconnect architectures have been proposed for data center and high performance computing systems [85]. The electronic switches typically connect all the servers in a multi-level hierarchy and provide large connectivity with short reconfiguration time to handle short-lived mouse flows. While the optical network, implemented by a single or an array of slow optical switches, provides large capacity links for long-lived elephant flows,

30 Introduction 13 as well explained in Helios [86], c-through [87] and HydRA [88]. The slow optical switches connect at a specific level of the electronic network (e.g., ToR). Due to the milliseconds reconfiguration time and the scalability issues regarding the port-count, these schemes are ideal for scheduled applications such as data migration, storage backup, and virtual machine (VM) migration [89] where the point-to-point and high-bandwidth connections last more than a couple of seconds, in order to compensate the reconfiguration overhead. Hybrid slow/fast optical switches In the hybrid electronic/optical schemes, the packet switched network is still following the traditional designing rules which would encounter the scaling issues of the electronic switches. To fully exploit the optical transparency to enable high-capacity and power-efficient service delivery, all-optical architectures based on slow and fast optical switches have been investigated [63]. The essence is to combine the advantages of both optical switching technologies to fulfil the requirements of emerging applications running in data centers in terms of ultra-high bandwidth and low network latency. The work presented in [90] and Torus topology [91] are utilizing the fast optical switching fabrics, i.e., SOAs and electro-absorption modulator (EAM)-gates to support both packet and circuit switching operations. Archon [92] employs beamsteering large-port-count slow fiber switches as the central cluster switches to serve the long-lived data flows and PLZT-based fast switch as plug-in module to provide low-latency time division multiplexing (TDM) functions. HOS [93] and HOSA [94] propose to use fast switches and slow MEMS switches in parallel. A centralized control plane is responsible for routing, scheduling and even switch configuration, which would result in increased complexity when scaling the network and thus compromise the flexibility offered by the fast optical switches Technical challenges As indicated by the referred works, introducing optical switching technology has the potential of increasing the capacity and improving the power efficiency of the data center networks [95], mainly due to the eminently advantageous property optical transparency. Commercial products of MEMS-based slow switches are also seeing penetration into data centers, providing reconfigurable high-bandwidth and data-rate agnostic channels [96, 97]. However, the limited port-count and tens of milliseconds reconfiguration time have strictly confined the applications to well-scheduled and long-lived tasks [98, 99]. Considering the bursty traffic features as well as high fan-in/out hotspots patterns in data center networks, the slow optical switches providing static-like and pairwise

31 14 Optical data center network interconnections would only be beneficial as supplementary switching elements. The fast optical switches allowing for on-demand resource utilization, highlyflexible connectivity and meanwhile overcoming the bandwidth scaling issues of the electronic counterpart are becoming the appealing switching scheme for data center networks. Despite the promises hold by the fast optical switching technology, the implementation of the fast optical switches supporting packetbased operation is actually facing several challenges. Fast control To fully benefit from the flexibility enabled by the statistical multiplexing, in addition to the fast optical switch featuring with nanoseconds hardware switching speed, the fast controlling mechanism is another key factor to guarantee the fast forwarding of the data packets. Typically an optical label or header is associated with the data packets, carrying the required information on the destination to be processed by the switching controller. Due to the limited processing capabilities realizable in the optical domain, the all-optical label processing is only achieved with small dimension, complex structure, and power inefficiency [100]. Therefore, the label processing for fast optical switches is mostly realized in an electro-optical manner, with the label information optically encoded employing TDM, WDM, subcarrier multiplexing (SCM), optical code division multiplexing (OCDM) and orthogonal labeling [101]. Regarding the data center network applications, the implementation of the labeling technique should follow the increase of the network scale and optical switch port-count, and more importantly, occupy as least resources as possible. Lack of optical memory Electronic switches employ random access memories (RAM) to buffer the data packets when the routing decisions are made and contention resolution is performed. As no effective RAM exists in optical domain, contention resolution is one of the most critical functionalities that need to be addressed for fast optical switches [102]. Several approaches have been proposed to overcome such issue, based either on optical fiber delay lines (FDLs) [103] or deflection routing [81]. However, these techniques increase significantly the system complexity in terms of routing control and demanding synchronization. Moreover, the power and quality of the signal are affected which results in limited and fixed buffering time. More promising solution is to push the buffer to the edge nodes in electronic domain [104]. To minimize the introduced latency, the optical switch should stay relatively close to the edge nodes and fast

32 Introduction 15 decision-making is required. The intra-dc networks are suitable for this scheme with interconnects ranging from few to hundreds of meters. SDN Compatibility Besides providing high-performance switching capabilities, the optical data center networks are embracing the automation for the provisioning of flexible and highly reliable connectivity, overcoming the limitation of currently deployed static and semi-automated management and control frameworks. To this aim, software-defined networking (SDN) is imperative to enable such dynamic environments with flexible network infrastructures to optimize performances and resources utilization [105]. The SDN-enabled data center network employs an open, standard, vendor-independent and technologyagnostic southbound interface to configure and monitor the underlying switching network [106, 107]. Therefore, to facilitate the realization and deep integration of the SDN control framework, the fast optical switches should be compatible with the open interface exhibited by the SDN control plane. Proper extensions on the open protocols (e.g., OpenFlow [108]) have to be designed and implemented, taking into account the specificity of the used technology. Moreover, efforts need to be made to bridge the standard interface and the proprietary ones exposed by the specific optical switches. Burst-mode receiver The traditional electronic switches deployed in data centers are based on Ethernet technology. Synchronization exists between any connected ports, with idle periods filled with pulse transitions, allowing continuous high-quality clock recovery [109]. Optical switching system is dealing with a different situation, where links are not always kept alive and the source node and destination node cannot be well synchronized. Especially in a fast packet switched network, the clock and phase of the data signal may vary packet by packet. Thus, for practical implementation, packet-based networks need burst mode receivers (BM-RXs) to synchronize the phase and clock of the received data packets. Moreover, although DCN is a closed environment with more controlled optical power variation, the receiver should still be capable of handling packets with different length and different optical power levels. Those functions contribute an overhead to the overall BM-RX operational time which should be minimized to achieve higher throughput and lower latency. This is especially important in an intra-dc scenario where many applications produce short traffic flows. Scalability Depending on the design and technology employed in fast optical switches, signal impairment and distortion are observed inevitably due to the effects such

33 16 Scope of this thesis as noise and optical nonlinearities. Consequently, the fast optical switches are realized with limited port-count. Scaling the network interconnectivity and maintaining the performance would require the switches to have port-count as large as possible and to be intelligently connected to avoid the hierarchical structure. The flat topology also brings the benefits of simplified controlling and buffering which may be problematic for fast optical switches [102]. On the other hand, optical transparency and WDM technology would benefit the DCN in the context of scaling up the bandwidth density [17, 50]. Further improvements could be made by means of photonic integration, which greatly reduces the optical switch footprint and power consumption [110]. The potential for photonic integration therefore becomes a desirable feature for the fast optical switches with all the components co-fabricated on a single device. 1.3 Scope of this thesis Groundwork: LIGHTNESS project The project Low Latency and High Throughput Dynamic Network Infrastructures for High Performance Datacentre Interconnects (LIGHTNESS) project [111] under European Union s Seventh Framework Programme for Research (EU FP7) started in November 2012 and targets for the design, implementation and experimental demonstration of a high-performance flattened optical data center network to effectively cope with the increasing popularity of cloud computing and big-data applications. LIGHTNESS DCN comprises innovative optical switching technologies and leverages SDN-based control plane providing highly flexible and programmable connectivity services. The research work reported in this thesis was mainly carried out within the LIGHTNESS framework and specifically, to investigate innovative photonic switching technologies to provide high-capacity, low-latency, scalable and programmable optical packet switching capabilities for DCNs. It also includes the implementation of such fast optical switch node as well as the interfaces with the rest network components and the SDN-based control plane. After the project being successfully concluded by end of 2015, more research work has been done towards further improvement of the intra-/inter-dc network performance by exploiting the developed fast optical switching system, as reported in the thesis Novel contributions to the field The main contributions of this thesis include:

34 Introduction 17 A low-latency and efficient optical flow-control mechanism is investigated for fast buffer-less optical switches. Spectrum-efficient bi-directional transmission of label and flow-control signals is realized. Triggered by the flow-control signal, the blocked data packets are retransmitted improving the packet-loss and latency performance. SDN control interface is designed and implemented for fast optical switch, facilitating the provisioning and management of the virtual networks (VNs). The developed agent bridges the SDN control plane (through extended OpenFlow protocol) and the field-programmable gate array (FPGA) switch controller enabling the communication of the two entities. A 4 4 fast optical switch for DCN is prototyped with discrete components based on which the system performance is fully assessed. Dynamic switching operation with flow control and burst-mode receiver has been experimentally validated. The flexible and programmable management is also achieved through the SDN-based controller. By using the fast optical switches, a novel DCN architecture is proposed and investigated. Two parallel layers of optical switches are employed and organized in a flat topology, handling the intra-cluster and inter-cluster traffic, respectively. Performance in terms of packet loss, latency, bufferdimensioning and scalability are analyzed through simulation and experiment. An optical label switched add-drop node is proposed for data center interconnect (DCI) metro networks. Statistical multiplexing of data packet enables efficient and flexible bandwidth utilization. The in-band optical label is utilized for fast switching control and fast power equalization for highcapacity traffic. Towards the integration, a photonic integrated fast optical switch with the studied architecture is employed for performance assessment. Experimental results show error-free dynamic switching operation with nanoseconds latency, indicating the potential of further improving the bandwidth density and power efficiency of optical DCNs Outline of the thesis The thesis is structured as follows. In Chapter 2, the architecture of the fast optical switch under investigation is presented. The choice of each sub-system is motivated by explaining the functionalities. Chapter 3 introduces the proposed optical flow control mechanism to mitigate the lack of optical buffer. The performance of the flow-controlled link in dynamic switching operation is reported. The effects caused by bi-directional transmission of label and flow

35 18 Scope of this thesis control on the data payload and acknowledgement signal are also discussed. In Chapter 4, the developed SDN-enabled interface for the fast optical switch is explained in detail. Incorporating with the SDN controller, the creation and optimization of the virtual networks become directly programmable. The network reconfiguration, QoS guarantee through priority assignment and load balancing operation are validated with selected cases. Chapter 5 first describes the set-up of the 4 4 fast optical switching system followed by the performance investigation including the communication with SDN control plane and burstmode receiver. Then, the prototyped node with board level integration is introduced. Based on this, the LIGHTNESS final demonstration including all the data plane and control plane components is presented. In Chapter 6, the novel DCN architecture OPSquare exploiting the fast and flow-controlled optical switches is proposed. The suitability for DC applications is examined in both simulations and experiments. Scalability to larger scale and higher capacity by employing WDM and higher-order modulation formats is also investigated. In Chapter 7, the study on deploying the fast optical switch in the DCI metro network is given. The flexible and efficient packet-based operation as well as fast per-channel power equalization enabled by the optical label switched node is explained. Experimental demonstration with prototyped node in a metro ring network is performed and scalability to larger number of nodes is analyzed. Chapter 8 presents the experimental assessment of the photonic integrated 4 4 fast WDM optical switch. Promising results with error-free switching operation has been achieved for high-capacity WDM data packets, which indicates the potential of the studied fast optical switch for further improving the bandwidth and power-efficiency of optical DCNs. The thesis concludes with Chapter 9, where the main results are summarized and an outlook on the future work is discussed.

36 Chapter 2 SOA-based fast optical switch with modular architecture In this Chapter, the architecture of the employed fast optical switch is introduced. As described in Chapter 1, among the schemes used for designing the fast optical switches, the optical semiconductor amplifier (SOA) based solution has the advantages of the nanoseconds switching speed, wide operational bandwidth, and loss compensation. Additionally, it holds the promise of photonic integration, allowing for realizing the whole switch fabrics in a single chip. Therefore, in LIGHTNESS project, an SOA-based modular switch architecture has been proposed to perform the fast switching of shortlived data packets. In the following, the modular architecture and the switching operation is described in Section 2.1. The utilized labeling technique is detailed in Section 2.2, followed by the preliminary implementation of the label generator and label processor. As the main sub-systems, the FPGA-based switch controller and the SOA-based 1 F switching module are introduced in Section 2.3 and Section 2.4, respectively. 2.1 Switching operation Modular architecture The fast optical switch under investigation has a modular structure and the schematic of an F F switch node is shown in Figure 2.1. The fast optical switch node interconnects F edge nodes, and the traffic from each node is handled by one of the F independent modules. The data packets generated at the edge nodes are carried by M WDM channels, with wavelength centered at λ 1, λ 2,..., λ M, respectively. At the optical switch node, the packets are first processed by a label extractor that separates the optical label from the optical payload (effective data). The label information is then detected and processed by the switch controller. The optical payload remains in the optical domain and is fed into the 1 F optical switch. According to the destination information indicated by the

20 Switching operation label and the stored look-up table, the switch controller controls the 1 F optical switch to forward the packets to the right output port.

37 20 Switching operation label and the stored look-up table, the switch controller controls the 1 F optical switch to forward the packets to the right output port. Arrayed waveguide grating (AWG) are used at the switch output to group the traffic coming from the same edge node. Figure 2.1: Schematic of the F F fast optical switch. It should be noted that during the switching operation, the modules operate in an independent and parallel way to each other, which introduces important features such as distributed control for the optical switch, leading to the following advantages. Firstly, the overall performance of the optical switch can be evaluated by testing a single module. Secondly, the parallel and independent operation of the module make the control complexity and the switching time (latency) of the entire switch independent of the port-count and equal to the switching time of a single module. Furthermore, scaling the port-count leads to a linear increase in components and energy consumption, by employing copies of the identical modules Synchronous operation Fast optical switching network can be operated either in synchronous mode or asynchronous mode [58]. In a synchronous network, time is slotted, and the switch fabric is reconfigured at the beginning of each slot. All the packets have the same size, and the duration of the slot includes also an appropriate guard

38 SOA-based fast optical switch with modular architecture 21 band. Due to variable link propagation delays, efforts need to be made to align the arrived data packets from different edge nodes. In an asynchronous system, packets are of variable size, switching may take place at any point of time, and there is no need to align arriving packets at the switch input. Although the asynchronous operation could provide more flexibility without requiring segmentation and alignment, synchronous slotted system has been chosen as the operation mode for the fast optical switch in this work, mainly due to easier management, higher bandwidth utilization and much less contention probabilities. Especially for the case where packets are electronically buffered at edge node, low contention rate is of paramount importance avoiding time-consuming retransmissions. For the modular switch architecture described in previous section, the alignment is only needed for the packets arrived at the same module, which are essentially from the same edge node Contention resolution The packets carried on different wavelengths destining to the same edge node are statistically multiplexed by using the output AWG. The use of the AWG is to achieve low-loss coupling (instead of WDM), therefore the contention could still happen if packets from two or more wavelength channels are present at the same slot. In case of contention, a priority scheme is established to resolve the contention, while the switch controller allows only one data packet with the highest priority to passing through the 1 F optical switch. The other packets will be blocked and dropped. Since no optical buffer is used, to avoid the packet loss, a copy of the transmitted packet is stored electronically at the edge node. A novel optical flow control mechanism is proposed and implemented in this work. While the switch controller sets the 1 F optical switch to forward the data packet after processing the labels, a flow control signal for each packet indicating the successful forwarding or blocking of the data packet is generated. Such flow control signal is sent back to inform the edge node about the contention resolution and according to the received ACK/NACK, the edge node will release/retransmit the corresponding packet. This flow control signal will greatly affect the performance of system latency and it decides the size of the costly electronic buffer. The detailed implementation of the flow control mechanism (ACK inserter in Figure 2.1) and the performance assessment will be detailed in Chapter 3. The priority scheme can be pre-defined (e.g., Round Robin) or dynamically controlled from the control plane, as will be explained in Chapter 4.

39 22 Optical in-band RF tone labeling 2.2 Optical in-band RF tone labeling Labeling technique The modular structure of the optical switch allows for the independent and parallel control of all the 1 F optical switch modules. As a consequence, the switch reconfiguration time (switching speed) is mainly determined by the label processing time. As the most widely employed optical labeling techniques to date, we found time-serial labeling and wavelength labeling [101]. In the timeserial labeling the label is usually placed before the payload on the same wavelength in the time domain. With the wavelength labeling, the label is on different wavelength channel(s) running in parallel to the data channel. The optical labels can be either in-band or out-of-band which can be extracted by optical filtering. As the time-serial and wavelength labeling occupy extra bandwidth resources, the in-band optical labeling provides a more efficient solution especially for high bit-rate payload which has a wide spectrum. Targeting for a large-scale and high-performance data center network, it is of key importance that the labeling technique allows for extremely low latency (in nanoseconds) while addressing a large number of destination ports. Figure 2.2: The RF tone in-band labeling technique: (a) in spectral domain and (b) in time domain. In [112] an optical in-band RF tone labeling technique that allows for nanoseconds parallel label processing and scaling to over thousands of ports has been presented. This is the labeling technique employed in this work. As shown in Figure 2.2 (a), in this technique, N label wavelengths, indicated by λ 1 to λ L, are inserted within the payload spectrum bandwidth, and each label wavelength carries M RF tones, indicated by f 1 i to f M i. Each of the RF tones is binary coded, and represents one bit of the label. N in-band wavelengths each with M RF tones are able to provide N M label bits for each packet. Therefore, addressing a larger number of ports can be achieved by adding label wavelength and

SOA-based fast optical switch with modular architecture 23 carried RF tones. All the label bits have the same duration as the payload length, as illustrated in Figure 2.2 (b).

40 SOA-based fast optical switch with modular architecture 23 carried RF tones. All the label bits have the same duration as the payload length, as illustrated in Figure 2.2 (b). The edge node should manage the optical label to be carefully aligned with the respective payload. Gathering the parallel labels at the same time, the switch controller can make instant switching operation and contention resolution based upon the simple logic decisions. Another advantage of such labeling technique is that the starting and ending time of the payload are indicated by the label, allowing for handling packets with variable lengths and avoiding the strict bit-level synchronization. Lowspeed optical-to-electrical (O/E) conversion is sufficient for detecting the optical label. The scalability and adaptability for different modulation formats have been reported in referred work, and the rest of this section will focus on the hardware implementation of the label generator and label processor Label processor When the data packet arrives at the optical switch, the label extractor first extracts the in-band label wavelength, while the optical payload of the packet is transparently forwarded to the 1 F optical switch. The label extractor here mainly composes of a series of the passive narrow band-pass optical filters such as a series of cascaded fiber Bragg gratings (FBGs) or integrated comb filters. After the O/E conversion, the optical label carrying multiple RF tones are sent to the label processor (LP), which recovers the baseband label bits and sends them to the switch controller. Figure 2.3 presents the detailed implementation of the RF tone LP for one of the label wavelength λ L j, which is filtered out by the FBG. Figure 2.3: Implementation scheme of label processor. The RF tones are firstly divided into parallel paths by using power splitter (PS) and band-pass filters (BPFs) with a central frequency of f i j, (i [1, M], j [1, N]), to select the corresponding RF tone. After the amplifier (Amp), the

24 Optical in-band RF tone labeling envelope detector recovers the envelope and then the baseband label bits are shaped by the comparator after the low-pass filter (LPF).

41 24 Optical in-band RF tone labeling envelope detector recovers the envelope and then the baseband label bits are shaped by the comparator after the low-pass filter (LPF). Finally, the baseband label bits will be sent to the switch controller. It should be noted that the RF tones processor block, process all the RF tones labels in parallel. The processing time is kept constant regardless of the number of label wavelengths and RF tones. With a limited increase in the latency and the complexity, an exponential increase of number of ports can be addressed by such labeling technique. The design and realization of the LP should follow certain specifications to guarantee the quality of recovered label bits. Most importantly, to ensure the fast switching operation, the delay should be consistent for all the tones and in the order of few tens of nanoseconds. An 8-bit LP has been designed and implemented with compact printed circuits board (PCB), as shown in Figure 2.4. Figure 2.4: PCB layout and the prototype of the label processor. The envelop detector utilizes a Schottky Diode based scheme together with proper impedance matching network. The RF tones are located at 130, 253, 410, 615, 820, 940, 1189 and 1400 MHz (Channel 1-8), respectively. The preliminary test has been made and the results are shown in Figure 2.5. For the recovered label bits, < 20 ns delay for the processing has been observed which is in line with the requirement. The rising edge lasts for 10 ns to 15 ns and the falling time ranges from 12 ns to 16 ns. Except for Channel 6 and Channel 8, more than 50 mv detectable range has been achieved for input power higher than -15 dbm. The worse performances of the two channels are mainly due to the defective impedance matching network resulting in additional losses. The total power consumption of the board is around 2 W. Appendix A includes the components used in the design and more detailed information on the assessment of the board can be found in [113]. The label processor described in this section has been utilized for the research work reported in Chapter 3 and Chapter 4.

5: (a) Delay and rising/falling time of 8 channels; (b) detectable range. 2.

42 SOA-based fast optical switch with modular architecture 25 Figure 2.5: (a) Delay and rising/falling time of 8 channels; (b) detectable range Label generator Pairing with the LP in the fast optical switch, at the edge node, there is a label generator (LG) module to up-convert the digital label bits into the desired RF tones that could be carried by single optical wavelength. The schematic of the employed LG based on mixers and oscillators is shown in Figure 2.6. The input signal consists of parallel baseband label bits which represents the destination of the data packet. The input power is controlled by the attenuator to equalize the generated RF tones and more importantly, to avoid the distortion caused by the saturation. The tones are combined by a coupler and the power loss is compensated by a RF amplifier. As tone purity is one of the primary goals here, the higher order components generated during the mixing would be a noise to other tone channels. Besides the power control, a BPF or LPF placed after the mixer may help improve the signal-to-noise ratio (SNR). The RF amplifier should be also carefully selected, avoiding intermodulation between multiple RF tones. The RF tones coming out from the LG then drives a directly modulated laser (DML), to generate the optical label, which is inserted into optical payload and transmitted to the optical switch node. Figure 2.6: Schematic of the label generator.

43 26 FPGA-based switch controller Corresponding to the 8-bit LP, an 8-tone LG is designed and implemented, as shown in Figure 2.7. The frequencies of the 8 tones match with the working frequencies of the LP. Variable attenuators have been exploited to adjust the power of the baseband signal to guarantee the low distortion. As no filters have been deployed in the design, the delay is within 5 ns with a rise/falling time increase of 1 ns. Tone purity is ensured by power control and sufficient tone spacing. The output RF tones could then directly drive the low-speed DML. For the optical in-band labeling technique, the wavelength of the DML should be stabilized to guarantee the proper filtering at the label extractor. This typically requires precise temperature control of the laser. Grouping more tones would cause higher peak power, in which case limited clipping would be acceptable to improve the performance. In Appendix A the components used for the implementation are summarized and more detailed analytical investigations are reported in [114]. Figure 2.7: PCB layout and the prototype of the label generator. 2.3 FPGA-based switch controller The recovered label bits from the label processor are sent to the switch controller for destination recognition and contention resolution. To maximize the benefits of fast processing brought by the parallelism, the FPGA has been selected as the switch controller for the fast optical switch node. It stores the look-up table, the entries of which are matched with the incoming label bits to determine the forwarding output ports. In case of contention where multiple data packets are destining to the same output, the FPGA-based switch controller forwards the data packet with highest priority and blocks the rest of them. Corresponding digital control signals are generated for the driving circuits of

44 SOA-based fast optical switch with modular architecture 27 the 1 F optical switching modules. Meantime, the switch controller generates the flow control signals for the received data packets, indicating the successful forwarding or the blocking due to the contentions. The flow control signals are sent back to the edge node, according to which the data packet stored in the buffer will be released or retransmitted. The FPGA-based switch controller also implements a communication interface to the control plane. Through such interface, the provisioning of the look-up table can be dynamically updated to change the network interconnectivity (generate or reconfigure new virtual DC networks) and priority assignment (Class of Services) as will be discussed in Chapter 4. Moreover, the statistic regarding on the received packets and contentions can be collected for monitoring purpose. Further actions can be performed to facilitate the long-term network management and optimization based on the statistics. It is worth to be noticed that the operation through the communication interface does not affect the fast nanoseconds forwarding of the data packets F optical switch Broadcast-and-select structure After the label extractor, the optical payload enters the 1 F optical switch to be forwarded to the possible F output ports. We employ a broadcast-and-select structure for the 1 F switching module, as illustrated in Figure 2.8. A coupler (splitter) first broadcasts the input signal to F paths, and the gate on each path selects the data packet appropriately by letting it through or blocking it. Multicast to several outputs is supported by this structure. The well-known disadvantage is the power loss caused during the broadcasting. Each broadcasting path would experience at least 3 log 2 F (db) loss. To address a large number of ports would therefore require amplifier to boost the power level, introducing extra noise. The gates should also provide as low loss as possible and feature with fast nanoseconds switching operation. Figure 2.8: 1 F optical switch based on broadcast-and-select structure.

45 28 1 F optical switch SOA gate Although SOA shows less ideal transmission properties compared with the fiber amplifiers in terms of noise figure, lower saturation output power, and pattern dependence due to the short time constant [115], the SOA has advantages of low cost, small size, and high potential for photonic integration. Moreover, the coverage of low-loss transmission bands allows for the operation at 1300 nm, which is dispersion-free in standard single mode fiber. Therefore, in our work, the SOA has been used as the selecting gate element in the 1 F switching module, to provide the fast (< 1 ns) switching on/off operation and compensation to the broadcasting loss. In the following, useful information for the understanding of the SOA related results presented in this thesis is given. Further information on SOA operating principles, different technological aspects, and detailed explanations on the underlying semiconductor physics can be found in [116]. Operation and gain characteristics An SOA is a semiconductor waveguide with a gain medium. It can be regarded as a semiconductor laser diode without reflective mirrors at its facets. A bulk active layer, multiple quantum well, or quantum dot active layers are used in SOA structure. The gain is obtained by injecting carriers from a current source into the active region. These injected carriers occupy energy states in the conduction band of the active material leaving holes in the valence band. Electrons and holes recombine either non-radiatively or radiatively, in this case releasing the recombination energy in form of a photon. Three radiative processes are important in such structures, namely stimulated absorption, spontaneous emission and stimulated emission of photons. The most important parameters characterizing optical amplifiers are: small signal gain, saturation output power, amplification bandwidth, polarization dependent loss (PDL), and noise figure. At a high output power, the SOA gain is saturated and compressed. A common parameter for quantifying gain saturation is the 3 db saturation output power P sat. The P sat is defined as the amplifier output power at which the amplifier small signal gain is reduced two times (has dropped with 3 db). The amplifier gain can be written implicitly as a function of the output power P out to the P sat G G 0 G 1 P exp G P out sat ( 2.1 )

46 SOA-based fast optical switch with modular architecture 29 where G 0 is the small signal gain. An example showing the typical gain versus the output power characteristics of the SOA is represented in Figure 2.9 (a). The typical optical amplification bandwidth of the 1550 nm SOA is presented in Figure 2.9 (b). The estimated 3 db optical amplification bandwidth is about 50 nm. The short wavelength tail of the gain profile saturates faster and therefor provides less gain, which is caused by lower density of carriers in the conduction band corresponding to higher energy levels. Hence, the long wavelength tail of the SOA gain profile should be used as the operating regime in the amplification applications. However, in the applications like wavelength conversion based on cross-gain modulation (XGM), operation with shorted wavelength is preferred due to the pronounced saturation [117]. Figure 2.9: SOA gain as a function of (a) output signal power and (b) wavelength. The polarization state of the input signal can vary with time and wavelength. Due to the waveguide structure and the gain material the SOA is polarization dependent. Cascading of the SOA can pronounce the polarization dependence. The PDL is defined as the difference between the provided gains for the orthogonal polarization modes. Several techniques for realizing the SOA with low PDL (< 1 db) have been developed, for instance, square cross-section waveguide, ridge waveguide, and strained-layer super-lattice structure [115]. Noise figure and OSNR For signal amplification, only the photons from stimulated emission are desired. However, the spontaneous emission of photons cannot be avoided in an SOA, which is the so-called amplified spontaneous emission (ASE) noise. The amount of ASE noise added by the SOA is described by its noise factor F or the noise figure NF F SNR SNR in ( 2.2 ) out

47 30 1 F optical switch NF 10log ( F) ( 2.3 ) The noise factor can be expressed by the population inversion factor n sp and the single-pass gain G 10 1 G 1 F 2n sp ( 2.4 ) G G The noise factor of an amplifier is minimized by using a population inversion factor n sp approaching 1. The minimum achievable noise factor for an optical amplifier with a large gain (G >> 1) is F = 2 (or NF = 3 db). The SOA noise figure is usually between 7 and 10 db. The optical signal-to-noise ratio (OSNR) is another key parameter indicating the ratio of signal power to noise power in an optical channel. The OSNR at the output of a chain of M identical amplifiers can be calculated as [118] OSNR P out ( 2.5 ) FGhvB M where P out is the output power, F is the amplifier noise factor, G is the amplifier gain, h is the Planck s constant, v is the light frequency, and B 0 is the optical bandwidth. The OSNR in [db] can be written as OSNR P L NF 10log M 10log [ hvb ] ( 2.6 ) db out S where all terms are in db and it is assumed that the gain of each amplifier G is compensating for the loss of the previous span L S. From Equation 2.6 it can be concluded that the OSNR declines with the number of amplifiers in the transmission line. Dynamic optical power operation With the decrease of the input optical power, ASE power is becoming relevant leading to the OSNR degradation. As indicated by Equation 2.6, the OSNR increases with the output power of the signals. Hence, a higher input power leading to higher output would also bring better OSNR performance. On the other hand, with the increase of the input optical power, the SOA gain is reduced due to the gain saturation, which leads to the distortion of the waveforms. Especially for higher modulation format other than OOK, the transitions between symbols are affected by the complex SOA response. Both the amplitude and phase fidelity of the amplification process are impaired to a different degree. Saturation may also induce inter-channel crosstalk caused by

48 SOA-based fast optical switch with modular architecture 31 XGM and four-wave mixing effect if several WDM channels are present simultaneously. Therefore, the dynamic range of operation is limited by the SOA noise for low input power and by signal distortions as well as non-linear effect for high input power. Trade-off exists between the OSNR and the saturation effect, thus the input signal power should be well controlled within a certain range. 2.5 Summary In this Chapter, the overall architecture and sub-systems of the employed fast optical switch are presented. The modular structure enables the highly distributed control for traffic coming from different edge nodes, which greatly reduces the switch controlling time. Contention in the synchronous slotted operation is resolved by the optical flow control, informing the edge node about the successful transmission or asking for retransmission in case of blocking. The optical in-band RF tone labeling technique adopted in the system can potentially address a large number of ports avoiding extra bandwidth and space resource occupation. Preliminary implementation of the label processor and label generator has been carried out which shows successful handling of 8-bit RF tone label. The modular structure and the RF tone labeling allow for parallel processing for each input as well as for each label bits leading to nanoseconds port-count independent reconfiguration time. SOA gate is selected as the gating element in the 1 F broadcast-and-select structure, to provide the fast switching speed and power compensation for the loss at broadcasting stage. Due to the transmission properties of the SOA, the influences on the data signal caused by the noise and saturation effect have to be considered. Employing the fast optical switch in data center networks would have the benefit of high capacity and fast nanoseconds switching. For the proposed optical flow control, the effects on the physical layer performance of the payload and the statistics (e.g., packet loss and latency) are of significant interest for the practical implementation. Moreover, to fully build a flat network with high-interconnectivity and flexibility, investigation on more aspects regarding the scalability of the switch as well as control plane management is necessary.

50 Chapter 3 Low latency and efficient optical flow control In this Chapter, the concept and implementation of efficient optical flow control for fast optical switching system is introduced 1. Due to the lack of practical optical buffer, the data traffic is stored electronically at the edge node of the optical switching network. The flow control acknowledges the edge node about the status of the transmitted traffic, which is either successful forwarding or blocking resulted from contention. In a closed environment like data center networks, limited latency overhead caused by the retransmission is expected. By reusing the label signal, we experimentally demonstrate the operation of an efficient optical flow control without occupying extra wavelength and space resources. In Section 3.1, the proposed bi-directional transmission of RF tone label and optical flow control is presented. The experimental set-up used for investigating the quality of the flow control signal and the effect on the payload data is reported in Section 3.2, followed by the assessment results. Finally, the flow control for the case with variable distance between the fast optical switch and the edge node is studied and discussed in Section Optical flow control Fast optical switching technologies have been extensively considered for the flat data center networks (DCNs), providing high-capacity, eliminating the O/E and E/O conversions, and overcoming the communication bottleneck and latency issues of traditional solutions. However, as stated in the Chapter 1, one of the biggest challenges for fast optical switches is the lack of the optical memory. Considering the practical implementation in DCNs, we propose an optical flow control mechanism to mitigate this issue. The data traffic is stored in electronic buffer, waiting for the flow control signal coming from the optical switch node. A positive acknowledgement (ACK) stands for the successful 1 Parts of this chapter are based on the results published in [125].

51 34 Optical flow control forwarding, and the data is released from the buffer. In response to a negative acknowledgement (NACK) which means the dropping due to the contention, the data is retransmitted and goes through the same procedure again. Such flow control is feasible in the DCN where the links between the edge node and optical switch range from tens of meters to hundreds of meters. Considering the requirement of low delivery latency, the contention resolution and flow control operation should be accomplished as fast as possible. During the contention resolution and eventually the retransmission, the data packets are stored in the costly electronic buffers at the edge nodes and will only be released from the buffers in response to the ACKs. Minimizing the overall buffer size will result in a system not only with lower cost and power consumption, but also with a lower latency. At the physical layer, the size of the electronic buffer at the edge nodes depends on the data traffic forwarding time (reconfiguration of the switch), the delay of the flow control signaling, the retransmission time, and the target traffic load. Although the highly distributed control of the fast optical switch allows the forwarding of the packets within 20 ns, the flow control latency significantly affects the buffer size. Figure 3.1: Optical flow control technique Therefore, we employed a novel optical flow control technique to efficiently transmit the in-band optical label information and flow control signals in a single wavelength channel, as schematically reported in Figure 3.1. The optical packets consisting of synchronized data payload and in-band RF tone label signal are transmitted to optical switch node. The in-band RF tone label technique has the advantages of efficient bandwidth utilization and parallel processing for each bit allowing for fast processing regardless of the bit-count. RF tones are placed at high-frequency region (>100 MHz) while the base-band is virtually empty which could be employed to transmit back the flow control signal. The in-band label can be extracted by using narrow pass-band filter such

52 Low latency and efficient optical flow control 35 as a fiber Bragg grating (FBG) or an integrated micro-ring resonator (MRR) [119]. After label extraction, the power of the extracted label is split into two parts. The first part is used for label detection at the label processor (LP) and then processed by the switch controller. The other part is sent to an SOA-based ACK Re-modulator driven by the base-band flow control signal generated by the switch controller. To transmit the ACK/NACK signal, we exploit the available base-band bandwidth avoiding the potential crosstalk with the RF tones that are transmitted at frequencies over 100 MHz. This allows for simply re-using the same label wavelength without additional label eraser or the need of extra lasers and the corresponding wavelength registration circuitries. The flow control signal on the re-modulated label wavelength is sent back to the edge node through the same WDM channel by using a circulator. At edge node, ACK receiver (RX) detects the flow control signal and then conducts buffer manager to fulfill the flow control. The proposed bi-directional system can effectively reduce system complexity, by transporting the flow control signal in the data plane network. 3.2 Experimental demonstration Experiment Set-up Based on the proposed scheme, we experimentally demonstrate and evaluate the performance of the flow control mechanism in the fast optical switching system. The set-up used here is shown in Figure 3.2. At the edge node, an FPGA acts as buffer manager which stores the transmitted label information and processes the ACK signals acknowledging whether the transmission was successful or not. Two channels of 40 Gb/s non-return-to-zero (NRZ) WDM packets at λ P1 = nm and λ P2 = nm with 600 ns duration and 40 ns guard time are generated. The wavelengths of the in-band label are centered at λ L1 = nm and λ L2 = nm to match the pass-band of FBG. For the proof of a 4 4 switch system, each label wavelength carries two binary coded RF tones. The number of tones can potentially scale up to at least 30 which will represent a large number of ports with constant processing time [112]. Figure 3.3 (a) illustrates the electrical spectrum of the two RF tones (f1 = 280 MHz, f2 = 650 MHz) of the label we used in this experiment.

36 Experimental demonstration Figure 3.2: Experimental set-up of optical flow control system. The average optical power of the payload and the label is 2.5 dbm and 0 dbm, respectively.

53 36 Experimental demonstration Figure 3.2: Experimental set-up of optical flow control system. The average optical power of the payload and the label is 2.5 dbm and 0 dbm, respectively. The two channels are combined by an arrayed waveguide grating (AWG), and sent to the fast optical switch node. Then the two channels are de-multiplexed and the labels are extracted by the FBG1 and FBG2 centered at λ L1 and λ L2, respectively. The FBGs have a 3 db bandwidth of 6 GHz to avoid spectral distortion of the optical payload. Figure 3.3 (b) shows the optical spectrum of the signals before and after label extractor for Channel 1. The optical power of the extracted label is split by a 20:80 coupler. The 80% of the power (-8dBm) is detected and processed by an FPGA-based switch controller. The 20% of the power is fed into an SOA employed as optical modulator driven by the FPGA switch controller to generate the flow control signals. The typical digital current driver required by the SOA is 30 ma which can be directly provided by the digital pins of the FPGA switch controller. The re-modulated label wavelength at λ L1 and λ L2 carrying the ACK/NACK signals are sent back to the edge node via a second FBG1 and FBG2 with similar characteristic as the FBG1 and FBG2, respectively. The use of two FBGs is to investigate the effect of the double pass filtering in case a photonic integrated MRR would have been used in the set-up. The ACK/NACK is then transmitted back on the same optical link to the edge nodes within the same WDM channel with no extra cost on bandwidth and space. Figure 3.3 (c) shows the electrical spectrum of re-modulated signal after detection at the edge node. It is clearly visible the two RF tones of the label at 280 MHz and 650 MHz, and the remodulated base band ACK signal. The electrical bandwidth has been well exploited which also prevents the possible cross talk caused by re-modulation. The flow control signal could be easily retrieved by using a 50 MHz LPF as shown in Figure 3.3 (d). According to the positive or negative value of the flow

Low latency and efficient optical flow control 37 control signal, the buffer manager releases the stored label (ACK) or retransmits the packet (NACK). Figure 3.

54 Low latency and efficient optical flow control 37 control signal, the buffer manager releases the stored label (ACK) or retransmits the packet (NACK). Figure 3.3: (a) Electrical spectrum of the RF tone label. (b) Optical spectrum before and after label extractor. (c) Electrical spectrum of detected re-modulated signal. (d) Electrical spectrum of ACK after LPF Dynamic operation To show the dynamic operation of the flow control, periodic label patterns have been used for both channels. As illustrated in Figure 3.4, original label bits that should be transmitted are given at the top. They are sent out and stored in the buffer in case a contention happens. The contention resolution algorithm is based on a fixed priority: packets at λ P1 have higher priority. When a contention occurs between the packets from two different channels destined at same output, packets at λ P1 will be always forwarded to the destination and a pulse signal with the same packet duration drives the SOA based modulator to generate the positive ACK signal, while the packet at λ P2 will be blocked and a NACK will be sent back requesting packet retransmission. If the buffer manager at edge node detects a NACK, the corresponding stored label information will be retransmitted. The actual transmitted labels including retransmissions are shown in the middle in which the contended packets are marked. At the bottom of Figure 3.4, it shows the flow control signals ACK1 and ACK2 for the two

55 38 Experimental demonstration channels detected at the edge node, which are generated by the switch controller and applied as driving signals to the SOA based modulators. We can also see from the figure that it takes 160 ns for labels transmitting and being detected by switch controller which includes 40 ns processing time. And 105 ns including 5 ns processing time is needed from ACK being generated until finally detected by buffer manager. The latency or the round trip time (RTT) for the flow control operation is 265 ns. Figure 3.4: Dynamic operation including retransmission Payload and flow control signal performance The separated payload at λ P1 and λ P2 to be fed into the 1 4 optical switches (not employed in the experiment) are then detected and analyzed by a bit error rate (BER) tester. The purpose is to evaluate the in-band filtering effect caused by the FBG and the possible crosstalk caused by the bidirectional transmission of the data traffic and the flow control signal. Figure 3.5 illustrates the BER curves of the two WDM channels after the label extractor as well as the back-to-back (B2B) ones employed as reference. Error free operation with only 0.5 db power

Low latency and efficient optical flow control 39 penalty at BER of 1 10-9 for 40 Gb/s payload has been measured, indicating no distortion has been caused due to the label and ACK transmission, which

56 Low latency and efficient optical flow control 39 penalty at BER of for 40 Gb/s payload has been measured, indicating no distortion has been caused due to the label and ACK transmission, which is also confirmed by the eye diagrams. Figure 3.5: 40 Gb/s payload BER curves and eye diagrams. Although the DCN is a close environment with respect to a telecom network, variation in distance and in power can still occur. Therefore we further estimated the eye-opening factor [120] and amplitude of the detected ACK as function of the label power fed into the SOA re-modulator and the SOA driving current. The results are presented in Figure 3.6. When the SOA is driven at 30 ma current, the eye-opening factor is higher than 0.8 for the input optical power ranging from -30 dbm to -18 dbm. An even larger dynamic range is possible with higher driving current. As result, the optical bidirectional system is quite robust to power fluctuation and thus to distance variation within the DCN. Low input optical power indicates that no impact has been caused on label detection that only 1% of label power is sufficient for the flow control operation. Moreover, low SOA driving current shows that this technique is not just spectral efficient but also power efficient that FPGA pins could directly drive the SOA to re-modulate the label signal. Considering the contributions to the energy consumption given by the low speed O/E converter for ACK detection (540 mw) and SOA for ACK re-modulation (80 mw), the flow control operation will introduce only 15.5 pj/bit more power consumption compared with the WDM fast optical switching systems presented in [84].

40 Variable link length Figure 3.6: Eye-opening and amplitude of detected ACK. 3.3 Variable link length The dynamic operation investigated in Section 3.

57 40 Variable link length Figure 3.6: Eye-opening and amplitude of detected ACK. 3.3 Variable link length The dynamic operation investigated in Section 3.2 assumes the links between the transmitting side and the optical switch have the same length and packet duration is longer than the RTT, which means that the buffer manager at edge node is already detecting the flow control information before the end of the packet transmission. Therefore, sending a new packet or retransmission of the blocked one is decided before the next time slot starts. However, in reality the distance between the edge node and optical switch node may vary from several tens of meters up to kilometers in a DCN. In the case that the RTT is longer than the packet duration, the corresponding flow control signal will arrive after the entire packet was sent out. A bigger buffer size providing longer buffering time is needed (typically at least equal to the RTT) and more importantly, the buffer manager should correctly judge to which transmitted packet correspond the received ACK. To this aim, we implemented a time slot based buffering technique that dynamically configures the size of a shifting buffer according to the tested RTT. The operation is the following. At the beginning of the system installation, the buffer manager first tests the RTT for the flow control operation by sending out a probe pulse signal to the switch node. The switch controller receives the prove signal and immediately sends back the same signal in the flow control manner. Buffer manager would record such RTT and adapts the implementation mainly

58 Low latency and efficient optical flow control 41 in two aspects. First, in the synchronous slotted operation, to guarantee the packets from different edge nodes arrive at the optical switch at the same time, a time offset T offset with respects to the system clock is applied at each edge node, which could be obtained from the modulo operation T T % T ( 3.1 ) offset RTT slot where T RTT is the RTT value and T slot is the slot duration determined by the system clock. On the other hand, a shifting buffer is utilized here to match the received the flow control signal with the corresponding stored data packet. When the label is being sent out, a copy will be fed into the shifting buffer, and forwarded to next stage at the beginning of each slot. The number of the stages N shift is determined by the T RTT and packet duration T slot Nshift T RTT / Tslot ( 3.2 ) Therefore the detected flow control signal refers to specifically the coming out label from the shifting buffer. A new packet will be transmitted and stored in the buffer in response to a positive ACK otherwise the coming out label will be sent out and go through this procedure again. Using the experimental set-up depicted in Figure 3.2, we increase the distance between the edge node and the optical switch by adding a 1 km single mode fiber (SMF) to evaluate the flow control operation with larger RTT, as shown in Figure 3.7 (a). The RTT is first tested, according to which the shifting buffer is configured. The traces of transmitted label and detected flow control signal (ACK) at edge node are given at right side. The reception of ACK is delayed by 10.5 μs (around 17 slots) mainly caused by propagation. During this time the transmitted data is stored and shifted in the buffer until the corresponding flow control signal arrives. At next time slot, a new data packet or the previously contented one will be transmitted according to the received ACK/NACK. Figure 3.7 (b) illustrates the results when two channels have different distance (1km/450m) from the fast optical switch node. In this case, the buffer manager sets different number of stages for them to match the RTT. Since it takes more time which is around 4 slots for packets from Channel 1 to reach optical switch, the contention may happen between the packet from Node 1 and packet from Node 2 with 4-slot later. From above results we can see that by using shifting buffer, the edge node could handle the situation of variable distance between transmitting side and the optical switch. The size of this shift

42 Summary buffer is configured according to the RTT to match the timing of detected flow control signal and corresponding stored data packet. Figure 3.7: Dynamic operation with variable distance.

59 42 Summary buffer is configured according to the RTT to match the timing of detected flow control signal and corresponding stored data packet. Figure 3.7: Dynamic operation with variable distance. (a) Both channels with1 km from the optical switch. (b) Channel 1 with 1 km and Channel 2 with 400 m. 3.4 Summary The flow control mechanism facilitates the retransmission in the fast optical switching system. The flow control is implemented through the same optical channel, without requiring dedicated interconnection networks. Limited portion (< 20%) of the in-band optical label power is reutilized which does not affect the label detection and meantime, simplifies the system avoiding the need for wavelength registration circuitries. Few extra components are required for this implementation: an SOA as low speed modulator and an optical circulator to couple the optical flow control signal into the data transmission channel. The additional FBG for testing the double filtering effect could be removed. At the edge node, an additional low-speed photodiode in combination with a LPF is needed to detect the ACK/NACK message. The performance of the optical flow control technique is experimentally investigated. Benefiting from the fast contention resolution, dynamic flow control operation with low latency is demonstrated. Assessment on the 40 Gb/s NRZ OOK payload indicates no degradation has been caused by the bi-

60 Low latency and efficient optical flow control 43 directional transmission of the payload with in-band label and optical flow control signal. Investigation on the quality of the flow control signal shows large dynamic range at low bias current of the SOA, proving the system to be robust when facing the problem of power fluctuation in a DC environment. Flow control scheme in handling variation in distance from edge node to optical switch node is also considered, providing a spectral-efficient and energyefficient solution for practical employment of fast optical switches in DCNs.

62 Chapter 4 Software defined networking enabled control interface In this Chapter, the software defined networking (SDN) enabled control interface for the fast optical switch is presented 2. The increasing demand for high-efficiency and agility in data center networks (DCNs) is posing great challenges on the optical switching based solutions. In response, LIGHTNESS has chosen SDN as the base control technology to facilitate network provisioning and virtualization. The rationale is to offload the intelligence of the network control from the switch controller to the SDN control plane. As the data plane element, the optical switch is dedicated for the data traffic forwarding. With the overview of the network functionalities and the requirements exposed by the applications, the centralized SDN controller could dynamically manage and optimize the performance of the underlying data plane. To this purpose, the SDN-enabled interface allowing for the flexible monitoring and management from the control plane is developed for the fast optical switch, which could fit seamlessly in an optical data center network by exploiting the open-source and standard protocol. The creation of multiple independent Virtual Networks (VNs) is enabled, which improves the resource utilization in combination with statistical multiplexing. Here the VN stands for a collection of flows associated with a single tenant. Statistical multiplexing allows for the on-demand resource sharing according to a certain scheduling discipline (e.g., first-come first served or fair queuing). Moreover, with the extensions of the standard protocol, more functionality can be developed to enhance the performance of traffic running within the VN. In the following, the VN based on the fast optical switch in context of LIGHTNESS is presented in Section 4.1. Then the detailed implementation of the SDN-enabled control interface including the operation of the agent is explained in Section 4.2. To validate and assess the VN generation and 2 Parts of this chapter are based on the results published in [127].

63 46 Fast optical switch based VN reconfiguration, experimental results showing quality-of-service (QoS) guarantee are reported in section Fast optical switch based VN SDN control plane Multi-tenancy enabling efficient resource utilization is considered as a key requirement for the next generation DCNs resulting from the growing demands for services and applications. Therefore virtualization mechanisms and technologies are deeply implemented by data center providers to efficiently multiplex customers within their physical network and IT infrastructures [43, 121]. The end users and business customers do not need to maintain their own physical IT infrastructures, while offering a scalable, simple to provision and cost-effective virtual solution for computing, storage and integrated applications [122]. In this scenario, a key innovation is the creation of multiple independent VNs which allows the sharing of the infrastructure by multiple coexisting tenants. The required interconnectivity resources can be logically ensured. Dedicated routing policy can be applied for different tenants based on abstracted switching resources, like traffic flows. DCN virtualization therefore enables additional benefits for service providers and enterprises, improved resource utilization, rapid service delivery, flexibility and mobility of applications within DCs [7]. While the servers in a DC have long been virtualized, the virtualization of DCN has not attracted much attention until recently [7]. It involves the virtualizing the network switching nodes in the data plane, which requires a proper abstraction of the resources first, and configuration according to the requirements later. The proprietary control interfaces of the switches are no longer efficient or effective for next paradigm DCNs. On the contrary, a standardized solution could help delivering significant levels of agility, speed, and manageability [123]. SDN has been chosen as the base control technology to facilitate network provisioning and virtualization. By fully abstracting the underlying data plane devices including the fast optical switch node, the SDN control plane can effectively create and manage multiple VNs each standing for a slice of the resources. The look-up table (LUT) determining the forwarding output of the data packet is populated by the SDN control plane. In the data plane, the switching operation relying on the statistical multiplexing is conducted at hardware speed (nanoseconds), which is decoupled from the milliseconds control from the SDN control plane.

64 Software defined networking enabled control interface 47 Figure 4.1: SDN reference model. The SDN control plane is deployed on top of the flat optical DCN, as depicted in Figure 4.1. It consists of a centralized SDN controller interacting with the DC network devices through the southbound interface (SBI), which in our case implements an extended version of the OpenFlow (OF) protocol [108]. A dedicated OF agent is deployed on top of each switching device as a mediation entity between the SDN controller and the proprietary control interfaces exposed by the devices. Each agent provides an abstraction (e.g., port-count, supported number of flows, and capabilities) of the corresponding DCN device and enables a uniform resource description at the SDN controller level. The SDN controller translates the requirements generated from the application plane and accordingly configures the data plane devices. Facilitated by the SDN control and related protocols, once the virtual data center infrastructure is provisioned, the traffic flows generated by the applications are automatically classified, recognized and associated to the given VN. Logical isolation (e.g., VLANs) can be performed to avoid the interference among the traffic carried in different VNs Fast optical switch based VN The work presented in this chapter focuses on the performance assessment of the fast optical switch-based VNs enabled by the SDN control. The fast reconfiguration time of optical switch node provides high flexibility and utilization to the VNs. Triggered by the DC operator or an external application leveraging the programmability exposed at the northbound interface, the SDN controller can dynamically configure the VNs through the SBI and eventually the agent, following a top-down approach. The VN is created and managed by the DC infrastructure provider. In turn, a flow is defined here as a set of application data packets that are aggregated into optical packets containing the same optical label with a certain load. As will be further explained, the load and the priority of the flows can be manipulated to provide guaranteed QoS in terms

48 Fast optical switch based VN of packet loss and latency. The tenant is allowed to run applications on its VN and makes changes to it when given the authority.

65 48 Fast optical switch based VN of packet loss and latency. The tenant is allowed to run applications on its VN and makes changes to it when given the authority. To set up a fast optical switch-based VN, the SDN controller configures the LUTs of the top-of-rack (ToR) switches (as edge nodes) through the SBI for those racks whose servers have to be interconnected. The controller also configures the LUT of the fast optical switch to interconnect those ToRs. Figure 4.2 presents an example of VN creation and reconfiguration in an optical DCN based on fast optical switches. VN 1 connects ToR 1 with ToR 2 while VN 2 interconnects ToR 2, ToR 3 and ToR N, with ToR 2 belonging to both VNs. Here, we assume that the tenant owning VN 1 intends to run a new application flow in a virtual machine (VM) hosted in Rack 3. In this case, ToR 3, which connects Rack 3 to the DCN, has to be included in VN 1 so a network reconfiguration is required. As said a top-down approach is used here: 1) The operation is triggered by the DC management by means of the SDN controller; 2) The SDN controller in turn updates the LUTs of the ToRs (ToR 1, ToR 2 and ToR 3 ) and the optical switch involved in the new VN 1 (VN 1 ); 3) Once the VN has been provisioned, application data are exchanged between the ToRs; 4) New flows with same destination are directly forwarded at sub-microsecond time scale, decoupled from the SDN controller operating at milliseconds time scale. Figure 4.2: Virtual Network creation and reconfiguration. Moreover, exploiting the statistical multiplexing enabled by the fast optical switch, the on-demand bandwidth/wavelength resource sharing is possible between flows associated either to the same VN or to different VNs. This enables the dynamic creation and reconfiguration of multiple VNs and the

Software defined networking enabled control interface 49 optimization of the DCN resources utilization, which leads to a high tenant density.

66 Software defined networking enabled control interface 49 optimization of the DCN resources utilization, which leads to a high tenant density. All these operations are enabled by the SBI and the agent residing on top of the fast optical switch node, which acts as a mediate entity between the standard protocol running in the SBI and the proprietary interface exposed by the optical switch controller. 4.2 SDN-enabled control interface Control architecture Figure 4.3 depicts the control architecture deployed for the scenario under investigation. The SDN controller is implemented by means of the OpenDaylight (ODL) platform, an open source initiative hosted by the Linux Foundation in its Hydrogen release (Base edition) [124]. ODL has an extensible modular architecture as well as a wide set of available services, appliances and northbound control primitives suitable for the SDN-enabled fast optical switch. However, the ODL controller requires a set of extensions in support of the fast optical switching technology. First, the OF protocol driver at the SBI side of the ODL controller needs to be modified by applying a set of fast optical switch extensions defined for the OF protocol. This enables the control and provisioning of optical switch nodes from the controller. Afterwards, a set of core services of the ODL controller including the Forwarding Rules Manager, the Service Abstraction Layer and the Topology Manager (responsible for managing the topology) are also extended in support of the optical switch nodes and the managed flows. Finally, the ODL management Graphical User Interface (GUI) is extended to allow the management of fast optical switch nodes as well as to configure and manage related traffic flows. Figure 4.3: Control architecture.

67 50 SDN-enabled control interface Besides this, the ToRs and the optical switch node are equipped with OF agents that enable the communication with the ODL controller through the SBI, therefore bridging and gluing control and monitoring mechanisms and primitives at both sides. More specifically, the OF agents running in dedicated servers, map the proprietary interface exposed by the optical switch controller and aggregation controller in the ToR into the extended version of the OF protocol implemented at the ODL SBI, as illustrated in Figure 4.3. The aggregation controller is responsible for aggregating the traffic coming from the servers of the rack, generating the optical packets and assigning the appropriate label to them (i.e., flow generation process). In such way, the agents translate the OF messages coming from the controller into a set of actions performed over the underlying device through the proprietary interface and vice versa. Hence, the SDN controller is able to configure the optical switch based DCN flexibly by updating the LUT, while on the other hand, the status of the underlying hardware switches is reported up to the SDN controller for monitoring purposes. The ODL management GUI allows for triggering and visualizing these actions OpenFlow extensions The OF protocol has arisen as a de facto standard to implement the SBI defined by the SDN paradigm. OF allows moving the network control out of the nodes to the control plane, since it enables the manipulation of the forwarding tables of such network devices. In this way, data flows can be automatically and dynamically configured to satisfy the requirements from the users. To this end, the OF protocol defines a set of messages and attributes, which are exchanged between the controller and the network elements, to configure the data flows. The most relevant messages of the protocol to the investigated optical switch are the FEATURES_REQUEST/REPLY message pair, the FLOW_MOD and the STATS_REQUEST/REPLY pair. The FEATURES_REQUEST message is sent by the SDN controller to the network device to request the capabilities (i.e., switching technology, number of ports, etc.) of the device. The network element sends back a FEATURES_REPLY to response the request from the controller. The controller uses the FLOW_MOD message to configure, modify or delete data flows. To this end, the operation (i.e., New Flow, Flow Modify or Delete Flow) and the characteristics of the flow are conveyed in the message. Finally, the controller requests statistics from the data plane by means of the STATS_REQUEST message. Different types of statistics (e.g., port, table, and flow) can be requested. Upon the reception of these requests, the network

68 Software defined networking enabled control interface 51 devices send back the associated STATS_REPLY messages containing the required values. However, the current OF specification focuses on the electronic devices, so some modifications are needed to enable OF in the proposed optical data plane. To this end, the OF v1.0 protocol is extended to fully support the fast optical switching technology and its specific switching paradigm. In particular, the fast optical switching feature is added to the ofp_capability attribute, which is conveyed in the FEATURES_REPLY message, so that the controller can recognize devices implementing this new switching technology. The ofp_match and ofp_action fields are extended as well to enable the corresponding flow configuration by means of the FLOW_MOD message. More specifically, the optical label and wavelength attributes extend the ofp_match, and the load of the traffic flow can be set thanks to a new action added to the ofp_action field. These extensions aim to support the configuration of the LUTs in both the optical switch node and the ToR. During operation, the ToR uses the information stored in the LUT to configure the labels of the optical packets, and the optical switch node uses it to switch the incoming packets to the appropriate output port. Table 4.1 lists the extensions made to support the functionalities of fast optical switch node. Table 4.1: OF protocol extensions to support the fast optical switch node Message Attributes Extensions FEATURES_REPLY ofp_capability Add fast optical switching feature FLOW_MOD ofp_match Add optical label and wavelength as new matching fields FLOW_MOD ofp_action Add the setting of the traffic load Figure 4.4 (a) depicts the extended ODL management GUI. Concretely, it shows the ToR and the optical switch node (the OPS node) that have been detected by the controller with the extended FEATURES_REPLY message. In addition, Figure 4.4 (b) depicts an example of an extended FLOW_MOD message received by the optical switch node to configure a new flow. Specifically, the figure presents the extended OF match field conveying the optical label with a value of 13 and the wavelength bitmap. It is worth noting here that the wavelength bitmap follows the same format as the one proposed in the optical circuit switching extensions addendum for the OF protocol v1.0 [125]. The new OF action, which allows for setting the load (20) to the traffic flow, is also shown in the figure.

69 52 Experimental evaluation Figure 4.4: (a) OpenDaylight GUI. (b) OpenFlow FLOW_MOD extended message. The standard STATS_REQUEST/REPLY message pair is used for the collection of statistics, since it already copes with the needs of the scenario under study. The frequency of the statistic report can be set from the control plane according to the specific requirements of the applications. It is also worth noting here that only port statistics are considered. 4.3 Experimental evaluation Facilitated by the implemented OF agent and extended OF protocol, the SDNbased control functions are enabled for the fast optical switch node. VNs can be created and managed remotely through the SDN control plane. To validate the VN flexibility and agility and QoS guarantee benefited from this, in the following, the experimental investigation on both data plane and control plane operations including VN reconfiguration, priority assignment and load balancing based on statistics collection have been presented. The set-up is illustrated in Figure 4.5. The FPGA-emulated ToR performs the statistical multiplexing of the packets associated to the traffic flows and transmits them to the fast optical switch node that, in turn, forwards the packets to the proper destination according to the attached optical labels. The FPGAbased switch controllers are interfaced to the OF agents through the USB connections. The attached 4-bit label contains the forwarding information (2 bits) and the class of priority (2 bits) which is assigned by the ODL controller according to the application requirements. The ToR is equipped with the

Software defined networking enabled control interface 53 aggregation controller. Furthermore, it implements the flow control mechanism at the ToR side.

70 Software defined networking enabled control interface 53 aggregation controller. Furthermore, it implements the flow control mechanism at the ToR side. In particular, the buffer manager inside the aggregation controller stores the label information and performs the packet (re-)transmission according to the ACK/NACK sent by the optical switch node. The gates used for controlling the transmission of packetized 40 Gb/s NRZ-OOK payloads (460 ns duration and 40 ns guard time) are triggered by the buffer manager in case of (re-)transmission. The fast optical switch has an SOA-based broadcastand-select structure as explained in detail in Chapter 2. Figure 4.5: Set-up for the experimental validation Virtual network reconfiguration The aggregation controller at the ToR side allocates a certain label for the incoming packet by matching the destination requirements with the LUT. Upon the reception of the packet, the optical switch processes the label and forwards it to the corresponding output port according to the information provided by the LUT. However, as the demands of users and applications change, the created VNs need to be flexibly reconfigured and adapted to the dynamic requirements of the applications. In this case, the LUTs of both the ToR and the optical switch nodes can be updated by means of the SDN controller to reconfigure the interconnection of the VNs according to the new requirements. In the example depicted in Figure 4.6, the ODL controller has originally provisioned VN 1. Application flows, Flow 1 and Flow 2, are statistically multiplexed on the same wavelength λ 1 and switched to different output ports. Different priority levels are assigned to each flow in case of potential resource competition between the flows. To support a newly generated Flow 3, a reconfiguration of VN 1 is required to provision the connectivity with output Port 3. To this aim, the DC management uses the ODL controller to update the

71 54 Experimental evaluation LUTs in the ToR and the optical switch through the OF interfaces exposed by the agents. Figure 4.6: Virtual networks reconfiguration In this procedure, an OF FLOW_MOD message carrying the command of New Flow is sent to the OF-agents, which process the command. As illustrated in Figure 4.4 (b), the message specifies the input and output optical switch ports, and the proper label including the class of priority for the corresponding packets. Then the agents execute the configuration instructions for the ToRs and the optical switch to update their LUTs so that one more LUT entry will be added. At this point, the VN has been reconfigured, and the optical switch node supports the delivery for the new flow. Additionally, the ODL controller can be also used to disable a certain flow or to make modifications (such as adjusting priority) by deleting or editing the entries of the LUT, respectively. Figure 4.4 (b) illustrates a detail of the OF FLOW_MOD message conveying a New Flow command. A summary of created flows listed on the GUI is given in Figure 4.4 (a). In the data plane, Figure 4.7 shows the LUT update for the original VN 1 (LUT) and the reconfigured VN 1 (LUT ). There, xx (L 4 L 3 ) represent the priority that, in case of collision, will be referenced to classify the priority in the order of 11 > 10 > 01 > 00. Note that, within VN 1, there are no entry routes to the output Port 3 in the LUT. The time traces of label L 2 L 1 (L 4 L 3 omitted), the incoming packets to the optical switch node (Flow in) marked with destination and outputs for the three ports are also plotted. The figure clearly shows that before the reconfiguration (left side), flows destined to ToR 3 are dropped since no matching label is found in the LUT of the optical switch. On the contrary, once VN 1 is reconfigured and the LUT is updated (right side), the flows towards

72 Software defined networking enabled control interface 55 ToR 3 are then properly delivered with the L 2 L 1 labeled to 11. The update process, which includes the communication between the ODL controller and both the ToR and the optical switch, takes around 110 milliseconds; after that, flows are statistically multiplexed and switched. It is worth to note that the reconfiguration process does not affect other flows, so packets destined to ToR 1 and ToR 2 perform hitless switching during the VN reconfiguration time. Figure 4.7: Time traces for labels and packets before/after VN reconfiguration Priority Assignment With statistical multiplexing, fast optical switch-based VNs allow for efficient resource sharing thus achieving high tenant density. However, as the traffic load increases, the competition for the physical resources may result into contention at the optical switch node. The flow control mechanism introduced in Chapter 3 aims at avoiding the data loss associated to contention. Nonetheless, this mechanism deteriorates the end-to-end latency performance due to the retransmissions and, once the buffer at the ToR side is fully occupied, the new coming packets will be lost. By assigning class of priority to data flows, the ones with higher priority will be directly forwarded without any retransmission. Following the top-down approach, the DC operator triggers the assignment of priority to a flow through the ODL controller. The extended OF enables this feature since the label information can be carried within the OF FLOW_MOD to configure the data plane. The label bits L 4 L 3 define four different priority

73 56 Experimental evaluation classes and the contention between the packets with the same priority is resolved here by means of round-robin scheduling. As illustrated in Figure 4.8, Flow 4 and Flow 5 are heading to the same output port (Port 3 ) on different wavelengths. As they come from the same ToR and reach the same module of the optical switch, there is a contention happening, and thus the priority class determines which packets are delivered and which ones are blocked and then retransmitted. Flow 4 has been assigned with higher priority (L 4 L 3 = 11 ) than Flow 5 (L 4 L 3 = 00 ). Therefore, in case of contention at the optical switch, packets associated to Flow 4 will be forwarded to the output Port 3 to avoid packet loss and higher latency caused by retransmission, while the ones associated to Flow 5 will be blocked and then retransmitted. Figure 4.8: Priority assignment. Figure 4.9 shows the label bits (L 4 L 3 L 2 L 1 ), the flow control signals (ACKs), and the switching results for the two contented flows. The ACK signals for Flow 4 are always positive (always forwarded) which means that all the packets are successfully delivered. Flow 5 packets labeled with x are blocked due to the contention and a corresponding NACK is generated to ask for the retransmission.

Software defined networking enabled control interface 57 Figure 4.9: Time traces of Flow 4 and Flow 5. Figure 4.10 shows the packet loss and latency for both flows with a uniformly distributed load.

74 Software defined networking enabled control interface 57 Figure 4.9: Time traces of Flow 4 and Flow 5. Figure 4.10 shows the packet loss and latency for both flows with a uniformly distributed load. The packet loss curves confirm no packet loss for Flow 4, while the 16-packet buffer employed at the ToR side prevents packet loss up to a load of 0.4 for Flow 5. For higher values of the load, as the buffer starts to be fully occupied, the packet loss increases linearly. The retransmissions observed for the blocked packets of Flow 5 lead to an exponential increase of the latency. On the contrary, the priority assignment guarantees a minimum latency of around 300 ns and thus a high QoS for Flow 4, which includes the transmission on the 2 25 m links as well as the label processing time. Figure 4.10: Packet loss and latency for Flow 4 and Flow 5.

75 58 Experimental evaluation Statistics Report and Load Balancing As the centralized controller of the whole DCN, ODL is also in charge of monitoring the status of all the underlying devices. Based on the collected realtime information, the ODL controller could dynamically provision VNs updates to fully exploit the network resource. Moreover, adjustments like balancing the traffic load can be triggered by ODL controller, further improving the network efficiency and utilization. To this purpose, OF STATS_REQUEST/REPLY message pairs are exchanged between the ODL controller and the data plane devices, where the OF STATS_REPLY messages contain the statistical information provided by the optical devices. In particular, the fast optical switch and ToR nodes collect the amount of processed data (in Kbytes) for both received and forwarded packets. The number of retransmissions due to the contention, which is essentially the number of NACKs, is also reported to the agent and included in the collisions field of the OF STATS_REPLY message. Figure 4.11: OF STATS_REPLY messages for (a) optical switch and (b) ToR. Figure 4.11 illustrates the statistics collection messages for both the optical switch node (a) and the ToR (b). The example depicts the scenario where two

Software defined networking enabled control interface 59 active flows face contention. The optical switch controller records the number of received packets as well as the NACK signals for each flow.

76 Software defined networking enabled control interface 59 active flows face contention. The optical switch controller records the number of received packets as well as the NACK signals for each flow. Once receiving a request for gathering the port statistics, the OF-agent reads the counters from its controlled device and reports the aggregated per-port values to the ODL controller through the OF STATS_REPLY message. Hence, the counters are translated into received and transmitted packets, and collisions (i.e., NACKs). Figure 4.11 (a) presents the detail of the OF STATS_REPLY message carrying the ports statistics of the fast optical switch. This information is then depicted in the ODL GUI. For evaluation purpose, the packet loss, which affects the QoS significantly, is a parameter that needs to be tracked. Since the buffer is implemented in the ToR side, the packet loss performance can only be collected and reported to ODL from the ToR OF-agent. This has been implemented by utilizing the TX dropped field of the OF STATS_REPLY message as shown in Figure 4.11 (b). The ODL controller can then be used to optimize the system performance according to the application requirements based on the statistic information reported by the OF-agents. Figure 4.12: Load balancing operation. An example showing the load balancing operation based on statistics collection and flow modification is given in Figure Two flows belonging to two different VNs have common output Port 2. As the load increases, the contention at Port 2 would cause high packet loss for both flows. Upon reception of the real-time status of the per-port packet loss and the occupancy of each alternative port from the ToR, the ODL can balance the load to the ports with less usage. As can be seen in Figure 4.13 (a), the load of both flows has been increased from 0 to 0.8, with 50% probability destining at Port 2 at the beginning.

77 60 Summary If ODL does not update any of the LUTs, high packet loss is observed. In comparison, targeting a packet loss threshold of , once the reported statistics tend to exceed this value, the load at Port 2 will be balanced to Port 1 (for Flow 6 ) and Port 3 (for Flow 7 ) through the Flow Modify command. In this case, the adjustment is proactive when the detected retransmission rate (contention possibility) is higher than 10%. A balancing step of 0.15 has been set to properly avoid possible performance degradation with the given load increasing speed. According to the QoS settings, the packet loss < and latency < 340 ns are guaranteed as shown in Figure 4.13 (b). Figure 4.13: Packet loss and latency changes without adjusting (a) and with balancing step of 0.15 (b). 4.4 Summary Following the SDN-based control technique selected in LIGHTNESS DCN, investigation on the SDN-enabled control interface for the fast optical switch is presented in this chapter. An OF-agent is developed to bridge the communication between the SDN controller and the optical switch node. The agent keeps an information model of the optical switch and communicates with the FPGA-based switch controller using the proprietary interface. On the other side, the OF-agent communicates with the SDN controller by means of extended version of OpenFlow protocol through the SBI. Therefore, based on

78 Software defined networking enabled control interface 61 the abstraction held in the agent, the SDN controller could configure and manage the optical switch-based virtual networks and corresponding data flows by updating the LUT of the switch controller. Moreover, statistics such as average load and contention rate can be collected through this interface and sent to the SDN controller for further performance optimization and management. The implementation of such SDN-enabled control interface is the key enabler to fulfil the SDN control framework in the optical switch-based DCN. Based on the developed control interface, the experimental assessment demonstrates the creation and reconfiguration of VNs by updating the LUTs stored in the switching nodes. For application flows with high priority, QoS can be guaranteed with proper priority assignment in the label field avoiding the performance degradation caused by the competition. In addition, the SDN controller is able to monitor the network by collecting real-time per-port statistics through the OpenFlow protocol. The load balancing operation can be introduced to further provide the QoS support which is a valuable feature in challenging situations. Equipped with a standard control interface, the fast optical switch could be integrated into an SDN-enabled control framework. The deployment of the SDN controller decouples the control plane from the underlying data plane so that the decisions are made based on the functional abstractions of the optical switch without interfering the fast forwarding. The flow-controlled fast optical switch with distributed control allows for sub-microsecond latency switching and large connectivity enabled by statistical multiplexing. On the other hand, the SDN-enabled VNs can be flexibly reconfigured and managed significantly improving the DC agility and controllability.

80 Chapter fast optical switch prototype and performance assessments In this Chapter, the implementation of a 4 4 fast optical switch prototype and the performance evaluation are presented 3. By integrating the flow control mechanism and the SDN-enabled control interface detailed in Chapter 3 and Chapter 4, a comprehensive investigation has been carried out for the 4 4 fast optical switching system. Physical layer performance in terms of dynamic switching with optical flow control as well as statistic study on the packet loss and latency are analyzed. Based on this, the switch prototype including all the functionalities of a 4 4 optical switch is developed. Board-level integration of the label processor and SOA/laser driver are implemented both of which are interfaced to the FPGA switch controller. The performance regarding on the burst-mode reception of statistically multiplexed optical packet is assessed. The prototype is also employed in the LIGHTNESS final demo which covers the network elements in data plane and control plane, providing a promising solution for the next-generation optical data center networking. In the following, Section 5.1 introduces the set-up used to analyze the 4 4 fast optical switching system and reports the evaluation results. The detailed implementation of the switch prototype is presented in Section 5.2. The assessments on the required preamble length incorporating with the burst-mode receiver is discussed in Section 5.3, followed by the Section 5.4 which provides the detailed set-up of the LIGHTNESS testbed and highlights the SDN-enabled optical switch prototype in the final demonstration fast optical switching system The schematic of the 4 4 fast optical switching system is shown in Figure 5.1. At each ToR, the aggregation controller assigns the inter-tor traffic (payload packet) with a certain wavelength (λ 1 or λ 2 ) and attaches it with an optical in- 3 Parts of this chapter are based on the results published in [139], [133] and [150].

64 4 4 fast optical switching system band label according to the internal LUT. The LUT can be remotely configured through the SDN-based control interface and can be seamlessly updated when necessary.

81 fast optical switching system band label according to the internal LUT. The LUT can be remotely configured through the SDN-based control interface and can be seamlessly updated when necessary. Besides the forwarding information, the LUT can also specify the class of priority to enable the flow-based operation as explained in Section 4.3. The binary label bits are coded into RF tones carried by the in-band optical label wavelength [112]. Due to the lack of an optical buffer, a copy of the transmitted packet is electrically stored at the aggregation controller and a fast optical flow control between the optical switch and the ToRs is implemented for packet re-transmission in case of contention [126]. Figure 5.1: Optical switch node with SDN-enabled control interface. The fast optical switch node consists of two identical modules each operating independently. It allows for a highly distributed control which makes the reconfiguration time of the switch port-count independent. This is especially important to ensure low latency as the switch scales to large number of ports. WDM packets are de-multiplexed and processed in parallel. Payload is fed into the SOA-based broadcast-and-select 1 2 switch to transparently forward the payload to one of the output ports. The label is extracted by a fiber Bragg grating (FBG) with a narrow passband, and then fed into the label processor (LP). The recovered digital label bits are then detected and matched with the LUT (provisioned through the SDN-based control interface) by the switch controller. It also resolves the possible packet contention, and configures the 1 2 switch to forward the packets with highest priority. In case of contention, the packets with lower priority are blocked and an optical flow control signal (NACK) is sent back to the ToR side asking for packet retransmission. A low-

82 4 4 fast optical switch prototype and performance assessments 65 speed directly-modulated laser (DML) driven by the switch controller is used to generate the flow control signal. This efficient optical flow control link ensures fast retransmission to minimize the packet loss and the latency. The nanoseconds switching time of the SOA in combination with the parallel processing of the label bits allows for a reconfiguration time of the switch in the order of few tens of nanoseconds. The optical switch node and the aggregation controller are equipped with a dedicated OF-agent to facilitate the SDN-based control including the reconfiguration of the LUT and monitoring of the statistics. It enables the programmable and flexible access from the SDN controller to the underlying optical switch node and the per-node settings could be efficiently optimized to meet the desired network performance [127]. As explained in Chapter 4, such OF control interface is aiming at the long-term management which has been fully decoupled with the fast and flow-controlled optical switching. Even though the SDN-based control takes much longer time (at milliseconds time scale), the latter one works seamlessly at the hardware speed (few tens of nanoseconds) based on the optical label Traffic generation The dynamic operation including flow control is validated by using packetized 40 Gb/s NRZ-OOK payloads carrying PRBS data which is externallymodulated on the wavelength of nm and nm for the two clients of each ToR. The packet has 540 ns duration and 60 ns guard time. An FPGA that acts as the aggregation controller generates the label associated to each packet and simultaneously provides a gating signal to enable the payload transmission. Buffers with limited 16-packet size are implemented by the FPGA to store the label for each client. Labels will be removed from the queue in response to a positive ACK, otherwise they will be retransmitted and the gate will be retriggered to implement packet retransmission. The label wavelengths, each carrying four RF tones (130, 253, 410 and 615 MHz), are centered at nm and nm, respectively. The electrical spectrum of the coded RF tone label is shown in Figure 5.2 (a). Two bits, L 3 and L 4, represent the class of priority, while L 1 and L 2 carry the forwarding information ( 01 for Port 1 and 10 for Port 2 ). Both priority and destination can be remotely managed through the OF-based control interface. The average optical power of the payload and the label at the optical switch input is 2.5 dbm and -2 dbm, respectively. Pass band of FBG is centered at the label wavelength and has a 3 db bandwidth of 6 GHz. This narrow bandwidth could avoid

83 fast optical switching system spectral distortion of the payload. Optical spectra of the packets before and after label extractor for the two clients are reported in Figure 5.2 (b). Figure 5.2: (a) Electrical spectrum for the 4-bit RF-tone label. (b) Optical spectrum before and after label extractor SDN-enabled control The FPGA-based switch controller has a dedicated OF-agent interface enabling the communication with the SDN controller. More specifically, the agent maps the USB link-based proprietary interface into the OF protocol implemented at the SDN controller. As explained in Section 4.2.2, extensions have been made for the OF protocol to facilitate the control and provisioning of the fast optical switch node. The agent translates OF messages coming from the controller into a set of actions performed over the underlying switch controller and vice versa. Therefore, the LUT is able to be flexibly configured, while the status of the switching operation is reported to the SDN controller for monitoring purposes through such interface. Figure 5.3 (a) illustrates two examples of exchanged messages through the OF-based interface to the agents for adding/deleting the LUT entries. As highlighted, the input port and label information are used to specify the LUT entry. After receiving the OF messages and configuring the FPGA controller correspondingly, the agent also reports feedbacks of the executed commands for

84 4 4 fast optical switch prototype and performance assessments 67 monitoring purpose as shown in Figure 5.3 (b). After the LUTs are properly configured, the generated packets are transmitted to the optical switch node for investigation of the switching performance. Figure 5.3: (a) Captured OpenFlow messages and (b) agent reports for adding/deleting LUT entry Dynamic switching To investigate the dynamic operation of the flow control and the payload switching of the system, optical packets are generated with a traffic load of 0.5 at the ToR side. Label bits L 3 and L 4 representing the class of priority have been fixed as 11 and 00 for the Client 1 and Client 2, respectively. Therefore in case of collision, packets from Client 1 have higher priority over the ones from Client 2. Figure 5.4 shows the dynamic generation/retransmission of the label (L 3 L 4 omitted) and the payload from both ToRs (each color represents one client). The time traces of the label detected by the switch controller and the flow control detected by the aggregation controller at the transmitter side are reported at the top. Label bits L 1 and L 2 carrying the forwarding information bring up 3 possibilities of switching since 00 represents no packet. 01 stands for Output1, 10 for Output2, and 11 for multicasting the payload to both ports. If two packets from different clients have the same destination, packet from Client 1 will be forwarded at the output while the packet from Client 2 will be blocked and a NACK will be sent back requesting packet retransmission. If Client 1 is in multicast, any data in Client 2 will be blocked.

68 4 4 fast optical switching system Multicasting for Client 2 will only be approved if Client 1 is not transmitting any packet. Figure 5.4: Dynamic operation of labels and payloads from both ToRs.

85 fast optical switching system Multicasting for Client 2 will only be approved if Client 1 is not transmitting any packet. Figure 5.4: Dynamic operation of labels and payloads from both ToRs. One or both of the SOAs will be switched on to forward the packets to the right destination. The waveforms of the transmitted packets (including retransmitted packets for Client 2) and the switch outputs are shown at bottom of Figure 5.4. Flag M stands for the multicast packets, which should be forwarded to both output ports. If Client 2 contends with Client 1 the packets will be blocked (shown with unmarked packets). In this case, a NACK is generated to inform the buffer manager of the transmitter that the packets have to be retransmitted. Figure 5.4 clearly shows the successful optical flow control and multicasting operation. The minimum latency (no retransmission) is 300 ns including 250 ns propagation delay provided by 2 25 m link between the ToR and optical switch. At switch output, a bit-error-rate (BER) analyzer is used to evaluate the quality of the detected 40 Gb/s payload. Figure 5.5 shows the BER curves and eye diagrams for packets from 4 different clients. Test results for back-to-back (B2B) as well as the signal after the transmission gate are also reported as reference. It clearly shows that the transmission gate used to set the traffic load does not cause any deterioration of the signal quality. Error free operation has been obtained with 1 db penalty at BER of after switch. The slight degradation is mainly due to the in-band filtering caused by label extractor and noise introduced by SOA switch, which is also confirmed by the eye diagrams. It proves that high data-rate operation is supported by our system and no

4 4 fast optical switch prototype and performance assessments 69 distortion has been introduced by the bi-directional transmission of label and flow control signal. Figure 5.

86 4 4 fast optical switch prototype and performance assessments 69 distortion has been introduced by the bi-directional transmission of label and flow control signal. Figure 5.5: BER curves and eye diagrams for 40 Gb/s payload Packet loss and latency To further investigate the performance of the 4 4 system with the flow control mechanism, the packet loss and the average latency are tested. Packet loss happens when the buffer is already fully occupied and there is a new packet to be served at the next time slot, in which case this packet is instantly dropped and considered lost due to buffer overflowing. The packet loss is then calculated as the ratio of the number of lost packets to total number of generated packets. The aggregation controller generates a packet for each different client with the same average traffic load. The destinations decided by the label pattern are chosen randomly between the two possible outputs (including multicast) according to a uniform distribution. Instead of using a fixed priority for the contention resolution algorithm at the optical switch, a round robin scheme is employed as priority policy to efficiently balance the utilization of the buffer and the latency between the two clients. This means that the priority will be assigned slot by slot. As a result, a packet in the buffer will be definitely sent to the proper destination within two time slots, and the respective buffer cell will be released. Figure 5.6 (a) shows the packet loss for different input loads and buffer sizes. The total amount of time considered is time slots. As expected the packet loss increases with the input load. Larger buffer size could improve the packet loss performance for input loads smaller than 0.7. Larger buffer capacity

87 fast optical switching system does not bring significant improvement when the load 0.7 because the buffer is always full and overflowing causing high packet loss. Figure 5.6 (b) presents the buffer occupancies when traffic load equals to 0.5, 0.6, 0.7 and 1, respectively. For the first 200 time slots, it is clear that for load of 1, the 16- packet buffer is rapidly filled up and for load of 0.7 the buffer is fully occupied most of the time. Figure 5.6: (a) Packet loss vs. load. (b) Buffer queue occupancy for different input load. (c) Average latency with buffer size of 16 packets. Average end-to-end latency for the system with a buffer size of 16 packets is reported in Figure 5.6 (c). The number of transmitted packets, retransmissions and slots spent in the buffer for each packet are recorded and employed to calculate the average latency. The lost packets are not considered in the latency calculation. Similarly to the packet loss curves, the average latency increases with the load. As the traffic becomes heavier especially when the load is higher than 0.6, the possibilities of contention increase rapidly which results in more retransmissions and longer waiting time in the buffer, and thus larger latencies. From Figure 5.6 it can be concluded that a packet loss ratio lower than and an average end-to-end latency less than 520 ns (including 250 ns offset for 2 25 m transmission link) could be achieved under relatively high traffic load

88 4 4 fast optical switch prototype and performance assessments 71 of 0.5 and the buffer capacity of 16 packets. These results are in agreement with the numerical studies conducted to investigate the scalability to hundreds of port-count [128]. In the real-time operation, the FPGA-based switch controller also counts the number of received packets, transmitted packets and retransmissions (collisions) due to contention. The values are stored in the registers which can be accessed by the SDN controller through the OF-agent. In Figure 5.7, the captured OpenFlow message carrying the statistics is illustrated. Besides the notification to the SDN controller, the agent also reports the statistics for monitoring purpose Scalability Figure 5.7: OpenFlow message and agent report for statistics. In this section we first experimentally investigate the switching performance with scaling to a large port-count. Due to the parallel modular architecture shown in Figure 2.1, the performance assessment of the system can be investigated by considering the performance assessment of one 1 F optical switch module. From the performance point of view, scaling the fast optical switch is mostly limited by the splitting loss experienced by the payload during the 1 F broadcast-and-select stage. Here we employ a variable optical attenuator (VOA) to emulate the splitting losses of the broadcasting coupler, as schematically reported in Figure 5.8 (a). The packetized 40 Gb/s NRZ-OOK payload carrying PRBS is sent to the 1 F optical switch combined with

89 fast optical switching system the optical label. After the label being filtered out, the BER and the OSNR of the payload are measured at the output of the switch. Figure 5.8: (a) Set-up for scalability investigation. (b) Gain characteristic with bias current of the SOA. (c) Penalty and OSNR vs. scale of 1 F switch. The input optical power of the 1 F optical switch is 0 dbm and the attenuation caused by the VOA is set to be 3 log 2 F (db). The SOA gate is switched on to forward the packet, and in the meantime to amplify the signal. The current applied to the SOA has been also varied to investigate the effect on the OSNR degradation. Figure 5.8 (b) gives the gain characteristic versus bias current of the SOA from which we could see that the SOA operates transparently at 30 ma and 18 db amplification could be supplied when biased at 70 ma. Considering the splitting loss, the SOA could compensate the 18 db loss caused by the 1 64 broadcast stage resulting in a lossless 1 64 optical switch. Figure 5.8 (c) shows the power penalty (measured at BER= ), and the OSNR of the switched output as a function of F for different SOA bias currents. A power penalty of < 1.5 db for F up to 64 is measured regardless of the bias current of the SOA. For N > 64 the penalty is mainly caused by the deterioration of OSNR as a result of increased splitting loss. The results indicate that the port-count of the fast optical switch could be potentially scaled up to a large number of ports at the expense of limited extra penalty. In addition, by increasing the bias current of the SOA, a lossless system could be achieved without extra amplification.

4 4 fast optical switch prototype and performance assessments 73 Then the scalability to a large port-count as a function of traffic load is investigated by evaluating the packet loss ratio and

90 4 4 fast optical switch prototype and performance assessments 73 Then the scalability to a large port-count as a function of traffic load is investigated by evaluating the packet loss ratio and latency performance of the switching system. Different numbers of modules (F) and wavelength channels (M) are considered, with F = M = 2, 4, 8, 16, 32 and 64, which correspond to a total port-count of 4, 16, 64, 256, 1024 and The distance between the ToRs and fast optical switch is set to 50 m and the data-rate at each input is 40 Gb/s. The round-trip time is set to 560 ns including the propagation delay and label processing time. The buffer capacity of each wavelength channel is 20 kb and a realistic traffic pattern has been employed here [128]. The simulation results on the packet loss ratio and latency are reported in Figure 5.9. Figure 5.9: (a) Packet loss ratio and (b) latency when scaling the port-count. The contention probability at the switch output increases with the scaling of the port-count. This translates into more packet retransmissions and higher buffer occupancy, compromising the packet loss performance of the system, as shown in Figure 5.9 (a). Similar behavior is also found in the latency performance as illustrated in Figure 5.9 (b), due to the retransmissions and waiting time in the buffer. At load of 0.35, lower than packet loss ratio and 1000 ns latency have been achieved for the port-count of Further increase to 4096 ports only leads to slight performance degradation. The results on the physical layer switching performance and statistic investigation confirm that the fast optical switch under investigation can potentially scale to large port-count. 5.2 Prototype implementation Based on the design scheme in Figure 5.1 and preliminary implementation results shown in Section 2.2, the 4 4 fast optical switch prototype has been finalized and the implementation details are described in this section.

91 74 Prototype implementation Label processor RF tones labeling technique has been utilized to efficiently transmit the label signal and maintain the instant parallel processing which significantly reduces the latency. The label processor detects the optical label signal and processes it into digital label bits. In the prototype, the performance of the label processor values significantly on the RF electronics which requires dedicated design. Compared with the single module introduced in Section 2.2, the label processor in the prototype has more completed functions. First, it groups all the four processing modules for the four input channels on the same PCB board. Secondly, it also includes the optical-to-electrical (O/E) interfaces that the extracted optical label can be plugged into the PIN photodiode (PD) on the board. Moreover, the detected baseband analog label signals will be further digitalized so that the output of the label processor can be directly interfaced to the FPGA switch controller. Figure 5.10: PCB board of the label processor. The fabricated label processor PCB board is shown in Figure It is capable of processing in parallel 4 optical label signals each carrying an 8-bit RF tone label. As an example, the Module 1 has been highlighted and the functional blocks of one bit processing have been depicted. The PD will first convert the optical RF tones label into electrical signal. Then a band-pass filter (BPF) separates each tone from the others and the envelop detector (ED) in

4 4 fast optical switch prototype and performance assessments 75 combination with impedance matching circuit demodulates the baseband label bit.

92 4 4 fast optical switch prototype and performance assessments 75 combination with impedance matching circuit demodulates the baseband label bit. At the shaping stage which consists of an amplifier and a compactor, the analog signal will be digitalized and output to the interface with the FPGA. The compact design greatly simplifies the interfaces and saves the needed space SOA/DML driver After the label bits are detected, the FPGA performs the contention resolution and controls the SOA gates. For a 4 4 fast optical switch, 8 SOAs are needed to compose four 1 2 switches. The discrete SOAs are mounted on the 14-pin butterfly socket in the driver board. Coarse temperature control (± 1 C ) and adjustable driving current are provided. The on/off of the current is controlled by the FPGA switch controller. Besides the 8 SOAs, there are 4 DMLs to generate the optical flow control signal. As the wavelength of the DML should match with the WDM channel, fine temperature control (± 0.01 C ) is provided to stabilize the output wavelength. The fabricated SOA driver board is shown in Figure 5.11 in which the function modules are highlighted. Figure 5.11: PCB board of the SOA driver Implemented prototype and power consumption The label processor and SOA driver boards are soldered with the components, and the optical switch prototype is assembled in a 2U rack case by integrating the FPGA switch controller, passive optical components and power supply. Figure 5.12 shows the photos of the final prototype. The FPGA switch

93 76 Prototype implementation controller receives the digital label bits from the label processor and resolves the contention. It provides the control signal to SOA driver to forward or block the packets. Moreover, the FPGA is responsible of communicating with the SDN controller through the OF-agent to update the stored LUT and report the collected statistics. Passive optical devices (circulators, couplers, FBGs, etc.) are arranged in a separate space. The prototype have utilized in other experimental validation tasks as will be discussed in the following sections. Figure 5.12: Optical switch prototype. Considering the power consumption of the prototype, Table 5.1 summarizes the contribution of each sub-module. The FPGA switch controller has been implemented by a development board from NALLATECH [129], which integrates many hardware components providing unnecessary functionalities. By using a customized FPGA board or simplified microcontroller board, the switch control including label detection, contention resolution, SOA/DML control, and communication with OF-agent could be realized with less power consumption. The power consumed by the label processor is mainly due to the RF amplifiers. The values for the SOA/DML driver are tested when 70 ma current is driven continuously. During the switching operation, only the SOAs need to forward packets and DMLs need to transmit ACKs would be switched on, in which case the power consumption is lower than the maximum values shown in the table. Average energy consumption is estimated for the temperature control, which actually depends on the operation driving current, setting temperature and environment temperature. Table 5.1: Contribution for power consumption

94 4 4 fast optical switch prototype and performance assessments 77 Sub-module Power/unit (W) Quantity Power (W) FPGA controller Label processor SOA driver DML driver Temperature control (coarse) Temperature control (fine) Total 38 W In total, the 4 4 fast optical switch prototype has a power consumption of 38 W. This value could be maintained even when handling traffic with higher datarate and higher-order modulation formats, benefited from the transparent optical switching. Due to the modular structure, scaling to a larger number of ports would cause a linear increase of the power consumption. Besides the employment of optimized switch controller, further improvement could be achieved by using photonic integrated circuit (PIC) and application-specific integrated circuit (ASIC) for the label processing as well as the switching control. 5.3 Performance assessment with burst-mode receiver Although data center network (DCN) is a closed environment with controlled optical link length, in the optical DCN, the receiver at the edge node (i.e., ToR) should be able to handle the packets with length ranging from sub-microseconds to tens of microseconds. Especially in a fast optical switch based network where packets from different edge nodes are statistically multiplexed, moderate variation in optical power levels, phase synchronization, and clock are happening packet by packet. Moreover, signal impairments due to the SOA switches such as pattern dependent amplification, may affect the performance of the receiver. Therefore, for proper detection of the received data traffic in practical implementations, fast optical switch based DCN requires burst mode receivers (BM-RXs). Typical BM-RX includes several functions such as fast automatic gain control (AGC) and decision threshold extraction, clock and phase synchronization [130]. Each of those functions contributes, with a different overhead, to the overall BM-RX operational time. This time determines the minimum length of the preamble and the packet guard-time to properly detect the signals. From a network point of view, minimizing the preamble and packet

95 78 Performance assessment with burst-mode receiver guard-time (packet overhead) would result in higher throughput and lower latency. This is especially important in an intra-dc scenario where many applications produce short sub-microseconds traffic flows. Figure 5.13: Experimental set-up for evaluation with burst-mode receiver. To investigate the individual time contributions of the AGC and phase synchronization as well as the switching performance when cooperating with a 10 Gb/s BM-RX, the fast optical switch prototype depicted in Section 5.2 is utilized in the experimental set-up shown in Figure Input packets at different input ports are switched by the fast optical switch node according to the attached optical label. Each data packet starts with a sequence of 1010 as preamble. A certain guard-time is placed in between consecutive packets. At the output, the switched packets are detected by the BM-RX. It consists of a BM trans-impedance amplifier (BM-TIA) featured with fast gain setting and a BM limiting-amplifier (BM-LA) recovering the amplitude [131]. A reset signal is applied externally in the experiments. Oscilloscope and BER tester are used for qualitative and quantitative evaluation of the detected switched packets. Three operational cases have been considered to experimentally evaluate the required preamble length including the effects caused by the power variation, guard-time and SOA impairments. The 10Gb/s NRZ OOK packets are generated at λ 1 = nm and λ 2 = nm consisting of 1200 ns (1500 Bytes) payload Case I minimum preamble length Case I determines the preamble as a function of the dynamic power range and guard-time. In the set-up shown in Figure 5.13, packets at input Port 1 are switched to output Port 1 and Port 3 at a certain power level. There should be enough preamble length and guard-time for the proper operation of BM-RX and at the same time, without compromising the throughput and latency performance. The preamble length is investigated as a function of received

4 4 fast optical switch prototype and performance assessments 79 optical power and guard-time and it would also provide the information on the minimum guard-time that could be placed in between the

96 4 4 fast optical switch prototype and performance assessments 79 optical power and guard-time and it would also provide the information on the minimum guard-time that could be placed in between the packets. Figure 5.14: Waveforms detected by BM-RXs. (b) Minimum preamble length vs. input power at different guard-time. The waveforms of the input packets, the switched packets at output Port 1 and Port 3, and the detected packets by the BM-RXs are shown in Figure 5.14 (a). The zoom-in at the starting and ending of a packet provides a better vision of the preamble and the clear payload bits indicates correct amplitude recovery. Original empty switched time slot are then filled with false transient signals due to high gain of the BM-LA. These results have been obtained for 50 ns guardtime and -22 dbm input power. BER curves for back-to-back signal and switched packets after BM-RX are reported in Figure Error free operation has been obtained for both output ports with 1 db power penalty. In the next experiment, the guard-time and the optical power of the packets are varied to investigate the required preamble length that guarantees the BER of The guard-times considered are 25 ns, 50 ns, 100 ns, 200 ns, and 1000 ns. The optical power varies from -21 dbm to -13 dbm. The preamble length is optimized with 1 ns step. Figure 5.14 (b) indicates that input optical power ranging from -21 dbm to -13 dbm ensures a dynamic range larger than 8 db for the BM-RX with several nanoseconds decrease of the preamble. For a guardtime of 25 ns, a preamble of 25.6 ns could guarantee BER < Larger guard-time slightly reduces the required preamble because the charges at parasitic nodes in the circuits will be fully discharged. It also illustrates that the BM-RX would function properly after long empty packets sequence, as in the case of low traffic load.

97 80 Performance assessment with burst-mode receiver Figure 5.15: BER curves for B2B and 3 cases Case II - asynchronous packets Case II investigates the capability of the BM-RX to detect asynchronous switched packets. Packet flows 2 and 3 at the input Port 3 and Port 4 in the set-up shown in Figure 5.13 represent the situation that asynchronous packets with different wavelengths and power levels (representing packets that experience different link distances) are forwarded by the fast optical switch to the same output Port 2. As one of the key functions of the BM-RX, the gain and threshold should be fast settled to equalize the power fluctuation of incoming packets. Figure 5.16: Waveforms of asynchronous packets equalized by the BM-RX. The waveforms of asynchronous packets at different wavelengths and optical power are detected by the BM-RX as shown in Figure Fast AGC is

4 4 fast optical switch prototype and performance assessments 81 of great significance to guarantee that the BM-RX could handle the power fluctuation that occurs in the optical DCN.

98 4 4 fast optical switch prototype and performance assessments 81 of great significance to guarantee that the BM-RX could handle the power fluctuation that occurs in the optical DCN. The BM-RX output trace shows that the power levels of two flows are equalized. The zoom-in also indicates the successful recovery of both asynchronous flows coming from different sources with different power level. The false transient signal which is caused by the reset settling of the BM-TIA has also been observed in the guard-time. BER curves for the two asynchronous packets are plotted in Figure Power penalty of 1 db due to switching operation has been measured Case III clock data recovery Case III investigates the preamble as a function of the clock data recovery (CDR) locking time. After the amplitude recovery conducted by the BM-RX, the clock phase alignment should also be realized by a BM-CDR. The employed BM-CDR is a fast-lock phase-locked loop (PLL)-based CDR which is AC coupled to the BM-RX with a time constant of ~100 ns. The preamble length is therefore increased to ~150 ns, mainly in line with the time constant of the CDR settling time. The output waveform of BM-CDR is reported in Figure 5.17 which clearly shows the transient response after the AC coupling. The BER curves as a function of different preamble lengths are also shown in Figure Preamble length shorter than 128 ns is not sufficient for proper CDR locking with error floor existing in the BER curve. For preamble length longer than ns (including the 25.6 ns due to the BM-RX), error free operation is achieved with 1 db penalty with respect to the B2B signal. To further improve performance, the Gated voltage-controlled oscillator (VCO) based CDR or Over-sampling based CDR would be a better solution to ultimately decrease the preamble contributed by BM-CDR [130]. Figure 5.17: Waveform of BM-CDR output.

99 82 LIGHTNESS testbed and final demonstration 5.4 LIGHTNESS testbed and final demonstration As the main task of the LIGHTNESS project, the final testbed integrating the data plane and control plane has been built and the overall testbed set-up is illustrated in Figure 5.18, which features with a fully SDN-programmable intra- DCN architecture including network function virtualization capabilities. The 4 4 fast optical switch prototype has been employed as the optical packet switch (OPS) node to handle the short-lived bursty traffic flows. In the data plane, the programmable network interface card (NIC) is plugged into the server, and all the traffic from the same rack are groomed by the optical ToR switch and further interconnected to the optical backplane with on-demand OPS and OCS switching capabilities. The SDN-based DCN control plane builds and reconfigures the physical layer topology, by dynamically provisioning appropriate cross-connections in the optical data plane to match the requirements of various applications. The SDN agents are employed to supply the information translation between the network elements (Programmable NIC, ToR switch, OPS, OCS) and the OpenDaylight SDN controller. The powerful SDN control plane enables the virtualization of computing and network resources creating a virtual data center (VDC) and virtual network functions (VNF) on top of the data plane. The VDC is a pool of virtual compute, memory, storage and virtual network resources abstracted from the physical layer. The construction and reconfiguration of multiple VDCs are supported by the LIGHTNESS architecture. Figure 5.18: Set-up for LIGHTNESS final demonstration.

4 4 fast optical switch prototype and performance assessments 83 5.4.1 Testbed set-up The testbed is composed of a set of interconnected hardware equipment.

100 4 4 fast optical switch prototype and performance assessments Testbed set-up The testbed is composed of a set of interconnected hardware equipment. Four rack-mounted PowerEdge T630 servers are equipped with an FPGA-based NIC board utilizing 10G SFP+ transceivers, providing an interface to the optical network. The FPGA-based NIC using NETFPGA SUME development board has been designed to plug directly into a server, and replaces the traditional NIC. In the prototype design, it has an 8-lane Gen3 PCIe interface for DRAM communication, one 10 Gb/s interface for getting commands from SDN control agent and sending feedback, two OPS/OCS hybrid 10 Gb/s ports for interrack/cluster communication and an OPS label pin interface connecting to the optical label generator. All servers are connected to a port Polatis circuit switch, which acts as a ToR switch and as an optical backplane on top of each cluster. A 1 4 optical power splitter, two 1 4 wavelength selective switches (WSS) and the 4 4 fast optical switch prototype for OPS function, which logically performs as two 2 2 OPS switches (as shown in the set-up in Figure 5.18), are attached to the optical backplane. The splitter is used to accomplish OCS broadcasting scenarios and the WSS is for grooming inter-cluster channels from different servers/rack to the destination cluster. The OPS switch is able to rapidly switch optical packets with a reconfiguration time of 20 ns. The label bits are generated by each NIC, then an optical RF tone label is created by the prototyped label generator (detailed in Section 2.2.3) and attached to each optical packet. At the OPS, the label is extracted, processed and matched with the stored LUT by the switch controller to determine the packets destination. As shown in the time traces of Figure 5.19, depending on the combination of the values of the optical labels, different (or no) outputs are activated for each 38.4 µs time slot. Multicasting is enabled when two label bits have been set as 11. Figure 5.19: Time traces of labels and OPS switch outputs. For this experiment, OpenDaylight (ODL) is used as the SDN controller, and OF agents for Polatis, OPS switch, WSS and hybrid OPS/OCS NIC were

84 LIGHTNESS testbed and final demonstration developed to enable the SDN-based programmability, as shown in Figure 5.18.

101 84 LIGHTNESS testbed and final demonstration developed to enable the SDN-based programmability, as shown in Figure Also, ODL internal software modules are extended to support some network device specific features. For example, regarding the OPS and WSS ports, the switch manager and service abstract layer were extended to record the supported wavelength and supported spectrum range, respectively, both of which are used to validate the configuration. Furthermore, as previously reported in Section 4.3.3, the transmitted optical packet statistics can be collected and maintained by the statistics manager. In order to properly configure the above optical devices, the forwarding rules manager has been extended to construct the required set of configuration information, e.g., label and output for the OPS switch; central frequency, bandwidth and output for the WSS and label and output for the NIC which is required for OPS operation. For OPS connection establishment, it takes around 420 ms for the configuration, including 210 ms for ODL controller to process the requests, 200 ms for the OF configuration commands to reach the OF-agent and 10 ms for the OF-agent configuring the FPGA-based switch controller VDC management and applications Figure 5.20: VDC planner application with OPS or OCS multicasting options. For the purposes of the experimental tests, two control plane applications have been implemented and deployed on top of the ODL: a VDC composition and a monitoring virtual network function (monitoring VNF). The user/vdc management deploys the VDC planner to dynamically provision virtual network slices within the DCN enabling thus multi-tenant data centers. It consists of a graphical user interface (GUI) interacting with the ODL controller. The user can access to the GUI with any existing browser and dynamically create a VDC request, as depicted in Figure The parameters that can be specified are: i)

4 4 fast optical switch prototype and performance assessments 85 virtual machines (VMs) to be used in the virtual slice; ii) virtual links to be created to connect those VMs; iii) technology (OPS or

102 4 4 fast optical switch prototype and performance assessments 85 virtual machines (VMs) to be used in the virtual slice; ii) virtual links to be created to connect those VMs; iii) technology (OPS or OCS) for each link; iv) multicast properties for VMs and virtual links. According to the request, the VDC planner generates flows to be programmed into the network, distributed among the different technologies (NIC, OPS, OCS and backplane). Figure 5.21: Monitoring VNF. Monitored power of Polatis ports associated to the output of OPS (top) and monitored packet rate, threshold and theoretical maximum (bottom). When a VDC has been deployed (e.g., a multicast VDC using OPS resources), the user can request a dynamic VNF to monitor various VDC parameters. The request is handled by a network function virtualization (NFV) server that creates the monitoring function. The created VNF retrieves network information via two different interfaces: RESTful northbound to the ODL controller to gather OPS packet statistic information and generate a graph of the OPS packet rate; and by using the backplane (Polatis power monitoring) to monitor the power of every link and device attached. Figure 5.21 shows a screenshot of the VDC monitoring function, where the information is plotted in a web interface, showing two graphs illustrating the optical power received in two multicast ports (in dbm), and the packet rate in the OPS switch, with a theoretical maximum and a configurable threshold. The monitoring VNF allows

103 86 LIGHTNESS testbed and final demonstration the VDC tenant to take two different recovery choices based on the monitored data: (i) switch the multicast traffic to a second OPS switch by making a backup copy of the content between two servers inside the same rack when the optical power detected drops below an expected value, and (ii) switch the multicast traffic using the optical splitter (OCS multicasting) whenever the OPS packet rate exceeds the threshold, meaning not sufficient OPS resources to cope with the desired VDC service OPS/OCS operation Performance results have been collected to validate several interconnection scenarios based on the request of VDC applications and NFV functions. First of all, we evaluated the DCN physical layer for intra-rack, inter-rack and intercluster unicast and multicast communication by measuring BER for both OPS and OCS switching technologies using real traffic with scrambled PRBS payload from the traffic analyzer. The traffic analyzer feeds the FPGA-based NIC with 10 Gb/s Ethernet traffic, and then the NIC pushes the data to one of its hybrid OCS/OPS ports. When OPS mode is chosen, the NIC, depending on the configuration received from the SDN controller, sets up the optical packet duration, encapsulates certain number of Ethernet frames and releases the optical packet while the label is generated and combined in parallel. OPS BER test results plotted in Figure 5.22 (a) show 1 db and 3 db penalties when passing through one (for intra-cluster) and two (for inter-cluster) switches, respectively. Penalties less than 2 db are observed for OCS interconnection scenarios, as reported in Figure 5.22 (b). Figure 5.22: (a) BER curves for intra-/inter-cluster traffic through WSS and one/two OPS switches. (b) BER curves for intra/inter-rack and inter-cluster traffic.

104 4 4 fast optical switch prototype and performance assessments 87 In addition to the BER test of the physical connections, we investigated the network Layer 2 results regarding the interconnection latency between two NICs. The test and measurements are based on the best possible latency with maximum bit-rate and all measured latency values include FPGA processing time, which can vary depending on the implementation design. Instead of transmitting Ethernet frames, for OPS switching scenarios, larger end-to-end latency of around 200 μs has been observed. The reason is mainly due to the aggregation, segregation and buffering operations required for the optical packets. In the FPGA, the first buffer of segregation-aggregation part uses numerous FIFOs and store-and-forward techniques, which could be dramatically optimized with dedicated design (e.g., packet fragmentation ASIC [132]). Another contribution to the latency is the extra overhead caused by the long preamble key characters, which are inserted for burst-mode reception and clock synchronization. A preamble length of 25.6 μs has been used in the experiment, which could be drastically reduced to less than 150 ns by employing dedicated clock and data recovery circuits [133]. Therefore, much higher throughput can be achieved with shorter packet length, which in turn significantly improves the latency performance by reducing long aggregation/segregation and buffering time LIGHTNESS final demo As the final assessment and validation step for the work carried out during the project, the LIGHTNESS final demonstration joining the efforts from all the partners has been carried out at the exhibition of the 41 st European Conference on Optical Communications (ECOC 2015) in Valencia, Spain. In the dedicated booth, the full LIGHTNESS system was integrated on site and successful live demonstration were performed during the three days event at ECOC. The LIGHTNESS demonstration presented the SDN-enabled and Programmable Optical Data Centre with OPS/OCS Multicast/Unicast Switch-over. The demonstration set-up is based on the testbed shown in Figure 5.18 and a photo of the booth is given in Figure The scope of the demonstration was twofold: first, show the programmable transport, switching and OpenFlow configuration of data flows over the LIGHTNESS hybrid optical flat DCN, with a full on-site integration of the LIGHTNESS extended OpenDaylight SDN controller and optical data plane. As clearly shown in Figure 5.23, the data plane employs the fast optical switch prototype as OPS switch, the Polatis switch as backplane and OCS switch, programmable optical NICs and optical ToRs. The label generator used for interfacing the NIC in OPS operations is also highlighted. Second, on top of the OpenFlow based provisioning features,

88 Summary the demonstration showed applications for: i) on-demand VDC provisioning and reconfiguration, with creation of multicast virtual slices using OPS resources, ii) monitoring VNF to retrieve

105 88 Summary the demonstration showed applications for: i) on-demand VDC provisioning and reconfiguration, with creation of multicast virtual slices using OPS resources, ii) monitoring VNF to retrieve OPS statistics and OCS port power status, and automated OCS/OPS and multicast/unicast switch-over. The VDC composition and monitoring VNF can be assessed through the developed GUI, as illustrated in Figure The LIGHTNESS demo has attracted a lot of interests from the audience, providing good feedback on the overall achievements. Figure 5.23: LIGHTNESS final demo at ECOC Summary The assessment of the 4 4 fast optical switch node has been performed in this chapter. Based on the optical flow control mechanism and the SDN-enabled control interface, the dynamic switching operation with LUT configuration and statistic report has been first demonstrated. Exploiting the highly distributed control, the RF tone in-band labeling technique and nanosecond switching of SOA, 20 ns reconfiguration time has been achieved. A packet loss ratio lower than and an average end-to-end latency less than 520 ns (including 250

106 4 4 fast optical switch prototype and performance assessments 89 ns offset for 2 25 m transmission link) has been tested under relatively high traffic load of 0.5 and limited buffer capacities of 16 packets. Increasing the buffer size could improve the performance in terms of packet loss for load values smaller than 0.7. Investigation on the scalability indicates that scaling up to ports is possible with limited extra power penalty while maintaining the same latency performance. The amplification introduced by SOA switch could compensate the splitting loss of the broadcast stage resulting in a lossless fast optical switching system. Then the work on prototyping the 4 4 fast optical switch node is presented. Board-level integration of the label processor as well as SOA/DML driver has been implemented with proper interfaces to the FPGA-based switch controller, which also provides the access to the OF-agent for SDN-enabled control for LUT configuration and statistic collection. Power consumed by the prototype is mainly contributed by the FPGA board and label processor and a linear increase with the port-count scaling is expected. Incorporating with a BM-RX, the developed 4 4 fast optical switch prototype has been experimentally investigated for required preamble length during the packet-based operation. Results indicates that a preamble of 25.6 ns could allow error free operation of 10 Gb/s asynchronous switched packets with 25 ns minimum guard-time. Deployment of a fast-lock PLL-based BM-CDR introduces extra 128 ns preamble. The Gated-VCO based CDR or oversampling based CDR would be a better solution to ultimately decrease the preamble contributed by BM-CDR. The 4 4 fast optical switch prototype has also been successfully employed in the final testbed and demonstration of the LIGHTNESS project as the OPS node. Owning to the prototyped label generator interfaced with the NIC, optical packets can be flexibly handled by the fast optical switch. The developed SDNenabled control interface and the OF-agent for the fast optical switch prototype facilitate the integration with the SDN control plane, based on which the VDC provisioning and reconfiguration as well as monitoring VNF to retrieve OPS statistics have been demonstrated. With the integration of the extended SDN controller and the innovative optical data plane, the LIGHTNESS provides a SDN-enabled and fully programmable optical DCN.

107

108 Chapter 6 OPSquare DCN based on flow-controlled fast optical switches In this Chapter, a novel all-optical DCN architecture OPSquare is presented and the performance assessments by means of simulation as well as experiment are carried out 4. The OPSquare employs parallel intra-/inter-cluster switching networks guaranteeing a flat topology. The fast flow-controlled optical switch nodes depicted in Chapter 3 are used as the switching elements, which allow for flexible switching capability in both wavelength and time domains. Large interconnectivity can be achieved by utilizing moderate port-count optical switches with low broadcasting ratios, benefited from the transceiver wavelength assignment implemented at the ToR for addressing corresponding grouped racks. First the packet loss, latency, throughput, and scalability are numerically investigated under realistic data center traffic model developed in OMNeT++ [134]. Then, the experimental evaluation of the OPSquare DCN by employing 4 4 fast optical switch prototypes shows multi-path dynamic switching with flow control operation. The case deploying and optical switches connecting 1024 and 4096 ToRs are emulated and limited performance degradation has been observed. The potential of switching traffic with higher-order modulation formats and waveband signals along with optical switch port-count scaling has been experimentally investigated, demonstrating the suitability of OPSquare architecture for addressing the scaling challenges of traditional DCNs in terms of both network architecture and switching technology, by providing potentially Petabit/s capacity and low-latency switching capabilities. In the following, the OPSquare architecture including the operation of the ToR and flow-controlled fast optical switches is introduced in Section 6.1. Section 6.2 presents the simulation layout and overall performance analyses under a realistic DC traffic pattern, followed by the comparison with current DCN based on electrical switches in terms of latency and power consumption. 4 Parts of this chapter are based on the results published in [154].

109 92 OPSquare DCN architecture In Section 6.3, the experimental demonstration exploiting fast optical switch prototypes is reported. The capability of switching multi-level signal and waveband channel with potential scalability of optical switches to large portcount is discussed in Section OPSquare DCN architecture The OPSquare flat DCN architecture under investigation is shown in Figure 6.1. It consists of N clusters and each cluster groups M racks. Each rack contains K servers interconnected via an electronic ToR switch. Two WDM bi-directional optical links are equipped at the ToR to access the parallel intra- and intercluster switching networks. As shown in the Figure 6.1, the N M M intracluster optical switches (IS) and M N N inter-cluster optical switches (ES) are dedicated for intra-cluster and inter-cluster communication, respectively. The i- th ES interconnects the i-th ToR of each cluster, with i = 1, 2,, N. Single-hop direct interconnection is provided for ToRs within the same cluster, while at most two-hop is sufficient to interconnect ToRs residing in different clusters. It is worth to notice that multiple paths are available for each pair of ToRs, increasing the network fault-tolerance. The number of interconnected ToRs (and servers) scales as N M, that by using moderate port-count ISs and ESs, up to 1024 ToRs (40960 servers in case of 40 servers per rack) can be interconnected. Figure 6.1: OPSquare DCN architecture built on fast optical switches.

OPSquare DCN based on flow-controlled fast optical switches 93 6.1.1 ToR switch The schematic showing the functional blocks of the ToR is presented in Figure 6.2.

110 OPSquare DCN based on flow-controlled fast optical switches ToR switch The schematic showing the functional blocks of the ToR is presented in Figure 6.2. As the edge node of the optical DCN, the ToR consists of a server interface and a network interface. Part of server traffic is exchanged between intra-rack servers (intra-rack traffic), and the rest is directed to servers in the same cluster (intra-cluster traffic) or different clusters (inter-cluster traffic). When an electrical data packet from a server arrives at the ToR, the packet header carrying the destination information will be checked by the head processor. For the intra-rack traffic, the ToR directly processes and forwards the traffic to the destination server. Intra-rack contention is solved by using K intra-rack buffer queues at the server interface. Figure 6.2: Schematic of the ToR switch. The network interface targets for the intra-/inter-cluster communication. It consists of p WDM transceivers (TX and RX) with dedicated electronic buffers to interconnect the ToR to the IS optical switch through the optical flowcontrolled link [126], while q WDM transceivers interconnect the ToR to the ES optical switch. The reason of employing multiple (p and q) WDM transceivers is twofold. First, WDM allows for scaling the communication bandwidth between the ToRs and the optical network employing multiple wavelengths to generate a high-capacity channel (waveband super-channel). For instance, if an oversubscription of 1 is desired for the ToR grouping K = 40 servers each with

111 94 OPSquare DCN architecture a bandwidth of B SERV = 10 Gb/s, the capacity offered by the p+q WDM transceivers should be K B SERV = 400 Gb/s. The oversubscription is defined here as the ratio between the bidirectional aggregate bandwidth of the servers and the bidirectional capacity provisioned to the optical switching network. Multiple 25 Gb/s, 40 Gb/s or 100 Gb/s in the near future WDM transceivers can be tailored according to the intra-/inter-cluster traffic ratio. Second, each of the WDM transceivers is only dedicated for the communication with a different group of ToRs. For intra-cluster network, the M ToRs are thus divided into p groups and each group contains F = M/p ToRs. One of the p WDM TXs addresses F (instead of M) possible destination ToRs, in combination with the 1 F switch at IS. As an example shown in Figure 6.2, with regards to ToR F+1 in Cluster 2, the TX 1 at λ 1 communicates with Group 1 (ToR 1, ToR 2,, and ToR F ), TX 2 for Group 2 and TX p for Group p. The structure and operation of the intercluster interface are similar to the intra-cluster ones. After being processed at the head processor, the inter-rack data packets are forwarded to one of the WDM interface according to the header destination. In particular, data packet is stored in the buffer of the TX i of the p WDM interfaces if the data is destined to the intra-cluster ToR Group i, or in the buffer of TX j of the q WDM interfaces if destined to the inter-cluster ToR Group j. Optical packet will be formed and a copy is sent to the destination ToR via the fast optical switch by the WDM TX. An optical label identifying the destination ToR is attached to the packet. The optical in-band RF tone labeling technique has been used to convert the digital label bits into RF tone format, which has been detailed in Section 2.2. The schematic of the label generator (LG) and the prototyped 8-bit LG are presented in Figure 6.3. The RF tone label is carried on the in-band direct-modulated laser (DML), and will be extracted and processed at the fast optical switch node. Figure 6.3: Label generator.

OPSquare DCN based on flow-controlled fast optical switches 95 6.1.2 Fast optical switch node The schematic of the fast optical switch node acting as IS/ES is shown in Figure 6.4.

112 OPSquare DCN based on flow-controlled fast optical switches Fast optical switch node The schematic of the fast optical switch node acting as IS/ES is shown in Figure 6.4. The optical switching is realized by SOA-based broadcast-and-select architecture, which has been explained in Chapter 2. In OPSquare DCN, the different WDM TXs at the ToR have been assigned for the communication with different groups of ToRs. As shown in Figure 6.4, the fast optical switch node has a modular structure and each module consists of F units each of which handling the WDM traffic from one of the F ToRs in a single group. The WDM inputs are processed in parallel and the packets are directed to the output port according to the optical label. The label extractor (LE) separates the optical label from the payload. The extracted label is processed on-the-fly by the switch controller, while the payload is fed into the 1 F switch. The operation of LE and 1 F switch has been explained in Section 2. The fixed fiber delay line (FDL) is used to match the switching time while the label is processed. According to the label information, the switch controller enables the SOA gates of the 1 F switch. The SOA has fast nanoseconds switching speed and can provide optical amplification to compensate the splitting losses caused by the broadcasting architecture. Figure 6.4: Flow-controlled fast optical switch node.

96 OPSquare DCN architecture It is worth to notice that benefiting from the WDM TX wavelength assignment for the grouped ToRs and the wavelength switching capability enabled by the optical switch

113 96 OPSquare DCN architecture It is worth to notice that benefiting from the WDM TX wavelength assignment for the grouped ToRs and the wavelength switching capability enabled by the optical switch node, the splitting loss of the broadcast-and-select switch is 3 log 2 F db (F = M/p for the IS switch and F = N/q for the ES switch), which is much less than 3 log 2 M db and 3 log 2 N db required in the switch architecture reported in [135]. Lower splitting losses lead to less OSNR degradation and significant improvement of the scalability and feasibility of the network. The WDM optical switch works in a synchronous manner and owning to the modular structure, only the ToRs in the same group should be synchronized to make sure that the packets arrive at the switch module at the same time. The synchronization among the groups can be realized through the coordination from the optical switch controller or the control plane. The traffic from the same group and addressing the same destination ToR would be multiplexed by an F-port AWG and received by one of the p RX links. Compared with the F 1 coupler, the AWG has much less losses especially when F scales up. Figure 6.4 depicts the connection of AWG 1 of each unit which is corresponding to the traffic from Group 1. To be combined by the output AWG, the wavelength from different ToRs in the same group should not overlap with each other, and therefore a simple round-robin mapping has been applied, as defined in Figure 6.5. Two different cases, p larger than or equal to F, and p smaller than F are considered. For ToR 1, ToR 2,, ToR F, the wavelength used to address the same destination group is different with each other which can be multiplexed by the output AWG. When p is equal or larger than F, each output AWG groups F out of p wavelength channels. In case p is smaller than F, to avoid that the wavelength repeats at the F-port AWGs, F wavelength channels in total are utilized in the system and each ToR would select p of them again arranged according to the round-robin mapping. Figure 6.5: Wavelength mapping rule.

114 OPSquare DCN based on flow-controlled fast optical switches 97 Due to the statistical multiplexing, the contention may happen at the IS output due to the time domain collision of the traffic coming from different units. Therefore, different class of priority can be applied to guarantee the success forwarding of the traffic flows with more stringent QoS requirement. The control plane will define the priority by updating the look-up table in the switch controller. The packet with highest priority is forwarded to one or several of the F output ports even in the case of contention. The implemented flow control helps to avoid the loss of the blocked packets. As described in Chapter 3, signals indicating the successful forwarding or blocking (ACK or NACK) are generated by controlling a DML and sent back to the ToRs within the same WDM channel. According to the received flow control signal, the flow controller at the ToR releases the packets stored in the buffers (for ACK) or triggers the retransmission of blocked packets (for NACK). 6.2 Numerical investigation Simulation set-up To fully investigate the performance of the OPSquare architecture, OMNeT++ network simulation framework has been employed to model the DCN operation. 40 servers per rack each with 10 Gb/s uplink are programmed to create data packets with length ranging from bytes at a certain load. A data centerlike traffic pattern is generated and Figure 6.6 (a) illustrates the cumulative distribution function (CDF) and the histogram of the packet length generated during the simulations [14]. Packet arrival times are modeled with ON/OFF periods (with/without data packets transmission), matching the traffic behavior found in DCs [14]. The CDFs of both these periods are presented in Figure 6.6 (b). The ON periods follow the same length distribution regardless of the simulated input traffic load value, while the OFF period is proportional to the chosen simulation load value. We assume that the traffic in the network is equally spread over all the servers. All the packets in every transmission (ON) period are randomly sent to one of the possible destination servers. The roundtrip-time (RTT) between the ToR and optical switch node is 560 ns, which includes transmission over the 2 50 m distance between the optical switch and the ToRs and 60 ns delay caused by the label processor as well as flow control operation [126]. In all simulations, each server transmits an average of data packets which corresponds to around optical packet time-slots at load of 0.1. The ToRs in the same group are synchronized to transmit the optical packets at the same time. In practice, the ToRs from the same group could be synchronized for instance by a distributed clock.

115 98 Numerical investigation Figure 6.6: (a) CDF and histogram of the packet length. (b) CDF of ON/OFF periods. Considering that most of the traffic resides inside the cluster, four transceivers have been assigned to the IS and one for the ES (p = 4 and q = 1). For the inter-rack communication, the data packets will be forwarded to the ports associated with the intra/inter-cluster network interface and stored in the 64-byte buffer cells. The number of cells N cell that a data packet occupies is given by N cell L / 64 ( 6.1 ) packet where L packet is the length of the packet in bytes. Five cells with the same destination are aggregated to compose a 320-byte optical packet to be transmitted in the fixed 51.2 ns time-slot. The delay caused by the head processing and buffering at the ToR input are taken as 80 ns and 51.2 ns, respectively. With each ToR grouping K = 40 servers each with a bandwidth of B SERV = 10 Gb/s and each transceiver operating at D_rate = 50 Gb/s, the oversubscription (OV) at the ToR level is OV K BSERV 2 / p q D _ rate 2 1.6:1 ( 6.2 ) Traffic distribution First we analyze the performances under different intra/inter-cluster traffic ratios. An OPSquare DCN consisting of 256 racks (10240 servers) organized in

OPSquare DCN based on flow-controlled fast optical switches 99 16 clusters has been considered. 16 16 optical switches are employed as ISs and ESs to interconnect the ToRs.

116 OPSquare DCN based on flow-controlled fast optical switches clusters has been considered optical switches are employed as ISs and ESs to interconnect the ToRs. Half of total traffic is assigned as intra-tor and the other half as intra/inter-cluster. Two traffic cases have been studied: 1) 37.5% intra-cluster traffic and 12.5% inter-cluster traffic (3:1 case); 2) 40% intra-cluster traffic and 10% inter-cluster traffic (4:1 case). Figure 6.7 (a) shows the results of the packet loss and server end-to-end latency as a function of the load. The size of buffer at each TX is 20 KB. As expected the packet loss and latency performance experience degradation when the load increases due to the contention and high occupancy of the buffer. It can be seen that two different intra/inter-cluster traffic ratios give similar performances. Lower than 2 µs server end-to-end latency and packet loss ratio are achieved with a load of 0.3. The latency is mainly contributed by the processing and buffering time at the ToR, propagation delay (560 ns for intracluster traffic and 1.12 µs for inter-cluster traffic) and the retransmissions for the blocked data packet. Considering the traffic in data centers does not exceed 30% of the maximum network capacity for most of the time as indicated in [35, 38], the OPSquare can effectively handle the data center traffic by using moderate port-count optical switches. The performance can be further improved by employing bundles of parallel fibers and multiple receivers at the ToR switch in which case more packets can be forwarded and received at the same time [128]. Figure 6.7: Packet loss ratio and server end-to-end latency for different (a) intra/intercluster traffic ratios and (b) buffer sizes Buffer dimensioning Secondly, we analyze the performances as a function of the buffer size. The electronic buffer dedicated for the TX stores the data packet cells in case of

117 100 Numerical investigation retransmission. When the buffer is full, the newly arrived data packet will be discarded. The size of the buffer is one of the most critical hardware dimensions as larger buffer can improve the packet loss and in principal, would deteriorate the latency performance when given a higher load. As clearly shown in Figure 6.7 (b), for load lower than 0.5, setting a larger buffer size improves the packet loss performance at the expense of slightly larger latency. For load higher than 0.5, the packet loss is unavoidable and a larger buffer size does not help to improve the performance. This is due to the heavy contentions and full buffer occupation caused by retransmissions. The latency performance becomes worse because the large buffer barely leads to longer waiting time in the queue. Table 6.1: Parameter settings Scales (number of servers) Servers per rack K 40 Transceivers per ToR p 4 q 1 Clusters N Racks per cluster M Groups F (IS) F (ES) Interconnectivity scalability The system performances as a function of the DCN size, ranging from 2560 servers to servers have also been investigated. The amount of racks varies from 64 to 1024, and consequently the ES and the IS optical switches with port-count of 8 8 to are needed to achieve the desired DCN size, as listed in Table 6.1. The buffer is set as 20 KB for each TX and the intra: inter traffic ratio of 3:1 is employed. The packet loss ratio and server end-to-end latency as a function of number of servers are given in Figure 6.8 (a). It can be observed that the performance is maintained as the number of servers increases. The packet loss ratio is smaller than and the server end-to-end latency is lower than 2 µs at load of 0.3 for all scales, which indicates the fine scalability to Petabit/s DCN of the OPSquare architecture. Similar results have been achieved for the throughput performance, as clearly shown in Figure 6.8 (b). The network starts to saturate at load of 0.5 and slight difference among all the scales has been obtained.

118 OPSquare DCN based on flow-controlled fast optical switches 101 Figure 6.8: (a) Packet loss ratio and server end-to-end latency and (b) throughput when number of servers scales Comparison of latency and power consumption We have compared the performance of OPSquare in terms of latency and power consumption with the leaf-spine architecture which is widely used in current DCN. A DCN consisting of 1024 ToRs (40960 servers) is considered. For a 2:1 oversubscribed leaf-spine DCN, leaf switches and spine switches are needed [16]. The latency of the electronic switch increases super-linearly with the port-count, and the latency of the leaf switches and spine switches are taken as 500 ns and 5 µs respectively based on the commercial products [136]. Assuming that the distance between the leaf and spine switches is 50 m, such leaf-spine DCN has an average latency of ~6.5 µs which is mainly contributed by the spine switches with large port-count. The latency value here is based on cut-through switches under full load (load = 1). Comparable result is achieved for the fully loaded OPSquare network and benefiting from the flat topology and fast switch control mechanism, the OPSquare architecture has lower than 2.5 µs latency at load of 0.4. Table 6.2: Power consumption of the components Components Symbol Power (W) Electrical switch P ES 12.5/port 10 Gb/s transceiver P TRX-10 1 Optical switch controller P CTRL 25 Label processor P LP 6 SOA driver P SOA 1 ACK driver and processor P ACK 1.5 Label Generator P LG 3.5

102 Experimental investigation As the leaf-spine and OPSquare DCNs have the same number (1024) and same port-count (64 64) of the ToR (leaf) switches, the power consumption is compared for the

119 102 Experimental investigation As the leaf-spine and OPSquare DCNs have the same number (1024) and same port-count (64 64) of the ToR (leaf) switches, the power consumption is compared for the interconnection network. The breakdown contributions of the components are listed in Table 6.2. For the 2:1 oversubscribed leaf-spine DCN, the power is consumed by the spine switches and the transceivers: P P P leaf spine spine TRX P P 0.28 MW ES TRX 10 ( 6.3 ) where P spine and P TRX are the power consumption of the spine switches and transceivers, respectively. For OPSquare architecture, the port-count of the IS and ES is (M = N = 32). The power consumption of the WDM fast optical switch nodes can be calculated by using the following equation: P OPSquare P OS 2 32 P 32 4 P 8 P P P 0.16 MW CTRL LP SOA ACK LG ( 6.4 ) where P OS is the power consumed by the optical switch nodes. An energy saving of 40% is achieved for OPSquare, which is mainly resulted from the elimination of the larger number of transceivers [137]. The optical transparency enabled by the fast optical switches allows for further scaling to higher data-rate and higher-order modulation formats without implementing dedicated powerconsuming circuits, maintaining the same amount of consumed power. 6.3 Experimental investigation Figure 6.9: Set-up for experimental assessment.

OPSquare DCN based on flow-controlled fast optical switches 103 Besides the numerical assessment, we also experimentally investigate the performance of the OPSquare architecture by using the set-up

120 OPSquare DCN based on flow-controlled fast optical switches 103 Besides the numerical assessment, we also experimentally investigate the performance of the OPSquare architecture by using the set-up illustrated in Figure 6.9. It consists of 3 clusters each formed by 3 ToRs and the fast optical switch prototypes introduced in Section 5.2 are employed as the optical switches (IS and ES). It is worth noting that the evaluation here mainly focuses on the intra-/inter-cluster interface and the parallel optical switching networks. The schematic of the implemented ToR is depicted in Figure At the intra-/inter-cluster interface, two payload TXs are generating 40 Gb/s NRZ- OOK packets with 540 ns duration and 60 ns guard time. For each port of the optical switches (IS and ES), the arriving traffic is allocated on the wavelength channels at λ 1 = nm, λ 2 = nm and λ 3 = nm, respectively. An FPGA acting as a local controller stores the look-up table with the packet destination, according to which the 4-bit label (2 for inter- and 2 for intracluster destination ToR) is generated. The label is first stored in a first-in-firstout (FIFO) queue with a size of 16 packets and will be removed from the queue in response to a positive flow control (ACK) signal. The label TX transforms the digital 4-bit label into in-band RF tone optical label and attaches it to the corresponding payload packet. At the WDM optical switch, the optical label is extracted and the digital label bits are recovered for contention resolution. The switch controller will then control the 1 2 (F = 2) switches accordingly to forward the packet to the destination port and block the contended ones with lower priority. The optical flow control signal is sent back to the ToR through the same WDM channel, where the ACK RX detects the ACK/NACK signal. The local controller will remove the label in the buffer in response to the ACK, while retransmit the label and retrigger the payload transmission in case a NACK is received. Figure 6.10: Schematic of the implemented ToR.

104 Experimental investigation When the packet arrives at the ToR after passing through the optical switch, the optical label will be first processed to check if it belongs to the current ToR (single

121 104 Experimental investigation When the packet arrives at the ToR after passing through the optical switch, the optical label will be first processed to check if it belongs to the current ToR (single hop), otherwise (two hops) it will be optically bridged to the next switch node (IS or ES) and a copy will be stored. As shown in Figure 6.10, the locally generated traffic and the packets to be bridged are scheduled by the 1 2 switch and the bridged packets have been given the higher priority when no retransmission of local traffic is under process. The blocked traffic will wait for the next available time-slot. At the destination ToR, the payload will be received after the label being removed by the FBG. The recorded packets and flow control signals to validate the intra- and intercluster dynamic switching operations are shown in Figure As a selected case, inter-cluster communication from ToR 1 to ToR 5 via two different paths, i.e., Path 1 (ES 1 ToR 4 IS 2 ) and Path 2 (IS 1 ToR 2 ES 2 ) is investigated. The traffic generated by the ToR is reported and each packet has been labeled with the destination ToR (1 to 9). For the Path 1, the packets from the ToR 1 and the ToR 7 (higher priority) with destination ToR 5 are transmitted to ToR 4 through ES 1. In case of contention, the traffic from ToR 1 will be blocked, a NACK is sent to ToR 1 and the blocked packet is retransmitted in the next available timeslot. At the ToR 4, packets with destination ToR 5 and ToR 6 will be optically forwarded to the IS 2 and stored in case retransmission is needed. At the IS 2, packets coming from ToR 4 and ToR 6 (higher priority) and destining ToR 5 might have contention. Optical flow control provides, in case of contention, the ACK signals to ToR 4 as shown in Figure 6.11, where the packets received by the ToR 5 are also reported. As an alternative path, Path 2 accesses the IS for the intra-cluster communication first instead of inter-cluster connection. The traces of dynamic switching via Path 2 are also illustrated in Figure Figure 6.11: Traces for multi-path dynamic switching.

OPSquare DCN based on flow-controlled fast optical switches 105 The BER performance and the eye diagrams for the 40 Gb/s payload are presented in Figure 6.12.

122 OPSquare DCN based on flow-controlled fast optical switches 105 The BER performance and the eye diagrams for the 40 Gb/s payload are presented in Figure The gate SOA has a noise figure of 7 db and is biased at 60 ma providing 15 db gain. Lower than 1.5 db power penalty at BER of and 31.3 db OSNR for the indirect connection (longest path) has been obtained. The noise introduced by the SOA gates slightly degrades the payload quality as also confirmed by the eye diagrams. Figure 6.12: BER curves and eye diagrams. Figure 6.13 (a) and (b) report the measured experiment results of 9-ToR network in terms of packet loss and end-to-end latency under uniform traffic distribution. For load of 0.4 and a buffer size of 16 packets (equivalent to 45 KB) at each TX, a packet loss ratio lower than and an average ToR end-toend latency lower than 1.4 µs have been achieved. Figure 6.13: (a) Packet loss ratio and (b) end-to-end latency for 9, 1024 and 4096 ToRs.

123 106 Scalability investigation 6.4 Scalability investigation optical switch for 4096-ToR interconnectivity Towards the Petabit/s and even larger scale DCN, the fast optical switch node with higher port-count is needed to scale out the number of connected ToRs. By deploying 32 32/64 64 optical switches, the square (port-count) scalability allows for flat interconnection of 1024/4096 ToRs. Following the study carried out for the 9-ToR network, the performances of 1024-ToR and 4096-ToR are experimentally emulated. By using the system modeling environment provided in System Generator [138], the complete network including the traffic generators at the ToRs and optical switches are modeled. Bit and cycle-accurate simulation studies on the statistics are then performed. As shown in Figure 6.13 (a) and (b), the packet loss ratio and latency perform similarly with the 9-ToR network resulted from the port-count independent processing latency. Limited degradation caused by the slight increase of contention possibility has been observed. The packet loss ratio is smaller than and while the ToR end-to-end latency is lower than 2.5 µs and 3.1 µs at load of 0.4 for ToR and 4096-ToR, respectively. The achieved result is in line with the simulation study presented in Section 6.2 where similar trends for the packet loss ratio and latency have been observed. Considering the practical implementation, due to the identical modular structure, the performance of a optical switch or higher port-count would mainly depend on the 1 F broadcast-and-select optical switch. While the splitting loss caused by the broadcasting stage can be compensated by the SOA gates, guaranteeing loss-less operation and sufficient power level for the receiver side [139], the noise added by the SOA will lead to degradation in OSNR. Therefore, the main limiting factor of scaling out the port-count is the OSNR degradation after the splitting loss experienced by the payload. As one of the appealing advantages of the OPSquare architecture, the grouping mechanism and wavelength switching capability allow for lower broadcasting ratio F (M/p for IS and N/q for ES). The lower splitting loss would result in improved OSNR performance, supporting higher port-count with larger input dynamic range. Different combinations of p, q, and F can be selected to achieve the desired ToR connectivity. The performances of the 40 Gb/s payload switched by the corresponding 1 F the broadcast-and-select optical switches are reported in Figure The bias current of the SOA gates is 70 ma. As clearly shown in the figure, increasing the number of transceivers at the ToR results in smaller F and larger input dynamic range can be achieved guaranteeing a limited power penalty. When p and q equal to 4 and 1 8

OPSquare DCN based on flow-controlled fast optical switches 107 broadcast and select optical switch is employed, 12 db input dynamic range is obtained with less than 1.5 db power penalty.

124 OPSquare DCN based on flow-controlled fast optical switches 107 broadcast and select optical switch is employed, 12 db input dynamic range is obtained with less than 1.5 db power penalty. Besides the improvement of the payload performance, larger p and q can dramatically boost the bandwidth of intra-/inter-cluster communication and lower the power consumption at the SOA gates for compensating the splitting loss, which are also desired features. Figure 6.14: Power penalty for different implementation of optical switch Multi-level modulation formats and waveband switching Scaling up the speed per link is an alternative solution for increasing the total capacity of the DCN. The transparency to the data rate/format enabled by the fast optical switches allows for the immediate capacity upgrade maintaining the same switching infrastructure, without dedicated parallel optics and formatdependent interfaces to be further included as front-end. Nowadays, the simple on-off keying modulation widely used for DC interconnect may not be sufficient for future bandwidth requirement. Therefore, schemes like 4-level pulse-amplitude modulation (PAM4) and discrete multi-tone (DMT) featuring with intensity-modulated direct detection (IM/DD) as well as WDM waveband channel have been considered as potential candidates to effectively increase the data-rate of the interconnect in DCNs [140]. In this respect, three types of directly-modulated traffic, namely 28 Gb/s PAM4, 40 Gb/s DMT, and 4 25 Gb/s NRZ OOK featuring with IM/DD have been investigated for the OPSquare architecture. Exploiting the flat topology and the modular structure of the optical switch, the switching performance in OPSquare would mainly depend on the 1 F broadcast-and-select switch and will be limited by the splitting loss experienced by the payload. Using the prototyped optical switch depicted in Section 5.2 and the experiment setup shown in Figure 6.15, the switching performance and the port-count scalability for realizing a large-scale OPSquare DCN have been assessed. The broadcasting

108 Scalability investigation splitter is emulated by a variable optical attenuator (VOA) and another SOA has been added before the 1 F optical switch as pre-amplifier of the input signal.

125 108 Scalability investigation splitter is emulated by a variable optical attenuator (VOA) and another SOA has been added before the 1 F optical switch as pre-amplifier of the input signal. It could improve the input dynamic range and can be potentially photonic integrated with the optical switch [141]. The gating SOA in the prototype guarantees nanoseconds operation and compensates the broadcasting loss. Figure 6.15: Experimental set-up for PAM4, DMT and waveband switching. (AWG: arbitrary waveform generator; BERT: BER tester) Switching of 28 Gb/s PAM4 traffic PAM4 with forward error correction (FEC) can effectively increase the transmission bit-rate in an IM/DD manner with low-complexity and real-time digital signal processing (DSP), making PAM4 a promising candidate for shortreach interconnects solutions. Due to the bandwidth limitation of the DML, the switching performance of 14 Gbaud/s (28 Gb/s) PAM4 traffic has been evaluated, instead of 50 Gb/s being considered for 200 Gigabit Ethernet (GbE) and higher speeds. As illustrated in Figure 6.15, the 28 Gb/s PAM4 traffic is generated by driving the DML from a 60 GSa/s arbitrary waveform generator. At the switch output, the traffic is received by a PIN photodiode (PD) integrated with trans-impedance amplifier (TIA), after which the signal is captured by a real-time 50 GSa/s digital phosphor oscilloscope (DPO) for offline DSP. The DML and PD+TIA are commercially mature 10G class components. Digital pre-compensation is applied at the arbitrary waveform generator and digital filtering is applied at the receiver side. The BER curves of the switched 28 Gb/s PAM4 traffic as well as the eye diagrams for a 4-port switch configuration (6 db splitting loss) are shown in Figure 6.16 (a). The input optical power is 0 dbm (without pre-amplifier SOA). A power penalty of 0.8 db at typically targeted BER (pre-fec) of has been observed, which is also confirmed by the eye diagrams. The pre-amplifier SOA is then added for the investigation of the optical power dynamic range.

OPSquare DCN based on flow-controlled fast optical switches 109 The scales of 1 8 and 1 16 for 1 F switch which are corresponding to the actual scales of 32 32 and 64 64 optical switch when employing

126 OPSquare DCN based on flow-controlled fast optical switches 109 The scales of 1 8 and 1 16 for 1 F switch which are corresponding to the actual scales of and optical switch when employing 4 WDM transceivers (p = q = 4) are considered. The bias current of the pre-amplifier and the gate SOA have been adjusted to optimize the performance. In scale, 10 db dynamic range has been measured with < 1.5 db penalty, while for case, 8 db dynamic range has been obtained with 3 db penalty. Due to the low OSNR tolerance, the OSNR degradation caused by the high splitting loss has greatly affected the power penalty that 1-2 db extra penalty has been observed when scaling from to ports. Figure 6.16: (a) BER curves and (b) input dynamic range for 28 Gb/s PAM4. Switching of 40 Gb/s DMT traffic DMT can make optimal utilization of the bandwidth by adapting the modulation format and power of each subcarrier to the characteristics of the transmission channel. As illustrated in Figure 6.15, an arbitrary waveform generator with a sample rate of 24 GSa/s drives the DML to generate the DMT traffic. The switch output is received and captured by the DPO for offline DSP. A probe signal with uniform power and QPSK bit loading is first sent through the optical switching system and the signal-to-noise ratio (SNR) at each subcarrier is estimated. Then the bit loading algorithm is used to obtain the bit and power allocation for the evaluated DMT signal. Here 512 subcarriers are used, targeting for a line rate of 40 Gb/s. An example of the optimal bit allocation after bit loading is shown in Figure 6.17 (a). It can be seen that the modulation format changes from 128-QAM to BPSK as the SNR decreases towards highfrequency subcarrier.

127 110 Scalability investigation Figure 6.17: 40 Gb/s DMT traffic. (a) bit allocation per subcarrier; (b) BER curves; (c) input dynamic range; (d) throughput vs. received optical power. The BER curve for the 4-port configuration is depicted in Figure 6.17 (b). The power penalty at BER of is less than 0.6 db after the switching system. The optical power dynamic range for scales of and targeting for a line rate of 40 Gb/s and BER of is presented in Figure 6.17 (c). The penalty experiences a fast increase when reducing the input optical power lower than -6 dbm. The port-count of (1 16 broadcasting ratio) could assure an 8 db dynamic range within 3 db power penalty. Despite the loss-less operation guaranteed by the gain compensation, the high power level needed at the receiver side has led to a limited dynamic range and poor powerefficiency. Larger port-count would cause worse performance exposing more pressure on the link power budget. Figure 6.17 (d) reports the flexible data rate adaption enabled by the DMT system. The target BER has been set to and the received optical power is varied from -3 dbm to 0 dbm. By updating the optimal bit allocation, the bit-rate can be dynamically adjusted with an increase from 41 Gb/s to 43.5 Gb/s (corresponding to Gb/s net bit-rate with 7% hard-decision FEC). The assessments carried out for the PAM4 and DMT traffic have validated that the fast optical switches in OPSquare network are potentially capable of

128 OPSquare DCN based on flow-controlled fast optical switches 111 handling the multi-level modulated traffic with a large port-count. By employing 4 WDM transceivers (p = q = 4) per ToR each operating at 40 Gb/s and fast optical switches (1 16 broadcasting ratio), an OPSquare DCN comprising 4096 ToRs each with 320 Gb/s aggregation bandwidth would have a capacity larger than 1.3 Petabit/s. Larger interconnectivity can be achieved either by increasing the broadcasting ratio of the 1 F switch with limited performance degradation as shown in Figure 6.16 and Figure 6.17, or increasing the number of transceivers per ToR which could also improve the bandwidth performance. Multi-λ (WDM 4 25 Gb/s) waveband switching Besides the employment of higher-order modulation schemes, grouping multiple wavelength channels would be a straightforward way to increase the link speed without implementing sophisticated driving interface. Current 100 GbE solutions (e.g., QSFP28 and CFP4) are following this trend by grouping 4 25 Gb/s lanes. Such waveband signal could fit naturally into the OPSquare architecture in which case the AWGs at the input and output port of the optical switch should support coarse WDM (CWDM) applications. Similar to the PAM4 and DMT schemes, the network capacity can be increased without extra engineering effort on the switching infrastructure by adding wavelength channels into the waveband group. Multiple wavelength channels are considered as an entirety and fed into the SOA gate simultaneously. The performance of waveband switching enabled by the broadband operation of the SOA-based switch is investigated for the 1 F optical switch. As shown in Figure 6.15, two DMLs and two electro-absorption modulated lasers (EMLs) with 1.6-nm spacing are driven by the 25 Gb/s PRBS NRZ OOK packetized payload from the bit pattern generator (BPG) and then decorrelated to generate the waveband 4 25 Gb/s data. It is then fed into the emulated 1 F optical switch. At the output port, each channel is filtered out by an optical band-pass filter (OBPF) with 3 db bandwidth of 1 nm and sent to the 40 Gb/s photo-receiver. The BER curves and eye diagrams of the four wavelengths channels for a 4- port optical switch configuration are presented in Figure 6.18 (a). In DCs the target BER for 100 GbE standards is , for which less than 0.8 db power penalty can be achieved for all four wavelength channels.

129 112 Scalability investigation Figure 6.18: (a) BER curves and (b) input dynamic range for 4 25 Gb/s waveband. The power penalty at BER of for 4 25 Gb/s NRZ-OOK traffic with different input optical power is reported in Figure 6.18 (b). The scales of the optical switch are 32 32, 64 64, and when employing 4 wavebands (p = q = 4). 16 db input dynamic range is achieved with less than 2 db power penalty. The optimal input optical power is around -5 dbm and both lower and higher input power result in higher penalty mainly due to the noise and saturation, respectively. Scaling the port-count from to only results in < 0.6 db extra penalty. Scaling to larger port-count and more wavelength channels are expected to further increase the capacity at the expense of limited additional power penalty and smaller dynamic range. Here each waveband has a 100 Gb/s capacity which can be increased by inserting more wavelength channels. Since each SOA gate only deals with a certain waveband, the maximum number of wavebands (equivalent to p) and channels inside each band depends on the operation bandwidth of SOA gate and the wavelength channel spacing. The SOA typically has a gain bandwidth larger than 40 nm which is sufficient for accommodating more than GHzspaced channels. With four 100 Gb/s wavebands per ToR and optical switches (1 16 broadcasting ratio), an interconnectivity of 64 2 = 4096 ToRs and a capacity > 3.2 Petabit/s can be achieved benefiting from the transparent optical switching and WDM TX wavelength assignment for the grouped ToRs featured by the OPSquare architecture.

130 OPSquare DCN based on flow-controlled fast optical switches Summary In this chapter, the operation of the novel optical DCN architecture OPSquare based on the developed fast optical switch node has been presented. Benefiting from the parallel optical switching structure, high-capacity and low-latency switching capabilities can be achieved by employing moderate port-count optical switches. The optical packet crosses the single optical switch node in the flat topology, facilitating the employment of a buffer-less optical switch with fast optical flow control. The OPSquare also introduces WDM TX wavelength assignment for the grouped ToRs, which in combination with the wavelength switching enabled by the fast optical switch guarantees lower broadcasting ratio for realizing the same port-count. Therefore lower splitting losses are experienced by the data payload, leading to less OSNR degradation and significant improvement of the scalability and feasibility of the network. Numerical and experimental investigations on the OPSquare architecture towards the implementation of low-latency and Petabit/s scale DCN have been carried out. Simulation models have been built in OMNeT++ environment and realistic traffic are considered. Assessment results show lower than 2.4 µs server end-to-end latency and packet loss ratio with 0.4 load and 50 KB buffer per TX for a server DCN. Scaling to servers shows slight performance degradation. Performance comparisons with leaf-spine DCN indicate 5 µs lower latency at load of 0.4 and 40% power saving for OPSquare architecture. The experimental validation is performed by utilizing the 4 4 WDM optical switch prototypes described in Section 5.2. Multi-path flowcontrolled dynamic switching operation has been demonstrated. Experimental investigation on scaling from 9 ToRs to 4096 ToRs has shown limited degradation with lower than packet loss ratio and 3.1 µs latency. Increasing the number of the WDM transceivers equipped at the ToR would greatly improve the bandwidth performance and meantime lower the broadcasting ratio of the 1 F optical switch. The capability of the OPSquare architecture to switch high-capacity PAM4, DMT and waveband OOK traffic when scaling up the port-count is experimentally assessed. Employing 4 WDM TXs (waveband channels) per ToR (p = q = 4) and optical switches (1 16 broadcasting ratio), the OPSsquare could provide an interconnectivity between 64 2 = 4096 ToRs and Petabit/s capacity. Potential scaling to larger scale and higher capacity is enabled by the transparent optical switching, scalability enabled by the architecture and the TX wavelength assignment for the grouped ToRs.

131

132 Chapter 7 Optical label switched adddrop node for data center interconnect In this chapter, a novel optical add-drop node based on optical label switching for data center interconnect (DCI) metropolitan area networks (MANs) is presented 5. The optical in-band RF tone label technique introduced in Section 2.2 has been employed as the controlling scheme, which saves the scarce spectrum resources for the WDM traffic. Combining the parallel processing of the label and fast nanoseconds switching of the SOA, the node can provide packet-based add/drop operation for high-capacity and statistically multiplexed data traffic, significantly improving the flexibility and resources utilization of the network. The low-latency property is further guaranteed by the fast power equalization for the WDM channels, which exploits the power indication carried by the in-band optical label. Besides the blocking/forwarding function, the bias current of the SOA gate is thus dynamically adjusted to provide individual gain compensation for each channel. The performance of the optical label switched (OLS) add-drop node has been evaluated in a ring-network. By using the developed node prototypes, the dynamic add/drop operation including multicasting and fast power equalization has been experimentally demonstrated. Statistics investigation confirms the efficient bandwidth utilization resulted from the statistical multiplexing, with less wavelength channels required for loss-free and low-latency delivery. Targeting for the increasing demand for higher capacity and larger interconnectivity in the DCI MANs, the capability of switching Terabit/s QPSK wavebands signal with a large number of crossed nodes are also assessed. In the following, Section 7.1 introduces the structure of the optical label switched add-drop node based on which the nanoseconds reconfiguration time and add/drop operation are supported. The system evaluations of a high capacity ring network employing four node prototypes showing fast wavelength reuse and channel equalization for 50 Gb/s data packets are presented in Section 5 Parts of this chapter are based on the results published in [155] and [156].

133 116 Optical label switched add-drop node 7.2. The capability of dynamically switching high-capacity waveband traffic and potential scalability to larger number of waveband channels and interconnected nodes are experimentally studied in Section Optical label switched add-drop node Flexibility and efficiency required in DCI networks Driven by the cloud computing and the upcoming 5 th generation (5G) of mobile communications, the introduced data-intensive services are imposing more stringent bandwidth requirements on the communications inside the data centers as well as the communications between the separated data centers which are supported by the DCI metro networks [142]. These metro data centers are geographically distributed with distance ranging from several kilometers to a few tens of kilometers. Besides the reasons of lower capital expenditure and operating expenditure at certain locations, the services are brought closer to the end-users, thus reducing the service delivery time. These data centers actively communicate in the DCI transport network for services such as backup, virtual machine (VM) migration, video streaming and fault/disaster recovery. As high bandwidth is required for these applications, WDM deployment incorporating with transparent optical switching node has been considered as an attractive option for DCI metro networks [143]. Compared with long-haul networks, the traffic distribution changes much more rapidly in the DCI metro networks. This leads to the necessity of reconfigurable network elements which could quickly and efficiently rearrange the network capacity as demanded. Such increased flexibility also brings efficient management of the network resources with finer granularity, which is of paramount importance to limit the cost of the infrastructure and in the meantime to satisfy the demanding bandwidth requirement. In addition, some of the data transfers along the DCI require guaranteed low delays. Examples are the migration of VMs and the transfer of virtualized 5G network services [144]. Therefore, network elements especially the add-drop node that provide low delivery latency and efficient resource usage are more and more required for the metro and DCI networks. Current add-drop node based on electronic switches with dedicated data rate and format O/E/O front cards can effectively realize packet switched MAN. However, foreseeing the deployment of bandwidth-adaptable systems, low power consumption, data rate and format independent operation is required. Optical add-drop multiplexers (OADMs) based on optical circuit switches can

134 Optical label switched add-drop node for data center interconnect 117 transparently route signals with variable data rates and formats at low power consumption, but the tens of milliseconds reconfiguration time results in an inefficient performance and utilization of the network resources [51]. The milliseconds long configuration time is mainly due to the setting of the channel wavelength (destination) of the edge node transceivers, the optical add-drop switching system and per channel power level equalization to prevent transmission impairments. As high capacity is demanded, the wavelengths become scarce resources that ultimately limit the capacity and scalability of the DCI networks. In this respect, by offering fast switching control and finegrained on-demand wavelength utilization (statistical multiplexing), fast optical switch based add-drop node supporting packet-based operation could be the key enabler for improving the flexibility and efficiency of DCI metro networks Structure and operation To speed-up the switching time and power equalization operation of the adddrop node, and thus lower the latency of the network, fast optical label processing for each optical channel is performed in parallel by the node controller. The optical label provides information on the node destination, and therefore no specific wavelength has to be assigned as destination node. This results in a better utilization of the wavelength resources, which can be employed according to the availability for increasing the performance, flexibility, and the capacity (e.g., generating WDM super-channels) of the network. Moreover, the optical label is processed on-the-fly (few tens of nanoseconds) and in asynchronous fashion by the controller. No time and power hungry clock synchronization circuits are required, and thus speeding up the operation of the add-drop node. Relying on the label information for addressing the destination node, it provides also the advantage to implement multicasting in the optical domain by operating the add-drop node in drop-and-continue mode. Last but not least advantage, the in-band optical label can be exploited for the power level monitoring of each optical channel. Therefore the proposed OLS technique facilitates fast equalization by properly adjusting the gain of the SOA gates. The detailed structure of the proposed OLS add-drop node is shown in Figure 7.1. At each node, the WDM channels are first de-multiplexed and then divided by a 1 2 coupler into a drop arm and continue arm. Each channel contains one or multiple wavelength(s) which can be tailored according to the desired capacity. In the continue arm, the channel is fed into the SOA based optical gate, which is controlled by the FPGA-based node controller in nanoseconds scale. SOA technology guarantees nanoseconds switching and

135 118 Optical label switched add-drop node adjustable optical gain for power equalization operation. In the drop arm, an optical Label Extractor (LE) separates the in-band optical label from the payload by using a narrow-band fiber brag grating (FBG). The payload is then fed into the SOA based gate and eventually will be forwarded to the fast burstmode receiver (BM-RX). As explained in Section 5.3, the automatic gain control is applied here and the clock and phase synchronization are performed for correct reception of the data packets. An example of BM-RX able to lock to the incoming packet within 3 ns is reported in [145]. Compared with the electrical switch based ring, reduction in the number of WDM transponders at each node can significantly lower the consumed power. No dedicated powerhungry circuits are needed even when scaling to higher data-rate and higherorder modulation formats. Figure 7.1: Schematic of the optical label switch add-drop node. The power of the extracted optical label will be split into two parts, as illustrated in Figure 7.1. First one will be used for real-time monitoring to implement rapid channel equalization. The analogue equalizer circuits are made to feedback the SOA bias current and thus to adjust the optical gain experienced by the continue traffic. The other part of the extracted label is processed by the Label Processor (LP) to recover the RF tone label bits [112]. The recovered bits are processed by the node controller. According to the label information, the controller determines if the packet has to be dropped / forwarded or dropped and forwarded (multicasting) by controlling the SOAs in the drop and continue arms. Note that all the labels as well as the label bits of the WDM channels are processed in parallel. This allows the controller to have the full vision of available slots and reconfigure the node in nanoseconds scale. On both arms, fixed delay lines are added to synchronize traffic with the generated controlling signals. In the add operation, the controller sets the fast (nanoseconds scale)

136 Optical label switched add-drop node for data center interconnect 119 tunable transmitter (TTX) [146] and Label Generator (LG) for transmitting the packets and corresponding labels originally stored in the node electrical buffer. The LG is the same as the one described in Section 2.2. The binary label bits generated from the node controller are coded on multiple RF tones. A fast tunable laser centered on the optical label wavelength is modulated with the RF tones. The 4 4 optical switch prototype developed in Section 5.2 has similar structure with the OLS add-drop node. Slight modifications including adding the equalizer circuits and replacing the output AWG have been made to implement the 4-channel OLS add-drop node prototype. Compared with the optical packet and circuit integrated ring node developed in [147] which uses serial header for control, the OLS add-drop node does not require requires power and time consuming clock and data recovery for recovering the header bits. The packet OADM in [146] enabling subwavelength granularity and optical transparency uses a supervisory wavelength channel to carry the control information. As the number of channels scales, longer header bits are required to address all channel destinations, which results in much longer header processing latency. Instead, the OLS add-drop node employs parallel labeling to allow for scalable and ultrafast operation. The monitoring of the optical label power as well as fast response of the SOA also enables the fast per-channel power equalization, which serves as an essential building block to fully achieve a flexible add-drop node featuring with nanoseconds scale reconfiguration time. 7.2 Assessment with 50 Gb/s data traffic Experiment set-up The performance of the proposed OLS add-drop node has been assessed in a network scenario, where 4 prototypes has been realized and interconnected in a unidirectional ring topology, as shown in Figure 7.2 (a). Four laser sources are used to generate the 100-GHz spaced WDM optical channels with wavelength located at nm, nm, nm and nm, respectively. Each of them carries the packetized NRZ-OOK (PRBS 2 9-1) payload at data rate of 50 Gb/s. To facilitate the network management and guarantee an efficient low-latency adding operation of new packet [146], synchronous slotted operation with fixed packet duration of 2 µs and 100 ns guard time have been utilized. Due to the lack of fast tunable transmitters, the WDM data packets are distributed to the 4 nodes as adding traffic. As illustrated in Figure 7.2 (b) and (c), the FPGA-based node controller enables the optical gates to emulate the fast tunable transmitters and the adding local traffic ADD 1-ADD 4.

120 Assessment with 50 Gb/s data traffic Figure 7.2: (a) 4-node ring network. (b) 50 Gb/s payload generation. (c) Schematic of Node 1 in the experimental assessment. Figure 7.2 (c) presents the detailed schematic of Node 1 in the experimental setup.

137 120 Assessment with 50 Gb/s data traffic Figure 7.2: (a) 4-node ring network. (b) 50 Gb/s payload generation. (c) Schematic of Node 1 in the experimental assessment. Figure 7.2 (c) presents the detailed schematic of Node 1 in the experimental setup. The link span between each two nodes is 10 km SMF. An Erbium doped fiber amplifier (EDFA) for channel amplification and a piece of dispersion compensating fiber (DCF) for chromatic dispersion compensation are placed in each span. Note that the OLS add-drop node provides fast equalization of the channels and therefore no adjustment of the EDFA gain profile is needed. The payload packet is attached with an optical in-band RF tone label [112], with wavelength centered at nm, nm, nm and nm, respectively. 4-bit label (carried by RF tones of 130 MHz, 253 MHz, 410 MHz and 615 MHz) has been used in this experiment to address the 4 possible destination nodes as well as multicasting destination. At each node, the optical labels are simultaneously filtered out by the LEs using a FBG with a 3dB-bandwidth of 6 GHz. The FPGA node controller, fed with recovered label bits, controls the SOA gates to drop / continue or drop and continue the packets. The destinations of the adding packets will be checked with the ones in continue arms for possible contention. When a slot at any wavelength is available, the controller will start the add operation by triggering the emulated TTX and LG for creation of in-band optical label. Benefiting from the parallel labeling technique, less than 20 ns delay is observed for both add and drop operations. A portion of the in-band optical label power is used for real-time monitoring and equalization. The power equalization optoelectronic circuits mainly consisting of photodetector, a logarithmic amplifier and operational amplifiers have a response time of 100 ns and a dynamic range up

138 Optical label switched add-drop node for data center interconnect 121 to 10 db. Faster response and larger dynamic range can be achieved by a dedicated design employing off-the-shelf components Dynamic operation Figure 7.3 illustrates the optical spectra at the input and output of OLS add-drop Node 3. At the input, the four channels have a maximum power level difference of around 10 db. By employing the feedback bias generated by the electronic power equalizer circuits, the channels are equalized as a result of the automatic adjustment of the SOAs gain, which has been clearly shown in the output spectrum. The red broken line in Figure 7.3 shows the spectrum of the dropped channel. In this case, the SOA gate on the continue arm is switched off by the node controller. A contrast ratio of around 40 db has been measured, guaranteeing no crosstalk applied to the newly added traffic. Figure 7.3: Spectra of Node 3 input and output. Figure 7.4 reports the traces recorded in the dynamic add/drop operation. The packets have been labeled with destination Node 1-4 and B1-B4 stand for the multicasting packets originated from Node 1-4. At the last destination node, the multicasting packets will be blocked at the continue arm and the slots are available for adding packets. Before adding any packet to the available slot, the node controller will first check the possible contention between the local traffic and the packets in the continue wavelength channels. In the experiment, only one receiver has been used for each node. Therefore if the local traffic has the same destination with any of the existing packets, it will be kept in the buffer. If more receivers would have been available, more channels can be simultaneously added and dropped.

139 122 Assessment with 50 Gb/s data traffic Figure 7.4: Dynamic add/drop operation for the four nodes. BER curves for the dropped 50 Gb/s channels at different nodes of the ring and the back-to-back as reference are shown in Figure 7.5. Multicast traffic is added into the ring from Node1 at the channel where arriving packets have just been dropped. As clearly shown in the Figure 7.5, the traffic received at Node 4 has a power penalty of 2.5 db which is mainly contributed by the residual chromatic dispersion and the accumulated ASE noise from the SOAs. Eye diagrams also confirm the slight degradation of the dropped 50 Gb/s payload at the different nodes. Scaling to larger number of nodes is ultimately limited by the optical signal-to-noise ratio (OSNR) degradation caused by the accumulated ASE noise of the EDFAs and the SOAs used to compensate the loss of the transmission link and at the node. By optimizing the input optical power at the SOA, > 22 db OSNR in a recirculating loop cascading 24 SOAs (sufficient for 24 OLS add-drop nodes) have been demonstrated [148]. In our system, the AWGs acting as filters can further improve the OSNR performance that the

Optical label switched add-drop node for data center interconnect 123 system could potentially scale up beyond 24 nodes with adequate OSNR for error free operation. Figure 7.

140 Optical label switched add-drop node for data center interconnect 123 system could potentially scale up beyond 24 nodes with adequate OSNR for error free operation. Figure 7.5: BER curves and eye diagrams for data dropped at different nodes Statistic investigation The statistics of the OLS add-drop node based ring network when dealing with different traffic load has been investigated. Each node generates uniformly distributed traffic with a load ranging from with a multicast (to other 3 nodes) ratio of 5%, 15% and 25%, respectively. The performance in terms of the packet loss, average delivery time and the throughput for the 4-node ring network is reported in Figures 7.6 (a), (b) and (c), respectively. It can be observed from the figures that for load equals to 0.4 and multicasting ratio of 25%, the packet loss is lower than The average delivery time including waiting time for available slot in the buffer and propagation along the ring increases slowly to 110 µs for load up to 0.5. Then it rapidly grows at high load mainly due to the long waiting time caused by the 16-packet buffer. Accordingly, the throughput saturates at load of 0.5 and higher throughput will be achieved when lower multicast ratio has been set.

141 124 Assessment with 50 Gb/s data traffic Figure 7.6: Statistic analyses of the 4-node ring network. (a) Packet loss. (b) Average delivery time. (c) Throughput. Owning to the efficient bandwidth utilization resulted from the statistical multiplexing, efficient service delivery can be achieved with limited wavelength resources. Considering a ring network of 4, 8, and 16 OLS add-drop nodes, we have performed the bit and cycle-accurate simulation studies on the minimum number of wavelength channels needed to achieve a loss-free operation. In this case, same amount of receivers as the wavelength channels are equipped at each node instead of the single one used in the experimental assessment. As shown in Figure 7.7, more wavelength channels are necessary to boost the capacity of the network when the load increases. For a ring network containing N (4, 8 and 16) nodes, N/2 wavelength channels are adequate for loss-free and low-latency delivery with the load up to 0.6. The statistical multiplexing enabled by the fast reconfigured OLS add-drop node indeed brings the benefits of flexible and efficient bandwidth utilization to the networks.

142 Optical label switched add-drop node for data center interconnect 125 Figure 7.7: Minimum number of wavelength channels needed for loss-free operation in ring network of (a) 4 nodes; (b) 8 nodes; (c) 16 nodes. 7.3 Assessment with high-capacity waveband traffic Experiment set-up Facing the increasing demand for higher capacity in DCI networks, the capability of switching high-capacity Terabit/s wavebands signal with highorder modulation format is assessed for the OLS add-drop node. Similar with the work carried out for the 50 Gb/s traffic, a ring topology interconnecting 3 node prototypes has been employed, as shown in Figure 7.8 (a). It includes the experimental set-up with detailed structure of Node 1. Twelve 50-GHz spaced DWDM channels ( nm nm) are modulated using a 28 GBaud QPSK transmitter followed by the polarization and odd/even channel decorrelation stages, as shown in Figure 7.8 (b). The generated signal is distributed to the 3 nodes as add traffic (Add 1, Add 2 and Add 3) where it is separated into two waveband signals each containing 6λ 112 Gb/s dualpolarization (DP) QPSK data. In combination with the corresponding optical inband RF tone label, the 6λ 112 Gb/s DP QPSK data is gated by the FPGA-

9: Spectra of input, drop ports and output for Node 1. The payload packet with 2.

143 126 Assessment with high-capacity waveband traffic based node controller to emulate the tunable adding operations as shown in Figure 7.8 (a) and (b). Figure 7.8: (a) 3 nodes ring network and schematic of Node 1 in the experiment set-up. (b) 112 Gb/s DP-QPSK traffic generation and detection Dynamic operation Figure 7.9: Spectra of input, drop ports and output for Node 1. The payload packet with 2.5 µs duration and 100 ns guard time is attached with an optical in-band RF tone label, centered at the wavelength of nm and nm. 3-bit label are used to address the 3 possible destination nodes and multicast destinations. The optical spectra of the input, drop ports for each

144 Optical label switched add-drop node for data center interconnect 127 waveband, and the output of Node1 is illustrated in Figure 7.9. The optical labels are filtered out by a FBG with 3 db bandwidth of 6 GHz. The controller fed with detected label bits, controls the SOA gates to drop / continue or drop-and-continue the packets. The destinations of the adding packets will be checked with the ones in continue arms for possible contention. Benefiting from the parallel labeling and processing, less than 20 ns switching time is observed for add and drop operations. The traces recorded for dynamic add/drop operation is shown in Figure The packets have been labeled with destination Node 1-3 and B1-B3 stand for the multicasting originated from Node 1-3. Figure 7.10: Traces recorded for dynamic add/drop operation. The dropped channel is detected by the coherent receiver and captured by a real-time 40 GSa/s oscilloscope for demodulation by offline DSP, as depicted in Figure 7.8 (b). From a practical point of view, the local oscillator (LO) should be implemented by a fast (nanoseconds) tunable laser controlled by the node controller. The total input optical power to each node is 1 dbm and the SOA gates are biased at 80 ma. The BER curves for the dropped channel at λ of nm at the three different nodes are illustrated in Figure 7.11 (a). The measured OSNR penalty was less than 0.8 db at BER of Figure 7.11 (b) reports the BER for all channels at drop port of Node 1 with 16 db OSNR received. Similar performance over 12 wavelengths has been observed and the small difference is mainly due to the residual unbalanced input power. It can be seen that the dispersion is well compensated by the DSP [149] and wavelength independent operation with limited OSNR penalty caused by the ASE from the SOA noise has been achieved.

145 128 Assessment with high-capacity waveband traffic Figure 7.11: (a) BER curves for dropped traffic at 3 nodes. (b) Wavelength dependency Scalability investigation The capability of switching more wavebands is also investigated for the OLS add-drop node. More wavebands are inserted into the WDM traffic and the BER curves of the switched traffic after one node when three (18λ) and four wavebands (24λ) are processed have been illustrated in Figure Compared with the back-to-back BER curve, negligible penalty is found when scaling the number of waveband channels. This indicates that the OADM is capable of processing multiple wavebands with very limited performance degradation. Figure 7.12: BER curves for 2, 3, and 4 wavebands after single node. In the DCI metro networks, the scalability to large number of nodes is an important feature for the add-drop node. Therefore, the capability of supporting multiple waveband channels crossing a large number of OLS add-drop nodes is

Optical label switched add-drop node for data center interconnect 129 experimentally assessed by employing the 25-km length recirculating loop setup shown in Figure 7.

146 Optical label switched add-drop node for data center interconnect 129 experimentally assessed by employing the 25-km length recirculating loop setup shown in Figure Figure 7.13: 25 km looping setup with OLS add-drop node prototype. First, two wavebands are sent into the loop with input power controlled by an attenuator. To guarantee a BER below 7% hard-decision (HD) FEC limit, the minimum input optical power for different number of loops is reported in Figure Large dynamic range (covered by the green area) of more than 7 db is achieved for 9 loops (225 km), showing the tolerance to the power fluctuation experienced by the traffic. Figure 7.14: Input power dynamic range for 2 wavebands. Then the BER and OSNR for 2, 3 and 4 wavebands are tested and the results are shown in Figure Degradation caused by the amplifier results in a linear (0.25 db/node) OSNR decrease of the received drop signal. For 2 wavebands (12λ), 9-node (225 km) and 15-node (375 km) operation is obtained below HD-

130 Summary FEC and soft-decision (SD) FEC limit, respectively. Scaling the number of wavebands has small expense due to the lower OSNR caused by the gain sharing.

147 130 Summary FEC and soft-decision (SD) FEC limit, respectively. Scaling the number of wavebands has small expense due to the lower OSNR caused by the gain sharing. For 4 wavebands (24λ) traffic, 8-loop and 11-loop operation is obtained below HD-FEC and SD-FEC limit, which accounts for Terabit/s total capacity over 200 km and 275 km, respectively. Figure 7.15: BER and OSNR for 2, 3, and 4 wavebands in looping test. 7.4 Summary A novel optical add-drop node proposed for DCI metro networks has been investigated in this chapter. Optical label switching introduced in Chapter 2 has been employed in the OLS add-drop node as the controlling technique for the traffic forwarding and switching. The optical in-band RF tone label not only carries the destination information including the priority assignment, but also acts as the indicator of the power level of each WDM channel. 20 ns add/drop operation is enabled by the nanoseconds FPGA-based switching control. Partial power of the optical label is monitored by the power equalization feedback circuit, which facilitates the automatic adjustment of the optical gain of the gate SOA to equalize the channels. Fast (~100 ns) response speed has been achieved exploiting off-the-shelf components and further improvement could be made by using dedicated circuits. Prototyped nodes are implemented by integrating the discrete SOA gates with dedicated electronic circuit drivers, label generator and processor, and the FPGA controller for performance assessments. A 40 km ring network consisting of four OLS add-drop nodes is first studied. Experimental result shows dynamic add/drop operation for packetized 50 Gb/s

148 Optical label switched add-drop node for data center interconnect 131 NRZ-OOK payload. Employing a 16-packet buffer at each node, a packet loss ratio lower than with an average delivery time of 110 µs has been observed for a traffic load of 0.4. Benefiting from the statistical multiplexing, numerical investigation shows efficient bandwidth utilization with wavelength re-use. Scaling to N (4, 8 and 16) nodes, N/2 wavelength channels are adequate for loss-free and low-latency delivery with the load up to 0.6. The capability of handling high-capacity waveband signal is experimentally assessed with 6λ 112 Gb/s DP-QPSK traffic. Two wavebands are generated with GHz spaced DWDM channels and the switching performance of the OLS add-drop node is investigated in a 3-node ring network. Dynamic switching with 20 ns operation time and < 0.8 db OSNR penalty is reported. Scalability in terms of the number of wavebands and crossed nodes is studied in a 25-km looping set-up. Larger than 7 db dynamic range has been obtained for 2 wavebands (12λ) signal crossing up to 9 nodes (225 km). 4 wavebands (24λ) operation providing Terabit/s capacity across 11-nodes 275 km distance is also achieved with limited performance degradation, which confirms the suitability of the OLS add-drop node for high-capacity and low-latency DCI metro networks.

149

150 Chapter 8 Photonic integrated fast optical switch In this chapter, a photonic integrated 4 4 WDM fast optical switch is introduced and the switching performance assessment with different types of traffic is presented 6. The system-level validations of the fast optical switch node based on developed prototypes have been reported in previous chapters. Promising capabilities have been demonstrated for the potential applications in high-capacity and low-latency data center networks as well as data center interconnect metro networks. Commercial discrete components have been used, whereas the practical implementation would require the integration of hundreds or even more of those optical components, resulting in power-inefficient bulky systems. The realization of photonic integrated circuits (PICs) can bring light to this, with the promises of reduced footprint and power consumption. It would be especially beneficial for application scenarios like the data center networks, resulting in diminished networking complexity, easy management and less concerns with the cooling issues. In view of this, a 4 4 WDM fast optical switch PIC has been designed and fabricated exploiting the modular architecture featured by the fast optical switch node. More than 100 components including the SOAs, AWGs and couplers are integrated in the same chip. The static characterization as well as dynamic switching operation has been performed and the switching performance of handling high data-rate and multilevel modulated traffic is experimentally investigated. Limit power penalty has been observed due to the ASE noise introduced by the SOAs. The compensation of the losses offered by the SOAs allows for large dynamic range, indicating the potential scalability to higher data-rate and larger number of port-counts of the WDM optical switch PIC. In the following, the structure and the design parameters of the fast optical switch PIC is explained in Section 8.1. The nanoseconds dynamic switching operation with 40 Gb/s WDM traffic is presented in Section 8.2. The capability 6 Parts of this chapter are based on the results published in [157].

134 Fast optical switch PIC of handling high data-rate and multi-level modulated traffic showing large input power dynamic range with limited power penalty is discussed in Section 8.

151 134 Fast optical switch PIC of handling high data-rate and multi-level modulated traffic showing large input power dynamic range with limited power penalty is discussed in Section Fast optical switch PIC The schematic of the photonic integrated fast optical switch is shown in Figure 8.1. The non-blocking fast optical switch has N inputs each carrying M different WDM wavelength channels generated by the edge nodes. The modular structure enables the parallel processing of the N WDM inputs by the respective optical modules. Each optical module consists of a 1: N splitter to broadcast the WDM channels to the N wavelength selective switches (WSS). The outputs of the N WSSs are connected to the respective N output ports. Each WSS can select one or more wavelength channels and forward the channels to the output ports according to the switching control signals. The WSS consists of two AWGs and M SOA-based optical gates. The first 1 M AWG operates as wavelength demultiplexer. Turning on/off the M SOAs determines which wavelength channel is forwarded to the output or is blocked. The second M 1 AWG operates as wavelength multiplexer. The broadband operation of the SOAs enables the operation with any C-band wavelength. Moreover, the amplification provided by the SOA compensates the loss introduced by the two AWGs. It should be noted that the amount of SOAs is N N M which is only partially (< N M) turned on simultaneously. Figure 8.1: Schematic of the WDM fast optical switch PIC.

152 Photonic integrated fast optical switch 135 Based on the schematic shown in Figure 8.1, a PIC integrating four optical modules each with four WDM channels (without wavelength combiner) are designed, as shown in Figure 8.2. The chip has been realized in a multi-project wafer (MPW) in the Jeppix platform with limited space of the cell (6 mm 4 mm). Each of the four identical modules processes one of the four WDM inputs. At the input of each module, an 800 μm booster SOA is employed to compensate the 6 db losses of the 1:4 splitter and partially the AWG losses at the WSS. The passive 1:4 splitter is realized by cascading 1 2 multi-mode interferometer (MMI), with the outputs connected to four identical WSSs, respectively. The AWGs of the WSS are designed with a free spectral range (FSR) of 15 nm, which has been tailored to fit the limited cell size offered in the MWP. The quantum well active InGaAsP/InP SOA gates have a length of 350 μm. The input and output facets of the chip are anti-reflection coated. The chip includes a total of 112 elements (4 booster SOA pre-amplifiers with DC bias, 64 gate SOAs with RF bias, 32 AWGs, and MMI couplers). The light shaded electrodes are wire bonded to the neighboring PCBs to enable the control of the SOA gates. Lensed fibers have been employed to couple the light in and out of the chip. Spectral tuning of the AWGs is in principle possible by using heaters on top of the AWG waveguides (not implemented in this design). The parameters and static performances including the operational bandwidth of the SOAs, central wavelength as well as passband for the AWGs, crosstalk between different channels have been characterized [141]. Figure 8.2: Schematic of the fabricated 4 4 fast optical switch PIC.

136 10 Gb/s, 20 Gb/s and 40Gb/s NRZ-OOK traffic 8.2 10 Gb/s, 20 Gb/s and 40Gb/s NRZ-OOK traffic 8.2.1 Experimental set-up To assess the performance of the 4 4 fast optical switch PIC, the experiment set-up shown in Figure 8.

153 Gb/s, 20 Gb/s and 40Gb/s NRZ-OOK traffic Gb/s, 20 Gb/s and 40Gb/s NRZ-OOK traffic Experimental set-up To assess the performance of the 4 4 fast optical switch PIC, the experiment set-up shown in Figure 8.3 is employed. Four WDM optical channels with packetized NRZ-OOK (PRBS ) payload at data rates of 10 Gb/s, 20 Gb/s and 40 Gb/s are generated at nm, nm, nm and nm. The packet has the duration of 540 ns and 60 ns guard time. The four WDM channels are de-correlated, amplified and injected to the photonic integrated WDM optical switch PIC via a polarization controller. Module 1, as one of the four identical modules, has been selected for the switching performance assessment. The total power launched into the input Port 1 is 2 dbm (corresponding to -4 dbm per channel). The input SOA, acting as a booster amplifier, is continuously biased with 100 ma of current. The shorter SOAs in the WSS acting as optical gates on the different channels are controlled by an FPGA-based switch controller. The temperature of the PIC is maintained at 25 C through the water cooler. Figure 8.3: Experimental set-up for performance assessment of NRZ-OOK traffic Dynamic switching The dynamic switching operation of the single WSS (WSS1 in Figure 8.2) is first investigated. The WDM packets arrived at WSS 1 in Module 1 are demultiplexed and controlled individually by the gate SOAs. The switch controller will turn on one or multiple gates to forward the packets, and the selected wavelength channels are multiplexed at the WSS output. The traces of the WDM input packets (shown in black) are illustrated in Figure 8.4. Each packet is labeled with the wavelength channel needs to be switched to the Output 1, while packets labeled with M means that multiple gates are enabled for multicasting operation. The switch controls (Ctrl 1-4) generated by the FPGA and applied to the 4 SOA gates of the WSS are also illustrated in Figure 8.4.

Photonic integrated fast optical switch 137 The signals are synchronized with the packets and on states correspond to a bias current of 40 ma.

154 Photonic integrated fast optical switch 137 The signals are synchronized with the packets and on states correspond to a bias current of 40 ma. The packets of the four channels (CH 1-4) are fast switched (~ 10 nanoseconds) and the outputs are presented in Figure 8.4. The traces indicate that the packets are properly switched according to the control signals with a contrast ratio larger than 28 db [141]. Figure 8.4: Traces for 4 channels in WSS1 destined to Output 1. As a second system assessment, the dynamic switching operation of the four WSSs has been investigated. In this case, packets at wavelength Channel 1 (1525 nm) is switched to one of the 4 output ports by controlling the Channel 1 SOA gates of the 4 WSS. The traces of the input packets, the control signals, and the switched outputs are reported in Figure 8.5. The input packets are labeled with the destination output ports and broadcasting ( B ) to two and more ports are also enabled by turning on multiple SOA gates. It can be seen from the Figure 8.5 that the fast dynamic switching operation in space, wavelength as well as time domains are fully supported by the WDM fast optical switch PIC.

155 Gb/s, 20 Gb/s and 40Gb/s NRZ-OOK traffic Figure 8.5: Traces for Channel 1 in Module 1 destined to four outputs BER performance To quantify the performance of the fast optical switch PIC, the BER curves for each WDM channel at different data rates are measured and the results are shown in Figure 8.6 (a)-(c). The back-to-back (B2B) curves are also included for reference. The gate SOA for the channel under test is supplied with 60 ma driving current and the output is amplified and sent to the BER tester (BERT). When the single wavelength channel is input to the PIC (shown in blue curves), error-free operations with less than 0.5 db have been measured for Channel 1 (CH 1) and Channel 3 (CH 3) at different data rates. While for the Channel 2 (CH 2) and Channel 4 (CH 4), the penalty was around 1 db at 10 Gb/s, 20 Gb/s, and 2 db at 40 Gb/s data rate. The eye diagrams of the switched output are also reported and confirm the signal degradation mainly due to accumulated noise for CH 2 and CH 4. When all the four WDM input channels are fed into the switch, the BER results (shown in red curves) indicate slight performance degradation with an extra penalty of around 0.5 db for CH 1 and CH 3 and 1 db for CH 2 and CH 4 compared with single wavelength operation for the different data rates.

156 Photonic integrated fast optical switch 139 Figure 8.6: BER curves for single channel and WDM channel input at (a) 10Gb/s; (b) 20Gb/s; (c) 40Gb/s. It can be seen from the assessment results that the WDM fast optical switch PIC can dynamically handle the WDM data traffic within few nanoseconds in space and wavelength domains. Error-free operation has been measured for 10 Gb/s, 20 Gb/s and 40 Gb/s single as well as multiple WDM channels with < 1.5 db and < 3 db penalty for CH 1/3 and CH 2/4, respectively. As reported in the characterizations [141], CH 2 and 4 experience extra losses due to the substantial optical power coupling between the channels caused by the not fully resolved waveguides after the de-multiplexer AWG. Therefore lower input power into the gate SOAs result in OSNR performance degradation which is also confirmed by the BER curves and eye diagrams in Figure 8.6. The gain compensation brought by the SOAs and the resulted limited power penalty indicates the potential scale of the PIC to higher data rate and port-count.

157 140 High-data and multi-level modulated traffic 8.3 High-data and multi-level modulated traffic Considering the application of the fast optical switch PIC in DCNs deployed with advanced optical interconnect solutions, the capability of handling the high data-rate and multi-level modulated traffic is a necessity. The potential impact on the signal integrity and OSNR degradation when given lower input power therefore should be well addressed Experimental set-up To assess the performance of the fast optical switch PIC in switching the highcapacity and multi-level modulated traffic, the experimental set-up shown in Figure 8.7 is employed. Four lasers located at nm, nm, nm and nm are utilized to generate the WDM channels. Three types of traffic are generated and tested, namely the 40 Gb/s NRZ-OOK, 20 Gb/s PAM4 and data-rate adaptive DMT. The four WDM channels are de-correlated, amplified and polarization-controlled before reaching at the fast optical switch PIC chip. Module 1, as one of the four identical modules, has been selected for the switching performance assessment. The bias current for the booster SOA and gate SOA are adjusted by an FPGA-based switch controller. The temperature of the chip is maintained at 25 C through the water cooler. Figure 8.7: Experimental set-up for performance assessment of NRZ-OOK traffic Assessment results The switching of the packetized 40 Gb/s NRZ-OOK (PRBS ) traffic is first analyzed. The power of each channel lunched into the WDM optical switch is - 4 dbm and the driving current for the booster SOA and gate SOA is 100 ma and 60 ma, respectively. At the output, switched traffic is amplified and sent to

Photonic integrated fast optical switch 141 the BER tester (BERT). The BER curves as well as eye diagrams for Channel 1 (1525.0 nm) with single channel and WDM input applied are shown in Figure 8.

158 Photonic integrated fast optical switch 141 the BER tester (BERT). The BER curves as well as eye diagrams for Channel 1 ( nm) with single channel and WDM input applied are shown in Figure 8.8 (a). The back-to-back (B2B) curve is also included for reference. Error-free operation with 0.5 db power penalty at BER of has been measured for single channel case. When four WDM channels are fed into the switch, the result indicates slight performance degradation with a penalty of around 1 db which is mainly due to the noise introduced by the SOAs. The power penalty measured at BER of with different input optical power is plotted in Figure 8.8 (b). The bias current for the booster and gate SOAs are varied to equalize the output power of the chip. For single channel input, 14 db dynamic range is achieved with less than 1.5 power db penalty. WDM input would cause an expense of 1 db extra penalty. Smaller dynamic range of 10 db is obtained for the WDM input case. The required bias current for the booster SOA and gate SOA is also illustrated in Figure 8.8 (b). Higher input power would require less driving current for both SOAs to achieve the equalized power output. Figure 8.8: (a) BER curves and (b) dynamic range with adjusted SOA current for 40 Gb/s traffic. As promising interconnect solutions in DCNs, PAM4 and DMT can effectively boost the link capacity and the switching performances have been analyzed for the fast optical switch PIC. The WDM channels are modulated with the PAM4 and DMT traffic generated by the 24 GSa/s arbitrary waveform generator. At the switch output, the traffic is received by a real-time 50 GSa/s digital phosphor oscilloscope (DPO) for offline DSP. The BER curves and the eye diagrams of the switched 20 Gb/s PAM4 traffic at Channel 1 are shown in Figure 8.9 (a). A power penalty of 0.4 db and 1 db at BER of has been

142 High-data and multi-level modulated traffic observed for the single channel and WDM input, respectively. The optical power dynamic range guaranteeing a BER < 1 10-3 is then studied.

159 142 High-data and multi-level modulated traffic observed for the single channel and WDM input, respectively. The optical power dynamic range guaranteeing a BER < is then studied. The power penalty when varying the input power and the corresponding bias current of both SOAs for equalized output are depicted in Figure 8.9 (b). Both cases achieve > 10 db dynamic range within 2 db power penalty. Similar with the NRZ-OOK traffic, lower input power is limited by the OSNR degradation where the noise is more dominant. Larger penalty is also found for higher input power due to the increased sensitivity to saturation, as confirmed by the eye diagrams in Figure 8.9 (b). The situation without the output EDFA is also investigated for the single channel input case, and the dynamic range result is presented in Figure 8.9 (b). Similar trend with less power penalty has been observed compared with the presence of EDFA. Smaller dynamic range is mainly due to the inadequate output power level when lower input power is applied. Figure 8.9: (a) BER curves and (b) dynamic range with adjusted SOA current for 20 Gb/s PAM4 traffic. For the DMT traffic, we first evaluated the effect of the input power to the optical switch PIC chip on the achievable data-rate. As illustrated in Figure 8.10 (a), 0 dbm is found to be the optimum (absolute data-rate is not optimized). An example of the optimal bit allocation after bit loading for the 10 GHz DMT with 256 sub-carriers is included in Figure 8.10 (b). Channel 1 with 0 dbm optical power is sent to the PIC and the maximum achievable data-rate with an average BER < is reported in Figure 8.10 (c). 1 db power penalty at data-rate of 32 Gb/s is introduced by the WDM optical switch PIC compared with the B2B traffic.

Building petabit/s data center network with submicroseconds latency by using fast optical switches Miao, W.; Yan, F.; Dorren, H.J.S.; Calabretta, N.

Building petabit/s data center network with submicroseconds latency by using fast optical switches Miao, W.; Yan, F.; Dorren, H.J.S.; Calabretta, N. Published in: Proceedings of 20th Annual Symposium of