Ethernet for the ATLAS Second Level Trigger Franklin Saka

Size: px

Start display at page:

Download "Ethernet for the ATLAS Second Level Trigger Franklin Saka"

Jasper Webb
5 years ago
Views:

1 Ethernet for the ATLAS Second Level Trigger by Franklin Saka Royal Holloway College, Physics Department University of London 2001 Thesis submitted in accordance with the requirements of the University of London for the degree of Doctor of Philosophy

2 Abstract In preparation for building the ATLAS second level trigger, various networks and protocols are being investigated. Advancement in Ethernet LAN technology has seen the speed increase from 10 Mbit/s to 100 Mbit/s and 1 Gigabit/s. There are organisations looking at taking Ethernet speeds even higher to 10 Gigabit/s. The price of 100 Mbit/s Ethernet has fallen rapidly since its introduction. Gigabit Ethernet prices are also following the same pattern as products are taken up by customers wishing to stay with the Ethernet technology but requiring higher speeds to run the latest applications. The price/performance/longevity and universality features of Ethernet has made it an interesting technology for the ATLAS second level trigger network. The aim of this work is to assess the technology in the context of the ATLAS trigger and data acquisition system. We investigate the technology and its implications. We assess the performance of contemporary, commodity, off-the-shelf Ethernet switches/networks and interconnects. The results of the performance analysis are used to build switch models such that large ATLAS-like networks can be simulated and studied. Finally, we then look at the feasibility and prospect for Ethernet in the ATLAS second level trigger based on current products and estimates of the state of the technology in 2005, when ATLAS is scheduled to come on line. 2

3 Acknowledgements I would like to thank my supervisors, John Strong and Bob Dobinson for the opportunity to carry out the work presented in this thesis, for their guidance and advice. I would also like to thank the members of the ATLAS community, Marcel Boosten, Krzysztof Korcyl, Stefan Haas, David Thornley, Roger Heely, Marc Dobson, Brian Martin and other past and present members of Bob Dobinson s group at CERN with whom I was lucky enough to work. I am also grateful to: PPARC for funding this PhD; my industrial sponsors SGS-Thompson, in particular those I worked with (Gajinder Panesar and Neil Richards) for their help and friendship; CERN and the ESPRIT projects ARCHES (project no ) and SWIFT. I would like express my appreciation to: Antonia Dura bueno paella Martinez who was there through the sleepless nights (Gracias por haber tenido paciencia); to Celestino Celestial Casanova Canosa, we did it Tino! Thanks also to Stefano Teti Caruso, Gabriela Susana Chiapas chica Garcia, Teresa belle potosina Segovia, Micheal you guys Pragassen, Uma Bala... umski Shanker, and Roy jock strap Gomez and all my other dear friends for making the journey more interesting. Finally, to Ophelia, Sheila, Kelvin, Adil and the rest of my family, thank you for your continued encouragements and support. To David, Maxwell, Rachel and Natalie, I hope you will achieve an equivalent and more in the years to come. This one is dedicated to my mother Evelyn who saw it all from the start. Cheers mum. 3

4 4

5 Contents 1 Introduction Physics background The ATLAS Trigger/DAQ system The level-2 trigger Thesis Aim Thesis Outline Context Contribution Requirements for the ATLAS second level trigger General Requirements A Review of the Ethernet technology Introduction History of Ethernet The Ethernet technology Relation to OSI reference model Frame format Broadcast and multicast The CSMA/CD protocol Full and Half duplex Flow control Current transmission rates Connecting multiple Ethernet segments Routers

6 3.4.2 Repeaters and hubs Switches and bridges The Ethernet switch Standards The Bridge Standard Virtual LANs (VLANs) Quality of service (QoS) Trunking Higher layer switching Switch management Reasons for Ethernet Conclusion Network interfacing Performance issues Introduction The measurement setup The comms 1 measurement procedures TCP/IP protocol A brief introduction to TCP/IP Results with the default setup using Fast Ethernet Delayed acknowledgement disabled Nagle algorithm and delayed acknowledgement disabled A parameterised model of TCP/IP comms 1 communication Effects of the socket size on the end-to-end latency Results of CPU usage of comms 1 with TCP Raw Ethernet A parameterised model of the CPU load Conclusions for ATLAS Gigabit Ethernet compared with Fast Ethernet Effects of the processor speed TCP/IP and ATLAS Decision Latency Request-response rate and CPU load Conclusion for ATLAS

7 4.6 MESH MESH comms 1 performance Scalability in MESH Conclusion Further work Ethernet Network topologies and possible enhancements for ATLAS Introduction Scalable networks with standard Ethernet Constructing arbitrary network architectures with Ethernet The Spanning Tree Algorithm Learning and the Forwarding table Broadcast and Multicast for arbitrary networks Path Redundancy Outlook Conclusions The Ethernet testbed measurement software and clock synchronisation Introduction Goals An example measurement Design decisions Testbed setup The Traffic Generator program The usage of MESH in the ETB software synchronising PC clocks Method Factors affecting synchronisation accuracy Clock drift and skew Temperature dependency on the synchronisation Integrating clock synchronisation and measurements Conditions for best synchronisation Summary of clock accuracy

8 6.5 Measurements procedure Configuration files The transmitter and receiver Considerations in using ETB Possible improvements Strengths and limitations of ETB Commercial testers Price Comparison Conclusions Analysis of testbed measurements Introduction Contemporary Ethernet switch architectures Operating modes Switching Fabrics Buffering Modelling approach Switch modelling Introduction The parameterised model Principles of operation of the parameterised model Conclusion Characterising Ethernet switches and measuring model parameters End-to-End Latency (Comms 1) Basic streaming Testing the switching fabric architecture Testing broadcasts and multicast Assessing the sizes of the input and output buffers Testing quality of service (QoS) and VLAN features Multi-switch measurements Saturating Gigabit links Conclusions

9 8 Parameters for contemporary Ethernet switches Introduction Validation of the parameterised model Parameters for the Turboswitch Testing the parameterisation on the Intel 550T Conclusions Performance and parameters of contemporary Ethernet switches Switches tested Broadcast and Multicast Trunking on the Titan T Jumbo frames Switch management Conclusions Conclusions Achievements Considerations in using Ethernet for the ATLAS LVL2 trigger/daq network Nodes Competing technologies Future work Summary and conclusions Outlook A Glossary of networking terms 189 B MESH Overview 193 C The architecture of a contemporary Ethernet switch 197 C.1 Introduction C.2 The CPU module C.3 The CAM and Logic module C.4 The Matrix Module C.5 The I/O modules C.6 The switch operation

10 C.7 Frame ordering C.8 Address aging and packet lifetime C.9 Conclusions D A full description of the parameters for modelling switches

11 List of Tables 3.1 Network diameter or maximum distances for three flavours of Ethernet on various media A comparison of the MESH and TCP/IP overheads per byte and fixed overheads A comparison of the MESH and TCP/IP fixed CPU overhead and fixed CPU overhead per ping-pong The deviation in clocks for Fast and Gigabit Ethernet as a function of the warm up time. In microseconds per minute The list of commands for the configuration of the ETB nodes An example synchronisation result as stored in global clocks file for six nodes The commands for measurement initialisation An example of the output of an ETB transmitter An example of the an ETB receiver output. This shows that node 0 was transmitting to node 1 frames of 250 bytes. The achieved throughput was MBytes/s and the average latency was 9782 s Model parameters for the Turboswitch 2000 Ethernet switches. The parameters obtained from the ping-pong measurement are marked with. The parameters obtained from the vendors are marked with. The parameters obtained from the streaming measurement are marked with (the maximum bandwidth for 1500 Bytes is given) Model parameters for the Intel 550T Ethernet switches. The parameters obtained from the ping-pong measurement are marked with. The parameters obtained from the streaming measurement are marked with (the maximum bandwidth for 1500 Bytes is given). NA implies not applicable

12 8.3 Model parameters for various Ethernet switches. The parameters obtained from the ping-pong measurement are marked with. The parameters obtained from the vendors are marked with. The parameters obtained from the streaming measurement are marked with (the maximum bandwidth for 1500 Bytes is given). NA=not applicable

13 List of Figures 1.1 A schematic of the ATLAS detector The Three levels of the ATLAS trigger/daq The proposed LVL2 architecture The setup of the ATLAS LVL2 trigger network The history of the Ethernet technology An illustration of a segment or collision domain Ethernet and how it fits into the OSI 7 layer model The format of the original Ethernet Frame The format of the new Ethernet Frame with support for VLANs and eight priority levels The format of the full duplex Ethernet pause frame An illustration of a hub A network with two segments connected by a Bridge The cost of Fast and Gigabit Ethernet NICs and switches as a function of time The PC system architecture An illustration of the protocols in relation to each other The comms 1 setup The model of the TCP/IP protocol Comms 1 under TCP/IP. The default setup: CPU = Pentium 233 MHz; MMX OS=Linux An illustration of the comms 1 exercise involving the exchange of one TCP segments (not to scale)

14 4.7 An illustration of the comms 1 exercise involving the exchange of two TCP segments (not to scale) An illustration of the comms 1 exercise involving the exchange of three TCP segments (not to scale) Comms 1 under TCP/IP: CPU = Pentium 200 MHz MMX: Nagle algorithm on: Delayed acknowledgement disabled: Socket size = 64 kbytes OS=Linux Measurement against parameterised model. Comms 1 under TCP/IP: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Delayed ack disabled: Socket size = 64 kbytes. OS=Linux The flow of the message in the comms 1 exercise Comms 1 under TCP/IP for various socket sizes: Delayed ack off: Nagle algorithm disabled: CPU = Pentium 200 MHz MMX: Socket size = 64 kbytes OS=Linux Comms 1 under TCP/IP with CPU load measured: Delayed ack disabled: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Socket size = 64 kbytes OS=Linux CPU usage from comms 1 under TCP/IP with CPU load measured: Delayed ack disabled: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Socket size = 64 kbytes OS=Linux A model of the CPU idle and busy time during the comms 1 measurements Comms 1 under TCP/IP and raw Ethernet sockets with CPU load measured: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Delayed ack on: Socket size = 64 kbytes OS=Linux The magnification of Figure 4.16(b). The latency from comms 1 under TCP/IP and raw Ethernet sockets with CPU load measured: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Delayed ack on: Socket size = 64 kbytes: OS=Linux Comms 1 under TCP/IP and raw Ethernet sockets: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Delayed ack on: Socket size = 64 kbytes OS=Linux

15 4.19 Comms 1 under TCP/IP for Fast and Gigabit Ethernet: Delayed ack on: CPU usage measured: CPU = Pentium 400 MHz: Nagle algorithm disabled: Socket size = 64 kbytes OS=Linux CPU load for comms 1 under TCP/IP for Fast and Gigabit Ethernet: Delayed ack on: CPU usage measured: CPU = Pentium 400 MHz: Nagle algorithm disabled: Socket size = 64 kbytes OS=Linux The effect on the fixed latency overhead when changing the CPU speed The modified comms 1 setup to allow the measurement of Request-response rate and the client CPU load Request-response rate against CPU for Fast and Gigabit Ethernet on 400 MHz PC. OS=Linux The Measured request-response rate against CPU load for various processor speeds Extrapolation of the minimum frame (Figure 4.24) to 100% CPU load The relationship between the TCP/IP request-response rate and CPU speed at 100% load for minimum and maximum frame sizes Comms 1 under MESH and TCP/IP for Fast and Gigabit Ethernet: CPU = Pentium 400 MHz: OS=Linux CPU load for comms 1 under MESH and TCP/IP for Fast and Gigabit Ethernet: CPU = Pentium 400 MHz: OS=Linux CPU load for comms 1 under MESH. Model vs. Measurement for Fast and Gigabit Ethernet: CPU = Pentium 400 MHz: OS=Linux Fast and Gigabit Ethernet CPU load for MESH and TCP/IP for the minimum and maximum frame lengths. CPU = Pentium 400 MHz: OS=Linux T=TCP/IP, M=MESH, FE=Fast Ethernet, GE=Gigabit Ethernet, minf=minimum frame, maxf=maximum frame The change in the maximum MESH CPU load for comms 1. Fast and Gigabit Ethernet. OS=Linux A tree like topology. Note that a node can be attached to any of the switches Connecting the same type of Ethernet switches without being limited by a single link does not increase number of ports A link blocked due to a slow receiver The Ethernet based ATLAS trigger/daq network

16 5.5 An example of one loop path in the Clos network, shown by the bold lines. Each square represents a switch Broadcast as handled by a modified Clos network. In this simple network, only stations A and C are allowed to broadcast in order to avoid looping frames. The bold lines show the direction of the broadcast frame A broadcast tree using VLANs in a Clos network. In this network, only switch ports belonging to VLAN b are allowed to forward broadcasts. The bold lines show the direction of the broadcast frame The PCs used for the LVL2 testbed at CERN Performance obtained from streaming 6 FE nodes to a single Gigabit node through the BATM Titan T4. The limits of the receiving Gigabit node is reached before the limits of the switch The setup of the Ethernet measurement testbed Unidirectional streaming for Fast and Gigabit Ethernet using MESH and UDP. CPU=400 MHz; OS=Linux How we synchronise clocks on PCs A normalised histogram of half the round trip time through a switch The mean value of the round trip time The standard deviation of the round trip time How the gradient of two monitor node deviate from The error in the predicted time for different warm up times The effect on the drift when the PC side panels are removed The measurement technique Standard deviation in gradient Error in the predicted time over 5 minute intervals Variation in the sleep time between ping-pongs Error in the predicted time over 5 mins for varying time between ping-pongs The range of the number of points that can be used to make the best line fit A flow diagram illustrating the synchronisation, measurement and traffic generation in ETB The frame format of ETB software

17 6.20 A comparison of the transmit and receive inter-packet time histogram when sending frames of 1500 bytes at 240 s inter-packet time A histogram of the end-to-end latency when sending frames of 1500 bytes at 240 s inter-packet time The typical architecture of an Ethernet switch The crossbar switch architecture The shared buffer switch architecture The shared bus switch architecture The interaction between modelling and measurement activity The parameterised model: Intra module communication The parameterised model: Inter module communication An example plot of the comms 1 measurement. The PC overhead, i.e. the direct connection overhead should be subtracted to leave the switch port-to-port latency Port to port latency for various Gigabit Ethernet switches The expected result from streaming Results from unidirectional streaming through various Gigabit Ethernet switches Typical plot of load against latency for systematic and random traffic. The latency here refers to the end-to-end latency from one PC into another Relationship between the ping-pong, the basic streaming and streaming with the systematic traffic pattern Typical plot of offered load against accepted load. If flow control works properly, we cannot offer more load than we can accept Typical plot of offered load against lost frame rate. For switches where flow control works properly, we should observe no losses The setup to discover the maximum throughput to and from the backplane An example set up to test the priority, rate and latency distribution of broadcast frames compared to unicast frames Investigating input and output buffer sizes Fast Ethernet priority test on the BATM Titan T4. High and low priority nodes streaming to a single node Testing VLANs on a switch. Nodes 1 and 2 are connected to ports on VLAN a, nodes 3 VLAN b and node 4 on VLAN a and b

18 7.21 End-to-end latency through multiple Titan T4 Gigabit Ethernet ports A setup to test trunking. Trunked links are used to connect two Ethernet switches Looping back frames to saturate a Gigabit link Example results comparing a loopback and a non-loopback measurement on the BATM Titan T The end-to-end latency for direct connection and through the Turboswitch The throughput obtained for unidirectional streaming with two nodes through the Turboswitch The minimum inter-packet time obtained for unidirectional streaming with two nodes through the Turboswitch The Turboswitch 2000 results from the 3111 setup to discover access into and out of a module Random traffic for 3111 setup through the Turboswitch Traffic is intermodule only Histogram of latencies for various loads (as a percentage of the Fast Ethernet link rate) configuration random traffic. Model against measurements The results of the bidirectional streaming tests on the Intel 500T switch. This shows that the up to four Fast Ethernet nodes can communicates at the full link rate Investigating the buffer size in the Intel 550T switch The performance of the Intel 550T Fast Ethernet switch with random traffic. Model against measurements A picture of the BATM titan T The Foundry BigIron 4000 switch Port to port latency for broadcast packets. Obtained from comms The frame rate obtained when streaming broadcast packets through the Titan T B.1 The transmit and receive cycles in MESH (Source: Boosten [10]) C.1 The architecture of the Turboswitch C.2 The format of the control packet from the CAM/logic module C.3 An illustration of two modules of the Turboswitch 2000 and their connection to the backplane. The shaded areas show where packets can queue in the switch when transferring from module 1 to module

19 C.4 A simplified flow diagram showing the operation of the Turboswitch

20 20 Chapter 0.

21 Chapter 1 Introduction 21

22 22 Chapter 1. Introduction 1.1 Physics background Experiments with the electron positron collider (LEP) have shown us that new physics and answers to some of the most profound questions of our time, lie at energies around 1 TeV. The large hadron collider (LHC) is an accelerator which brings protons or ions into head-on collisions at higher energies than ever achieved before. LHC experiments are being designed to look for theoretically predicted phenomena. However, they must also be flexible enough to be prepared for new physics. The LHC will be built astride the Franco-Swiss border west of Geneva. ATLAS is one of four experiments at the LHC. Its concept was first presented in 1994 and it is expected to be operational from 2005 for a period of at least 20 years. One of the main goals of ATLAS is to understand the mechanism of electroweak symmetry breaking (the search for one or more Higgs bosons) and the search for new physics beyond the standard model. In addition, precision measurements will be performed for the standard model processes (e.g. the masses of the W boson and of the top quark and the proton structure) and for new particles (properties of the Higgs boson(s), properties of supersymmetric particles). In keeping with CERN s cost-effective strategy of building on previous investments, it is designed to use the 27-kilometre LEP tunnel, and be fed by existing particle sources and preaccelerators. The LHC is a remarkably versatile accelerator. It can collide proton beams with energies around 7-on-7 TeV and beam crossing points of unsurpassed brightness, providing the experiments with high interaction rates. It can also collide beams of heavy ions such as lead with a total collision energy in excess of 1,250 TeV. Joint LHC/LEP operation although originally envisaged has since been dropped. 1.2 The ATLAS Trigger/DAQ system The ATLAS detector (a schematic of which is shown in Figure 1.1), is expected to produce images of 1 to 2 MByte at a frequency of 40 MHz, thus a rate of 40 to 80 TeraBytes/s. However not all the collisions produce interesting physics and warrant further analysis. The trigger s task is to select the most interesting collisions or events for further analysis, but no more than the amount that can be transferred to permanent storage. The ATLAS detector s trigger and data acquisition system (Trigger/DAQ) has been organised

1.2 The ATLAS Trigger/DAQ system 23 Figure 1.1: A schematic of the ATLAS detector. into three levels as shown in Figure 1.2. Level-1 (LVL1) consists of purpose-built hardware.

23 1.2 The ATLAS Trigger/DAQ system 23 Figure 1.1: A schematic of the ATLAS detector. into three levels as shown in Figure 1.2. Level-1 (LVL1) consists of purpose-built hardware. It acts on reduced granularity data from a subset of the detectors. The beams of particles cross each other every 25 ns or at a frequency of 40 MHz. The LVL1 trigger identifies events containing interesting information. Information on these events, including the number of signatures, their type and position in the detector are gathered to form regions of interest (RoI). The RoIs are passed to the next level at a reduced rate of 75 khz (the system is being designed to support a maximum reduced rate of 100 khz). As illustrated in the Figure 1.2, LVL1 acts on the muon and calorimeter but not on the inner-tracking information. The initial rate can be reduced adequately without the inner-tracking information. The decision latency for the LVL1 trigger is 2 s. During this time, all the detector data are stored in pipelined memories. If the event is accepted, all the data are transferred to the read out buffers (ROBs) where the data are stored during level-2 processing. As mentioned above, the level-2 (LVL2) trigger receives images identified as interesting by LVL1 at the frequency of 75 khz (maximum of 100 khz). Further analysis of the collisions at LVL2, reduces the event frequency to 1 khz for the next level. The analysis uses full granularity, full-precision data from the inner-tracking, calorimeters and muon detectors.

24 24 Chapter 1. Introduction Interaction rate ~1 GHz Bunch crossing rate 40 MHz LEVEL 1 TRIGGER 75/100 KHz ~2 µ s CALO MUON TRACKING Pipeline memories Derandomizers Regions of interest Readout drivers (RODs) LEVEL 2 TRIGGER ~ 1 KHz ~1-10 ms Readout buffers (ROBs) EVENT BUILDER ~1-10 GBytes/s EVENT FILTER Hz ~1 s Full-event buffers and processor sub-farms Data recording ~ MBytes/s Figure 1.2: The Three levels of the ATLAS trigger/daq

25 1.3 The level-2 trigger 25 LVL2 uses data from regions of sub-detectors which according to the RoI information, are expected to contain interesting data. Level-3 (LVL3) trigger, also known as the Event Filter or EF makes the final decision on whether to reject or store the event for off-line analysis. Accepted events from LVL2 are forwarded to LVL3 processors via the event builder. At this point, a full reconstruction of the event is possible with a decision time of up to 1 s. The storage rate is up to 100 Hz giving a throughput of up to 100 MBytes/s to tape. The full event data are used at this level. 1.3 The level-2 trigger The proposed architecture chosen for study [2] is shown in Figure 1.3. Level 1 trigger Detector Data RoI Builder ROB 1 ROB 2... ROB n Supervisor Farm Network... PROC 1 PROC 2 PROC n ROB = Readout Buffer PROC = Processor Figure 1.3: The proposed LVL2 architecture. An RoI builder receives RoI information fragments from the LVL1 processors. These RoI fragments are organised and formatted into a record for each event. The RoI builder then transfers the record to a selected supervisor processor. The supervisor processor allocates the event to a LVL2 processor and forwards the RoI record to the processor. The processor collects the event fragments from the ROBs, processes them and sends the decision to the supervisor. The supervisor receives the decision and decides whether to discard, process further or accept the event. The supervisor updates the trigger statistics and multicasts the decision to the ROBs.

26 26 Chapter 1. Introduction The LVL2 trigger is estimated to require a processing power of MIPS [2]. The event filter has similar processing requirements. Efforts are made to include as much flexibility as possible in the trigger design to allow for upgrades and to cope with unpredicted demands. An intense study into the LVL2 system has been undertaken [1][2]. A pre-requisite was to build the LVL2 system from commodity off-the-shelf products. Thus a network of workstations (NOWs) approach was proposed to provide the processing requirements. The advantages of this approach are; No development costs and periods. Inexpensive due to competing vendors. Easy to obtain. Widely supported. Has continuity and long lifetime due to installed base in industry. 1.4 Thesis Aim This thesis deals specifically with the ATLAS LVL2 trigger. The focus is illustrated by the shaded region of Figure 1.3. We assess the suitability of the Ethernet technology as a solution to the ATLAS LVL2 trigger network. Therefore our concern is with the nodes and protocols connecting to the network, the network interface cards and the network itself. 1.5 Thesis Outline Following this introduction, Chapter 2 summarises the requirements of the ATLAS detector and more specifically the LVL2 trigger system. These requirements are summarised in the current ATLAS HLT, DAQ and DCS Technical Proposal [1]. Chapter 3 is a brief look at the Ethernet technology and standards and the reasons why it is being considered for the ATLAS trigger/daq network. Chapter 4 is an examination of the host performance using Ethernet network interface cards (NIC) and various protocols. In Chapter 5, we examine the architecture of Ethernet switches and what would be the ideal configuration for a high performance parallel application like the ATLAS LVL2. In Chapters 6 we develop and analyse a flexible and cost effective tool to characterise the performance of Ethernet switches. In Chapter 7 we present the measurements performed with the

27 1.6 Context 27 tool and the analysis that led to the development of the parameterised model of Ethernet switches. We describe the model parameters, the measurements required to obtain the parameters and other measurements to allow a more complete characterisation of contemporary Ethernet switches. In Chapter 8 we present a validation of the parameterised model and give the parameters which allow contemporary Ethernet switches from various manufacturers to be simulated and compared. Chapter 9 contains a summary of the conclusions presented throughout the thesis and a look at the future. 1.6 Context Funding for this project was awarded through a Co-operative Awards in Science and Engineering (CASE) studentship form the Particle Physics and Astronomy Research Council (PPARC) in collaboration with SGS-Thompson Microelectronics. Partial funding also came from the EU project SWIFT. The actual work was carried out mostly at CERN and partly at SGS-Thomson. CERN s policy on industrial collaboration encouraged our involvement in the EU projects Macrame, SWIFT and ARCHES. The work has been useful to CERN in understanding Ethernet switches and networks and allowing models to be built for analysis of the ATLAS trigger/daq network. It has also been useful to our industrial collaborator, which supplied its switches, in helping prove the performance of their product by a third party. Some of the SWIFT project s objectives have been met by the work presented here. The work presented here has also paved the way for another project building an Ethernet protocol analyser and performance tester. The ideas presented here are being used and the bottlenecks revealed here are being overcome by other novel techniques. 1.7 Contribution The author s original contributions are Chapters 4, 5, 6, and 8. Chapter 7 is a collaborative effort where the author provided the necessary information to allow the models to be constructed. Thus the results from the modelling are not completely the author s work. Chapter 9 contains the conclusions from this work. The contributions made to the ATLAS project have been;

28 28 Chapter 1. Introduction Setting up the testbed for the ATLAS LVL2 framework software. Assessment of the Ethernet technology specifically for ATLAS LVL2 trigger/daq network. Defining a methodology and writing the software for assessing the performance of an Ethernet switch with the ATLAS trigger/daq in mind. Assessment of protocols and NIC issues affecting network performance in order to achieve the best performance for the ATLAS trigger/daq. Providing analysis of current Ethernet switches architectures to aid modelling of the ATLAS trigger/daq network Providing input (network and host performance) for the modelling of the ATLAS trigger/daq network (architectures and methodology) Collaborating successfully with members of the ATLAS community and industrial partners. The issues highlighted in this thesis will have to be further addressed by the ATLAS trigger/daq community. The next major milestone is the submission of the Technical design report scheduled for June 2002.

29 Chapter 2 Requirements for the ATLAS second level trigger 29

30 30 Chapter 2. Requirements for the ATLAS second level trigger 2.1 General Requirements The challenges of constructing an experiment like ATLAS are huge and complex, requiring multidisciplinary effort. Given the time scale of ATLAS, many issues are still incomplete or uncertain in their detail. The aim of this section is to present the parts of the LVL2 requirements influencing the problems dealt with in this thesis. At the start of the work presented here, a study by the ATLAS community called the Demonstrator Program [2] was nearing its end. Its results which directly influenced this work are: Increased confidence that affordable commercial networks would be able to handle the traffic in a single network - a total of a several GBytes/s among about 1000 ports. Standard commercial processors (especially PCs) were favoured for the LVL2 processing, rather than VME-based systems, since they offer a better price/performance ratio. Sequential processing steps and sequential selection offers advantages such as reduced network bandwidth and processor load. Control messages should pass via the same network as the data. The LVL2 Supervisor should pass the full event control for each event to a single processor in the farm. The findings of the Demonstrator Program were used in the next stage of the ATLAS program, the so called Pilot project 1 [1]. The principal aims were to produce a validated LVL2 architecture and to investigate likely technologies for its implementation. The work for the Pilot project was divided into three main areas: functional components, testbeds and system design. The functional components covered optimised components for the supervisor, the ROB complex, networks and processors. Testbeds covered the development of the Reference or framework software, a prototype implementation of the complete LVL2 process and the construction and use of moderately large application testbeds to use this software. Finally, system design covered modelling activities and an integration activity to consider the issues related to how the LVL2 system integrates with other subsystems and the requirements it has to meet. The work presented in this thesis touches on all three points, specifically, the LVL2 nodes and network. The findings influenced the testbed setup and the modelling. There is an aim for some 1 The pilot project took place in the period from early 1998 to mid 2000.

31 2.1 General Requirements 31 degree of commonality within the detector. Common software and hardware components are encouraged to guarantee maximum uniformity throughout ATLAS and adapted where necessary to the particular detector requirements. The LVL2 trigger/daq application is require to run at 75 khz image processing rate, but be scalable to 100 khz. The following is a list of the indicative performance requirements identified in the paper model [4]. At 100 khz with an image size of 1-2 MBytes, the network throughput would be up to 200GBytes/s. The use of the RoI guidance means around 5% of the image will be analysed by the LVL2 processors. This brings the average network capacity to 5 to 10 GBytes/s. This will be mostly in the direction from the ROBs to the processors due to the request-response nature of traffic patterns. On average, an event is spread over 75 buffers. Each of these buffers hold on average 660 to 1320 bytes of the event. This gives an event size of around 50 to 100 kbytes. The total number of ROBs is around 1700, therefore the average ROB throughput will be of the order of! " 2.9 to 5.9 MBytes/s. With 1700 ROBs and 75 ROBs/event, the rate per ROB must be #$ %&! " 4.4 khz. The maximum ROB rate is 12 khz [4], corresponding to a maximum throughput of 7.9 to 15.8 MBytes/s. All processors must be able to access all ROBs and vice versa. Each ROB has the same probability of being accessed, therefore we want a uniform bandwidth across the network. There are a minimum of 550 processors [46] in the LVL2 network. The maximum rate per processor is therefore ')(*'+,- " khz. This corresponds to a throughput of 9.0 to 18.0 MBytes/s. The LVL2 accept rate is 1 to 2 khz. This implies a rate of 1 to 4 GBytes/s to LVL3. This means that the peak network throughput will be 0/1'2 14 GBytes/s. The trigger/daq uses a sequential selection strategy. Performing the LVL2 processing of the inner detector after the initial confirmation of the LVL1 trigger reduces the average latency compared to processing in parallel, even though the latency for some events increases. Furthermore more complex algorithms which can only be run at lower rates and require sequential strategy can be used for some types of events in LVL2. Some of these algorithms use RoIs coming from LVL2 processing.

32 32 Chapter 2. Requirements for the ATLAS second level trigger The ROBs receive the accepted events of the LVL1 trigger from the front end electronics. The ROBs are used as the data sources for the LVL2 processors and the event builder. The basic operation of a ROB is as follows; Data are received into the ROB from the detectors across readout links with a bandwidth of up to 160 MByte/s and at an average rate of up to 100 khz. Selected data are requested from the buffers by the LVL2 system at a maximum rate of about 14kHz for any given buffer. Final LVL2 decisions are passed back to the ROB so that memory occupied by rejected events can be cleared. To reduce message handling overheads, it is more efficient to pass the decisions back in groups of 20 or more decisions. The use of multicast and broadcast in this case may also reduce the message handling overheads. Data for accepted events are passed downstream for processing by the event filter. The ATLAS LVL2 trigger is shown in Figure 2.1. The network for the ATLAS trigger/daq is required to be scalable, fault tolerant, be upgradable, cost effective and have a long lifetime in terms of usability and supportability. The architecture should aim to use commodity items (processors, operating system (OS) and network hardware) wherever possible. Scalability: Future requirements of the trigger may evolve to require more processors/computing power, ROBs or simply network throughput. The network must be scalable to provide these requirements. Fault tolerance: This is an important issue for the ATLAS trigger. Faulty links and switches should be detected and the traffic rerouted until they have been repaired. Ideally this should be automatic and built into the network. Reliability: Packets should not be lost. Contention must be dealt with in a manner which avoids packet loss. Unicast, broadcast and multicast are all very important to the performance of the LVL2 trigger. Latency: The characteristics of the trigger latency need to be known and understood in order to choose more effectively the size of the buffers in the system. Throughput: The required throughput must be supported.

33 2.1 General Requirements 33 ~1700 buffers. Distributed 1 MByte image. Buffer Buffer Buffer Buffer Buffer BIG SWITCH Processor Processor Processor Processor Processor ~550 processors analyse data from buffers Figure 2.1: The setup of the ATLAS LVL2 trigger network.

34 34 Chapter 2. Requirements for the ATLAS second level trigger

35 Chapter 3 A Review of the Ethernet technology 35

36 36 Chapter 3. A Review of the Ethernet technology 3.1 Introduction There are a number of network technologies being looked at as possible solutions to the ATLAS LVL2 trigger network. At the start of this project, the three main technologies were SCI, ATM and Ethernet. We focus on the Ethernet technology. In this Chapter, we review the Ethernet technology and standards. 3.2 History of Ethernet Ethernet is a medium-independent local area network (LAN) technology. Its development started in 1972 at Xerox PARC by a team led by Dr. Robert M. Metcalfe. It was designed to support research on the office of the future. The Ethernet technology was based on a packet radio technology called Aloha, developed at the University of Hawaii. Originally called Alto Aloha net, it was used to link Xerox Altos (one of the world s first personal workstations with a graphical interface) to one another, to servers and to printers. It ran at 2.94 Mbit/s. In May , the word Ethernet was used in a memo to describe the project. This date is known as the birthday of Ethernet. The change in the name was meant to clarify that the system could run over various media, support any computer and also to highlight the significant improvements over the Aloha system. Formal specifications for Ethernet were published in 1980 by a DEC, Intel and Xerox consortium that created the DIX standard. In 1985, Ethernet became an IEEE (Institute of Electrical and Electronic Engineers) standard known as IEEE All Ethernet equipment since 1985 have been built according to the IEEE standard. Developments in technology have led to periodic updates in the IEEE standards. In the 1990s, the boom in data networking, the increase in popularity of the Internet and new applications requiring higher throughput led to the development of the 100 Mbit/s Fast Ethernet and the 1000 Mbit/s Gigabit Ethernet standards. Table 3.1 shows the three flavours of Ethernet currently in use today and the variety of media on which they can run. The still increasing demand for bandwidth is leading to a new 10 Gbit/s 802.3ae standard developed by the 10 Gigabit Ethernet Alliance 1. The alliance was founded by the networking industry leaders (3Com, Cisco Systems, Extreme Networks, Intel, Nortel Networks, Sun Microsystems, and World Wide Packets) to develop the standard, and to promote interoperability among 10 Gi- 1 The 10 Gigabit Ethernet Alliance

37 Y Y 3.3 The Ethernet technology 37 Medium Gigabit Ethernet Fast Ethernet 10 Mbit/s Ethernet Rate 1000 Mbit/s 100 Mbit/s 10 Mbit/s CAT 5 UTP 100 m (min) 100 m 100 m Coaxial cable. 25 m 100 m 500 m Multimode fibre m 412 m 2 Km Single mode fibre 3-5 Km 20 Km 25 Km Table 3.1: Network diameter or maximum distances for three flavours of Ethernet on various media gabit Ethernet products. The expected date for the release of the standard is The history of the Ethernet development is summarised in Figure 3.1. ; < =?>A@ B?>C< D < EFBHGIEF@ GIJ D EI< J KCB L ; ; ;NMIOFP QR E D <_; < =?>C@ B?>C< D < EFBHGIEF@ GIJ D EI< J KCB L ; ; ;NMFOFP QRF` Z J [TEI\FJ <_; < =?>C@ B?>C< D < EFBHGIEF@ GIJ D EI< J KCB L ; ; ;NMFOIP5QRIa ] O Z J [FEF\IJ <_; < =?>A@ B?>C< D < EIB?GIEI@ GXJ D EI< J KAB L ; ;5;"MFOFP QRFEI> :5: 7 ; < =?>A@ B?>C< GI>CST>CU KAVXW,>CB?< cd>cst>cu KCVIW,>CB?<_KCe D f J < g =?J B?[ EIB?GhB?> f W^>AGXJ E E D < ; < =?>A@ B?>C< GI>CST>CU KAVXW,>CB?< Z J [FEF\IJ < ; < =?>C@ BH>A< GX>ASF>AU KCVIW,>CB?< ] O Z J [FEF\IJ < ; < =H>A@ B?>C< GI>ASF>AU KAVIW^>ABH< b_j W,> Figure 3.1: The history of the Ethernet technology In order to distinguish between the different Ethernet technologies, in what follows, we refer to the 10 Mbit/s Ethernet as traditional Ethernet, 100 Mbit/s as Fast Ethernet and 1000 Mbit/s Ethernet as Gigabit Ethernet. The word Ethernet is used as a generic name for the above, applied to all the technologies with Ethernet as part of the name. 3.3 The Ethernet technology Originally, all nodes attached to a traditional Ethernet were connected to a shared medium as shown in Figure 3.2. Both Ethernet and pure Aloha technologies do not require central switching. Data transmitted are readable by everyone. All nodes must be continuously listening on the medium and checking each packet to see if its destination corresponds to the node s address. Thus the intelligence is in the end nodes and not the network. Ethernet provides what is known as a best effort data delivery. There is no guarantee of reliable data delivery. This approach keeps the complexity and costs down. The Physical Layer is carefully

38 { { { { { i j k m l 38 Chapter 3. A Review of the Ethernet technology noqprts uvwsu&x?yzv Figure 3.2: An illustration of a segment or collision domain engineered to produce a system that normally delivers data very well. However, errors are still possible. It is up to the high-level protocol that is sending data over the network to make sure that the data are correctly received at the destination computer. High-level network protocols can do this by establishing a reliable data transport service using sequence numbers and acknowledgement mechanisms in the packets that they send over Ethernet Relation to OSI reference model The International Telecommunications Union (ITU) and International Standards organisation s (ISO) Open Systems Interconnect (OSI) 7 layer reference model is a reference by which protocol standards can be taught and developed. Its seven layers are; { Physical (Layer 1): This is the interface between the physical medium (fibre, cable) and the network device. It defines the transmission of data across the physical medium. Data Link (Layer 2): This layer is responsible for access to the Physical Layer and for error detection, correction and retransmission. Network (Layer 3): This layer provides routing of packets across the network. It is independent of the network technology used. Transport (Layer 4): This layer provides reliable transfer of data between end-points. It defines a connection oriented or connectionless connection. It hides the lower layer complexities from the upper layers. Session (Layer 5): This layer establishes and maintains a session or connection. It provides the control of communications between the application layers. Presentation (Layer 6): This layer ensures that the coding system between the applications are the same. It encodes and decodes binary data for transport and deals with the correct

39 3.3 The Ethernet technology 39 formatting of data. Application (Layer 7): This layer is the program used to communicate. Figure 3.3 shows how Ethernet relates to the OSI 7 layer reference model. The Data Link Layer is divided into two: the Media Access Control (MAC) and an optional MAC control layer. žÿœ T œ ĥˆ_ F ˆ~? N}~}T T T Aƒ* T X }~} ˆ ˆ_Š Hˆ~ XŠF Cƒ ƒ* T X ) X Iƒ ˆ h X}Tƒ T? A ª «Œ XŠFŠ T X ƒ š, ˆ ƒ Ž ˆ _ Š }_ ˆ ƒ # T T " XŠFŠ,? Iƒ ˆ h ḧª ƒ* & ˆT C " ) d ±C² ³ C ~ " z ±C² ³ C ~ ~ " z d ±C² ³ AƒH T Hš œš T T A µf * C ¹ µi * C º C» ¼?½ µi * C ¹~¾ µi * C º~¾ C» ¼?½ µf * C ¹ µi * C ¾ À CÁXÂ ÃHÄ À CÁXÂ ÃHÄ À CÁIÂ Ã?Ä À CÁXÂ ÃHÄ À CÁXÂ ÃHÄ À CÁXÂ ÃHÄ À CÁIÂ Ã?Ä À CÁXÂ ÃHÄ Figure 3.3: Ethernet and how it fits into the OSI 7 layer model. Computers attached to an Ethernet can send application data to one another using a high-level protocol software such as NetBIOS, Novell s IPX, Appletalk or the TCP/IP protocol suite used on the worldwide Internet. Ethernet and the higher level protocols are independent entities that cooperate to deliver data between computers Frame format Figure 3.4 illustrates the Ethernet frame format. The first seven octets are known as the preamble. It is sent to initiate the transfer and also inform other nodes on the shared medium that the medium or link is busy. Its value in hexadecimal is 55:55:55:55:55:55:55. Following the seven octets, a one octet start of frame delimiter (SFD) is then sent to announce the start of the frame. The value of the SFD in hexadecimal is a5. After the start of the frame, there is the destination address followed by the source address. The source and destination address fields are both six octets long. The type field of two octets is next. This signifies the type of frame (or higher layer protocol packet) being sent, or in some cases (where the value is less than 1500 decimal), the length of the frame. After the type field, there is a data field. This can be between 46 and 1500 octets. Data less than minimum of 46 octets are padded with zeros. The higher layer protocol packets are carried in this

40 40 Chapter 3. A Review of the Ethernet technology field of Ethernet frames. Finally at the end of the frame, there is the frame check sequence field. This is a four octet field providing a sequence check for the integrity of the frame. There is also a minimum inter-frame gap which corresponds to 12 octets. This gives a total length of 84 octets for the minimum and a maximum of 1538 octets. Preamble Start of frame delimiter Destination Address Source Address Length/ Type 7 octets 1 octet 6 octets 6 octets 2 octets octets 4 octets 12 octets Data Frame check sequence Interframe gap Figure 3.4: The format of the original Ethernet Frame. Preamble Start of frame delimiter Destination Address Source Address Type= 0x8100 Tag control Information (TCI) Length/ Type 7 octets 1 octet 6 octets 6 octets 2 octets 2 octets 2 octets octets 4 octets 12 octets Data Frame check sequence Interframe gap User priority Canonical Format Indicator (CFI) VLAN Identifier 3 bits 1 bit 12 bits Figure 3.5: The format of the new Ethernet Frame with support for VLANs and eight priority levels. Figure 3.5 shows the new Ethernet frame format. This is the same as the original Ethernet frame format of Figure 3.4 with the exception of a reduced minimum data size and an extra four octets composed of a two octets Priority/VLAN field and a two octets type field. The type field must be set to 8100 hexadecimal to signify this new format. The format has a 12-bits VLAN identifier (VID) field and three bits priority field. There is a one bit field called the Canonical Format Indicator or CFI. It indicates whether MAC Addresses present in the frame data field are in canonical format or not. In canonical format, the least significant bit of each octet of the standard hexadecimal representation of the address represents the least significant bit of the corresponding octet of the canonical format of the address. In non-canonical format, the most significant bit of each octet of the standard hexadecimal representation represents the least significant bit of the corresponding octet of the canonical format of the address. This is used to indicate for instance Token ring encapsulation. The minimum frame length including the inter-packet gap stays at 84 bytes and the maximum increases to 1542 bytes. As each Ethernet frame is sent onto the shared medium, all Ethernet interfaces look at the 6-octet destination address. The interfaces compare the destination address of the frame with

41 3.3 The Ethernet technology 41 their own address. The Ethernet interface with the same address as the destination address in the frame will read in the entire frame and deliver it to the higher layer protocols. All other network interfaces will stop reading the frame when they discover that the destination address does not match their own address Broadcast and multicast A multicast address allows a single Ethernet frame to be received by a group of nodes. Ethernet NICs can be set to respond to one or more multicast addresses. A node assigned a multicast address is said to have joined a multicast group corresponding to that address. A single packet sent to the multicast address assigned to that group will then be received by all nodes in that group. A multicast address has the first transmitted bit of the address field set to 1, and has the form x1:xx:xx:xx:xx:xx. The broadcast address which is the 48-bit address of all ones (i.e. ff:ff:ff:ff:ff:ff in hexadecimal), is a special case of the multicast address. Setup of the NIC is not necessary for the broadcast. Ethernet interfaces that see a frame with this destination address will read the frame in and deliver it to the networking software on the computer. The multicast is targeted to a specific group of nodes whereas the broadcast if targeted for every node The CSMA/CD protocol Nodes connected on a traditional Ethernet are connected on a shared medium. This is also known as a segment or a collision domain. Signals are transmitted serially, one bit at a time and reach to every attached node. In the pure Aloha protocol, anyone can transmit at any time. A node wanting to transmit does so. If another node is currently transmitting, a collision occurs. A collision is detected when a sender does not receive the signal that it sent out. If a collision is detected, the sender waits a random time, known as the backoff, before retransmitting. This leads to poor efficiency in heavy loads. Ethernet improved on this by using the Carrier Sense Multiple Access Collision Detection (CSMA/CD) protocol. To send data a node first listens to the channel to determine if anyone is transmitting (carrier sense). When the channel is idle any node may transmit (multiple access). A node transmits its data in the form of an Ethernet frame, or packet. If a collision is detected by the transmitting nodes (collision detection), they stop transmitting and wait a random time (backoff)

42 42 Chapter 3. A Review of the Ethernet technology before retransmitting. After each frame transmission, all nodes on the network wishing to transmit must contend equally to transmit the next packet. This ensures fair access to the network and that no single node can lock out another. Access to the shared medium is determined by the medium access control (MAC) embedded in the Ethernet network interface card (NIC) located in each node. The backoff time increases exponentially after each collision. After 16 consecutive collisions for a given transmission attempt, the interface finally discards the Ethernet packet. This can happen if the Ethernet link is overloaded for a fairly long period of time, or is broken in some way. Table 3.1 shows the network diameter or maximum distance over the various media. These distances are due to the round trip times of the minimum packet size. The round trip time is the time it takes for a signal to get from one end of the link and back. If there are two nodes A and B, at either end of the link, the worst case condition is that one node, B for example, starts to transmit just as the transmission signal from the other node (in this case node A), reaches it. This will cause a collision. In order for node A to detect the collision, it must still be transmitting when the signal from B gets to it. Otherwise the frame is assumed by A to have been correctly sent out. This criteria sets the maximum segment length for each medium in CSMA/CD mode Full and Half duplex Half-duplex mode Ethernet is another name for the original Ethernet mode of operation which uses the CSMA/CD media access protocol. Full-duplex Ethernet is based on switches and does not use CSMA/CD. In full-duplex mode, data can be received at the same time that it is sent. Since there is no way of detecting collisions this way, full-duplex mode requires that only a single node is connected to each collision domain. Thus full-duplex Ethernet links do not depend on the signal round trip times, but only on the attenuation of the signal in the medium Flow control The IEEE 802.3x full duplex flow control mechanism works by sending what is known as a Pause packet as shown in Figure 3.6. The pause packet is a MAC control frame. That means it is restricted to the MAC level, it is not passed up to the higher layers. The destination address field of the pause packet is set to the multicast address 01:80:C2:00:00:01. Thus all NICs must be able to receive packets with this destination address. The type field of two octets is set to 8808 hexadecimal. The MAC opcode field which comes after the type field is set to 0001 hexadecimal.

43 3.3 The Ethernet technology 43 Following the opcode there is a two octet control parameter. This contains an unsigned integer telling the receiving node how long to inhibit its transmission. The time is measured in pause quanta, where a quanta is 512 bit times. For Fast Ethernet, this is 5.12 s and for Gigabit Ethernet s. After the control parameter, there are 42 octets transmitted as zeros to achieve the minimum Ethernet frame length. All other fields in the pause frame are set in the same way as normal frames. The pause packets are only applicable to full duplex point to point links. Preamble Start of Destination frame Address= delimiter 01:80:C2:00:00:01 Source Address MAC MAC opcode Control =0001 Type=0x8808 Control parameter Reserved (transmitted as zeros) Frame check sequence Interframe gap 7 octets 1 octet 6 octets 6 octets 2 octets 2 octets 2 octets 42 octets 4 octets 12 octets Figure 3.6: The format of the full duplex Ethernet pause frame. There also exists a flow control technique known as backpressure for half duplex mode. Backpressure is asserted on a port by emitting a sequence of patterns of the form of the Ethernet frame preamble. This stops other nodes from sending frames. The disadvantage with backpressure is that if enabled, all other nodes on the same segment cannot send frames between themselves or to other nodes on other segments Current transmission rates The CSMA/CD medium access protocol and the format of the Ethernet frame are identical for all Ethernet media varieties, no matter at what speed they operate. However, the individual 10-Mbit/s and 100-Mbit/s media varieties each use different components, as indicated in Figure 3.3. The operation for 10 Mbit/s Ethernet is described in the IEEE standard. At this speed, one bit time is 100 ns. The Fast Ethernet standard IEEE 802.3u is the standard for operating at the line speed of 100 Mbit/s. One bit time is 10 ns. The Gigabit Ethernet standard IEEE 802.3z supports operation at 1000 Mbit/s data rates. One bit time is 1 ns. Most deployed Gigabit Ethernet systems are running in full duplex mode. Some switch manufacturers do not even implement half duplex option on their switches. In Figure 3.3, there are various Physical Layer types shown. Examples are 10BASE-T, 100BASE- TX and 1000BASE-SX. The first part of the notation implies the rate of the link. The BASE implies baseband, meaning only one signal on the link at once (time division multiplexing), as opposed to broadband where multiple signals are on the wire at once (frequency division multiplexing). The last part describes the medium type. For 10 Mbit/s, T and F stand for twistedpair and fibre optic. There exists also 5 for thick coaxial cable, indicating a maximum segment

44 44 Chapter 3. A Review of the Ethernet technology length of 500 metres and 2 for thin coax indicating 185 meter (rounded up) maximum length segments. For Fast Ethernet, there exists TX implying twisted-pair segments and FX implying fibre optic segment type. The TX and FX medium standards are collectively known as 100BASE-X. There also exists T4 segment type which is a twisted-pair segment type that uses four pairs of telephone-grade twisted-pair wire. The twisted-pair segment type is the most widely used today for making network connections to the desktop. Gigabit Ethernet has two Physical Layer types, SX implying the fibre optic medium and the recently developed T which implies twisted-pair. The TX and FX media standards used in Fast Ethernet are both adopted from physical media standards originally developed by the American National Standards Institute for the Fibre Distributed Data Interface (FDDI) LAN standard (ANSI standard X3T9.5). The Gigabit Ethernet fibre Physical Layer signalling borrows from the ANSI Fibre Channel standard. The availability of these proven standards reduced development time and also helps to drive down the cost of the components. 3.4 Connecting multiple Ethernet segments There are a number of Ethernet devices to connect together multiple Ethernet segments. These are routers, repeaters, hubs, bridges and switches Routers Routers are Layer 3 devices that enable switching from one Layer 2 technology to another. Packets are routed according to their Layer 3 information. In order to forward a packet, a router searches its forwarding database for the Layer 3 destination address and the output port. The router changes the destination MAC address of the packet to the MAC address of the next network equipment in line to the destination. This could be another router, a switch or the destination node. Routers offer firewalls and support multiple paths between nodes. They do not automatically forward broadcasts and thus help create separate broadcast domains and reduce performance problems caused by a large broadcast rate. This allows complex but stable networks to be designed.

45 3.4 Connecting multiple Ethernet segments Repeaters and hubs In providing longer segment or collision domains, Ethernet repeaters were developed. A repeater is a half duplex, signal amplifying and re-timing device. Strategically placed in the network, it cleans and strengthens the signal attenuated by travelling through the physical medium. Repeaters blindly regenerate all data from one of its ports to another. There is no decoding to worry about, therefore repeaters are very fast. All nodes attached to the repeater are on the same collision domain saw the introduction of star-wired 10BASE-T hubs with twisted pair wiring. A hub is simply a multiport repeater, used to provide multiple connection points for nodes. Hubs operate logically as a shared bus as shown in Figure 3.7. The connections are on the same collision domain even though the media segments may be physically connected in a star pattern. Ð Ñ Ò Ó Ô Õ Ï.Ö Å ÆÈÇtÉ ÊËÍÌ0ÊËhÎFÏ.Ì Figure 3.7: An illustration of a hub. The disadvantage of repeaters and hubs is that they are wasteful of bandwidth since everything is copied to all ports except the incoming port. Repeaters and hubs are OSI Layer 1 devices Switches and bridges Ethernet bridges have over time evolved into switches or switching hubs. Bridges and switches are an improvement over the original shared medium model because they have added intelligence to provide a filtering mechanism to ensure that only packets destined for the appropriate segment are forwarded to those segments. Switches can also operate in full duplex mode. They can also send and receive multiple packets simultaneously. The round trip timing rules for each LAN stop at the switch port. This means a large number of individual Ethernet LAN segments can be connected together. Switches may also allow the linking of segments running at different speeds. Data can be sent from a node running at 10 Mbit/s across the switch to another running at 1000 Mbit/s.

46 46 Chapter 3. A Review of the Ethernet technology Compared to routers, switches tend to be less expensive, faster and simpler to operate. However, routers allow multiple paths to exist between nodes and allow the connections different technologies. Compared to hubs, switches are inherently slower due to the filtering process which enables more of the network bandwidth to be used for transferring useful data. They also tend to cost up to five times more than a hub of the same number of ports. The distinction between switches and routers is slowly disappearing as vendors increase the funtionalities in their switches. Devices refered to as routing switches are appearing on the market. 3.5 The Ethernet switch Standards This section describes how Ethernet switches works and the standards they conform to. All Ethernet switches must adhere to the IEEE 802.1D bridge standard. Vendors may implement additional features some of which are IEEE standards others which are not. We discuss the bridge standard and some of the other advanced features of Ethernet switches below, The Bridge Standard A bridge is a transparent device used to connect multiple Ethernet segments (see Figure 3.8). Transparent means that the connected node are unaware of its existence. A bridge is also a layer 2 device, meaning that it operates on Ethernet addresses. The Ethernet bridge standard, IEEE 802.1D, describes the essential part of the Ethernet switching mechanism. Each of the bridge ports runs in promiscuous mode, receiving every frame transmitted on each connected segment. A bridge limits the traffic on network segments. This is done by forwarding frames needed to be forwarded to a different segment and filtering those whose destination can be found on the same segment that it arrived on. The effect is to limit the amount of traffic on each segment and increase the network throughput. In Figure 3.8, if node A is sending a frame to node B, then the bridge filters the frame so that it doesn t appear on segment 2. However if node A is sending to node D, then the frame is forwarded by the bridge to segment 2. Note that nodes on the same segment are required to operate at the same speed. A bridge learns what nodes are connected to it. It has an address table which maps bridge ports to MAC address. Of course there can be more than one MAC address associated with a particular bridge port. When a node is first plugged into a bridge port, the bridge is unaware of its MAC address until it starts sending frames. When a frame is sent, the source address of the frame is looked at by

47 í Ø Ù Ú 3.5 The Ethernet switch Standards 47 Û è.éèýëê éèì"å%ýßè.ãéèü~ä â à.þáãà.äíåæ â à.þáãà.äíå%ç 0Û Ü~ÝßÞáà Figure 3.8: A network with two segments connected by a Bridge. the bridge in order to learn the MAC address of the node connected to that port. Prior to this, all frames destined for that address are broadcast to all ports. Bridges use the spanning tree algorithm to detect and close loops which would otherwise cause packets to continuously loop round the network. The spanning tree algorithm is discussed futher in Section When bridges were first introduced, they tended to be software based. As a result the speed at which they forwarded frames depended on the bridge CPU. Ethernet switches have evolved from bridges and therefore incorporate the bridge standards. Furthermore, the internal structure of Ethernet switches means that they are able to receive and forward multiple frames simultaneously Virtual LANs (VLANs) As illustrated in Figure 3.8, bridges and switches do not limit broadcasts so that broadcasts are received by all nodes in the network. Broadcast frames are limited to broadcast domains. For a large network, broadcasts can take up a significant amount of the useful bandwidth. Broadcasts can be stopped by adding routers because routers do not forward broadcasts, however routers add latency and have less bandwidth. Also, in a large network, for reasons of security or ease of management, network administrators may not want certain nodes exchanging data. Virtual LANs or VLANs may be used as a solution to both of these problems. VLANs (IEEE 802.1Q) are a way of providing smaller networks within a LAN by segmenting the networks, such that traffic from

48 48 Chapter 3. A Review of the Ethernet technology certain groups of nodes are limited to certain parts of the network. VLANs can be thought of as a way of providing multiple broadcast domains. The IEEE 802.1Q standard defines VLANs that operate over a single spanning tree. This means a VLAN is defined by a subset of the topology of the spanning tree upon which it operates. In a network with VLANs, only nodes of the same VLAN membership are allowed to communicate with unicast, multicast or broadcast traffic. However, nodes may belong to more than one VLAN, that is VLANs can overlap. There are several ways in which VLAN membership can be defined; by the switch port, by the MAC address or by Layer 3 information such as the IP address. To define a port based VLAN in a switch, each switch port is assigned a VLAN number or membership. When a packet arrives at the switch port, the VLAN of the packet is noted. If the destination port is in the same VLAN, the packet is forwarded. Otherwise it is dropped. If the frame needs to be broadcast, it is broadcast to all ports in the same VLAN. Address based VLANs are defined by instructing the switch about which MACs addresses to put into which VLANs. The switch would then only forward frames if the source and destination MAC addresses were in the same VLAN. Similarly, VLANs based on Layer 3 and higher layer requires the configuration of which field to base the VLAN filtering. The frame based VLAN allows VLANs to span multiple switches since it allows the VLAN information to be encoded in the frame. The VLAN identification (VID) is 12-bits of the tag control information (see Section and Figure 3.5) and allows 4093 private VLANs to be defined. There are three reserved VIDs values, 0 which implies a null VLAN, 1 which is the default VID and FFF. A frame with a tag control information is known as a tagged framed. A switch can be instructed to add a tag control information field to a frame when it enters the switch such that when it is transmitted on the output port, the VID can be used to identify which VLAN it belongs to. Conversely, a switch can strip the tag control information before sending the frame to the output port. This is to ensure that network equipment which does not understand the tag control information can accept the frame. VLANs provides improved manageability, security and increased performance by limiting unwanted traffic over the network Quality of service (QoS) The IEEE 802.1p Quality Of Service uses a three bit field (see Figure 3.5) to assign a priority to the frame. Eight priorities can be assigned to the frame. The priority field is found inside the tag

49 3.5 The Ethernet switch Standards 49 control information field. Some vendors also implement a port based priority system whereby a switch port is assigned a priority. Thus all packets from nodes attached to that port will have the same priority. The 802.1p standard does not specify the model for deciding which packets to send next. This is up to the vendor. The 802.1p standard is being merged with 802.1D standard Trunking Link aggregation (IEEE 802.3ad) which is also known as clustering or trunking was standardised in May This is a way of grouping links from multiple ports into a single aggregate link. The effective throughput of this aggregate link is the the combined throughput of the independent links. In order to retain frame sequence integrity, a flow of packets between any two nodes (a conversation) flowing across the trunked link can only use a single link of the trunk. This means the effective throughput of any conversation is limited by the speed of one of the trunked links. Broadcast on trunked links are handled like other frames such that they are not sent multiple times to the same destination. Trunking offers load balancing whereby a conversation can be moved from a congested link of a trunk to another link in the same trunk. This is also used to support link redundancy where all conversations on a disabled link are rerouted to a different link within the same trunk Higher layer switching Currently, no standards exist but vendors have recognised a market need and are introducing higher layer switching features in their switches. These features allow the switches to look deeper into the frame before making the switching decisions. No consistent definitions exist and as a result, their implementations are varied. These features are refered to as IP switching, Layer 3 switching and even Layer 4 switching. Some vendors claim to offer the full routing protocols in their switches Switch management There are no IEEE standards for managing switches. Most vendors provide software which uses the Simple Network Management Protocol (SNMP) to collect device level Ethernet information and to control the switch. SNMP uses Management Information Base (MIB) structures to record statistics such as collision count, packets transmitted or received, error rates and other device level information. Additional information is collected by Remote MONitoring (RMON) agents to aggregate the statistics for presentation via a network management application.

50 50 Chapter 3. A Review of the Ethernet technology The management interface generally comes in two forms, a serial connection to a VT100 terminal and management application software. In the second case, there is a trend towards web browser based management software. The clear advantage with this is the increased portability and location independence offered by the web. 3.6 Reasons for Ethernet The reasons why we are considering Ethernet as a solution to the ATLAS LVL2 network are; Price: Compared to other technologies being considered for the ATLAS LVL2 network, Ethernet is very price competitive both in terms of initial outlay and the cost of ownership. Historically, prices of Ethernet components have fallen rapidly when components conforming to new standard are introduced (See Figure 3.9). This trend is predicted to continue. Gigabit Ethernet s design drew heavily on PHY of X3.230 Fibre Channel project. This implies the Fibre Channel PHY components can be used for GE driving down costs further. Volume: Ethernet has a huge installed base. 83% of installed network connections in 1996 were Ethernet [6]. It has become so ubiquitous that today personal computers are sold with an Ethernet NIC as standard. Ethernet continues to enjoy large sale volume, adding to the price reductions. Simplicity: Ethernet is relatively simple to install compared to the alternative technologies. It also offers easy migration to higher performance levels. Management tools: There are management and trouble shooting tools available. Ethernet switches also support hot swap, whereby nodes can be connected and disconnected without having to power off. This is a highly convenient feature as adding and removing nodes from the network need not interrupt everyone else on the network. Performance increase: Ethernet currently runs at three different speeds. 10 Mbit/s, 100 Mbit/s (Fast Ethernet) and 1000 Mbit/s (Gigabit Ethernet). 10 Gigabit per second is currently under development and furthermore, there is a move towards 40 Gigabit per second. Reliability: Ethernet hubs and switches have become increasingly reliable. Increased functionality: New features are being added to Ethernet to support new applications and data types (QoS, VLAN tagging Trunking, see Section 3.5).

51 3.7 Conclusion Gigabit Ethernet switch port Gigabit Ethernet NIC Fast Ethernet switch port Fast Ethernet NIC Unit cost Year Figure 3.9: The cost of Fast and Gigabit Ethernet NICs and switches as a function of time. Lifetime: The lifetime of the ATLAS equipment is greater than a decade. We have confidence in the longevity of Ethernet due to the installed base and developments in the technology to meet the demands of new applications. Ethernet and PC are a commodity approach to the ATLAS trigger/daq. 3.7 Conclusion In this section, we have introduced the Ethernet technology and outlined the reasons why it is of interest to ATLAS. Traditional Ethernet provides best effort model. Its intelligence is mostly in the nodes, making Ethernet simple. Switched Ethernet is evolving to support QoS, VLANs, multicast congestion control and web-based management as more intelligence is added to the network. The widespread popularity of Ethernet ensures that there is a large market for Ethernet equipment, which also helps keep the technology competitively priced. For ATLAS, 100 Mbit/s Ethernet and higher speeds are of interest. Also, only switched Ethernets are of interest due to the requirements imposed by ATLAS. Therefore the CSMA/CD protocol is of no interest. The potential for the added flexibility due to the evolving standards and emerging higher layer switching functionality compared to simple switching hubs is of interest for the ATLAS trigger system. In the following chapters, we look at Fast and Gigabit Ethernet running in full duplex mode.

52 52 Chapter 3. A Review of the Ethernet technology

53 Chapter 4 Network interfacing Performance issues 53

54 54 Chapter 4. Network interfacing Performance issues 4.1 Introduction In this chapter, we look at Ethernet network interfaces and issues affecting their performance. It is important to understand the performance of the end nodes such that an assessment of the protocol overheads can be made and the end nodes can be characterised for the modelling of the ATLAS LVL2 trigger system. Figure 4.1 shows a simplified representation of the PC system architecture. The CPU is attached to main memory via a memory bus. The PCI bus connects to the memory bus via a PCI bridge and the network interface card or NIC is connected to the PCI bus. On our systems, the memory bus is 64-bit running at 66 MHz and the PCI bus is 32-bit running at 33 MHz. CPU Memory Memory bus 64-bit 66/100 MHz Memory controller PCI Bridge PCI-bus 32-bit 33 MHz NIC Figure 4.1: The PC system architecture. We look at the performance of the TCP/IP protocol implementations under the Linux operating system (OS) and MESH [11] [12] [13], a low overhead messaging and scheduling library written under the Linux OS running on PCs with the ATLAS LVL2 application in mind. An illustration of the layering of these communication interfaces is shown in Figure 4.2. In the Linux OS, processes run either at the user level or at the kernel level. User applications access the network via the kernel socket interfaces. The socket interfaces access the protocols at the required levels shown in the figure. TCP applications use the SOCK STREAM, UDP applications use the SOCK DGRAM, IP applications use SOCK RAW and raw Ethernet applications use the SOCK PACKET interface. MESH is a user level process with its own driver. It bypasses the kernel to access the NIC hardware. MESH also has its own scheduler to schedule the running of MESH applications.

55 4.2 The measurement setup 55 User space Socket interfaces TCP app UDP app IP app Raw Ethernet app MESH SOCK_STREAM SOCK_DGRAM SOCK_RAW SOCK_PACKET MESH app MESH app MESH driver MESH app Kernel Space TCP UDP TCP/IP Stack IP NIC driver NIC Hardware Figure 4.2: An illustration of the protocols in relation to each other. We look in detail at the comms 1 or ping-pong [16] benchmark because the traffic pattern resembles that of the ATLAS LVL2 request-response pattern and hence we can draw some conclusions about the performance for ATLAS. Fast and Gigabit Ethernet results are presented. We concentrate on Linux because of its significantly better performance compared to Windows NT [23] and the free availability of the Linux source code to aid understanding. The ATLAS trigger DAQ requires computation as well as communications. In these measurements, we measure the CPU loading during communications to give an idea of the CPU power left for running the LVL2 software and trigger algorithms. 4.2 The measurement setup The setup for the measurements here consisted of two PCs directly connected via Fast or Gigabit Ethernet. We use 100Base-TX Fast Ethernet (copper cables with RJ45 connectors) and the 1000Base-SX Gigabit Ethernet (multi-mode fibre optic cables and connectors). For Network interface cards (NICs or adapters), we use Intel EtherExpress Pro 100 [36] for the Fast Ethernet measurements and the Alteon ACENIC Gigabit Ethernet NICs [37] for Gigabit Ethernet. The two PCs were completely isolated from any other networks. All unnecessary processes (such as screen blanking and screen savers, moving the mouse) were disabled to avoid generating

56 56 Chapter 4. Network interfacing Performance issues any extra CPU load or usage overheads and to maintain a steady background state during the measurements. The Linux OS was booted into single-user text mode for minimum OS setup to minimise the CPU overhead. We used versions and The PCs used ranged from 166 MHz to 600 MHz Pentium machines. The main memory size was 32 MBytes or above. In each of the measurements, we used pairs of PCs of the same type connected together and running the same operating system to assure a symmetric setup. We used IP version 4 and Ethernet frames without VLAN tags. 4.3 The comms 1 measurement procedures Comms 1 or ping-pong is a simple message exchange between a client and a server (See Figure 4.3). We distinguish between message and data. The message is the user information to be transmitted whereas the data corresponds to the information encapsulated by the protocol. The client sets up a message and sends it in its entirety to the server. The server receives the complete message and sends it back to the client. The time for the send and receive (the round trip time) is measured on the client PC (we do not include the message setting up time by the application since we are interested in the communications only. For TCP, we do not include the connection setup time). Half of the round trip value is taken in order to obtain elapsed time (or latency) in sending the message one way. It is this that we plot in our graphs. Knowing the message size and the elapsed time, the throughput can be calculated. This throughput represents the non-pipelined throughput, that is, there is only one packet going through the system at any time. Even in this setup, a single comms 1 measurement could include extra time due to the operating system scheduling. Thus in order to get the communications performance a typical application would receive, each measurement was repeated 1000 times and the average is taken. The CPU usage measurements are obtained by simply implementing a thread counting function at the client. This counter is initially calibrated to find out how fast it can count without any other threads running. The communications thread is raised to a higher priority such that the counting thread will only be run when the processor is not processing any communications, thus giving a count per second value less than the initial calibrated measurement. From this, we can deduce the percentage of the CPU time used in the communication.

57 î 4.4 TCP/IP protocol 57 Server Thread 1 Receive Transmit Communication Client Thread 1 Transmit Receive Communication Time Thread 2 Computation Figure 4.3: The comms 1 setup. 4.4 TCP/IP protocol A brief introduction to TCP/IP TCP/IP (Transmission Control Protocol/Internet Protocol) protocol is widely available and supported by all commercial operating systems. The Transmission Control Protocol or TCP [14] is a reliable connection-oriented stream protocol. It guarantees delivery with packets in the correct sequence and also provides an error correction mechanism. TCP sits on top of IP, the Internet Protocol [15]. IP is a connectionless unreliable packet protocol. It provides error detecting and the addressing function of the Internet. However IP does not guarantee delivery or provide flow control. The TCP/IP protocol suite includes UDP, ICMP, ARP, RARP and IGMP [18] [19]. A TCP/IP protocol stack is an implementation of the protocol suite. Here we look at the Linux implementation. We are looking specifically at TCP because it has all the features required by the ATLAS trigger/daq system such as guaranteed delivery of data and flow control. Due to its pervasiveness, it is natural to ask if TCP/IP can support the ATLAS trigger/daq application. TCP was developed in the late 1960s and has been evolving ever since. It was designed to build an interconnection of networks that provide universal communication services and to interconnect different physical networks to form what appears to the user to be one large network. TCP/IP was designed before the advent of the OSI 7-layer model. Its design is based on four layers (See Figure 4.4): î The network interface or data link layer. TCP/IP does not specify any protocol here. Example protocols that can be used are Ethernet, Token ring, FDDI, ATM etc. The network or Internet layer. This layer handles the routing of packets. The IP protocol is a network layer protocol that provides a connectionless unreliable protocol. The IP header

58 î î 58 Chapter 4. Network interfacing Performance issues is usually its minimum of 20 bytes. The transport layer. This layer manages the transport of data. There are two transport layer protocols provided in the TCP/IP suite. They are the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP). TCP is a connection oriented protocol which provides a reliable flow of data between hosts. It therefore requires a connection setup. The TCP header size is usually 20 bytes. UDP is a connectionless unreliable protocol. Applications using UDP have to provide their own flow control and packet loss detection and recovery mechanism. The UDP header is eight bytes. The application layer. This layer is the program/software which uses the TCP/IP for communication. The interface between the application and the transport layer is defined by sockets and port numbers. To the user application, the socket acts like a FIFO to which data are written and are emptied out by the protocol. The port number is used to identify the user application. Common TCP/IP applications are Telnet (remote login), FTP (File Transfer Protocol) and SNMP (Simple Network Management Protocol). Our measuring application runs at this layer. Application Application (Telnet ftp etc.) Transport TCP UDP Internetwork IP ICMP ARP RARP IGMP Network interface or data link (Hardware) Ethernet, Token ring, FDDI, ATM, x.25 etc Figure 4.4: The model of the TCP/IP protocol. In opening a TCP/IP socket for communications, there are various options which can be set. These options come under the header of socket options. The socket size is one of these options.

59 4.4 TCP/IP protocol 59 It refers to the available buffer space for sending and receiving data from the peer node. The send and receive socket buffers can be set independently. Sliding window and sequence number To guarantee the delivery of packets, TCP uses the Sliding Window algorithm to effect a flow control. Packets sent from a TCP node have in the header a window size. The window size tells the peer TCP node how many bytes of data the originating node is prepared to receive. This system ensures that the peer node does not overload the buffers of the originating node. Every window size of data must be acknowledged to confirm delivery. There is also a sequence number in the TCP header to identify packet loss. The application can control the initial TCP window size by changing the socket size. The window size advertised by the TCP protocol to a peer will depend on the receive buffer available since buffer space may be taken up by data still to be read by the application. Maximum segment size and maximum transmission unit TCP sends data in chunks known as segments. The maximum segment size (MSS) depends on the maximum transmission unit (MTU) of the underlying link layer protocol. For Ethernet, the MTU corresponds to the maximum amount of data that can be put into a frame which is 1500 bytes. This means the maximum segment size for TCP/IP running on top of Ethernet is 1460 bytes, taking into account the TCP and IP headers. Delayed acknowledgements TCP uses an acknowledgement scheme to ensure that packets have been delivered. Acknowledgements are encoded into the TCP header. This allows the acknowledgement to be attached (piggybacked) to the user messages heading in the opposite direction. If there is no user data heading in the opposite direction, a TCP header is sent with the acknowledgement encoded. To help avoid congestion caused by multiple small packets in the network, acknowledgements are deferred until the host TCP is ready to transmit data (such that it can be piggybacked) or a second segment (in a stream of full sized segments) has been received by the host TCP. When acknowledgements are deferred, a timeout of less than 500 ms [20] is used after which the acknowledgement is sent. According to Stevens [18], the timer which goes off relative to when the

60 60 Chapter 4. Network interfacing Performance issues kernel was booted and most implementations of TCP/IP have acknowledgements delayed by up to 200 ms. Nagle algorithm Another congestion avoidance optimisation found in TCP/IP implementations is known as the Nagle algorithm. It was proposed by John Nagle in 1984 [17]. It is a way of reducing congestion in a network caused by sending many small packets. As data arrives from the user to TCP for transmission, the TCP layer inhibits the sending of new segments until all previously transmitted data have been acknowledged. While waiting for the acknowledgements to come, the user can send more data to TCP for transmission. When the acknowledgement finally arrives, the next segment to be sent could be bigger due to the additional sends by the user. No timer is employed with this algorithm, however when the segment reaches the MSS, the data is sent even if the acknowledgement has not arrived. The TCP/IP protocol is very complex. We have described above only the points which reoccur in this text. Readers wishing to know more, such as how TCP recovers from packet loss, are referred to [18], [19], [14] and [15] Results with the default setup using Fast Ethernet The results shown in Figure 4.5 were obtained from measurements on two 233 MHz Pentium processors. Figure 4.5(a) shows the non-pipelined throughput (as described in section 4.3) and Figure 4.5(b) the latency plot both measured against the message size. These results were obtained from measurements run on the default setup of the Linux OS, that is, without explicitly specifying any TCP options. This default setup has both the Nagle algorithm and the delayed acknowledgement enabled. Note that the plots in Figure 4.5 are from the same results. Taking the reciprocal of the latency axis of Figure 4.5(b) and multiplying it by the message size will give the plot in Figure 4.5(a). Plotting the results in these two forms emphasise the features that we would like to discuss. 1. The first part of the graphs is the message from zero to 1460 bytes. In Figure 4.5(a) we see that the throughput rises to a maximum of just over 6 MBytes/s for a data size of 1460 bytes. The form in Figure 4.5(b) of this range (zero to 1460) is linear and rising with the message size, although it is not visible due to the scale (See the next section for plots of this range).

61 4.4 TCP/IP protocol Throughput obtained in Comms 1. Nagle on. 64K socket 100 Latency obtained in Comms 1. Nagle on. 64K socket Throughput. MBytes/s 6 4 Latency. ms Message size. Bytes Message size. Bytes (a) Throughput (b) Latency Figure 4.5: Comms 1 under TCP/IP. The default setup: CPU = Pentium 233 MHz; MMX OS=Linux The second part of the graphs is the message sizes which are multiples of 1460 bytes. At these points, the message size fits into full sized TCP segments. we see a low latency and high throughput compared to the other parts of the graphs. 3. The third part of the graphs is the region from 1461 to 2919 bytes message size. In this region the message requires two TCP segment to transmit. The half round trip latency is around 95 ms, corresponding to a throughput of near zero. 4. The fourth part of the graphs is the region where the message size is greater than 2920 bytes, not including the multiples of 1460 bytes. In this region, the messages require three or more TCP segments to transmit. The latency fluctuates within a band from 5 ms to 12 ms. The throughput rises because the latency is fixed within the said band while the message size is increasing. The exchange of messages between the client and server when the user message fits into a single TCP segment is illustrated in Figure 4.6. To begin with, the client sends a message. The server TCP receives the message and schedules an acknowledgement to be sent. Since the server s response is immediate, the acknowledgement is piggybacked onto the response message. At the client side, the message is received and as in the case of the server, the acknowledgement is

62 62 Chapter 4. Network interfacing Performance issues Server s s+ack... s+ack s+ack s+ack s+ack s+ack s+ack Client Time one round trip time s = partial segment or full segment ack = acknowledgement Figure 4.6: An illustration of the comms 1 exercise involving the exchange of one TCP segments (not to scale). scheduled to be sent. However the client process repeats the process immediately, allowing the acknowledgement to be piggybacked. This continues for a total of 1000 times. As we see in the diagram, each time the acknowledgement is piggybacked thus we obtain the optimal communications performance. The features shown in Figure 4.5 from data size greater than 1460 bytes are due to two effects, the delayed acknowledgement and the Nagle algorithm [15] within the TCP protocol. As motioned above, the Nagle algorithm inhibits the sending of new segments until all previously transmitted data have been acknowledged or until the size of the message to be sent reaches the MSS, in this case 1460 bytes. With respect to the delayed acknowledgement, we should keep in mind is that acknowledgements are sent if they can be piggybacked to user data. As described above, the second part of the graphs in Figure 4.5 are the areas where multiples of 1460 bytes. At these points, the latencies are low and the throughputs high. The reason for this is that at these points, the user messages are exactly multiples of the MSS. This means that they are not inhibited by the Nagle algorithm when transmitted and the resulting acknowledgements they generate can piggyback on the response data in the case of the server and the next request data in the case of the client. In the third part of the graphs in Figure 4.5 the user message lies between 1461 bytes and 2919 bytes. The observed effect is explained in Figure 4.7. Here, the client sends the first full segment. Since the second segment is a partial segment (less than the MSS), the Nagle algorithm causes it to wait until the outstanding acknowledgement has been received. At the server side, a single

63 4.4 TCP/IP protocol ms 100 ms 100 ms Server delayed ack timer Server ack F+ack p ack... ack F+ack p F p ack F+ack F p ack Client Time 100 ms 100 ms 100 ms Client delayed ack timer one round trip time one round trip time p = partial segment F = full segment ack = acknowledgement Figure 4.7: An illustration of the comms 1 exercise involving the exchange of two TCP segments (not to scale). segment is received and since there are no segments being returned to the client to piggyback onto and a second segment is not received, the acknowledgement is delayed until the server s delayed acknowledgement timer fires. When this happens, the acknowledgement is sent and then the client sends the remaining partial segment. The server sends the first segment of the response with an acknowledgement piggybacked on it. Again since the second segment is a partial segment, the Nagle algorithm causes it to be delayed until the outstanding acknowledgement has been received. As with the server, the delayed acknowledgement timer fires before the acknowledgement is sent. This series of events continues for the total of 1000 times which is the number of times the measurement is performed. In Figure 4.7 we illustrate a single round trip time. This contains two delayed acknowledgements each of which fires at intervals of 100 ms. One due to the server and other due to the client. Thus in Figure 4.5(b), the half round trip time plot for the message size between 1461 bytes and 2919 bytes shows a latency near 100 ms. The fourth part of the graphs in Figure 4.5 is the region where the message size is greater than 2920 bytes, not including the multiples of 1460 bytes. We believe that the observed feature are also due to the combination of the delayed acknowledgements and the Nagle algorithm. The key point to remember here is that the delayed acknowledgement timer goes off relative to when the kernel was booted and does not when a packet was received. With this in mind, Figure 4.8

64 64 Chapter 4. Network interfacing Performance issues Server Server delayed ack timer ack F F+ack p ack ack F F+ack p ack ack F F+ack p F F p ack F F+ack p ack F F+ack p ack ack Client Time 1st round trip time 2nd round trip time 3rd round trip time Client delayed ack timer p = partial segment F = full segment ack = acknowledgement Figure 4.8: An illustration of the comms 1 exercise involving the exchange of three TCP segments (not to scale). shows three different scenarios for measurement of the round trip time for a message spanning three segments. In the first case, the delayed acknowledgement timers do not fire for both client and server. The client sends two full segments and the Nagle algorithm causes the last segment (which is a partial segment) to wait for an acknowledgement. When the two segments are received at the server, the acknowledgement packet is sent immediately. When the client receives the acknowledgement, the last segment is sent. A similar sequence of events occur as the server sends the response message back to the client. In the second case, the delayed acknowledgement timer fires either for the client or the server. The case illustrated in Figure 4.8 shows the delayed acknowledgement timer firing at the server. The reason for this is that since the delayed acknowledgement timer goes off relative to when the kernel was booted it can fire at the server when the first segment is received. This causes the acknowledgement to be sent out. Therefore when the second full segment arrives, and acknowledgement is not sent out until the delayed acknowledgement timer fires again. The third case is where both the client and the server delayed acknowledgement timer fires during the message exchange. Further work needs to be performed in order to better understand the behaviour of TCP when the message size is three segments or larger. Results by Rochez [26] for comms 1 under windows NT are similar to those of Figure 4.5.

65 4.4 TCP/IP protocol 65 Conclusion for ATLAS Most of the ATLAS LVL2 messages from ROBs to the processors will span only a single TCP segment, but the average fragment size from the SCT is 1600 bytes and from the calorimeters it is 1800 bytes [4]. In these cases, the observed behaviour with the default setup (the combination of the delayed acknowledgement and the Nagle algorithm) would have the effect of increasing the delays in the communications between the ROBs and processors. In the next section, we disable the delayed acknowledgement to see how the behaviour changes Delayed acknowledgement disabled Figure 4.9 shows the measurement (on two 200 MHz Pentium processors) repeated, but with the delayed acknowledgement disabled. That is, acknowledgements are sent immediately TCP segments are received. Note that the Nagle algorithm is still enabled. The features observed with Figure 4.5, when sending two or more TCP segments are no longer visible. The downward spikes in the Figure 4.9(b) represents message sizes corresponding to whole numbers of TCP segments and hence minimum transit latency of the comms 1 measurement. The length of the spikes in microseconds is the extra time added to the packet latency as the client waits for the acknowledgements. Therefore the length of the spikes corresponds to the time it takes the acknowledgement packet to go from the server to the client. Since the TCP acknowledgement comes in a TCP header, this should be approximately the time to send the minimum TCP segment which is ï s from Figure 4.9(b). The actual length of the spike is ï s on average. This leaves an overhead of 25.9 ï s. We expected the acknowledgement to take less than 100 ï s since the application is not involved. We are uncertain what this extra time is due to. A possibility could be that sending a data-less acknowledgement could require more processing time than sending a piggybacked acknowledgement, and thus delaying the sending of the packet which follows the acknowledgement Nagle algorithm and delayed acknowledgement disabled With the delayed acknowledgement and Nagle algorithm off, the resulting throughput curve shows only features corresponding to the Ethernet frame boundaries as shown in Figure The figure also has a plot of a parameterised model of the communication. The model shows very good agreement with the measurements. The model is explained in the next section.

66 66 Chapter 4. Network interfacing Performance issues 12 Throughput obtained in Comms 1. Nagle on. Delayed ack off 1600 Latency obtained in Comms 1. Nagle on. Delayed ack off Throughput MBytes/sec Latency. usecs Message Size. Bytes Message Size. Bytes (a) Throughput (b) Latency Figure 4.9: Comms 1 under TCP/IP: CPU = Pentium 200 MHz MMX: Nagle algorithm on: Delayed acknowledgement disabled: Socket size = 64 kbytes OS=Linux Conclusion for ATLAS From these results, the best configuration of the end-nodes, in terms of communication, for ATLAS LVL2-like traffic is with both the Nagle algorithm and the delayed acknowledgement disabled. However, these results do not take the CPU load into account. Later in this chapter, we will look at the CPU performance A parameterised model of TCP/IP comms 1 communication Values from the measurements In Figure 4.10(b), there are four distinct features we model. 1. The offset from the latency axis. This tells us the fixed overhead or the minimum overhead in sending a TCP segment. The actual value requires extrapolation from 1460 to six message bytes due to the minimum and maximum packet size restrictions of Ethernet. The value obtained is ï s. 2. The area from a message size of six to 1460 bytes, or the single segment area. In this area, only a single TCP segment and Ethernet frame is sent. The gradient obtained for this area is ï s/byte.

67 ü 4.4 TCP/IP protocol Measured modelled TCP/IP comms 1 with Fast Ethernet. Meas vs model. 64k socket Measured modelled TCP/IP comms 1 with Fast Ethernet. Meas vs model. 64k socket Throughput. MBytes/s 6 4 Latency. us Message size. Bytes Message size. Bytes (a) Throughput (b) Latency Figure 4.10: Measurement against parameterised model. Comms 1 under TCP/IP: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Delayed ack disabled: Socket size = 64 kbytes. OS=Linux Thus in the region of 0 to 1460 bytes, the model has the form; ðñ òôóöõz?ø õ&ùú"ûýüþø õ.ÿ (4.1) Where û is the message size, ð ò ñ is half the round trip time. 3. Every subsequent area of size 1460 bytes from message size 1461 is the multi-segment area. In these areas, multiple segments are sent, thus advantage can be taken of the pipelining effect. The gradient measured here is ï s/byte. 4. The height between the subsequent multi-segment areas as described in item 3. This is the overhead TCP/IP suffers in sending an extra TCP segment. We measure this to be 55 ï s. The link time for sending the minimum Ethernet frame size is 6.72 ï s (including the interpacket time). This is only 12% of the total time. this means the rest of the time is due to the node overhead (protocol, PCI bus, driver and NIC). For the multi-segment regions, the model is of the form ð ñ òôó ð ñ ò ü ð ñ ò is half the round trip time from the previous full segment size. õz Aõ û (4.2)

68 î 68 Chapter 4. Network interfacing Performance issues End-to-end latency User space App Send Receive App Kernel space Protocol Driver PCI transfer Send Interrupt Receive Interrupt PCI transfer Driver Protocol NIC NIC NIC Link Layer Link Figure 4.11: The flow of the message in the comms 1 exercise. The model We model the flow of the data (of the first 1460 bytes) for the ping-pong as shown in Figure This shows a simplified message flow from application transmission to application reception. A summary of the flow is as follows. The message transfer begins with the application write to the protocol (in this case TCP/IP). The protocol packs the data with the right headers and calls the NIC driver. Note that the Ethernet source address, destination address and type field are added by the protocol before the driver is called. The driver sends the packet to the NIC which adds the Ethernet CRC and sends the packet on the link. The NIC generates an interrupt after the successful send. At the receiver side, the NIC reads the frame from the link, copies it into main memory via the PCI bus and notifies the driver. The driver runs and passes the packet to the protocol, which then passes the packet to the application after removing the protocol headers. For our parameterised model, we define the following as overheads as having a fixed component and a data size dependent component: î The protocol overhead: the fixed overhead is defined as seconds, which accounts for the protocol setup time and in addition the rate is defined as bytes/s, which accounts for any data copies. This makes the protocol overhead equal to ü û, where û the data size. The PCI transfer: the fixed overhead is defined as seconds, which accounts for the arbitration and setup time. The rate is defined as bytes/s. We are using 32-bit 33 MHz PCI, thus the rate is 132 MBytes/s. is We also define the link rate as "! #. For Fast Ethernet this is 12.5 MBytes/s. We define the following as constants: î The application overhead $ seconds. This is a system call.

69 î î î î î û û ú ' ü ú ü ø ) ø ' ü ú ' ü ü ) ) ' ü ' ) ü ) ü ' ü ) 4.4 TCP/IP protocol 69 The driver overhead % & seconds. The NIC overheads! seconds. The receive interrupt! seconds. At various points, the protocol payload changes due to the overheads introduced by the protocols. Thus we define the following: bytes: The user message size at the application level. î(' bytes: The extra overhead incurred by the TCP/IP protocol. This is made up of the TCP and IP headers and equals 40 Bytes. î*) bytes: The overhead due to the Ethernet framing as the packet is transferred over the PCI bus. This is the Ethernet destination address, the source address and the Ethernet type field. The value is 14 Bytes. We do not include the Ethernet CRC which is added and removed by the NIC (see the link overhead + ). + bytes: The extra overhead on the frames due to the link transfer. This is the preamble, start of frame delimiter and the CRC field. This equals 12 bytes. The inter-packet gap is not added since we are considering a single frame transfers. The end to end latency or half the round trip time for the data can be written as; ð ñ òôó $ ü, ü ü! - ü ü % & ü, ü ûü ûü ü! - ü ü,/ ü 0! ü % & ü ûü."! # ü + û ü, ü $ (4.3) ð ñ òôó ú"û ) ü 1 ûü ûü ü."! # $ ü ü % & ü 1 ü! - ü! (4.4) üôú ü + We re-arrange the equation in the form of Equation 4.1 ð ñ ò óöû ú ü üôú ' ü."! # ü ü + / ü 2!3# $ ü, ü % & ü,1 ü! -4 ü 0! (4.5) If we substitute the numbers above for our system into Equation 4.5, it becomes: ð ñ òôóöû ú ü ú øú ü-ú øú! ü ú õ0üþø øú ü õ0üþø øú! $ ü ü % & ü ü! ü! (4.6) øú

70 õ õ ø ÿ õ ø ú ú ü õ ú ø ÿ õ 9 70 Chapter 4. Network interfacing Performance issues ðñ ò-óöû ú 5 ü ø6 ü ø õõ76ú Comparing equation 4.7 and 4.1, we get the following; This solves to give 8 ó ú! ø6 ü ú ü ø6 $ ü, ü % & ü, ü! 4 ü 0! (4.7) óöõz?ø õ&ùú (4.8) MBytes/s. Our system has a 64-bit 66 MHz memory bus which yields 528 MBytes/s. This implies that the protocol performs multiple copies. Comparing equation 4.7 and 4.1, we also get: ø õõ76ú ü ø6 $ ü ü % & ü 1 ü! - ü! ó ø õ.ÿ (4.9) Measurements done by Boosten [12] on a 200 MHz system reveals that the system call overhead, 0$ is 8 ï s, the interrupt overhead,! is 18 ï s and the NIC send and receive overheads,! - are 10.5 ï s each. Substituting these values into Equation 4.9, we get: ø õõ76ú ø6 This solves to give ü 9 ü, ü % & ü,1 üþø õz üþø ó ø õ.ÿ (4.10), ü % & ü,/ ó ú!?ø ï s (4.11) An often suggested optimisation of TCP/IP is moving the protocol onto the NIC. From Equation 4.5 we see that moving the TCP/IP protocol onto the NIC will save the transfer of the protocol overheads across the PCI bus ú ' ) :1 which corresponds to 0.8 ï s in the total fixed overhead of ï s. We can also see that eliminating the data copying will reduce the data dependent overhead by ú7 ; ó õz Aõ)ø õ ï s per byte. This is not significant for Fast Ethernet where the link time for the minimum frame size (42 data bytes) is 5.76 ï s. For Gigabit Ethernet where the link time for the minimum packet ï s, this would be a worthwhile reduction. The above argument does not take into account the fixed processing overhead of the TCP/IP protocol which would moved from the host CPU to the NIC. From Equation 4.11 this could be a maximum of 23.1 ï s (but it is likely to be much less). Limitation of model The above parameterised model applies only for single TCP segment communication. In making this model, we have made a number of assumptions. We have assumed time symmetry in transmitting and receiving, that is, the elapsed time in each layer of Figure 4.11 is assumed to be the

71 4.4 TCP/IP protocol k socket 32k socket 16k socket 8k socket 4k socket TCP/IP comms 1. Various socket sizes k socket 32k socket 16k socket 8k socket 4k socket TCP/IP comms 1. Various socket sizes 1200 Throughput. MBytes Latency. us Message size. Bytes Message size. Bytes (a) Throughput (b) Latency Figure 4.12: Comms 1 under TCP/IP for various socket sizes: Delayed ack off: Nagle algorithm disabled: CPU = Pentium 200 MHz MMX: Socket size = 64 kbytes OS=Linux same on transmit as it is on receive. Potentially we can profile the various layer in Linux to deduce their actual time for both transmission and receiving. The performance of the PCI bus is not clear. From our experience, changing the chipset on which the measurements were run while maintaining the same processor speed had a significant effect. A report 1 by Intel also shows the the PCI bus performance depends on how well the NIC is designed. For four NICs tested, the PCI efficiency (the amount of the PCI bus transfers which were actual user data compared to the total transfers) ranged from 10% to 45%. These are the two most significant sources of inaccuracies in the conclusions drawn based on our model Effects of the socket size on the end-to-end latency It is possible to set the send and receive socket buffers to different values. In Figure 4.12(a) and 4.12(b), both the send and receive buffers of the client and server machines were set to the same value. Looking at Figure 4.12(a), the 4K byte socket size has a large drop in throughput at 2048 bytes data size. The same can be seen for the 8K bytes socket at 4096 bytes data size and also the 16K bytes socket at 8192 bytes data size. 1

72 72 Chapter 4. Network interfacing Performance issues The socket size is related to the TCP window size. TCP uses the window size to tell the remote host how much buffer space it has available to receive data. This avoids the remote host from overflowing the buffers of the local host. From the results shown in Figure 4.12, this implementation of TCP sets the window size to half the socket size. The latency increases by 133 ï s at a data size equivalent to half the socket size. This is equivalent to the time it takes to send an acknowledgement. The latency is due to the fact that transmitted data must be acknowledged before new data can be transmitted. In the latest Linux kernel (2.4.x), the socket size corresponds directly to the window size. Conclusions for ATLAS We have seen here that the bigger the socket size, the better the performance since more data can be received before an acknowledgement is transmitted. In the case of ATLAS where there are around 1700 connections per node, we cannot use an arbitrarily large socket sizes. The optimum is to tune the socket size is the product of bandwidth and the round trip time, also known as the bandwidth delay product. This gives the number of bytes that can be stored in the connection between the client and server. With this setting the link can be fully utilised in a one way transmission Results of CPU usage of comms 1 with TCP The measurements were repeated, but this time we used a low priority thread to measure the CPU load as described in Section 4.3. The results for the latency and throughput are shown in Figure The plot of the CPU load is shown in Figure Note that for the single segment region, the measurements of Figure 4.13 have not changed when compared to Figure The CPU load for a single segment (message size of up to 1460 bytes) reaches a maximum of 60%. Figure 4.15 shows a crude model of the CPU busy and idle times on the client and server during the ping-pong measurement (it does not take into account the interrupt due to sending). It shows that when one CPU is busy, the other is idle. Furthermore there is an overlap when neither processor is busy. This is due to the extra time the message spends being sent from one node to the other. We label this time the minimum I/O time and it is due to the PCI bus, NICs send and receive parts of the protocol and the link. Therefore we expect the CPU usage to be always below 50% (where the processors are the same at either ends). This is true if the interrupt due to sending is less than the minimum I/O time. This is the case since the interrupt due to sending is

73 4.4 TCP/IP protocol Throughput obtained in comms 1 exercise. 64k socket size. Nagle off. With CPU usage 7000 Latency obtained in comms 1 exercise. 64k socket size. Nagle off. With CPU usage Throughput. MBytes/s 6 Latency. usecs Message size. Bytes Message size. Bytes (a) Throughput (b) Latency Figure 4.13: Comms 1 under TCP/IP with CPU load measured: Delayed ack disabled: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Socket size = 64 kbytes OS=Linux ï s and the NIC send and receive overhead alone is 21 ï s. Figure 4.14 clearly shows that the maximum CPU load is around 60%. We attribute this to the extra work in sending and receiving the acknowledgement which is sent immediately on receipt of a packet. Comparing the CPU load measurements of Figure 4.14 with the latency and throughput measurements of Figure 4.13, we see that for multiple TCP segments, the latency and the CPU load fluctuate randomly. We also see that the generally, the communications performance drops as the CPU load drops. From this, we can conclude that the OS is not switching fast enough between the CPU load measuring thread and communications thread, thus giving more CPU time to the load measuring thread (note that in this measurement, the server has no load measuring thread, so what we see here is due to the client s load measuring thread). We suspect that this behaviour is due to the number of packets sent and received by the client node. The scenario is as follows. When an outgoing ping message of a single segment is sent from the client to the server, the server generates an acknowledgement and sends it immediately. When the returning pong message is ready to be sent, it sends that also. When sending messages spanning two or more segments, the server sends an acknowledgement for each incoming segment. This effectively doubles the number of packets sent and received per second. To prove this, we reduce the number of packets per seconds by re-enabling the delayed ac-

74 74 Chapter 4. Network interfacing Performance issues 70 CPU usage obtained in comms 1 exercise. 64k socket size. Nagle off 60 Percentage CPU usage Time Client CPU Idle Busy Server CPU Idle Minimum I/O time = Link time + 2(PCI + NIC) time 10 Idle Busy Message size. Bytes Busy Idle Figure 4.14: CPU usage from comms 1 under TCP/IP with CPU load measured: Delayed ack disabled: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Socket size = 64 kbytes OS=Linux Idle Busy Idle Figure 4.15: A model of the CPU idle and busy time during the comms 1 measurements. 12 TCP/IP Raw Ethernet Comms 1. Raw Ethernet sockets vs. TCP/IP. kv MHz 4500 TCP/IP Raw Ethernet Comms 1. Raw Ethernet sockets vs. TCP/IP. kv MHz Throughput. MBytes/s 8 6 Latency. usecs Message size. Bytes Message size. Bytes (a) Throughput (b) Latency Figure 4.16: Comms 1 under TCP/IP and raw Ethernet sockets with CPU load measured: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Delayed ack on: Socket size = 64 kbytes OS=Linux

75 4.4 TCP/IP protocol TCP/IP Raw Ethernet Comms 1. Raw Ethernet sockets vs. TCP/IP. kv MHz Latency. usecs Message size. Bytes Figure 4.17: The magnification of Figure 4.16(b). The latency from comms 1 under TCP/IP and raw Ethernet sockets with CPU load measured: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Delayed ack on: Socket size = 64 kbytes: OS=Linux TCP/IP Raw Ethernet Comms 1. Raw Ethernet sockets vs. TCP/IP. kv MHz 45 Comms 1. Raw Ethernet sockets vs. TCP/IP. kv MHz TCP/IP Raw Ethernet TCP/IP model Raw Ethernet model Percentage CPU usage Percentage CPU usage Message size. Bytes Message size. Bytes (a) Single and multiple segments (b) Single segment only Figure 4.18: Comms 1 under TCP/IP and raw Ethernet sockets: CPU = Pentium 200 MHz MMX: Nagle algorithm disabled: Delayed ack on: Socket size = 64 kbytes OS=Linux

76 76 Chapter 4. Network interfacing Performance issues knowledgements (the acknowledgements are therefore piggybacked) and still with the CPU measuring thread enabled. The results are shown in Figure Also plotted in Figure 4.16 are the results of the comms 1 measurement on raw Ethernet SOCK PACKET interface, that is bypassing the TCP/IP stack (See Figure 4.2). Comparing the TCP curve of Figure 4.16 with Figure 4.13, although there is still a lot of randomness, there is improvement in the communications performance. This shows that reducing the packet rate helps the performance since the OS scheduler switches at a lower rate between threads. Conclusion for ATLAS For the ATLAS trigger system, a computation is required on the processing nodes therefore we cannot avoid the use of multiple threads. Reducing the maximum data size is not a solution because data can be coming in from multiple sources and will cause the same effects. For ATLAS, the behaviour of this version of the Linux scheduler (kernel version ) is not ideal. A statement on suitability for ATLAS cannot be made without considering how this behaviour changes with the CPU speed and relating it to the likely processor speed to be used in the LVL2 system. The effect on the performance of the TCP/IP communications with respect to CPU speed is looked at later in this chapter. We can conclude that the delayed acknowledgement in the absence on the Nagle algorithm is not detrimental to the comms 1 performance. Furthermore, the delayed acknowledgement reduces the load on the CPU Raw Ethernet Bypassing the TCP/IP protocol and using raw Ethernet SOCK PACKET interface gives us the curve labelled Raw Ethernet in Figure With the SOCK PACKET interface, the application using the interface must supply a pre-formated Ethernet frame (with the source and and destination address, the type field and the data. The CRC is done by the NIC) for transmission. Thus for messages which span multiple Ethernet frames, the application must perform the packetisation. Figure 4.16 shows that the raw Ethernet curves have less randomness for messages spanning multiple Ethernet frames than TCP. This demonstrates that the performance loss due to switching between thread depends on how much processing time the protocol uses. It also demonstrates the TCP/IP overhead is around 40 ï s. This is seen more clearly in Figure Under these conditions, for TCP/IP, Equation 4.1 becomes

77 E ú ø ò E õ E 4.4 TCP/IP protocol 77 For raw Ethernet, Equation 4.1 becomes ð ñ òôóþõz?ø õ ûýüöø õ (4.12) ð ñ òôóþõz?ø õ76ù"û ü<6ú! (4.13) for raw Ethernet sockets. Figure 4.18 shows the CPU load obtained from the comms 1 measurement. In Figure 4.18(b), we also show the measurements against a parameterised model of the CPU load for messages limited to a single segment. The model is described below A parameterised model of the CPU load Given the behaviour of this implementation of TCP/IP, we model the CPU load or usage for single segment messages. The CPU load is a measure of how hard the CPU works during the communications. It depends on the sends and receives (ping-pongs) it does each second. The number of ping-pongs per second is the reciprocal of the round trip time (twice the end-to-end latency ð ñ ò ). We model the CPU load as some fixed value plus a value dependent on the number of ping-pongs per second. =?>A@ B $ % ódc B ü ð0ñ C 4 (4.14) Where C B is the fixed value independent of the number of ping-pongs per second. C B therefore represents load in setting up the ping-pong measurement. ð ñ ò is the end-to-end latency. C F is the CPU load per ping-pong. We have the value of ð ñ ò from the ping-pong measurement. To obtain the values of C B and C 4 we selected two message sizes and solved two simultaneous equations based on the measured latency and CPU load on Figure 4.17 and Figure 4.18(b). For TCP/IP, C B is 14.0% and C 4 % is HG ùõ ø õ %s. For raw Ethernet C B is also 14.0% and C 4 is HG &ÿ ø õ %s. Therefore when compared to raw Ethernet, the TCP/IP protocol has an extra load of 49% per pingpong Conclusions for ATLAS TCP/IP has a 49% larger overhead per ping-pong than raw Ethernet. However, raw Ethernet does not have adaptive congestion control, loss detection and recovery. For ATLAS, using raw Ethernet implies relying solely on the underlying Ethernet error detection and flow control mechanism or

78 78 Chapter 4. Network interfacing Performance issues building some degree of error and loss detection and recovery protocol on top of raw Ethernet. In the latter case, the performance is not guaranteed to be better than TCP/IP. The potential gain would have to be weighed against other costs such as the extra development and maintenance Gigabit Ethernet compared with Fast Ethernet So far we have looked at the performance under Fast Ethernet. In this section we compare TCP/IP comms 1 performance under Fast and Gigabit Ethernet. For the Gigabit Ethernet tests, we use the Alteon ACENIC. At Gigabit rates, frames can arrive 9 9 into the NIC at E G ø ø õ packets/s. If an interrupt was to be sent to the CPU for each arriving frame, the CPU would be constantly dealing with interrupts, leading to the situation seen in Section and Figure Recall that the setup used to produce the plots was with the Nagle algorithm and more importantly the delayed acknowledgement both disabled. This meant that for every segment received, a separate acknowledgement is sent. Therefore for each segment of data sent by a host, there are four interrupts generated. At the sender, there is an interrupt due to the transmission of the segment. There are two interrupts at the receiver, the first when the packet is received and a second when the acknowledgement is sent. Finally, there is an interrupt at the sender when the acknowledgement is received. The load of dealing with these interrupts and scheduling a computation process is what led to the shape observed in Figure To help increase the communications performance by reducing the number of interrupts, most current generation NICs have what is known as interrupt mitigation. In the Alteon ACENIC, this is referred as coalesce-interrupts. It allows the user to regulate how many frames to collect on the NIC before transmitting on the link or raising an interrupt and passing it to the CPU. For these tests, the coalesce interrupts feature was set such that a single packet triggered a send or receive. This avoid inaccurate round trip time measurements. Figure 4.19 shows the throughput and end-to-end latency plots against message size for Gigabit Ethernet and Fast Ethernet. The measurements were performed using the Linux kernel version on 400 MHz processors. No drivers are available for the ACENIC under Linux version 2.0.x. Also shown are the lines of best fit. Unlike the previous plots for the kernel version , there is a lot of fluctuations for both Fast and Gigabit Ethernet at small message sizes. We attribute this to the change in the OS from Linux kernel version to Ignoring these fluctuations, the gradient of the line of best fit is for the Fast Ethernet line (FE) with an intercept at 80.2 ï s. For the Gigabit Ethernet (FE) line, the gradient is with an intercept at 91.2 ï s.

79 E 9 ü 9 E 4.4 TCP/IP protocol 79 Therefore the equivalent of Equation 4.1 for Fast Ethernet is ðñ òôóþõz Aõ ÿù"û õz Cú (4.15) The differences between Equation 4.1 and Equation 4.15 is due to the different CPU speeds and kernel change. For Gigabit Ethernet the equation is: ð ñ òôóþõz Aõ&ú ù"û ü ùzø Cú (4.16) The difference between the gradient of the Fast and Gigabit Ethernet is attributed to the link rate and the difference in the intercepts to the NICs with the different drivers. Comparing the Fast Ethernet TCP/IP curves of the Linux kernel version (Figure 4.17) and (Figure 4.19(b)), we see that there are more fluctuations at the small message sizes for the newer kernel (version ) than for the old. Between these two kernel versions, the changes which could account for these fluctuations are the NIC driver, the scheduler and the TCP/IP stack itself. The CPU load is shown in Figure The modelled Fast Ethernet reaches a CPU utilisation of 45% and modelled Gigabit Ethernet reaches 40%. For Gigabit Ethernet measurement, the maximum CPU load does not increase from a message size of 500 bytes to zero bytes. In Figure 4.19(b) we also note that the latency does not decrease from a message size 500 bytes to zero bytes. This must be clearly due to the interrupt mitigation (since a higher rate is achieved by the Fast Ethernet NIC) limiting the rate of interrupt and hence the number of send and receives to around per second. From Equation 4.14, the value for C B is 4.0% for Gigabit Ethernet and 5.7% for Fast Ethernet. The value of C 4 is 6 ÿÿ ø õ HG %s for Gigabit Ethernet and 6ú ø õ HG %s for Fast Ethernet. Conclusions for ATLAS For applications with a request-response like communications and with message sizes in the range shown in Figures 4.19 and 4.20, the host send and receive latency dominate the link latency. Therefore in this range, there is no great advantages in using Gigabit Ethernet over Fast Ethernet when we consider that the current cost the Gigabit Ethernet NIC is five that of Fast Ethernet. Tests on Gigabit Ethernet under Windows NT 4.0 showed an increase in the fixed latency overhead of at least 21% compared to Linux. We measured 160 ï s on a 233 MHz NT PC compared with 132 ï s on a 200 MHz Linux PC.

80 80 Chapter 4. Network interfacing Performance issues GE FE GE modelled FE modelled FE and GE comms MHz. kver FE and GE comms MHz. kver GE FE GE modelled FE modelled Throughput. MBytes/s 6 Latency. usecs Message size. Bytes Message size. Bytes (a) Throughput (b) Latency Figure 4.19: Comms 1 under TCP/IP for Fast and Gigabit Ethernet: Delayed ack on: CPU usage measured: CPU = Pentium 400 MHz: Nagle algorithm disabled: Socket size = 64 kbytes OS=Linux FE and GE comms MHz. kver GE FE GE modelled FE modelled 40 Percentage CPU usage Message size. Bytes Figure 4.20: CPU load for comms 1 under TCP/IP for Fast and Gigabit Ethernet: Delayed ack on: CPU usage measured: CPU = Pentium 400 MHz: Nagle algorithm disabled: Socket size = 64 kbytes OS=Linux

81 4.4 TCP/IP protocol Comms 1 fixed overhead for various CPU speeds FE kernel FE kernel GE kernel Comms 1 fixed overhead. usecs CPU speed. MHz Figure 4.21: The effect on the fixed latency overhead when changing the CPU speed Effects of the processor speed Looking at Figure 4.15, we can see that increasing the CPU speed on both the client and server hosts will reduce the busy time. This will have one of two possible effects depending on how the busy time compares to the idle time. 1. If the busy time does not decrease significantly compared with the idle time as the CPU speed increases the observed CPU load will decrease while the end-to-end latency will remain fairly constant. This is an indication that we are limited by the I/O. 2. If the busy time decreases significantly compared with the idle time, then the number of ping-pongs will increase. The effect is that the CPU load observed will remain constant while the end-to-end latency will decrease. This is an indication that we are limited by the software. It cannot reach the I/O limit. The performance comparison of TCP/IP running on various speed processors, three different kernels versions of Linux and Fast and Gigabit Ethernet are summarised in Figure The plot is of the CPU speed against fixed latency overhead. We see firstly that the older version of the Linux kernel performs better than the newer This could be the effect of optimisations made in areas such as the scheduler of the Linux kernel version deteriorating the communications performance or simply, the communications code (for example the NIC driver and TCP/IP stack) is less well optimised in Linux version than in We also see that the performance of the Gigabit Ethernet is consistently worse than that of Fast Ethernet. This must be because of higher overhead in the NIC and the driver. We can conclude that the protocol cannot reach the I/O limit. The difference between two PCs running at different speeds is not simply the clock speed of the machines. The architecture of the

82 82 Chapter 4. Network interfacing Performance issues chips changes. For example, the cache size and the number of pipeline stages in the processor may change. Furthermore on the motherboard itself, the PCI chipset may change. We have seen during our tests that different chipsets have different performance. In light of this and the limited number of points in Figure 4.21, it is difficult to conclude any more from the figure. We come back to the issue of CPU speed effects in Sections and TCP/IP and ATLAS Decision Latency The required average decision time for the ATLAS LVL2 trigger/daq is 10 ms. If TCP/IP is to be used, the end-to-end latency for 1 kbytes on a 400 MHz processor according to Equation 4.15 and 4.16 is ï s for Fast Ethernet and ï s for Gigabit Ethernet. If we assume the request size to be the minimum packet size (although it is likely to be more), then the request takes 82.2 and 91.2 ï s for Fast Ethernet and Gigabit Ethernet respectively. Collecting 1 kbytes from 16 ROBs if done in parallel will be dominated by the latencies in getting the responses as they will arrive at the destination serially. The time taken will be approximately collection time óji7kl3mk1n1o +QP OKR:CTSôü ø6 E response latency (4.17) This gives 2.8 ms for Fast Ethernet and 2.0 ms for Gigabit Ethernet. With the requirement of the average LVL2 decision time of 10 ms, this leaves around 7 ms for any network latency and running the LVL2 algorithm on the processor. Overlapping of event processing reduces this latency. However, it is also necessary to account for queueing and congestion in the network. For a full TRT scan, messages will be requested and received from 256 ROBs. An unresolved issue is the scalability of TCP/IP. We do not know how TCP/IP performance suffers as the number of connections increase. Given our observations in Section 4.4.7, this implementation of TCP/IP does not scale well with a high frequency of packets per second. This is more due to the OS than to TCP/IP. The effects of the TCP acknowledgements with increasing number of connections has also not been looked at. Thus it is not clear that TCP/IP will be able to meet the ATLAS LVL2 requirements.

83 4.5 TCP/IP and ATLAS Request-response rate and CPU load Running the LVL2 algorithm requires CPU power. We have seen that up to 45% of the CPU power can be spent on communication. Here we expand on this to look at the request-response rate against the CPU usage. Pause Server Thread 1 Receive Transmit Communication Client Thread 1 Transmit Receive Communication Time Thread 2 Computation Figure 4.22: The modified comms 1 setup to allow the measurement of Request-response rate and the client CPU load. We modify the comms 1 measurement by firstly fixing the size of the message. We also put a pause of varying length between the server s receive and transmit time as illustrated in Figure The delay is implemented in the form of a tight loop to enable us to control the pause with microsecond precision and ultimately to control the request-response rate. As before, the CPU load is measured at the client host. Figure 4.23 shows the request-response rate against client s CPU load for Fast and Gigabit Ethernet, using the minimum and maximum Ethernet frame lengths on 400 MHz processors. Notice that for each case, the maximum request-response rate corresponds to the minimum pause at the server. The figure shows that for a given request-response rate, the client s CPU load is almost the same for Fast and Gigabit Ethernet. This shows that there is little dependency of the CPU load on the link technology. The work done in [28] has shown that based purely on the network (that is: no processing time is accounted for), at least 550 processors are required to meet the average ATLAS LVL2 throughput at 75 khz, otherwise the network becomes unstable. From [4], the combined request rate to the LVL2 ROBs is 6114 khz. Using these result, the average LVL2 processor request-

84 84 Chapter 4. Network interfacing Performance issues GE TCP/IP 6B GE TCP/IP 1460B FE TCP/IP 6B FE TCP/IP 1460B Direct comms MHz. Kernel MHz 400 MHz 450 MHz 600 MHz FE comms 1. TCP/IP min frame size. Kv Client s % CPU usage Client s %CPU load Request Response/s Request response/s Figure 4.23: Request-response rate against CPU for Fast and Gigabit Ethernet on 400 MHz PC. OS=Linux Figure 4.24: The Measured request-response rate against CPU load for various processor speeds response rate is 6114/550=11000 Hz. The worst case LVL2 ROB request-response rate is Hz [4]. Figure 4.24 shows the maximum request-response rate (minimum Ethernet frame size) against the client s CPU load as measured on four different processor speeds. In each case, both the client s and server s CPU were the same speed. The Fast Ethernet results are presented here. Figure 4.25 shows the extrapolation of this to 100% CPU usage. This shows that we can reach a requestresponse rate of 11 khz to 12 khz using a processor of around 300 MHz at 100% saturation. 2 x 104 Min frame size Max frame size Request response rate at 100% CPU load Client s %CPU load MHz 400 MHz 450 MHz 600 MHz FE comms 1. TCP/IP min frame size. Kv Request response/s CPU speed. MHz Request response/s x 10 4 Figure 4.25: Extrapolation of the minimum frame (Figure 4.24) to 100% CPU load Figure 4.26: The relationship between the TCP/IP request-response rate and CPU speed at 100% load for minimum and maximum frame sizes

85 4.5 TCP/IP and ATLAS 85 In Figure 4.26, we show the relationship between the CPU speed and the request-response for the minimum and the maximum frame sizes. A request-response rate of 12 khz is reached for the maximum frame size at around 600 MHz processor speed at 100% CPU load Conclusion for ATLAS The TCP/IP protocol was designed as a robust general purpose protocol. It is in wide use today as the protocol for the Internet. It works especially well on the desktop where the user can tolerate latencies of the order of milliseconds and above. There have been enhancements since its introduction to improve its general performance in a variety of situations, mainly for WAN application. As we have seen here, a combination of these enhancements can prove disastrous in terms of network performance for an application with the traffic pattern similar to the ATLAS request-response model. If TCP/IP is to be considered for the ATLAS LVL2 trigger network, careful attention must be paid to the implementation detail of the protocol version used. Specifically, the Nagle algorithm should be disabled and the delayed acknowledgement enabled to reduce CPU overhead caused by data-less acknowledgements. The implementation of TCP/IP under Linux has revealed the average collection latency of 2.8 ms for Fast Ethernet and 2.0 ms for Gigabit Ethernet on a 400 MHz CPU for the ATLAS LVL2 system. However, this latency does not include the network latency. We have also seen that a fast packet rate can degrade the communication performance in the presence of a second thread. A solution could be the use of a real-time scheduling system where the delivery of incoming packets is bounded in time. TCP posses other problems for ATLAS. TCP is a connection oriented protocol and hence requires any two communicating nodes to have a connection. A connection on a node takes up resources like buffers and CPU time to manage it. In the LVL2 system, each processor must connect to 1700 ROBs and vice versa. We have not studied the effect on the performance of a node on supporting 1700 connections. Alternatively, connections could be made any time a message needs to be sent and torn down after the transfer. The problem with this scenario is that each TCP connection takes three TCP segments and a disconnection takes four (since a TCP connection is full-duplex). This will increase the latency per message. T/TCP or TCP for transaction [25] reduces the time for transmitting a message via TCP. Rather than setting up a connection, sending the data then closing down the connection, this is all done

86 86 Chapter 4. Network interfacing Performance issues with three packets. One to initiate the connection, one for the actual message and a final packet to close the connection. Currently, the implementation of TCP/IP under Linux doesn t support T/TCP. The performance of the TCP/IP stack depends on its implementation. The normal implementation is in the OS kernel, thus the performance of TCP/IP is tied to the OS performance. In order to truly assess the performance of the TCP/IP protocol, it will be necessary to abstract it from the kernel. The measurements carried out have shown that: î for todays CPUs, the overhead per request-response is high. î there is unpredictability in the latency due to the Linux OS. î the scalability of many connections is a concern. There are a number of ways that the performance of TCP/IP may be improved. î Faster CPUs executing the communications faster, î SMP systems increasing the amount of processor time dedicated to communications. î Better implementation of the protocol stack and the operating system. î Intelligent NICs off loading some of the protocol processing from the processor. 4.6 MESH We have shown above that there are issues with TCP/IP which have to be resolved for the ATLAS LVL2 system where low-latency high-throughput communication and scheduling are required. Boosten [10] has shown that on 200 MHz Pentium, a Linux system call requires 8 ï s of overhead and ï s for an interrupt. He also measured a context switch time of 12 ï s (which includes a system call). These are expensive because they involve the CPU switching between user and kernel space. In addition an interrupt requires the CPU registers to be saved and often require the invocation of the OS scheduler. MESH was developed to overcome these communications and scheduling overheads. An overview of MESH is given in Appendix B MESH comms 1 performance The comms 1 performance of MESH compared with Ethernet is shown in Figure 4.27 and The figures show both Fast and Gigabit Ethernet on 400 MHz processors. From the end-to-end latency plot shown in Figure 4.27(b), we see that the MESH lines are very stable at low message sizes compared to the TCP/IP plots. This implies that MESH performance does not suffer with

87 E 9 9 E 4.6 MESH 87 high packet rates. There are two reasons for this. Firstly, rather than using interrupts to detect the arrival of packets MESH uses polling at 10 ï s intervals. Secondly, MESH is a single process running in user space (see Figure 4.2). It is MESH s own lightweight user thread that switches between MESH threads and not the OS scheduler. The equation describing the line of MESH with Fast Ethernet in Figure 4.27(b) is ð ñ òôóþõz Aõ õ&ú"û ü ú The equation of MESH with Gigabit Ethernet is given by Aõ (4.18) ð ñ òôóþõz Aõ&ú ú"û ü ú 6! Cù (4.19) These results together with those for TCP/IP are summarised in Table 4.1. The Overhead per byte corresponds to the gradient and the fixed software overhead corresponds to the fixed overhead with the link overhead and the NIC send and receive overhead subtracted. The values of the NIC send and receive overhead are obtained from [10]. They are 10.5 ï s for both Fast Ethernet send and receive and 6.1 ï s for Gigabit Ethernet send and 10.5 ï s Gigabit Ethernet receive. The link overhead is the link time for the minimum Ethernet packet. This is 5.76 for Fast Ethernet and for Gigabit Ethernet. The fixed software overhead therefore includes the PCI overhead. Description Fast Ethernet Gigabit Ethernet Overhead per byte. U s/byte Fixed software overhead. U s Overhead per byte U s/byte Fixed software overhead. U s TCP MESH Table 4.1: A comparison of the MESH and TCP/IP overheads per byte and fixed overheads Figure 4.28 shows the MESH CPU load and the model CPU load. The model is based on Equation For Fast Ethernet, C B is 2.0% and C 4 is ú 1.0% and C 4 is ú 6 HG %. For Gigabit Ethernet, C B is ø õ HG ø õ %. A summary of these value and how they compare to Fast Ethernet is given in Table 4.2. From Equation 4.17, we calculate the average collection time for MESH is 1.8 ms for Fast Ethernet and 829 ï s for Gigabit Ethernet. Figure 4.30 shows the request-response rate against CPU load for MESH and TCP/IP performed on the same 400 MHz processors. The MESH lines are labelled MFE for Fast Ethernet and MGE for Gigabit Ethernet. As before, we plot the minimum and maximum frame sizes.

88 88 Chapter 4. Network interfacing Performance issues Description Fast Ethernet Gigabit Ethernet Fixed CPU overhead V"W. % CPU overhead per pingpong V XY"Y. %s Fixed CPU overhead V W. % CPU overhead per pingpong V XY"Y. %s TCP ZT[\^]_ Z`[\a]/_ MESH Z`[b\c]_ Z`[b\^]_ Table 4.2: A comparison of the MESH and TCP/IP fixed CPU overhead and fixed CPU overhead per ping-pong 25 TCP GE TCP FE MESH GE MESH FE Comms 1 MESH and TCP. FE and GE. 400 MHz kver TCP GE TCP FE MESH GE MESH FE Comms 1 MESH and TCP. FE and GE. 400 MHz kver Throughput. MBytes/s Latency. usecs Message size. Bytes Message size. Bytes (a) Throughput (b) Latency Figure 4.27: Comms 1 under MESH and TCP/IP for Fast and Gigabit Ethernet: CPU = Pentium 400 MHz: OS=Linux

89 4.6 MESH Comms 1 MESH and TCP. FE and GE. 400 MHz kver TCP GE TCP FE MESH GE MESH FE 10 9 Comms 1 MESH. FE and GE. 400 MHz kver MESH GE MESH FE Model GE Model FE %CPU load %CPU load Message size. Bytes Message size. Bytes Figure 4.28: CPU load for comms 1 under MESH and TCP/IP for Fast and Gigabit Ethernet: CPU = Pentium 400 MHz: OS=Linux Figure 4.29: CPU load for comms 1 under MESH. Model vs. Measurement for Fast and Gigabit Ethernet: CPU = Pentium 400 MHz: OS=Linux We see from the MESH curves that for Gigabit Ethernet, we are able to reach requestsresponses/s: the rate required by the ATLAS LVL2 trigger processors. For Fast Ethernet we are unable to reach this rate for the maximum frame size due to the limitations of the link speed. We conclude that MESH has dramatically lower CPU utilisation than TCP/IP and is able to reach the performance required by the ATLAS LVL2 system at very low CPU utilisation (5% or less). From Figure 4.30, it can been seen that there is no message size dependent overhead for MESH since the curves for the minimum and maximum frames overlap. This is due to the fact that the only copy in the MESH communications happens between the NIC and main memory Scalability in MESH In order to test the scalability of MESH with CPU speed, we looked at the fixed overhead as with TCP/IP. For both Fast and Gigabit Ethernet, we noticed that the fixed overhead hardly changed with the CPU speed. This leads us to believe that with MESH we are approaching the limit of the NICs. Therefore we looked at the maximum CPU load as a function of CPU speed. This is plotted in Figure This shows that for both Fast and Gigabit Ethernet, the maximum CPU load decreases as the CPU speed increases. This plot clearly requires more points before any other concrete conclusions can be drawn.

90 90 Chapter 4. Network interfacing Performance issues TFEminf % CPU usage TGEminf TGEmaxf TFEmaxf MGEmaxf MFEmaxf MFEminf MGEminf Request Response rate per second x 10 4 Figure 4.30: Fast and Gigabit Ethernet CPU load for MESH and TCP/IP for the minimum and maximum frame lengths. CPU = Pentium 400 MHz: OS=Linux T=TCP/IP, M=MESH, FE=Fast Ethernet, GE=Gigabit Ethernet, minf=minimum frame, maxf=maximum frame Comms 1 MESH. FE and GE. kver MESH GE MESH FE 8 Max % CPU usage CPU speed. MHz Figure 4.31: The change in the maximum MESH CPU load for comms 1. Ethernet. OS=Linux Fast and Gigabit

91 4.7 Conclusion Conclusion We have shown in this chapter the performance of the Linux implementations (in the kernel versions and ) of the TCP/IP stack. We have looked at ways in which to get the best performance for ATLAS with TCP/IP. We have produced models describing the performance of TCP/IP as a function of the message size for both Fast and Gigabit Ethernet. The models describe the non-pipelined throughput, the end-to-end latency and CPU load. We concluded that this implementation of TCP/IP is inadequate for the ATLAS LVL2 system on today s processors. MESH (MEssaging and ScHeduling system) has high I/O performance obtained by using optimised drivers and scheduling. It has better performance than TCP/IP both in terms of end-to-end latency and CPU load. We have presented the MESH performance and compared it to TCP/IP. MESH unlike TCP/IP does not have guaranteed packet delivery, flow control or packet fragmentation. It uses the flow control provided by the lower layer protocol. We have also seen that the implementation of the protocol and indeed the OS play an important role. Linux uses an interrupt based system for communication. The consequence is that the processor can be live locked, that is the system becomes unresponsive as it spends a considerable amount of time servicing interrupts caused by the incoming packets. An example reported by Poltrack [30] achieved a maximum throughput of 647 Mbit/s but at the cost of CPU usage of 81.5%. As a result of this potential problem, Gigabit Ethernet NICs have interrupt mitigation to limit the number of interrupts per second generated. MESH is not affected by this problem. It uses a polling system, so the application programmer can decide how often to poll for newly arrived packets. Due to the speed of Gigabit Ethernet and future networking technologies, integrating the NIC more tightly to the CPU/Memory subsystem will remove the bottleneck and allow full link utilisation. Signs are that such a system is being developed Further work MESH is able to deliver on the performance, but by itself it does not guarantee delivery of packets or flow control. It relies on the flow control of the underlying layers. Further work is required in making MESH more suitable for ATLAS. We have looked in detail at one implementation of TCP/IP. The performance is tied very strongly to the operating system, the way incoming packets are detected and the scheduling be- 2

92 92 Chapter 4. Network interfacing Performance issues tween processes. Further work needs to be done on the TCP/IP performance on other operating system before generalisation on its performance can be made.

93 Chapter 5 Ethernet Network topologies and possible enhancements for ATLAS 93

94 94 Chapter 5. Ethernet Network topologies and possible enhancements for ATLAS 5.1 Introduction Two factors affecting network performance are the network topology and size. The ATLAS trigger/daq system requires a network supporting over a thousand nodes. The current IEEE standards for Ethernet do not inhibit the building of large scalable Ethernet networks, however, ensuring scalability means using higher link speeds and the topologies are limited to a tree-like topology (see Figure 5.1). In the tree topology, network performance is limited to the performance of the root switch and redundant links are not supported except in the form of Ethernet trunked links. In this chapter, we look at the strategies that we can use should the standard Ethernet topology prove inadequate for the ATLAS trigger/daq system. The discussions in this chapter are based on experiences with real Ethernet switches and the IEEE Ethernet standards. We look at the standard Ethernet topology, then we identify the features of Ethernet switches inhibiting the construction of non-standard scalable Ethernet networks and present possible solutions. 5.2 Scalable networks with standard Ethernet Using Ethernet equipment conforming to the standards, and with the standard configuration, what sort of scalable network can we build? By scalable, we mean that we are not limited by the throughput of any link, with increasing network size. For example, given a collection of say eight port switches, what sort of scalable networks can be built? In order to connect these switches in a scalable way, half the links will be dedicated to the end nodes and half will be dedicated to connection between switches. See Figure 5.2. Furthermore, the Ethernet standard limits us to the tree architecture in which the scalability of the network depends on the performance of the root switch represented by switch A in Figure 5.1. For an eight port switch, if we are not to be limited by any link (avoiding higher link speeds and using trunking), then no matter how we connect multiples of these switches, we can only connect eight nodes as illustrated in Figure 5.2. Thus the scalability depends on the speed of the fastest links. A potential problem with Ethernet is in the case where nodes are connected via at least two switches with flow control enabled. It is possible for communication between two nodes to block a shared link used by multiple nodes. This is illustrated in Figure 5.3. In Figure 5.3, node b1 is unable to receive at the rate at which node a1 is sending. This eventually leads to the buffers of

95 5.2 Scalable networks with standard Ethernet 95 A B C D E F G H I Key Switch Node Figure 5.1: A tree like topology. Note that a node can be attached to any of the switches. (a) A single 8 port switch. (b) Two 8 port switches connected in a scalable way using trunked links. (c) Three 8 port switches connected in a scalable way using trunked links. Figure 5.2: Connecting the same type of Ethernet switches without being limited by a single link does not increase number of ports.

96 96 Chapter 5. Ethernet Network topologies and possible enhancements for ATLAS a1 a2 a3 a4 a5 Switch A Switch B b1 b2 b3 b4 b5 Figure 5.3: A link blocked due to a slow receiver. both switch A and B being filled. Subsequently, packets are aged and thrown away and also the useful bandwidth of the link between switch A and B is dramatically reduced. This results in a detrimental effect on all communications from other nodes on switch A to nodes on switch B. This kind of problem is normally solved by the higher layer protocol like TCP, where the receiving end advertises the maximum number of bytes it is prepared to receive. The flow control strategy adopted by the vendor is important here. On one switch we tested, we were able to bring the system to a halt as described above. Two other switches we tested threw packets away if the destination node could not receive them fast enough, therefore avoiding the blocked link situation. This is of even more concern for the event filter where the traffic pattern is more like streaming than request-response. An example of the ATLAS LVL2 trigger/daq network architecture based on current Ethernet technology is shown in Figure 5.4. This has a central Gigabit Ethernet switch of 224 ports. The processors and ROBs are connected to the central switch via other switches which we term concentrating switches. There are five concentrating switches for the processors, each of which has 128 Fast Ethernet links, connecting 550 processors. Each concentrating switch also has 12 trunked Gigabit Ethernet links to the central switch. Connecting around 1700 ROBs to the central switch are 14 concentrating switches. Each of these switches has eight trunked Gigabit Ethernet connecting it to the central switch.

97 5.3 Constructing arbitrary network architectures with Ethernet 97 ~1700 Read Out Buffers (ROBs) 128 FE links 8 trunked GE links 224 port GE switch 12 trunked GE links 128 FE + 12 GE ports switch 128 FE links ~550 Processors Figure 5.4: The Ethernet based ATLAS trigger/daq network 5.3 Constructing arbitrary network architectures with Ethernet We would like to build a suitable network to meet the needs on the ATLAS trigger/daq. In this section we identify the constraints in building arbitrary networks topologies with off-the-shelf Ethernet switches and present solutions to these constraints The Spanning Tree Algorithm Two of the main goals of the spanning tree algorithm (specified in the Bridge standard document IEEE 802.1D) are to automatically detect and shutdown loops within the network and to provide redundant paths which can be activated upon failure. Loops in Ethernet networks are undesirable because they can allow frames to keep going around the network. Figure 5.5 shows a three stage Clos network made up from six switches with an example of such a loop. Trunks of two links are used to connect the switches. The bold lines in the figure identifies a loop in the network. If we have a broadcast frame then it is possible for the frame to endlessly circulate the network by the loop such as that indicated by the bold lines. This is because at each switch, the frame is forwarded to all ports. Of course in the Clos network shown, more loops can be identified which will forward the frame in the same way. As a result, a looping broadcast frame could effectively consume all available bandwidth. Furthermore, removing loops from the network ensures that there is only a single path between

98 98 Chapter 5. Ethernet Network topologies and possible enhancements for ATLAS any two nodes in the network. The effect is that frame sequence integrity is ensured, that is, frames are received in the correct order. Since the spanning tree algorithm can dynamically route around faulty links, loops are purposely built into Ethernet networks. In an Ethernet switch, trunked or aggregate links coexist with the spanning tree algorithm. Trunked links are recognised as a single link and not multiple paths between two nodes. The spanning tree algorithm works by sending a hello packet to all ports at a fixed user specified interval. These packets are ignored by the nodes, but are acknowledged by other switches and Bridges in the network. The switches can organise themselves logically in a hierarchical order and disable links to avoid loops. The user specified intervals are of the order of a few seconds. As a result, the spanning tree packets have no noticeable impact on the performance of the switch. In a network, if arbitrary topologies are to be possible, then we must ensure that loops are permitted to exist in the network. Modification 1: Multiple arbitrary topologies are possible with the spanning tree disabled. We were able to disable the Spanning Tree algorithm in the switches we tested. In some switches, switching off the spanning tree is an option in the management software. In one case however, we needed the assistance of the switch manufacturer because it required direct access to the switch software. The actual process was easy as the spanning tree algorithm was implemented as a single module in the software. Potentially, loops are only a danger when frames are broadcast. Frames which are broadcast are frames with broadcast and multicast addresses in the destination field and frames with destination addresses which are not recognised by the switch. In the latter case, the user has no control unless static entries or very long ageing times are put into the forwarding table. We were unable to find a simple way to make the spanning tree work only on broadcast frames, so it would have to be switched off Learning and the Forwarding table An Ethernet switch forwarding table, also known as Address table, content addressable memory (CAM) table or Filtering Database, holds the MAC address of the nodes connected to the switch and the switch port to which the node is connected. When the switch is first powered on, the CAM table is empty. The CAM is updated automatically by a process called Learning. The learning process is documented in the Bridge standard (IEEE 802.1D). The Learning process works by examining the source MAC address of each incoming frame

99 { } ~ y z 5.3 Constructing arbitrary network architectures with Ethernet 99 d:e ftgihkjhlnmporqlns7e tue jvqftewoqlns e d:e ftgihkjhlnmporqlxs e Figure 5.5: An example of one loop path in the Clos network, shown by the bold lines. Each square represents a switch. and associating that source MAC address with the switch port on which the frame arrived. The CAM is then updated accordingly. All future frames destined for that MAC address will be sent to that associated port. Unknown addresses are broadcast. That is, broadcast happen in the network even if the hosts do not send broadcast frames. A port/mac address association or CAM table entry will be removed after a specified time has elapsed (called the Ageing Time. Typically 300s. The minimum value is 10s and the maximum s). This is to allow for the possibility of machines being removed from the network. In order to have arbitrary topologies, the ability to switch off learning and ageing and to enter permanent entries into the CAM table is a desirable feature because broadcast of unknown MAC addresses will effectively be disabled. These are provided for by most Fast Ethernet switches. Permanent CAM table entries comes under the heading of Static Entries in the Bridge standard document (IEEE 802.1D). This means you can have complete control of labelling your network. Static entries effectively disables learning. All the switches we tested support this. Modification 2: Learning must be disabled and static entries put into the switch forwarding table Broadcast and Multicast for arbitrary networks Once the spanning tree algorithm has been disabled, there is no longer an automatic mechanism to shut off loops in the network. This means if loops are present in the network it will be possible for broadcast frames to loop round the network indefinitely as described in Section If the network is a well labelled network, i.e. the addresses of the attached nodes have been statically

100 { } ~ y z 100 Chapter 5. Ethernet Network topologies and possible enhancements for ATLAS entered into the forwarding table (see Section 5.3.2) and the multicast groups have been set up, then loops in the network should not be a problem. If static entries were not put into the forwarding table, the forwarding tables would be continuously updated due to the learning process as the broadcast frames could arrive at the same switch on different ports. In Figure 5.5, if A is to send a broadcast, then F for instance will receive the same broadcast on at least two separate ports. In order to stop frames looping around the network indefinitely, we must construct a broadcast tree. That is, certain ports in the network should be stopped from sending broadcast frame. In this way, we can still send broadcasts which will reach all nodes, but broadcast frames will not loop around the network as certain switch ports will be prevented from forwarding broadcast frames. Figure 5.7 shows a broadcast tree for a simple three stage Clos network. Only switches A and C have multiple broadcast ports. Each node in the network can still receive the broadcast frame. Modification 3: A broadcast tree must be constructed in order to stop broadcast frames looping around the network. d:e ftgihkjhlnmporqlxs e tue jhqfteworqlns7e de1ftgihkjhlnmporqlns e Figure 5.6: Broadcast as handled by a modified Clos network. In this simple network, only stations A and C are allowed to broadcast in order to avoid looping frames. The bold lines show the direction of the broadcast frame. A broadcast tree with the Turboswitch 2000 We were able to create a broadcast tree in one of the switches we tested (the Netwiz Turboswitch 2000) using a proprietary subneting feature. This feature allowed us to restrict broadcasts to a specified number of ports by defining those ports to be in the same subnet. Some ports were specified to be in more that one subnet, thus allowing broadcast to be sent between subnets. Uni-

101 { } ~ y z 5.3 Constructing arbitrary network architectures with Ethernet 101 d:e ftgihkjhlnmporqlns7e tue jhqfteworqlns7e de1ftgihkjhlnmporqlxs e we4 H 1fTq ƒ2 ˆ Š i 1jHm 1fTq ƒ2 ˆ Š ilnjhœž Figure 5.7: A broadcast tree using VLANs in a Clos network. In this network, only switch ports belonging to VLAN b are allowed to forward broadcasts. The bold lines show the direction of the broadcast frame. casts were not restricted by the subnetting. This can be used to form a broadcast tree as shown in Figure 5.7. We successfully set up a broadcast tree and tested that it worked. Broadcast trees with VLANs VLANs can be used to create a broadcast tree. In Section 3.5.2, we saw that one way in which VLANs work is by limiting the flow of traffic to groups of switch ports belonging to the same VLANs. We also know that the ports can belong to multiple VLANs. Figure 5.7 shows how a broadcast tree using VLANs may look. We define two types of ports: those belonging to VLAN u (for unicast) and those belonging to VLANs u and b (for unicast and broadcast). Ports belonging only to VLAN u can send and receive unicast frames, but can receive but not send broadcast packets out of the switch. Ports belonging to both VLAN u and b can send and receive unicast and broadcast frames. Since all ports belong to the unicast address, all the links in the network can be used to transfer unicast frames. Only ports belonging to both the VLANs u and b can be used to forward broadcast frames.

102 102 Chapter 5. Ethernet Network topologies and possible enhancements for ATLAS For this system to work, all the nodes connected to the network should be connected on switch ports set to both VLANs u and b. The nodes are also required to tag broadcast packets with VLAN b and unicast packets with VLAN u when transmitting. Broadcast packets tagged with VLAN u would still loop around the network and unicast packets tagged with b will be limited to the links selected for broadcasts. It is easy to see how this method can be extended to provide multiple broadcast trees or as a different way of setting up multicast groups. We have not tested this method of setting up a broadcast tree, but there is nothing in the standards to prevent it from being done Path Redundancy One of the advantages of network topologies such as the Clos is the multiple paths or routes available for a packet going from one point in the network to another. Ethernet networks allow only a single path between any two nodes in the network. As mentioned in Section 5.3.1, switching off the spanning tree means we no longer have use of the adaptive routing around faulty links. Furthermore the use of static CAM entries (for our well-labelled network) to direct the path of frames means that there is always only a single route between any two nodes. A way to obtain path redundancy is as follows. Multiple unicast addresses can be assigned to each NIC in the same way that multicast addresses are assigned to NICs. We have tried this on our two NICs, the Intel EtherExpress Pro 100 Fast Ethernet NIC and the Alteon ACENIC Gigabit Ethernet NIC. On both these NICs, we were able to assign multiple unicast addresses and receive packets so addressed. A range of Ethernet addresses can be assigned to each node and the switch forwarding tables can be set up such that for each address belonging to each node, a different path is taken through the network. (This method can be taken to an extreme by setting each Ethernet NIC into promiscuous mode in which all packets which arrive at the NIC are received and sent to the higher layers irrespective of the destination address on the packet). A sender can be modified such that when sending to a particular node, it uses the range of Ethernet addresses which correspond to that node. To ensure fair arbitration, the address selection could be done in a round robin fashion for instance. The disadvantage with this method is that although multiple paths exists, there is still no way to automatically reroute packet when a path becomes disabled. Trunking or link aggregation can coexist with the architecture described above to provide link redundancy and adaptive routing around faulty links. The use of trunking means the bandwidth of the links between switches is

103 5.4 Outlook 103 increased. An alternative is to develop a higher layer functionality in the nodes to detect and transmit around dead links by using a different destination address. A further disadvantage is the loss of frame sequence integrity. For the ATLAS trigger/daq system, if messages could be restricted to fit into one frame, then this should not be a problem. If not, a field could be encoded into the type field of each frame or inside the frame itself which can then be used for frame sequence integrity. Modification 4: Assigning multiple unicast addresses to a NIC can help to allow a greater choice of topologies in an Ethernet network. This can be done by setting up multiple unicast addresses as if they were multicast addresses on the NIC. Modification 5: Multiple NICs can be plugged into a host. This has the same advantages as modification 4, but with added redundancy in the hardware. This raises the cost of each node and implies an increased number of network ports. Multiple NICs in a single node is standard practice in connecting a single node to multiple networks, thus the method will work. We also note that this has been tried in the Beowulf project Outlook Since the begining of this work, the Ethernet standards have been evolving. In this section, we mentions briefly some of the recent, upcoming and other features being considered which should further increase the flexibility of Ethernet 2. Extensions to IEEE 802.1D: In the latest extensions to the IEEE 802.1D standard, provisions have been put into place to allow a node to dynamically registrater and de-register from a multicast group (GMRP, GARP Multicast Registration Protocol) and a VLAN group (GVRP, GARP VLAN Registration Protocol) by use of a protocol called GARP (Generic Attribute Registration Protocol). This makes network configuration of these attributes easier and without the need for manual intervention. Multiple spanning trees per VLAN (IEEE 802.1s): The standard IEEE 802.1Q specifies explicitly that is does not exclude the future extension of the standard to include VLANs over multiple spanning trees. This would be a significant extension since it would mean the ability to use multiple links between switches (without the use of trunking), greatly increasing the architectural 1 The Beowulf Project. 2 A document describing some of these developments can be found at

104 104 Chapter 5. Ethernet Network topologies and possible enhancements for ATLAS flexibility of Ethernet. Faster spanning tree reconfiguration (IEEE 802.1w): In light of today s networking speeds, the spanning tree protocol reconfiguration time of many seconds can be rather slow. The aim of the IEEE 802.1w standard is to provide a spanning tree protocol that can reconfigure within 100 ms. 10 Gigabit Ethernet (IEEE 802.3ae): It has been metioned already in Chapter 3 that development of the 10 Gigabit per seconds Ethernet is well under way. Products are expected on the market by the begining of Gigabit per second Ethernet are also being discussed. 5.5 Conclusions The standards currently adhered to by Ethernet switches allow the building of large networks of a tree-like topology, but the ability to build networks of other topologies is attractive because we can build in redundancy and scalability. Looking at real Ethernet switches and the Ethernet standard, we have pointed out how we can construct arbitrary networks such as the Clos from Ethernet switches. Our studies have shown that in order that Ethernet switches can be used to build arbitrary networks, the following are required; 1. Provide permanent CAM tables and disable learning: The ability to set up permanent CAM / Filtering table entries and disable learning is already provided for in the standards (IEEE 802.1D) and therefore incorporated in all Ethernet switches. 2. Switching off the spanning tree: In Ethernet networks, it is not possible to have multiple paths or loops between any two nodes. The spanning tree algorithm is used to find and remove loops by disabling certain ports. To build arbitrary networks, loops must be allowed, therefore the spanning tree should be disabled. This can be done on most switches we have seen. 3. Constructing a broadcast tree: Once the spanning tree has been removed, multiple paths can exist in the network. A broadcast tree must be constructed to allow broadcasts to reach all node in the network and avoid the prospect of broadcast frame looping the network indefinitely. This is more difficult to do since there is no provision for it in the Ethernet standards. We have shown here two methods, to do this. As a consequence of switching off the spanning tree algorithm and having fixed routing tables, we can no longer take advantage of the redundant paths of a particular network topology. Frames

105 5.5 Conclusions 105 cannot be rerouted if a link goes down. To resolve this, trunking can be used to provide multiple links between switches. This provides redundant links and also increases the bandwidth between switches. Another way to obtain link redundancy and increased bandwidth between end points in the network is by assigning a range of unicast Ethernet addresses to each node. Each NIC can be set to promiscuous mode or the assigned addresses can be registered in the same way as multicast addresses. Multiple paths can then be programmed into the network for reaching the same destination by use of the extra addresses give to each node. This however does not give automatic re-routing around broken links. Overall, the changes required to enable arbitrary networks to be built with commodity Ethernet switches are non-trivial and time consuming. It is also likely to require a unique approach for each switch. The ATLAS trigger/daq system has over a thousand nodes. Manually entering the addresses of over a thousand nodes into the forwarding table of each switch in the system and getting it correct will be extremely tedious and time consuming. For ATLAS, we would like to adhere as much as possible to the Ethernet standard. These standards are evolving and what we have highlighted here are features advantageous to ATLAS and high performance parallel computing. New features such as trunking may mean we can stick to the Ethernet standards if a large enough root/central switch can be bought.

106 106 Chapter 5. Ethernet Network topologies and possible enhancements for ATLAS

107 Chapter 6 The Ethernet testbed measurement software and clock synchronisation 107

108 108 Chapter 6. The Ethernet testbed measurement software and clock synchronisation 6.1 Introduction The architecture, performance and workings of Ethernet switches need to be understood in order to make informed decisions on the ATLAS LVL2 trigger network construction. Given the large market, it is clear that there will be variations in products from different vendors. Vendors make trade-offs between the performance and the cost of their products. We must understand the performance and architectural differences and the implications for the ATLAS LVL2 trigger network. To understand and characterise Ethernet switches, we have to perform measurements under controlled conditions. In this chapter, we present the Ethernet testbed switch characterisation software (ETB) used to characterise Ethernet switches and networks. The results produced with ETB also serve as input to the modelling of the ATLAS second level trigger network. In order to build the model, we required a detailed characterisation of Ethernet switches and end nodes. Assessment of the end node performance with MESH and TCP/IP has been presented in Chapter 4. The basic idea with ETB is to characterise switches by generating and transmitting traffic streams through the switch, then examining the received streams. In Chapter 7 we present the approach we use in modelling Ethernet switches. This chapter contains the measurements required to characterise a switch or network. 6.2 Goals With ETB, we want to measure the transmit and receive throughputs, the lost frame rate and the packet end-to-end latency, all as a function of the traffic load and type. Thus we need to be able to control the rate at which we transmit the packets. We also need to be able to distinguish the received streams when a node receives from more than one transmitter at the same time. This allows us to observe how different streams are affected by the network architecture and how this changes when priorities are used. In achieving these aims, we considered: The cost. Compared with the cost of buying a commercial tester, this method must be as cost effective as possible. See sections 6.9 and 6.10 for a comparative cost analysis. The availability of a large number of PCs at no extra cost. We had access to the LVL2 testbed PCs (See Figure 6.1) being used to test the ATLAS framework software. Up to 32 machines were available to us.

6.2 Goals 109 Also available was the Intel Fast Ethernet NIC [36] and Alteon ACE NIC Gigabit NIC [37]. Requirement of accuracy in the switch model of 5 to 10 %.

109 6.2 Goals 109 Also available was the Intel Fast Ethernet NIC [36] and Alteon ACE NIC Gigabit NIC [37]. Requirement of accuracy in the switch model of 5 to 10 %. The available OS, protocol and I/O software and our knowledge of their performance (TCP/IP, raw Ethernet and MESH). Figure 6.1: The PCs used for the LVL2 testbed at CERN An example measurement In Figure 6.2, we show the results of an example measurement with ETB. For this measurement, six Fast Ethernet nodes streamed fixed size messages to a single Gigabit Ethernet node at fixed intervals (systematically) through a switch. The switch was the BATM Titan T4. Figure 6.2 shows the accepted throughput against the end-to-end latency for 46, 512, 1024 and 1500 Bytes. Because the traffic is systematic, the latency remains constant until a saturation point is reached, when the latency rises sharply. In this case, the saturation point is due to the limitation in the receiving Gigabit Ethernet node, and not the switch. Using ETB with varying traffic patterns and configuration of the nodes, we can discover various details about the switch (See Section 7.6).

110 110 Chapter 6. The Ethernet testbed measurement software and clock synchronisation B 1024B 512B 46B 7nodes. 6FE to 1GE. systematic traffic. Titan flwcntrl on Average end to end latency. usecs Accepted traffic. MBytes/s Figure 6.2: Performance obtained from streaming 6 FE nodes to a single Gigabit node through the BATM Titan T4. The limits of the receiving Gigabit node is reached before the limits of the switch. The power of ETB comes from the ability of synchronise multiple PC and use them to produce multiple traffic generators and consumers with varying traffic patterns. This enables us to test a multitude of scenarios of traffic patterns load in a single switch unit or network. 6.3 Design decisions Testbed setup The setup of the testbed is shown in Figure 6.3. This setup was decided upon based on the available hardware and software mentioned above. It has a number of features. Each node surrounding the switch under test has two NICs, A and B. This implementation of MESH cannot be run sharing a NIC with other protocols. NIC A (running at 10 Mbit/s) is used to connect the nodes to the local CERN network using the network file system (NFS) to allow a user to control the configuration, starting and stopping of measurements and collecting the results from the nodes. NIC B (running at 100 or 1000 Mbit/s) is used for the testing. Only testing traffic was allowed from NIC B such that other traffic did not interfere with the measurement traffic. The advantage of this setup is that it gives the user remote access to all nodes from a control terminal connected to the CERN network. NFS provided a convenient way to share data between the nodes in the testbed via the 10 Mbit/s connections.

111 6.3 Design decisions 111 During measurements, traffic on NIC A was kept to a minimum and the nodes were dedicated to running the measurements such that maximum CPU time was given to the measurements. 10Mbit/s connection. PC with two NICs A B A B A B A B Control (TCP/IP) Hub Fast Ethernet or Gigabit Switch 100Mbit/s or 1Gbit/s connection. A B A B A B A B A = NIC running at 10Mbit/s B = NIC running at 100Mbit/s or 1Gbit/s Figure 6.3: The setup of the Ethernet measurement testbed The Traffic Generator program A traffic generator program was initially developed for the Macramé project [29] [31]. This program was taken and adapted to produce outputs suitable for ETB. The traffic generator program is a stand-alone program. It generates binary files of traffic patterns for each transmitter in the system. The pattern file contains a list of packet descriptors, each has a destination node number, a source node number, a message size in bytes and an interpacket time in microseconds. Via the input to the traffic generator program, the user is able to specify the data size, the destinations and the traffic patterns. The types of traffic patterns of interest are: Systematic. The inter-packet time is constant. Random. The inter-packet time is random exponentially about a mean. In both cases, the destination address can be constant, or uniformly-random distributed. The system is flexible enough to support other traffic patterns.

112 112 Chapter 6. The Ethernet testbed measurement software and clock synchronisation The usage of MESH in the ETB software The MESH Libraries [11] [12] [13] were developed for ATLAS to optimise the communication and scheduling of the available network connection and processing power in a node. MESH was chosen as the platform for the ETB software rather than TCP/IP or raw Ethernet for a number of reasons. Firstly, TCP recovers from packet loss transparently which makes any attempt to measure network packet loss difficult. Secondly, MESH has superior performance compared to TCP/IP (see Section 4.6) and UDP/IP (see below), it enables us to generate higher rate traffic. Thirdly, raw Ethernet and TCP/IP uses the OS scheduler whereas MESH uses its own light-weight scheduler which has been shown to provides better resolution on timing packet arrivals. See Section 4.6. Streaming performance of MESH As a demonstration of the superior performance of MESH, we performed a streaming measurement. The streaming measurement is aimed at finding the maximum rate at which messages can be sent out. The setup is the same as that illustrated in Figure 4.3. The client sets up a message of a fixed data size and streams the same message repeatedly as fast as possible to the server. The server continuously reads the messages sent by the client. The server records the time it started receiving the messages, the number of messages received and the time it stopped receiving. From this, the receive rate of the server can be calculated. The results are shown in Figure 6.4. The throughput obtained is different from that of Chapter 4 because here we take advantage of the pipelining effect. That is the maximum throughput achievable when multiple packets are sent at the same time. Figure 6.4(a) shows the achieved throughput against message size for UDP and MESH. Figure 6.4(b) shows the achieved frame rate against message size. We use UDP rather than TCP/IP because TCP/IP is a streaming protocol and hence multiple sends of small messages may get concatenated into a single big packet. For testing networks and switches, this is not a desired effect. We believe TCP cannot achieve higher throughput than UDP since UDP is a simpler protocol and has less overhead. For Fast Ethernet, we are able to reach the theoretical rate at 100 bytes for MESH and 250 bytes for UDP. For Gigabit Ethernet, we are not able to reach the theoretical throughput for either UDP or MESH. We reach a maximum throughput of around 71 MBytes/s for MESH and 45 MBytes/s for UDP. We believe that this limitation is due to the PCI bus and the receive part of the NIC driver. Our measurements have shown that we can send at a higher rate than we can receive. According

113 6.3 Design decisions 113 to Pike [30], the PCI bus request, bus grant and arbitration reduces the packet transfer bandwidth by as much as 30% of the total bus bandwidth. This implies a maximum throughput of around 92 MBytes/s for a 33 MHz 32-bit PCI bus. For the curve representing streaming over Gigabit Ethernet, the odd shape for message sizes between 500 and 1000 bytes for both MESH and UDP can be attributed to the current version of the Alteon NIC firmware. We are using version The previous version gave a smoother shape. The results show that MESH performs much better than UDP for both Fast and Gigabit Ethernet GE theoretical GE MESH GE UDP FE Theoretical FE MESH FE UDP 10 6 Throughput. MBytes/s GE Theoretical GE MESH GE UDP FE Theoretical FE MESH FE UDP Message size. Bytes Frame rate. Frames/s Message Size. Bytes (a) Throughput (b) Frame rate Figure 6.4: Unidirectional streaming for Fast and Gigabit Ethernet using MESH and UDP. CPU=400 MHz; OS=Linux MESH ports When using MESH, Ethernet frames are transmitted and received on MESH ports. These are the MESH end point communication entities. A MESH port is unique to each node and multiple ports can be set up per node. Local ports belong to the local node. All other ports are remote. This is similar to the idea that a network address can be local or remote to any node and each node can have multiple addresses. An Ethernet frame has the first four bytes in the user data area reserved for MESH port numbers. Two bytes for the source port and two bytes for the destination port. The frame size is encoded in the type/length field of the Ethernet frame. In ETB, each node has two local ports: a port for measurements and a port for synchronisation. This allows the two different types of traffic to be distinguished. Furthermore, when measurements

114 114 Chapter 6. The Ethernet testbed measurement software and clock synchronisation are taking place, no other traffic is sent on the control interface. This helps in obtaining more accurate results for the switch/network under test. For the ETB software, we are more interested in performance than minimising CPU usage. As such we do not make a single PC serve as more than one traffic source/sink in order to achieve maximum performance per node. A detailed evaluation of MESH including the CPU loading, the limitations of the driver, NIC and PCI bus as well as its use in the prototype LVL2 trigger is presented in [11] [12] and [13]. 6.4 synchronising PC clocks Method A requirement in ETB was to be able to make unidirectional end-to-end latency measurements. On a single node, the local PC clock was accurate enough to do measurements locally. However, if we need to do latency measurements across more than one node, we require a global clock or a system by which the clocks on the nodes could be synchronised. We are looking for accuracy in in the region of a few microseconds in synchronising the PC clocks. Simple Network Time Protocol (SNTP) is a common way to synchronise clocks, but it only gives 1 to 50 ms granularity [32]. There are a number of other possible methods. An effective one would be to remove the crystal from the PCs and connect the PCs to a single crystal or clock generator. For this to work, all the PCs would have to be the same ( same motherboard and CPU). We would like to be able to use different PCs. Another possible method would be to build hardware which can be plugged into the PCs and used to distribute a global clock. This will require extra cabling in connecting the PC and some hardware effort. In our chosen method, we send Ethernet frames through the network/switch under test to synchronise clocks. The idea is illustrated in Figure 6.5. One of the PC s local clock is used as the master or global clock. This PC is known as the global node. All other PC s local clocks are monitor clocks. These PCs are known as monitor nodes. The global node selects a monitor node to synchronise with. It sends a packet to the monitor node, noting its start time T. The monitor node will return the packet immediately, stamping it with its current local time. When the global node receives the returned packet, it notes its end time F. The global node can calculate

115 6.4 synchronising PC clocks 115 F óš T u Š T Tœ, ž. In an ideal situation, ^ F ó 5. By repeating this many times, we build a table of / F and Ÿ values which are used for a straight line fit of the form: F ój (6.1) Where the offset or skew and is the gradient or drift. From Equation 6.1, all future and past values of the monitor s local clock can be converted to the global time. Global node Monitor node tgs tgc tmc tge time Figure 6.5: How we synchronise clocks on PCs. In practice, the PC s clock values are 64-bit long. To avoid wrap-arounds during the calculations, the initial values of the clock are taken and all subsequent values are offsets from the initial values. Therefore Equation 6.1 becomes; F 4 3 ój /œ (6.2) Where Ÿ F b 3 is the initial time of the global node and 3 is the initial time of the monitor node. We make two important assumptions here. The time taken for a single frame to traverse the network/switch between two ports is constant in the case where no other frames are present in the network/switch. This is a valid assumption since between any two ports, frames will always take the same path and furthermore, in the absence of other frames, no queueing occurs to slow down the frame. Also, most switches do their switching in hardware and therefore have a fixed latency. Our second assumption is that the clocks have a linear relationship. That is, the drift is constant. How true this assumption is and the consequences for the synchronisation have been looked at.

116 116 Chapter 6. The Ethernet testbed measurement software and clock synchronisation Factors affecting synchronisation accuracy There are several factors affecting the accuracy we can get from the clock synchronisation using the method outlined. They are as follows: The system used in reading the PC s local clock is a MESH function call. This reads a special 64-bit register containing the number of ticks since the PC was turned on. The number of ticks is incremented every CPU clock tick. For a 200MHz Pentium II, a clock tick happens 200 million times every second, that is one tick every five nanoseconds. The maximum value a 64-bit register can hold is ª «3 : ª õf Ÿ. This corresponds to e+10 seconds, or 3177 years. We will not wrap around this counter during the lifetime of the tests. On our slowest PCs (200 MHz), doing one million calls to read this register takes seconds implying 75.3 ns/call. Therefore about 16 clock cycles are needed to read the clock. The specification of the PC crystal is 100 parts per million over a temperature range of -50 to 100 degrees Celsius. We do not expect to run the clocks at the extremities of the temperature range. Our main concern is how much the clocks drift with respect to each other. The synchronisation procedure described above has other problems. Firstly, we cannot be sure that the process as illustrated in Figure 6.5 is as symmetric as illustrated. Secondly, the round trip time (RTT) or the difference between a ` and Ÿ F is variable because although we are using the MESH environment, we can still be affected by the scheduling of the Linux operating system. Figure 6.6(a) shows the normalised histogram of the half the RTT between a global node and two monitor node across a switch. The bin size of is 1 s. Thus Figure 6.6(a) represents the probability of half the RTT being a certain value. The distribution is similar for the two nodes. The measurements run for about a minute. With a majority taking about 100 microseconds, there are therefore approximately entries. Plots of the same form are observed for directly connected nodes, but shifted in the latency axis by (15 s) an amount corresponding to the switch store and forward for 100 bytes. This shows that the switch only adds a fixed latency during the synchronisation. From Figure 6.6(b), we note that most of the results lie in the range of 49 to 55 s. However a few are recorded with as much as 200 s latency. These high latencies can be attributed to the OS. In order to combat this, we decided to accept only the RTT values which lie within 5% of the minimum. We repeated the measurement over a period of 7200 seconds, plotting the mean RTT and standard deviation after every minute of ping-pong. These are plotted in Figures 6.7 and 6.8. The mean changes by 0.7 s and the standard deviation is always less than 0.25 s.

117 6.4 synchronising PC clocks Global clock Synchronization Through Netwiz switch. Same module. All pingpongs. Synchronising with node 1 Synchronising with node Global clock Synchronization Through Netwiz switch. Same module. All pingpongs. Synchronising with node 1 Synchronising with node 2 Probability for half the round trip time Probability for half the round trip time Latency. us Latency. us (a) The probability of half the round trip time being a certain value. (b) Magnification of (a) Figure 6.6: A normalised histogram of half the round trip time through a switch Synchronising with 100 bytes Data size, through Netwiz switch. Same slot Synchronising with node 1 Synchronising with node Synchronising with 100 bytes Data size, through Netwiz switch. Same slot Synchronising with node 1 Synchronising with node Mean. usecs Standard deviation. usecs Time. Seconds Time. Seconds Figure 6.7: The mean value of the round trip time. Figure 6.8: round trip time. The standard deviation of the

118 118 Chapter 6. The Ethernet testbed measurement software and clock synchronisation Clock drift and skew The drift of the monitor clock with respect to the global clock given by the gradient,, of the straight line fit between the two clocks. It shows how much one clock varies in time with respect to another. Figure 6.9(a) shows the drift of one of the monitor clocks against the global clock over a period of 7200 seconds. As the drift is small, ª5 has been plotted to highlight the difference. A similar graph is obtained for the second monitor node, showing a similar change in from 0 to 1500 seconds and thereafter a fairly constant value. The initial change is due primarily to the processors heating up during the synchronisation process see Section The skew or intercept ( from Equation 6.2), is an indication of how well synchronised the clocks started off. It does not indicate any dependence on the warm up process, see Figure 6.9(b) x 10 5 Synchronising with 100 bytes Data size, through switch A. Same slot Synchronising with node Synchronising with 100 bytes Data size, through switch A. Same slot Synchronising with node Gradient Intercept. us Time. Seconds Time. Seconds (a) The deviation of the gradient from 1 for node 1. (b) The variation in the intercept Figure 6.9: How the gradient of two monitor node deviate from 1 In order to find out the effective error in the synchronisation for a given point in Figure 6.9, we calculate the predicted time at each point and subtract the real time. Figure 6.10 shows the error plotted at various times in the synchronisation process. Each curve represents the error in the predicted time as a function of the time after synchronisation for a given warm up time. The deviations are greatest for the smallest warm up times. For a warm up period greater than 1500 seconds, an error of ± 2.5 s can be obtained up to 400 seconds after synchronisation. The maximum deviation we found was 1.23 s per minute for a warm up time of 1500 seconds or greater. Thus to stay within our goal of 5 s accuracy, the measurements must not last more than 4 minutes after the synchronisation phase.

119 6.4 synchronising PC clocks Synchronising with 100 bytes Data size, through Netwiz switch. Same slot time = 107s time = 725s time = 1562s time = 2635s time = 3931s time = 5003s 1.58 x Synchronising with 100 bytes Data size, through Netwiz switch. Same slot Synchronising with node Error in the predicted time. us 10 5 Gradient Time. Seconds Time. Seconds Figure 6.10: The error in the predicted time for different warm up times. Figure 6.11: The effect on the drift when the PC side panels are removed Temperature dependency on the synchronisation We knew that the temperature has a big effect on the measurements. In order to get some idea of the effect, we performed the synchronisation and after 6000 seconds, we removed the PC s side panels of both the monitor and the global node and the synchronisation continued. The resulting effect on the drift is shown in Figure This shows that the clock crystals are very sensitive to temperature change. Up to this point a complete synchronisation phase was completed for one node before being started on the next. This will not scale very well as the number of nodes in the system increases. To avoid this, the calibration should be done with all nodes concurrently such that all the PCs are continuously working. Synchronising all nodes concurrently means the global node does a ping-pong with all the PCs in turn in a round robin fashion. The result is, apart from the global node, all PCs in the system do the same amount of work throughout the synchronisation process, thus maintaining a stable temperature and hence drift throughout the synchronisation process. We also improved the synchronisation process by accepting the points which had the widest separation in time. This gives a greater accuracy when calculating the line of best fit. The accuracy of the synchronisation process is quantified below in Section

120 120 Chapter 6. The Ethernet testbed measurement software and clock synchronisation Integrating clock synchronisation and measurements To integrate the synchronisation system and the measurements, our chosen method is illustrated in Figure This shows an illustration of the clock drift. The synchronisation is started and the system warms to a stable state after 1500 seconds when the first measurements can be made. The system returns to the synchronisation state after a measurement and subsequent measurements can be made without the need for waiting another 1500 seconds. There is always at least one synchronisation between measurements to allow for changes in conditions between measurements to be taken into account for the measurements that follow. Synchronisation Time Synchronisation start 1st Measurement 2nd Measurement (n-1)th measurements and synchronisations nth Measurement Figure 6.12: The measurement technique Conditions for best synchronisation We would like to know the conditions (value of the ETB variables) in which we can achieve the best synchronisation. These variables are the length of time to do the ping-pongs or synchronisation time, the number of ping-pongs per seconds and the number of points to use to derive the straight line fit. Varying the synchronisation time In this test, we varied the time to synchronise, while keeping the number of ping-pongs per second constant and the number of points (selected to derive the line of best fit) fixed at 20. The aim is to find out the minimum time to synchronise. The results are shown in Figures 6.13 and In both these figures, we rejected the first 1500 seconds of synchronisation. Figure 6.13 shows the

121 6.4 synchronising PC clocks 121 standard deviation in the clock drift against the synchronisation time. Figure 6.14 shows the error in the predicted time over five minute intervals. That is the error is calculated by taking the synchronisation result and predicting the time 5 minutes in the future then comparing the prediction with the actual time. The plots shown are of the form expected because as the synchronisation time increases, the number of ping-pongs increase. This increases the chances of obtaining ping-pongs with the minimum RTT and also increases the spread between points. Both help in achieving a more accurate line of best fit. From the figures, ten seconds synchronisation time is the optimum. 8 Accuracy of synchronisation with varying synchronisation time. 2nodes. 1.8 x 10 7 Accuracy of synchronisation with varying synchronisation time. 2nodes. 7 Standard diviation of (gradient 1) Error in predicted time over five mins. us/min Synchronisation time. s Figure 6.13: Standard deviation in gradient Synchronisation time. s Figure 6.14: Error in the predicted time over 5 minute intervals. Varying the number of ping-pongs per second Fixing the time to do a ping-pong to 10 seconds and keeping the number of points (selected to derive the line of best fit) at 20, we vary the number of ping-pongs per second by pausing or sleeping between ping-pongs. This is equivalent to increasing the number of nodes in the system. Figure 6.15 shows the standard deviation in the drift and Figure 6.16 shows the error in the predicted time. We see from the graphs that there is little influence from the sleep time until we get a sleep time of 100,000 s when there is a clear rise in the standard deviation in the drift and the error. This enables us to work out how many nodes we can have in the network before inaccuracies in synchronisation start to appear. The formula for the maximum number of nodes possible in the system is thus:

122 122 Chapter 6. The Ethernet testbed measurement software and clock synchronisation ª õõõõõ maximum ping-pong RTT in the network for 100 byte frame (6.3) For our switch, the longest ping-pong RTT for a 100 bytes message is 180us, giving a maximum nodes values of x 10 8 Accuracy of synchronisation with varying sleep between synchronisation. 2nodes. 2.5 Accuracy of synchronisation with varying synchronisation time. 2nodes. Standard diviation of (gradient 1) Error in predicted time over five mins. us/min Sleep time between ping pongs. us Sleep time between ping pongs. us Figure 6.15: Variation in the sleep time between ping-pongs. Figure 6.16: Error in the predicted time over 5 mins for varying time between ping-pongs. Varying the number of points to take for the line of best fit With the synchronisation time at 10 seconds, we varied the number of points used to derive the line of best fit. The results are plotted against the standard deviation in the drift in Figure If the number of points accepted is too small, then the calculated line of best fit is not accurate. If the number of points accepted is too large, then we accept points in the tail of Figure 6.6(a) and the calculated line of best fit is not accurate. The acceptable number of points is between five and Summary of clock accuracy The accuracy for the synchronisations are summarised in Table 6.1 for Fast and Gigabit Ethernet. For Fast Ethernet, a maximum deviation of 1.23 s per minute is achieved for a warm up time greater than 25 minutes. For Gigabit Ethernet, the maximum deviation 2.9 s per minute under the same conditions. Thus to stay within our required accuracy of 5 s, the measurements must not last more than 103 seconds after the synchronisation phase.

123 6.4 synchronising PC clocks Number of points to use in performing best line fit 10 0 standard deviation of gradient Number of points accepted Figure 6.17: The range of the number of points that can be used to make the best line fit. Warm up time 2mins 15mins 25mins 45mins 65mins 85mins Fast Ethernet Gigabit Ethernet Table 6.1: The deviation in clocks for Fast and Gigabit Ethernet as a function of the warm up time. In microseconds per minute.

124 124 Chapter 6. The Ethernet testbed measurement software and clock synchronisation 6.5 Measurements procedure Configuration files There are three distinct phases in the ETB program; A synchronisation phase, a traffic generating phase and a measurement phase as illustrated in the flow diagram of Figure Two configuration files supplied by the user are addresses and configuration. The addresses file contains a list of Ethernet MAC addresses of the nodes in the test-bed. The first node in the address list is used as the global clock, to which all other clock values are translated. The configuration file contains a list of commands which define the configuration of each node. The list of possible commands and their types are explained in Table 6.2. Command Type Default Comments time spread integer 5000 ² s The number of microseconds to histograms over. bin size integer 1 ² s The size of each bin in the histograms. all latency record on/off off Record the latency of each incoming packet and its source in a file called latency0x where 0x denotes the destination node. Used mainly for debugging/analysis. total pingpongs integer The maximum number of ping-pongs to do before deriving the global clock. If time tospend pp is reached first the actual number of ping-pongs done may be less. If set to zero, only throughput measurements are made. time tospend pp integer 10 s The maximum time to spend doing ping-pongs before deriving the global clock. The actual time may be less due to total pingpongs being reached. POINTS REQUIRED FOR BEST FIT integer 20 The number of ping-pongs selected to calculate global clock formula. inter pingpong time integer 0 ² s Time to pause between ping-pongs. Used mainly for debugging/analysis. link negotiation on/off off Autonegotiation. intelduplex full/half full The Intel ExtherExpress Pro 100 NIC flow control. Intel only. alteonflowcontrol on/off on The Alteon ACENIC flow control. ACENIC only. alteon macaddr MAC address disabled Override the programmed MAC address. The format is six hex value separated by colons. alteon rmaxbd integer 1 The number of Ethernet frames to collect on the ACENIC before sending to the higher layers. ACENIC only. alteon rct integer 0 s The maximum time to wait when receiving alteon rmaxbd Ethernet frames from the link before sending to the higher layers. ACENIC only. alteon smaxbd integer 1 The number of Ethernet frames to collect on the ACENIC before transmitting on the link. ACENIC only. alteon sct integer 0 s The maximum time to wait when collecting alteon smaxbd frames before transmitting on the link. ACENIC only. Table 6.2: The list of commands for the configuration of the ETB nodes.

125 6.5 Measurements procedure 125 Start User supplies addresses file into <dir> Start ETB./etb <dir> User supplies cofiguration file into <dir> Synchronisation Start measurement receive thread Start synchronisation s transmit and receive threads Do max ping-pongs for max time as specified in configuration Write synchronisation results into <dir>/global_clocks_file global_clocks_file No Is there a start flag Yes No Is there a stop flag End Yes Write last synchronisation results in <dir>/global_clocks Measurements Load <dir>/measurement_ini and glocbal_clocks global_clocks User supplies <dir>/measurement_ini file Zero receive thread statistics Traffic generator Start transmit thread User supplies pattern file Load traffic pattern from <dir> Generate traffic pattern into <dir> Wait 1 s to synchronise start Transmit traffic Is the result period up No Yes Update results No Is the measurement period up Yes Copy final results Wait 5 s for all nodes to stop transmitting End transmit thread Write final results in <dir> results End measurements Figure 6.18: A flow diagram illustrating the synchronisation, measurement and traffic generation in ETB.

126 126 Chapter 6. The Ethernet testbed measurement software and clock synchronisation Once started, ETB synchronises continuously until the user supplies the start flag. This is a flag which initiates the measurements. During the synchronisation process, a file called global clocks file is created. At the end of each synchronisation phase, an entry is added to it this file. The maximum time to spend synchronising and the the maximum number of ping-pongs to do before producing an entry can be specified by the user in the configuration file. An example of a single entry in the global clocks file with six 200 MHz PC in the testbed is shown in Table 6.3. Node 0 s local clock is used as the global clock. node intercept ³ slope yinitial µ ķ xinitial µ ¹` ķ points used exec time (sec- mean (º s) std dev sync no (clock ticks) (clock ticks) (clock ticks) onds) Table 6.3: An example synchronisation result as stored in global clocks file for six nodes. In Table 6.3, The first column (node) is the node numbered. The second column (intercept) is the intercept for the straight line fit. The third column (slope) is the slope of the line. The fourth (yinitial) and fifth (xinitial) columns are the initial y and x values, that is the initial global and local times. The sixth column (points used) is the number of points used in obtaining the best line fit. The seventh column (exec time) is the time after the start of the synchronisation process in which the results were obtained. The eighth (mean) and ninth (std dev) columns are the mean and standard deviation of the points used to produce line of best fit. The tenth column (sync no) is the entry number since the start of the synchronisation process. In the above, we have 134 entries since the start of the synchronisation. Once the start flag is initiated, the last entry in global clocks file is copied into a file called global clocks. This is used for the measurements that follow. The measurements start by all nodes reading the global clocks file. Next, the user supplied initialisation file measurement ini is read. This file is a list of five commands: max run time,

127 6.5 Measurements procedure 127 vlan, priority, cfi and extra string. The command and their arguments are explained the Table 6.4. Command Type Default Comments max run time Integer None The length of time to run the measurements for in seconds. The node name is normally set to all in this case. vlan Integer 0 The 8-bit VLAN identifier of the VLAN tag control information field. All packets leaving the node will have this VLAN value. priority Integer 0 The 3-bit user priority of the VLAN tag control information field. All packets leaving the node will have this priority cfi Integer 0 The 1-bit canonical format indicator (CFI) of the VLAN tag control information field. All packets leaving the node will have this CFI value extra string String extra string An extra sting printed with the results to help with the analysis. Table 6.4: The commands for measurement initialisation The transmitter and receiver MESH threads are used for implementing the transmitter and receiver. In making the measurements, we require a steady state. That is, we have to allow enough time such that all transmitting nodes in the system are sending at the requested rate and the target nodes are receiving. The steady state allows any erroneous measurements due to the asynchronous system startup and stopping to be discarded. During the first few seconds and the last few seconds of the measurement time, no measurements are taken. The asynchronous startups may be due to delays in accessing files via NFS (all node access the configuration files and traffic patterns on NFS. They also write their results in the same directory) and to a lesser extent, the use of PCs of differing speeds in the testbed. In performing the actual measurements of transmit and receive throughput, frame rate etc, the results are calculated every results period of three seconds and averaged over the whole measurement period. The transmitter The transmit thread is started after the global clocks have been read. Each node s transmit thread starts by reading the traffic patterns. The global clock node (node 0) sends to all nodes a time when they should all begin transmission. The packets are transmitted according to the traffic pattern file. If the end of the traffic pattern file is reached, then the sequence is started again from the top of the file. Each packet transmitted

128 128 Chapter 6. The Ethernet testbed measurement software and clock synchronisation has the 64-bit sequence number and timestamp entered into the data area of the packet. If we count the MESH control information overhead (source and destination port numbers) of four bytes, then there are 20-bytes of useful data in each packet. This is accounted for in the calculation of the results, but it does not affect the measurements for the minimum packet size since the minimum data field of an Ethernet frame is 46 bytes. Figure 6.19 shows the format of the ETB frame which is encapsulated in the data field of the Ethernet Frame. MESH destination port MESH source port Sequence number Timestamp 2 octets 2 octet 8 octets 8 octets octets Data Figure 6.19: The frame format of ETB software. The time stamped into each packet is the node s local clock time. The receiver reads the time stamp in the received packet and its own local time when the frame was received. It is able to convert both times to the global time using the information in global clocks and calculates the end-to-end latency. Every measurement period (three seconds), the results are calculated and at the end of the measurement, the average of the calculated results are saved to a file called sn0x.tx, where 0x corresponds to the node number of the transmitter. An example of the output of the transmitting thread is shown in Table 6.5. NoOf Frame Tx Node Throughput Frame Run Bytes Run Total Total Extra Nodes Size (MBytes/s) Rate (Bytes/run) Frames Bytes Frames String (Bytes) (frames/s) (frames/run) (Bytes) (frames) test run Table 6.5: An example of the output of an ETB transmitter. The meaning of the various fields in Table 6.5 are;» NoOfNodes: Number of node in the network. Obtained from the number of MAC addresses in the addresses file.» FrameSize: The size of the frames this node is transmitting. If multiple sizes are used, then this is the size of the last frame transmitted.» TxNode: The node number of the transmitter.» Throughput: This is calculated by looking at the number of bytes sent by the transmit thread divided by the measurement period.» FrameRate: The transmit frame rate is the number of frames per second transmitted.

129 6.5 Measurements procedure 129» RunBytes: The number of bytes sent in the measurement period.» RunFrames: The number of frames sent in the measurement period.» TotalBytes: The total number of bytes sent.» ExtraString: The extra string argument in the measurement ini file. At the end of the measurement period, the transmitting thread ends and the nodes return to synchronising until the next start flag. There is always at least one synchronisation process between each measurement cycle. The receiver The receive thread is started after the configuration file is read and before the synchronisation process starts. The received statistics are initialised to zero at the start of every measurement. When a packet comes into the node, the receive thread identifies which source port it came in on. Then the relevant variables and statistic are updated as follows: 1. The number of bytes and frames received in the current results period is updated. 2. The lost frame rate is checked. The sequence number in the frame should increment for each frame received from a particular port. 3. A histogram entry of lost packets is made per send node if there is a packet loss. The results are stored in the files named histogram clos from 0x to 0y where 0x is the transmitting node and 0y the receiving node. Subsequent bins in this histogram corresponds to an increasing number of consecutive losses. If there are no losses, no file is produced. The width of the histogram and its bin size is controlled by the time spread and bin size commands in configuration. 4. The number of received overflows is checked. In MESH, if a receiver is unable to accept packets fast enough from its port, then packets destined for its port are discarded so that other ports do not suffer as a result. Each port has a received overflow telling how many packets destined for that port have been thrown away. This tells us that ETB was unable to keep up with the receive rate. Thus we do not assign these losses to the device/network under test. 5. The time of arrival is noted as soon as the packet is received. The packet s end-to-end latency is calculated based on the source and destination node numbers and the global clocks file. 6. Three histogram files of the form histogram type from 0x to 0y are produced. Where 0x is the source node and 0y is the destination node and type is the type of histogram.

130 130 Chapter 6. The Ethernet testbed measurement software and clock synchronisation» histogram txip from 0x to 0y: A histogram of the inter-packet times as sent by the transmitter thread, that is, when the transmitter thread scheduled the packet to be sent. This is achieved by looking at the difference between the timestamp of subsequent packets received from a particular source. It tells us the traffic pattern we actually sent, which can be compared to what was asked to be sent. Only timestamps between subsequent packets where no packet losses occurred are histogrammed. Lost packets would artificially increase inter-packet times.» histogram rxip from 0x to 0y: A histogram of the inter-packet times as received at the receiver. This is achieved by noting the the time between arrivals of the incoming packets. This can be compared with the transmitted inter-packet time to observe the effect caused by the switch/network in between the nodes. An example of the received inter-packet time histogram histogram rxip from 0x to 0y, compared to the transmit inter-packet time histogram histogram txip from 0x to 0y, is shown in Figure The conditions for this were two nodes directly connected and one sending frames of 1500 bytes at a fixed inter-packet time of 240 ¼ s. The transmit inter-packet distribution is fairly narrow at the requested time of 240 ¼ s. The receive inter-packet time has a main peak at ¼ s and two smaller peaks, one 10 ¼ s each side of the main peak. The reason for the smaller peaks is due to the poll mechanism by which MESH detects the arrival of a packet. The time between polls is 10 ¼ s, therefore is a packet arrives just after a poll, it will be detected 10 ¼ s later. This in turn causes the inter-packet time between this packet and the next to be 10 ¼ s less than it should.» histogram meas from 0x to 0y: The histogram of the end-to-end latency. It is produced by recording the end-to-end latencies of each packet. An example of this histogram is shown in Figure The conditions were as above. The main peak is at 150 ¼ s. As above, the inter-poll time of 10 ¼ s is the reason for the second peak at 160 ¼ s. The width and bin size for these histograms are controlled by the time spread and bin size commands in configuration. 7. The source node number, the destination node number, the message size and the latency are recoded per packet and stored in files named latency0x if all latency record is enabled in the configuration file. 0x represents the receiving node number. Once the relevant variables and statistic are updated, the received packet is discarded. After

131 6.5 Measurements procedure bytes. 240 us inter packet time Transmit inter packet time Receieve inter packet time bytes. 240 us inter packet time 10 1 Probability Probability Inter packet time. us Figure 6.20: A comparison of the transmit and receive inter-packet time histogram when sending frames of 1500 bytes at 240 ¼ s interpacket time Latency. us Figure 6.21: A histogram of the end-to-end latency when sending frames of 1500 bytes at 240 ¼ s inter-packet time every results period, the results are calculated. At the end of the measurement period, the calculated results are averaged and stored to files named sn0x.rx where the 0x is the node number. An example of the output of the receiving thread is shown in Table 6.6: NoOf Frame Rx Node Tx Throughput Frame LostFrame Average TotLost Rx Over- TotRec Extra Nodes Size (Bytes) NodeNr (MBytes/s) Rate (frames/s) Rate (frames/s) Latency (² s) Frames (frames) flows (frames) Frames (frames) String test run test run test run test run test run test run Table 6.6: An example of the an ETB receiver output. This shows that node 0 was transmitting to node 1 frames of 250 bytes. The achieved throughput was MBytes/s and the average latency was 9782 ¼ s. The meaning of the fields in the table are;» NoOfNodes: Number of node in the network. Obtained from the number of MAC addresses in the addresses file.» FrameSize: The size of the frames being received.

132 132 Chapter 6. The Ethernet testbed measurement software and clock synchronisation» RxNode: The node number which is receiving.» TxNodeNr: The send node number.» Throughput:The receive throughput is the number of bytes received divided by the measurement period.» FrameRate: The number of frames per second received. This is calculated by looking at the number of frames received in the measurement period and dividing the number of frames by the time.» LostFrameRate: The number of frames lost divided by the measurement period.» AverageLatency: The average end-to-end latency of the frames received.» TotLostFrames: The total number of frames lost during all measurements.» RxOverflows: The number of frames lost due to insufficient buffer space in the software. This is a feature of the current implementation of MESH.» TotRecFrames: The total number of frames received.» ExtraString: The extra string argument in the measurement ini file. In Table 6.6, the receiving node was node 1 and only node 0 was transmitting. Broadcasts, multicasts and unknown MAC addresses There are a number of MAC addresses to which each node in the testbed will respond. The local MAC unicast address, the broadcast address, ff:ff:ff:ff:ff:ff, and a multicast address, 01:00:00:00:40:3d. There are also other addresses used for testing the switch/network s reaction to unknown Ethernet addresses. A MESH port is set up for each of these addresses. Thus each node can transmit to these ports or receive from them. When transmitting, the local port is always used as the sender. As a result, the receiving node can always identify the sender. At the receiver, no distinction is made in displaying the results as to whether the packet was sent to the local, broadcast, multicast or other port. This choice was made in order to make reading the results easier rather than overwhelming the user. 6.6 Considerations in using ETB We have been able to obtain quite accurate synchronisation of PC clocks. However, the OS can add arbitrary delays in the end-to-end packet latencies due to interrupts and scheduling points. To counter this, the nodes should be as lightly loaded as possible.

133 6.7 Possible improvements 133 To use ETB to do any measurements of switch performance, analysis of the node behaviour when directly connected is necessary. This allows the effects of the nodes to be factorised out from the switch or network. ETB produces the transmitted and receive inter-packet time histograms. When doing simulations, these histograms can be used as the distribution presented to the switch. The histograms take into account the effects of the OS. 6.7 Possible improvements To further improve the synchronisation process, the synchronisation just after the measurement could be combined with the synchronisation results before the measurement to obtain more accurate end-to-end latencies. Synchronisation using different packet sizes has not been done. We do not believe that this would make any difference to the results since a majority of the error is due to the OS scheduling of other processes. Support for TCP/IP could be added to ETB such that tests on Layer 3 switching could be performed. However, the extra processing would cause the performance of ETB to suffer. 6.8 Strengths and limitations of ETB Currently, the price for a PC (400MHz with 128 MBytes RAM, 8 Gigabyte hard disk and an Ethernet card) is $500. The price for an Intel EtherExpress Pro 100 is $80. Thus for an ETB Fast Ethernet port, the cost is $600. The price of the Alteon ACENIC [37] is $1500. Thus the cost of an ETB Gigabit port is $2000. This price is dominated by the cost of the NIC. A clone of the ACENIC, the Netgear GA620 [38] costs $500 and brings the cost of an ETB Gigabit port to $1000. For our tests, the cost was effectively zero since we had access to the PCs used for testing the ATLAS framework software equipped with the necessary Intel Fast Ethernet NICs and six PCs with the Alteon Gigabit NICs. A summary of the possible measurements that can be done with ETB are:» Throughput. Both send and receive throughput can be calculated simultaneously.» Latencies. Histograms of the transmit and receive inter-packet times and End-to-end Latencies can be produced to an accuracy of a few microseconds.» Packet loss. Measurements of the packet loss can be obtained.» Broadcast and multicast frames. We can send and receive broadcast and multicast frames.

134 134 Chapter 6. The Ethernet testbed measurement software and clock synchronisation» Point to point, point to multi-point, multi-point to point and multi-point to multi-point communications can be performed.» Oversized packets (up to 4 kbytes for Fast Ethernet and 9 kbytes for Gigabit Ethernet) can be used. There are a number of limitation, given that the tests are carried out using software in the PCs. They are:» Saturating a Gigabit link is difficult due to the combination of the PCI bus, the PC memory, the software overhead and the Ethernet NIC. It requires tricks such as the loop-back test or using multiple nodes through a primary switch. See Section » There is no central global clock, so a way of synchronising the PC clocks has been developed to obtain one way latencies through the switch [39].» Our latency measurements include the time the frame spends in the NIC, but this can be factorised out by direct measurements.» A steady state must be reached before measurements can be taken. Measurements on the initial ramp up of traffic cannot be obtained.» We are limited by the number of PCs and Ethernet NICs available for the Gigabit and Fast Ethernet. This limits the number of ports we can test simultaneously.» Different specification PCs may have an influence on node behaviour.» Maximum frame rate is ½ ¾ õõõ frames/s of a theoretical 0Àn¾ÂÁ õõõ frames/s for Fast Ethernet and ½ õõõõ frames/s from a theoretical Àn¾ÄÃ õfå frames/s for Gigabit Ethernet. We make one assumption about the switch under test. With only one user frame transmitted through the switch, the latency suffered by the frame between specific ports is constant. This is necessary to make the clock synchronisation work. 6.9 Commercial testers There exist test houses such as Mier 1, Tolly 2 and University of New Hampshire Inter-Operability Lab 3 who test commercial switches. The equipment used by these test houses tends to be specially built testers from companies such as Ixiacom 4 and Netcom 5. These testers use ASICs to transmit and receive frames at full Gigabit Ethernet line speed

135 6.10 Price Comparison 135 Most of these testers are intended to support a range of technologies, not just Ethernet. Due to their architecture, they are capable of performing measurements on cross-technology switches. Capabilities which may be found on commercial testers include;» Stress testing.» Performance measurements. Per port wire speed transmit and receive. Real-time latency on a packet by packet basis. QoS measurement. Displays results in real-time. User definable preamble, addresses and payloads.» Trouble shooting.» Illegal frames.» Tests for Ethernet, ATM, packet over SONET, Frame relay and token ring.» TCP as well as Ethernet modes Not all commercial testers offer all the above capabilities. An example of these testers is the Ixiacom s IXIA This is has a 16 slot chassis which can host 64 Fast Ethernet ports or 32 Gigabit Ethernet ports. 256 chassis can be connected together with a clock accuracy of 40 nanoseconds. One of Netcom s products, Smartbits 6000, is a six slot chassis which can host 96 Fast Ethernet ports or 24 Gigabit Ethernet ports. Eight chassis can be connected together to simulate large networks Price Comparison The capabilities of the commercial testers do not come cheap. For the IXIA 1600, the chassis costs of the order of $8,500, the Fast Ethernet module (four ports) is $8,500 and the Gigabit module (two ports) $16,000. The 16 slot chassis thus provides for a 64 port Fast Ethernet tester at $144,500 ($2,300 per port) or a 32 port Gigabit tester at $265,000 ($8,300 per port). The price of Netcom systems Smartbits 6000 tester is $18,200 for the chassis, $30,400 for each Fast Ethernet module (16 ports) and $24,300 for each Gigabit module (four ports). The price per port is thus $2,100 for Fast Ethernet and $7,000 for Gigabit Ethernet. Other Gigabit Ethernet testers include the following; Hewlett-Packard s LAN Internet Advisor, able to test one port full duplex or 2 ports in half duplex costs $50,000. Network Associates Gigabit Sniffer can also

136 136 Chapter 6. The Ethernet testbed measurement software and clock synchronisation test one port full duplex or 2 ports in half duplex. This costs $38,000. Wandel and Goltermann Technologies sell their Domino Gigabit for $41,000. Two are required to test two ports full duplex. Our PC system is a factor four times less expensive for Fast Ethernet ports and a factor seven times less expensive for Gigabit Ethernet ports. As the PCs can be used for other purposes, it is feasible to borrow them and the hardware cost is simply the cost of the extra NIC, making our Fast Ethernet tester 25 times less expensive and our Gigabit Ethernet tester a factor 14 times less expensive than commercial systems Conclusions The aim of developing an Ethernet testbed (ETB) has been met and at a competitive price. ETB is enabling the investigation of commodity Ethernet switches. It uses a farm of PCs to test switches by sending messages through them and extracting the achieved throughputs, latency distributions, and probability of a message arriving. Such characteristics are required to examine the suitability of Ethernet for the ATLAS LVL2 trigger. To date, eight different Ethernet switches with up to 32 nodes have been tested with ETB. It has been calculated that the system will support up to 166 nodes before deterioration in the results is observed. This limit is due to the synchronisation technique used here. A higher limit (as well as more accurate latency measurements) could be achieved if a more accurate method of synchronisation such as a global clock could be implemented. The ETB is capable of streaming at the full Fast Ethernet link rate. This allows FE switches to be tested under demanding conditions. With Gigabit Ethernet, we can reach 71 MBytes/s unidirectionally out of a potential 125 MBytes/s. Bidirectional streaming proves to be a problem due to the arbitration mechanism of the PCI bus. One stream can cause the PCI bus to lock temporarily into transmitting or receiving causing unfair distribution in the link bandwidth between the transmit and receive threads on each node.

137 Chapter 7 Analysis of testbed measurements 137

138 138 Chapter 7. Analysis of testbed measurements 7.1 Introduction Construction of the full size ATLAS trigger network for performance testing purposes would be ideal, though impractical and expensive at this early stage. Modelling and simulation are necessary precursors in assessing the performance of the network using Ethernet technology. Modelling will increase confidence that the system will work as predicted before the system components are purchased. Modelling also provides us with a tool by which the systems bottlenecks can be identified and possible alternative networking strategies investigated. Networks consisting of a layered structure of smaller switch units must be studied since it is unlikely that a single switch with over 2500 ports will be available. Thus to assess the scalability and performance of such a structure we evaluate single commodity Ethernet switch units. We model their behaviour with the aim of simulating the whole ATLAS trigger network as an array of switches. This work is the natural step after the Paper Model [4] and provides models of the ATLAS LVL2 system which are technology specific and can simulate the transient behaviour. In what follows, we present a brief description of the architecture of contemporary Ethernet switches, our modelling approach, a description of the switch modelling and a description of the measurement methodology used to characterise Ethernet switches and extract the necessary information for the models to be realised. The modelling is not the work of the author, however the author played was responsible for understanding and configuration of the switches, performing numerous measurements and analysis and took a high profile role in discussions which allowed the construction, calibrations and verification of the models. 7.2 Contemporary Ethernet switch architectures Figure 7.1 shows simplified representations of multi-port switches. The Switch of Figure 7.1(a) has four ports and a switch fabric or backplane. The CPU attached to the switch is used to manage the switch. It is used to run the SNMP server to allow configuration such as VLANs, port priorities and port speeds. Switches which can be so configured are known as managed switches. Switches without CPUs have fixed configurations and are known as unmanaged switches Most contemporary switches are hierarchical. They have a layered switching structure as shown in Figure 7.1(b). The switching units can be cascaded to increase the switch port density. The cascading

139 7.2 Contemporary Ethernet switch architectures 139 requires a second level of switching. Switch manufacturers use this architecture to provide module based switches where a chassis holds the backplane and CPU units. Modules containing the switch ports can be purchased separately to plug into the backplane. Customers can therefore plan their networks to allow for growth. These modular and hierarchical switches also allow switching between different speeds, 10, 100 and 1000 Mbit/s Ethernet. CPU CPU Backplane switch Switching Fabric Local switch Local switch Port 1 Port 2 Port 3 Port 4 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7 Port 8 (a) Simple switch architecture (b) Cascaded switch architecture Figure 7.1: The typical architecture of an Ethernet switch Operating modes The switches can operate in two modes. The first is known as store and forward. This means when a frames comes in on the input port, the whole frame is stored before being switched to its destination port. As a result of this store, the frame suffers a latency proportional to its size before being transmitted to the destination port. The advantages of store and forward are:» It allows transfer between different media speeds. For example going from 100 Mbit/s to 1000 Mbit/s and vice versa.» Buffering in the switch helps to improve network performance and is particularly important in dealing with transient congestion. With buffering, frames can be stored when the network is congested. Without buffering, they are certainly dropped.» The switch can discard corrupted frames before forwarding them to the destination port. The second way in which a switch can operate is in cut-through mode. This mode switches a frame to its destination port as soon as the destination address is known, while still receiving from the input port. Thus the frame suffers minimal delay in going through the switch. The cut-through switching mode is less popular because it is not possible to switch between different Ethernet

140 140 Chapter 7. Analysis of testbed measurements speeds. It also allows corrupted frames to be transmitted. A mode called interim cut through exists whereby at least the first 512-bits are stored before switching. This avoids the forwarding of runt 1 frames. It is possible for a switch to operate in both cut-through and store and forward modes. Equally valid is a mode where the frame is buffered first if the destination port is blocked, otherwise the operation is cut-through Switching Fabrics Contemporary Ethernet switches have one or a mixture of switching fabric architectures. These fabric are typically the crossbar, the shared buffer and the shared bus. An example of the crossbar fabric is shown in Figure 7.2. In a crossbar fabric, each port can communicate with another port at the same time without affecting the performance of the other ports. Frames switched through a crossbar fabric have to pass through two buffer, the input and output. If each link of the switching fabric runs at the same rate as the incoming port speed or higher, then the switch should be nonblocking. By non-blocking, we mean that for all data sizes, pairs of nodes communicating through the switch can reach the full link rate with all ports of the switch active. A shared buffer switch architecture is shown in Figure 7.3. Typically, the performance of this type of switch is limited by the speed of the shared buffer. An advantage of this type of switch is that the frames pass through a single buffer in being forwarded to their destination, thus providing a lower latency through the fabric compared to the other methods. A problem with this architecture is that scalability depends on how fast the memory can be made to run. An n-port non-blocking switch requires the memory to run at 2n Ã the speed of a single port. In a shared bus as shown in Figure 7.4, the buffers are distributed to the ports. All ports communicate via the switching bus. It has the obvious advantage of having memories which can be run at a slower speed than that of the shared buffer. The disadvantage is that a frame normally requires two store and forwards from source port to destination port and the performance depends on the speed of the bus. The shared buffer architecture tends to be more expensive than the bus based architecture due to the faster memory requirement. Also, only one pair of ports can be communicating at any one time. However, if the bus can run at n times the number of ports, where n is the rate of a single port, then the switch should be non-blocking. 1 runt frames are frames which are smaller than the legal Ethernet size.

141 7.2 Contemporary Ethernet switch architectures 141 Buffer Port 1 Network processor Buffer Port 2 Network processor Input Buffer Port 3 Network processor Buffer Port 4 Network processor Network processor Buffer Network processor Buffer Network processor Buffer Network processor Buffer Port 1 Port 2 Port 3 Port 4 Output Figure 7.2: The crossbar switch architecture Network processor Shared buffer Port 1 Port 2 Port 3 Port 4 Figure 7.3: The shared buffer switch architecture Shared bus Network processor Buffer Network processor Buffer Network processor Buffer Network processor Buffer Port 1 Port 2 Port 3 Port 4 Figure 7.4: The shared bus switch architecture

142 142 Chapter 7. Analysis of testbed measurements Buffering As we have seen, buffers may be shared for all ports or distributed. In general, the more buffers a frame has to go through to get from its input to its destination port, the greater the latency. For a store and forward operation, shared buffers would usually add one store and forward latency to a frame while distributed buffer would normally have at least two. Buffers help to increase throughput and utilisation. There are three types of buffering. Input, output and central buffering. The shared buffer switch architecture of Figure 7.3 is an example of central buffering. Input buffering allows for access to the switch fabric. It also allows for head-of-line (HOL) blocking to be resolved. Output buffering matches the switch fabric link speed with the output port s line speed. Managing the buffer queues allow quality of service (QoS) and congestion control to be implemented. The architecture of a real switch is presented in Appendix C. 7.3 Modelling approach The approach followed is illustrated in Figure 7.5. The first stage was to select a switch. The type of Ethernet switch selected was a hierarchical store and forward switch. The hierarchical structure simply means it is built in a cascaded modular fashion with a chassis as described in Section 7.2. The reasons why this type of switch was chosen was because the store and forward nature allows the cascading of switches of different speeds to form large networks, a prerequisite for the ATLAS LVL2 system. It also happens to be the most popular design for contemporary Ethernet switches. Next, we obtained as much information on the specification of the switch as possible, then constructed a detailed model. Unfortunately the specifications are not always accurate or are misleading, incomplete or simply unavailable. So measurements are also necessary to characterise the switch. Results from the detailed model are compared to the measurements in various configurations to ensure that the switch has been accurately modelled. If the model is not satisfactory, then refinements are made until it is. One cannot always obtain the depth of information about a switch to allow a detailed model to be constructed. Constructing an accurate detailed model is also time consuming and due to the resulting detail, slow to run. We therefore moved to a parameterised model. Detailed modelling of the switch was not repeated. Analysis of the detailed model revealed critical parameters. These critical parameters were

143 7.3 Modelling approach 143 used to simplify the model of the switch and create a vendor independent parameterised model. The modelling of other switches of the same class and type is done by obtaining the parameters of that switch and substituting them into the model. Being a simplified model, one cannot expect to get an exact match of the model to the measured results. We aimed for an accuracy between 5 to 10% of the measurement. The parameterised model of the switch can be used to model larger systems up to the full scale ATLAS trigger/daq system where models of other components can be added and the performance of the full system examined. Select the switch type Obtain the switch architecture and spec. Create detailed model Decide on choice of measurements Refine model based on measurements Make measurements No Is the model satisfactory? Yes Detailed model Simplification (Extract critical parameters) Design measurements aimed at collecting critical parameters Create parameterised model based on measurements and detailed model Make measurements No Is the parameterised model satisfactory? Yes Parameterised model End Figure 7.5: The interaction between modelling and measurement activity.

144 144 Chapter 7. Analysis of testbed measurements 7.4 Switch modelling Introduction We based our detailed model on the Turboswitch 2000 from Netwiz. A description of it is given in Appendix C. A network simulator called OPNET [42] was selected as the modelling tool. OPNET is a discrete event simulation tool specifically for simulating networks. It has implementations of various link layer protocols including Ethernet. This included nodes, links, MACs and switches. These implementations were generic, unrealistic and latency was incurred only in the links; they modelled ideal systems. Even so, the environment was useful because it gave us the basic framework such that we could focus on modelling the parameterised Ethernet switch. At the time of writing, there is no support for the latest IEEE standards such as flow control, trunking and VLANs. It is possible that these will be added in the future. The level of detail provided by OPNET causes modelling large network to be slow and time consuming. At a later stage, the model was ported to Ptolemy [43], a more general modelling tool. Ptolemy is faster but had less features. It is also the modelling tool adopted by other modelling effort within ATLAS. The two tier approach to modelling also provided a way to cross check the models during development The parameterised model There are three objectives for the parameterised switch modelling. They are; 1. Produce a flexible model which can accommodate future changes and developments of IEEE standards. 2. Produce a simplified model which executes faster than a model with many details. 3. Produce a model which can be easily modified to simulate switches from different vendors. These objectives facilitate the modelling of larger networks with tens of switches and thousands of nodes. They also imply that we can have a tool to model devices from different vendors by simply altering key parameters. A detailed model was constructed based on the description given in Appendix C. Measurements on the real switch were compared with the simulation results of the detailed model. Once we were satisfied that the detailed model sufficiently represented the real switch, we began parameterising the model. The aim was to find out what variables and characteristics defined the working and performance of the switch.

145 7.4 Switch modelling 145 Module MAC Input buffer P1 Buffer Manager P5 P8 Output buffer P2 Buffer Manager MAC P1 = Input buffer length (#frames) P2 = Output buffer length (#frames) P5 = Max Intra-module throughput (MBytes/s) P8 = Intra module transfer bandwidth (MBytes/s) P10 = Intra module fixed overhead (not shown) Figure 7.6: The parameterised model: Intra module communication. MAC Input Module Input buffer P1 Buffer Manager P3 P7 P7 P4 Output Module Output buffer P2 Buffer Manager MAC Backplane P6 MAC Input Module Input buffer P1 Buffer Manager P3 P7 Backplane P6 P7 P4 Output Module Output buffer P2 Buffer Manager MAC P1 = Input buffer length (#frames) P2 = Output buffer length (#frames) P3 = Max To backplane throughput (MBytes/s) P4 = Max from backplane throughput (MBytes/s) P6 = Max backplane throughput (MBytes/s) P7 = inter-module transfer bandwidth (MBytes/s) P9 = Inter module fixed overhead (not shown) Figure 7.7: The parameterised model: Inter module communication.

146 146 Chapter 7. Analysis of testbed measurements The parameterised model is based on the modular structure shown in Figure 7.6 and 7.7. The performance defining features of the switch were identified as the list of parameters below. Full description of these parameters is given in Appendix D. 1. Parameter P1: The length of the input buffer in the module in frames. 2. Parameter P2: The length of the output buffer in the module in frames. 3. Parameter P3: The maximum throughput for the traffic passing from the module to the backplane in the inter-module transfers in MBytes/s. 4. Parameter P4: The maximum throughput for the traffic from the backplane to the module in the inter-module transfers in MBytes/s. 5. Parameter P5: The maximum throughput for the intra-module traffic in MBytes/s. 6. Parameter P6: The maximum throughput of the backplane in MBytes/s. 7. Parameter P7: The bandwidth required for a single frame transfer in the inter-module communications in MBytes/s. 8. Parameter P8: The bandwidth required for a single frame transfer in the intra-module communications in MBytes/s. 9. Parameter P9: The fixed overhead in frame latency introduced by the switch for the intermodule transfer in microseconds. 10. Parameter P10: Fixed overhead in frame latency introduced by the switch for the intramodule transfer in microseconds Principles of operation of the parameterised model The operation of the parameterised model is based on calculations using parameters representing buffering and transfer resources in the switch. When the frame arrives at the switch a check is made to see whether there are enough resources to buffer the frame (if the current count of frames buffered in the buffer does not exceed the parameter P1). If the check is negative the frame is dropped. There is no flow control in the current implementation. Once the frame is buffered in the input buffer the current count of buffered frames in the source module is increased and the routing decision is made. Depending whether it is an intra or inter-module transfer the corresponding parameter, P9 or P10, is used to model the fixed overhead time for taking the routing decision. Currently there are 4 types of transfers: inter-module unicast, inter-module multicast, intra-module unicast and intramodule multicast (broadcast is implemented in the same way as multicast). The type of transfer defines which resources will be necessary to start the transfer. In case of unicasts the resources

147 7.4 Switch modelling 147 for a single frame transfer from the input buffer of the source module to the output buffer of the destination module will be necessary. In case of multicasts, resources for multiple transfers between and inside the modules will be necessary. The frame transfer is seen as a request to provide the bandwidth needed to commence the transfer: in the inter-module transfers the requested bandwidth is represented by the parameter P7 and for the intra-module transfers the requested bandwidth is represented by the parameter P8. Frames currently being transferred occupy some part of the throughput represented by parameters P3, P4 and P6 for the inter-module transfers and P5 for the intra-module transfers. The time for which they occupy a resource is known as the occupancy time. Together with evaluation of the transfer resources another check is made to verify if there is enough buffering capacity in the output buffer of the destination module. If the available throughput is larger or equal to the requested bandwidth and there is buffering available the frame transfer can start. Newly inserted frames reduce the available throughput by a fraction corresponding to the parameter P7 or P8 depending whether they are inter or intra-module transfers. Also, the current count of buffered frames in the output buffer is incremented. Once the resources have been granted, calculations are made to get the occupancy time. The occupancy time is calculated as the frame size divided by P7 or P8. It is used to evaluate how much throughput is available at any point in time. If the available throughput requested by a frame is less than that available, it waits until the necessary resources become available (when another frame s transfer finishes). If there are more frames waiting for resources it is up to the buffer manager to decide which frame will be transferred next. The buffer manager may implement different policies to take decisions: the frame waiting the longest time, the highest priority frame etc. When the frame arrives at the output buffer of the destination module it frees the allocated transfer and buffering resources in the input buffer of the source module. It is then up to the output buffer manager to decide which frame from the output buffer will be sent out next on Ethernet. Similar to the operation of the input buffer manager, the output buffer manager can implement different policies when making its decision. When the frame finally leaves the switch via the MAC, the current count of buffered frames in the output buffer is decremented. The allocation of resources for the multicast and broadcast might be different from the single frame transfer. The policy of handling the multicast and broadcast is strongly bound to the switch and we have not found any generalisation there. Currently the model creates a copy of the multicast (broadcast) frame for each remote module housing at least one destination port.

148 148 Chapter 7. Analysis of testbed measurements The performance of the parameterised model compared to that of a real switch on which it is based is given in Section Conclusion Analysing results from a set of communication measurements we were able to identify the likely internal structure of any Ethernet switch. With the help from the vendor we constructed a detailed model of the switch. It helped us to identify key parameters contributing to the frame latency and the throughput when traversing the switch and thus develope a parameterised model. The parameterised model applies to the class of switches characterised as modular: the switch is composed of modules communicating by a backplane and of the store-and-forward type (with two stages of buffering frames: in the source and in the destination modules). Further work is being done on the parameterised model. Features such as trunking, priorities and VLANs are being added. A validation of the parameterised model is presented in Section Characterising Ethernet switches and measuring model parameters In this section, we present the measurement methodology used to assess the performance of commodity off the shelf Ethernet switches and also extract the necessary information to allow the models to be realised. The limitations of the ETB software (described in Chapter 6) was kept in mind in designing these measurements. For the measurements described below, measurements of directly connected nodes are also made to obtain the overheads introduced by the PC (PCI, NIC, operating system and measurement software) and the performance limits. These can then be factorised out of the measurements with the switches End-to-End Latency (Comms 1) The comms 1 or ping-pong measurement procedure is as described in Section 4.3. It is made by sending a frame of a fixed size from one node to another and getting the receiving node to return the frame. The sending node can calculate the time it took to send and receive the message. Half of this time is assumed to be the end-to-end latency. This is repeated for a range of message sizes

149 7.6 Characterising Ethernet switches and measuring model parameters 149 to obtain a plot of message size against latency. Examples of the expected results once the PC overhead has been removed, are shown in Figure 7.8. There are two lines, a solid and a dotted line. Both showing latency as function of the message size. This is because as the message size grows, it takes a longer time to store before forwarding it. Slope indicating multiple store and forwards. Latency Slope indicating a single store and forward. Minimum latency Zero length latency 46-bytes 1500 bytes Message size Figure 7.8: An example plot of the comms 1 measurement. The PC overhead, i.e. the direct connection overhead should be subtracted to leave the switch port-to-port latency. Since it is possible to have more than one level of switching in a switch, this should be repeated to discover if different pairs of ports have different levels of switching between them. The solid and the dotted lines in Figure 7.8 reflect the single and multiple store and forward performance. The ping-pong measurement tells us the following.» The end-to-end latency gives the switch port to port latency. It tells us if the switch is operating in cut through mode or store and forward. If the switch is in store and forward mode, this will tell us the number of levels of switching, that is if there are one or more store and forwards. The number of store and forwards and the switch layout will show which combination of ports switch locally (intra module) and which switch via the backplane (inter-module).» It also tells us the maximum throughput achievable (from the gradient of the message size versus latency plot) without taking advantage of the pipelining effect. For the parameterised model, the reciprocal of the gradients of the lines in Figure 7.8 in MByte/s gives the bandwidth reserved for switching a single frame. This corresponds to Parameter P8 from Section if it is intra-module switching and parameter P7 if it is

150 150 Chapter 7. Analysis of testbed measurements inter-module switching. The minimum message size dependent overhead should be 0.08us/byte or 12.5 MBytes/s for Fast Ethernet and 0.008us/byte or 125 MBytes/s for Gigabit Ethernet for a store and forward switch.» We can obtain the non-message size dependent overhead. This is the zero length latency as shown in Figure 7.8. It is interpreted as the processing overhead required to make the routing decision. This corresponds to Parameter P10 from Section for intra-module switching and parameter P9 for inter-module switching. This is also an indication of the minimum amount of memory a switch needs. For example, a switch of Æ ports requires at least ÆÇÃ minimum latency Ã link speed bytes of memory. Examples of these measurements are given in Figure 7.9. This shows the switch port-to-port latency (results of the direct connection have been subtracted) for four Gigabit Ethernet switches. The results for four switches are shown: the Cisco 4003, the Cisco 4192G, the Cisco 6509 and the Xylan Omniswitch SR9. The plot shows that the Cisco switches operate in cut-through mode since their gradients are less than that of a single store and forward at a Gigabit rate (0.008 ¼ s/byte). The Xylan Omniswitch SR9 has a gradient of ¼ s/byte, which corresponds to a switch port-to-port rate of 40 MBytes/s. This suggest multiple store and forwards. The fixed overhead for the Cisco switches are 1 ¼ s and for Xylan Omniswitch it is 8 ¼ s. Further examples of these measurements are given in Figure 8.1 of Section Port to port latencies for various Gigabit Ethernet switches Cisco4003 Cisco4192G Cisco6509 XylanOmniSR9 40 Latency. us Message size. Bytes Figure 7.9: Port to port latency for various Gigabit Ethernet switches

151 7.6 Characterising Ethernet switches and measuring model parameters Basic streaming The basic streaming measurement is the same as that described in Section It is aimed at finding out whether we are limited by the switch, the node or the link speed. Firstly, two nodes are directly connected. One node streams messages of a fixed size to the other as fast as possible. The other node reads the messages as fast as possible and records the receiving rate. This is repeated for varying message sizes. The expected received rate should look like Figure The throughput should be a function of the message length. That is, the higher the message length, the higher the throughput. If we reach the theoretical maximum, then we are limited by the link otherwise we are limited by the node. The measurement is repeated with the two nodes sending through the switch. If we obtain the same results as for the direct connection, then we are not limited by the switch. A graph of message size against loss rate can be plotted if the switch is limiting. Examples of these measurements are given in Figure 7.11(a). This shows the received throughput for directly connected PCs and PCs connected through three different switches. The switches are the BATM Titan T4, the BigIron 4000 and the Alteon 180. For the direct connection, the structure between 500 and 1000 bytes is a feature of the NIC with flow control enabled. Figure 7.11(b) shows the corresponding loss rates. For the direct connection there were no losses. The Titan T4 lost the least frames and lost frames only when it had not learned the address of the destination node. The behaviour of the other switches did not change if the destination address was known or not. Throughput Max Message Size Figure 7.10: The expected result from streaming

152 152 Chapter 7. Analysis of testbed measurements Unidirectional streaming for various Gigabit Ethernet switches Unidirectional streaming for various Gigabit Ethernet switches Titan T4 BigIron 4000 Alteon Throughput. MBytes/s Lost frame rate. Frames/s Direct Titan T4 BigIron 4000 Alteon Message size. Bytes Message size. Bytes (a) received rate (b) The loss rate Figure 7.11: Results from unidirectional streaming through various Gigabit Ethernet switches Frame loss is clearly linked to the implementation of the IEEE 802.3x standard in the switches. The throughput measured at the receiver for the BigIron 4000 is equal to that of the direct connection. This suggest to us that the BigIron 4000 reacts to received flow control frames from the destination node, but it does not send flow control frames to the source node, instead it discards the frames that it cannot send. The Alteon 180 shows signs that it does send flow control frames slowing down the sender, but not enough to avoid lost frames. This is evident in the fact that it losses less frames than the BigIron Secondly, in Figure 7.11(b) in the region of message size around 1000 bytes, the received rate is above the received rate for direct connection, implying the lack of flow control packets. Finally, above 1000 bytes, we get to a position where no losses are detected. This is because sufficient flow control packets are sent by the switch to avoid packet loss Testing the switching fabric architecture The traffic types To test the switching fabric, multiple streaming nodes are used. The nodes can be asked to send at a specified rate to any number of destination addresses. The time between packets can be set as constant or random. The destination address can also be chosen to be constant or random. We define two traffic types for our measurements. The systematic and random traffic patterns.» The systematic traffic pattern corresponds to a situation where every source node transmits

153 7.6 Characterising Ethernet switches and measuring model parameters 153 to a single, unique destination node. The sources transmit at a constant rate, that is the time between subsequent frames is fixed. Since there is a single path through the switch for each stream of traffic, this type of traffic is free of contention and thus queues do not build up until saturation is reached. When saturation does occur, it may be in the nodes or in the switch. Therefore the maximum rate for directly connected node must be established. If the switch is non-blocking, then the average latencies should be constant as the transmission load increase, up to the limit of the nodes or of the link speed.» In the random traffic pattern (see section 6.3.2), the inter-packet times are exponentially distributed about a mean. Also, each node can send to all the other nodes in a random manner. The random traffic pattern is used as a way of cross checking how well the parameterised switch model agrees with the measurements on the real system. In both cases the load is increased by decreasing the mean value of the distribution while keeping the frame size constant. For the purpose of discovering the switch architecture, only the systematic traffic pattern is of interest. It allows us to see the limits of the switch performance sharply. The intra-module and inter-module transfer rate Testing for the maximum intra-module transfer rate tells us if all nodes in a module can communicate between themselves at the full link rates. The setup consists of populating all the ports of a module on a switch and sending traffic of fixed message sizes in the systematic fashion as described above bidirectionally between pairs of nodes. The combined received throughput of the nodes is the intra-module transfer rate. For a non-blocking switch, all the nodes will be able to reach the full line rate. This corresponds to the parameter P5 of Section The inter-module transfer rate is tested by selecting two modules on the switch. Both modules are populated with nodes. The systematic streaming pattern is used such that each of the traffic streams crosses the backplane, that is each node sends to a node in a different module. If limits are found in the inter-module transfer rate, then the measurement described in Section must be performed to determine the module access rates to the backplane. The access from the module to the backplane is parameter P3 of Section and the access from the backplane to the module is parameter P4 of Section

154 154 Chapter 7. Analysis of testbed measurements A comparison of the systematic and the random traffic should look like Figure 7.12 which shows a plot of the load accepted by the switch against the end-to-end latency for a given message size. The latency T as illustrated in Figure 7.12 should correspond to that obtained for the pingpong measurement at that message size. For a non-blocking switch, the throughput indicated by point L should be the sum of the maximum throughput achieved by the nodes for the chosen message size. This is illustrated in Figure 7.13 where the relationship between the ping-pong measurement and the streaming measurements is shown. Latency Random traffic Systematic traffic T L Accepted load Figure 7.12: Typical plot of load against latency for systematic and random traffic. The latency here refers to the end-to-end latency from one PC into another. The point L for the systematic case can be due to three things. 1. The limit on the PC. If the PC is not powerful enough to saturate the link, then what is observed is the effect due to saturation in the PCs. The PCs may not be able to saturate the link due to a combination of internal PCI bus, the NIC or the MESH software. The limit of the PCs can be obtained from the basic streaming tests for directly connected PC. 2. The limit on the link. We know the link speed from the technology standard. Taking the overheads into account, this speed can be calculated. 3. The limit on the switch. If we don t reach the limit of the PC or that of the link then the limit L corresponds to the switch limit. The results can be re-plotted as shown in Figure This shows the amount of traffic generated or offered by the nodes against the amount of traffic received or accepted through the switch. A straight line of gradient one implies everything sent by the nodes is delivered by the switch. The horizontal part of the graph will be visible if frames are lost. A plot of frame loss against the

155 7.6 Characterising Ethernet switches and measuring model parameters 155 Comms 1 exercise Streaming with measurement software for message size x THalf round trip latency End-to-end-latency Random traffic Systematic traffic T x Message Size L Accepted load LThroughput Streaming exercise x Message Size Figure 7.13: Relationship between the ping-pong, the basic streaming and streaming with the systematic traffic pattern.

156 156 Chapter 7. Analysis of testbed measurements Accepted load L Lost frame rate Visible if flow control does not work. This should be zero unless switch/software looses packets Visible if flow control does not work. L Offered load L Offered load Figure 7.14: Typical plot of offered load against accepted load. If flow control works properly, we cannot offer more load than we can accept. Figure 7.15: Typical plot of offered load against lost frame rate. For switches where flow control works properly, we should observe no losses. offered load from the nodes can be made as shown in Figure We will be able to see if the switch loses frames at low loads and high loads for a fixed message size. Measuring module access to and from the backplane In hierarchical based switching, the backplane switching fabric capacity is important, but the capacity of the links connecting the modules to the backplane is also an issue. This limitation may be different depending on whether we are considering traffic from the backplane or traffic to the backplane. In order to assess the access to and from the backplane we use the setup shown if Figure This shows a switch of Æ modules and È ports per module. For access to the backplane, the idea is to saturate the links from a module to the backplane without saturating the links from the backplane to the module. The nodes on the same module (a1 to a3) transmit as fast as they can to nodes on different modules (b1 to b3). The number of transmitters on module 1 (a1 to a3) are chosen such that their combined transmission rates can saturate the access to the backplane. At the other end ( b1 to b3), we must have enough nodes to absorb all the traffic being transmitted. The nodes communicate in pairs, that is, a1 send to b1, a2 sends to b2 etc. The combined received rate on nodes b1 to b3 is the maximum throughput to the backplane. This corresponds to parameter P3 of Section For access from the backplane, the idea is to saturate the links to a module from the backplane without saturating the links from the module to the backplane. The roles of the transmitters and

157 7.6 Characterising Ethernet switches and measuring model parameters 157 Switch Module 1 Module 2 Module 3 Module 4 Module n a1 Port m. Port 2 Port m. Port 2 Port m. Port 2 Port m. Port Port m. Port 2 a2 Port 1 Port 1 Port 1 Port 1... Port 1 a3 Port 0 Port 0 Port 0 Port 0... Port 0 b1 b2 b3 Figure 7.16: The setup to discover the maximum throughput to and from the backplane receivers are reversed. This corresponds to parameter P4 of Section Due to the varying number of ports and modules per switch, it may not always be possible to perform this test as described. For instance, for a switch with one port per module, access to and from the backplane will be the same. Examples of these measurements are given in Section The maximum backplane throughput The maximum backplane throughput is the maximum rate that can be transmitted across the switch backplane. The quoted value by the vendor may not be achievable due to the switch architecture. For example, for the Turboswitch 2000 (see Appendix C), the backplane has 128x128 links, each running at 40 Mbit/s, giving a total backplane bandwidth of 5.1 Gbit/s. In fact only 120 out of 128 links can be used for transfer of user data, giving a potential maximum backplane utilisation of 4.8 Gbit/s. This measurement aims to find out the maximum achievable backplane throughput. To determine this, all ports of the switch are loaded. Traffic is sent systematically between pairs of nodes such that the traffic streams pass through the backplane of the switch. All the traffic is inter-module. The total received throughput corresponds to the maximum backplane throughput.

158 158 Chapter 7. Analysis of testbed measurements This may be limited by access to and from the backplane (i.e. the backplane may be capable of more, but the architecture limits the accessible throughput) or by the capacity of the backplane itself (i.e. the backplane is the limit). For a non-blocking switch, all nodes will reach the line rate for both send and receive. The maximum achievable backplane throughput corresponds to the parameter P6 of Section Testing broadcasts and multicast Broadcasts and multicast frames are required to appear on multiple ports of an Ethernet switch. As a result they may be handled differently from unicast frames. We would like to know the following.» Do the broadcast and multicast frames suffer more latency than unicast frame?» Do all nodes receive broadcasts?» Are the rates or throughputs different from the unicast frames?» Are frame losses any different from unicast frames? Does the flow control propagate through the switch for broadcast traffic, that is does the internal flow control slow down the broadcasting node? The same tests performed with the unicast frames can be performed using broadcast and multicast frames to see if the switch supports them without degradation in the forwarding performance. 1. The first test is the ping-pong test. The modification here is that the client broadcast its request and the server s response is also a broadcast. As before this is done with and without the switch. This will tell us the switch port-to-port latency for broadcast frames. 2. If the broadcast ping-pong test is shown to have the same latencies as the unicast, then this measurement can be used to find out how broadcast and unicast are prioritised against one another. The setup for this test is shown in Figure It requires at least three nodes on the switch. One node will act as the broadcast node, the other will be the unicast node and the third will be receiving from both transmitting nodes. The average latencies, number of frames received and the loss rates of the unicast and broadcast can be looked at on the receiver and compared. For high transmission rates, are broadcast fames dropped in preference to unicast frames or vice versa? 3. The next test is to use two nodes, one node broadcasting as fast as possible and the other receiving as fast as possible as in the basic unicast streaming case (Section 7.6.2). This will reveal the same things as in the basic unicast streaming case, but for broadcast frames.

159 7.6 Characterising Ethernet switches and measuring model parameters With multiple nodes connected to the switch and one node broadcasting at its maximum rate, we would like to see if all nodes receive the broadcast. 5. If the basic streaming with broadcast shows no frame losses, then we can perform the following test to confirm that flow control is propagated for broadcast traffic. Using the same setup as above but with two nodes broadcasting at the full rate (such that saturation is reached), we can examine the receive rates to see if any packets are lost. Broadcast node Unicast node Switch Node receives broadcast and unicast Figure 7.17: An example set up to test the priority, rate and latency distribution of broadcast frames compared to unicast frames Examples of the broadcast measurements are given in Section Assessing the sizes of the input and output buffers Trying to measure the input and output buffer sizes is difficult. In general, we have to rely on the vendor s information on the size of the buffers in their switch. If packet aging can be turned off, a way to asses the input and output buffers sizes of a switch is illustrated in Figure The switch is programmed with static routes for the attached nodes, A, B and C. Flow control must be enabled between node A and the switch such that no packets are lost on that link. Node A is blocked from receiving such that the switch stores packets destined to it in the port a output buffer. The measurement starts with node B sending to node A. Since node

160 160 Chapter 7. Analysis of testbed measurements A is blocked from receiving, the output buffers at port a and the input buffers at port b will fill up. Once they are filled up, frames will be lost between node B and the switch port b. When node A is re-enabled to receive packets, it can examine the sequence numbers of the incoming packets to see if they are sequential. The last number before the sequence breaks will indicate the combined input and output buffers available for storing packets. In the second phase of the measurement, the same setup is repeated but with a third node, node C connected to the switch. As before, flow control is enabled between only node A and the switch and node A is blocked from receiving. Node B sends sequence numbered frames to node A. Within a few seconds, the output buffers of port a and the input buffers of port b will be full and subsequent frames from node B will be dropped. Node C also starts transmitting sequenced numbered frames to node A. This will cause the input port c to be filled up. Frames from node C will thus occupy only the input buffer of port c. When node A is re-enabled to receive, all the frames in the switch buffers from node B will be forwarded to node A since they arrived in the switch first. Then the frames from node C will be forwarded to node A. Again by analysing the sequence number of the frames received at node A from node C, the last number before the sequence breaks will indicate the input buffer size of port c. Assuming the buffer sizes are shared equally between ports and given that we know the combined input and output buffer size, we can calculate the output buffer size. A potential problem with this method is that frames may reach the age limit and be discarded by the switch. Therefore the frame aging should be disabled in the switch as mentioned above. This is specifically for the distributed memory switch architecture. For the shared memory architecture switch, there is no distinction between ports and their buffers. The input and output buffering correspond to parameters P1 and P2 of Section Testing quality of service (QoS) and VLAN features Quality of service and VLANs have been introduced into Ethernet with a new frame format which extends the standard Ethernet packet by four bytes (IEEE 802.3p). With the new frame format, There are eight priority levels (three bits) and 4093 private VLANs (12 bits) possible. A switch can also implement priorities and VLANs based on its ports and MAC addresses.

161 7.6 Characterising Ethernet switches and measuring model parameters 161 Node B Node C Flow control disabled Flow control disabled Port b Port c Switch Port a Flow control enabled Node A Figure 7.18: Investigating input and output buffer sizes. Frames Prioritisation Prioritisation is used to mark packets with a level of urgency such that high urgency packets are serviced before low urgency packets. The urgency or priority can be based on the frames source or destination Ethernet address, the TOS field in an encapsulated IP packet or the IP source or destination address. The new Ethernet frame format also has three bits reserved for eight levels of priority to be assigned. As mentioned in 3.5.3, the Ethernet standard does not specify the service rate for the different priorities. Furthermore, switches may support as little as two priority levels. The number of priority classes is normally given by the vendor (the IEEE standard 802.1p gives the recommended way in which vendors should split priorities in their switches based on the number of available classes), however the service rate of each priority levels is not always obvious. The priority feature is tested in a similar way as the broadcast and multicast frames. See Section 7.6.4, item 2. For a two priority system, one transmitter is configured to transmit high priority packets and the other low priority packets, but at the same rate. The latency of the high and low priority packets are examined at the receiver for varying loads. The expected result should show that for low loads, the low and high priorities will show the same latencies. For higher loads where we begin to reach the limitations due to the receiver, the link rate or the switch capacity, we expect to see high priorities maintain low end-to-end latency while the low priorities

162 162 Chapter 7. Analysis of testbed measurements latencies grow. The corresponding throughput for high priority should increase while the low priority throughput decreases. An example of the priority results is shown in Figure The measurement was performed on the BATM Titan T4 (via the Fast Ethernet ports) which has two levels of priority, high and low. The Figure 7.19(a) show the inter-packet time (and hence the offered load) plotted against the end-to-end latency. Figure 7.19(b) shows the inter-packet time against the accepted throughput for the same measurement. A packet size of 1500 bytes was used. The high and low priorities achieve the same average end-to-end latency until an inter-packet time of 248 É s. This corresponds to an offered rate of 6 MBytes/s from each of the sources, corresponding to saturation of the receiving node link. At this point, the high priority packet must wait at most the time to transmit a single 1500 byte packet. This is the reason for the jump in the latency of the high priority traffic between 248 to 140 É s inter-packet time. Within this region, the high priority packet have a constant endto-end latency. However, the end-to-end latency for the low priority traffic increases as the high priority traffic takes up more bandwidth. Below an inter-packet time of 140 É s, the high priority traffic saturates and its latency grows above 100 ms. At this point the ratio of the throughput of the high priority compared to the low priority is 89% to 11%, a value confirmed by the vendor Hi vs. Low priority. Titan T4. sys. FE ports Low priority High priority Hi vs. Low priority. Titan T4. sys. FE ports Low priority High priority End to end latency. us Throughput. MBytes/s Inter packet time. us Inter packet time. us (a) The end-to-end latency (b) The throughput Figure 7.19: Fast Ethernet priority test on the BATM Titan T4. High and low priority nodes streaming to a single node. The same measurement and setup can be used to test greater than two priority classes, but with another transmitter for each new priority class.

163 7.6 Characterising Ethernet switches and measuring model parameters 163 VLAN The VLAN is a feature available in Ethernet switches used to manage bandwidth more efficiently in networks. It does this by providing a way of segmenting networks such that certain types of traffic are limited to a certain part of the network. The support of VLANs can be tested by segmenting the network and observing if unicast, broadcast and multicast frames can cross the VLAN. This will require a setup such as that illustrated in Figure In this setup, nodes 1 and 2 are connected to ports on VLAN a, nodes 3 VLAN b and node 4 on VLAN a and b. Nodes 2 and 3 transmit broadcast, multicast and unicast frames to nodes 1 and 4. The received frames are analysed on all nodes. Node 1 should see only frames from node 2. Node 4 should see frames from node 2 and 3. Nodes 2 and 3 should not receive any frames. We would also like to test the ability of the switch to add and strip VLAN tags (this is necessary if the loop-back test of Section is to be performed). This test requires only two of nodes in the setup of Figure 7.20, for example nodes 3 and 4. The switch port connecting node 3 should be set to an untagged VLAN port and the switch port connecting node 4 should be set to a tagged VLAN port. Node 3 can then send unicast frames in the classical format to node 4. Analysis of the received frames on node 4 should show that the frames have the new frame format with a VLAN corresponding to the VLAN of the port to which node 3 is attached. Node 4 then sends packets with VLAN tag b to node 3. Analysis of the received traffic on node 3 should show frame sent from node 4 but without their VLAN tags Multi-switch measurements The next step is to look at nodes connected via multiple switches. This test is to discover whether the switch latencies increase linearly and the maximum rates are not degraded when cascaded. Also of concern is the performance of the implementation of the IEEE 802.3ad trunking standard. Cascaded switches To test the cascaded switches, multiple switches are connected together and traffic sent across them. the ping-pong measurements (Section 7.6.1) should be repeated to find out the end-to-end latencies and to see how they compare with the single switch measurements. The basic streaming test can be done unidirectionally and bidirectionally to see firstly if the

164 164 Chapter 7. Analysis of testbed measurements Node 2 Node 3 VLAN a VLAN b Switch VLAN a VLAN a and b Node 1 Node 4 Figure 7.20: Testing VLANs on a switch. Nodes 1 and 2 are connected to ports on VLAN a, nodes 3 VLAN b and node 4 on VLAN a and b. results agree with that of measurements on a single switch. Secondly, to discover how well the switch to switch flow control works, the frame loss will be looked at in the cases where the switch is saturated. The per link and the per module rates should also be looked at. For these multi-switch tests, we will be looking for the maximum throughput that can be achieved, the frame loss and the end-to-end latency. The end-to-end latency at low transmission rates should be consistent with the ping-pong results. An example of the end-to-end latency across multiple switches are shown in Figure These results were obtained from the ping-pong measurements. They have the results of the direct connection subtracted to leave the latency of going through the switches. These results show that the latency increases linearly (the store and forward time) as the number of switches increases. Trunking The IEEE 802.3ad link aggregation or trunking is a recent standard which enables multiple links to be group into a single aggregate link (see Section 3.5.4). For a given pair of switches We would like to know the following: 1. The maximum number of links that can be trunked per switch. The standards do not specify any limit on the number of ports that can be trunked, however on some switches only a subset of ports can be trunked.

165 7.6 Characterising Ethernet switches and measuring model parameters switch 2 switches 3 switches 2 nodes. Comms 1. T4 GE Combined switch latency. us Message size. Bytes Figure 7.21: End-to-end latency through multiple Titan T4 Gigabit Ethernet ports. 2. Does the trunked link work as expected. That is, are we able to obtain the bandwidth equivalent to the aggregation of the trunked links. 3. In the event of the failure of a link in a trunk, we would like to know if the traffic is re-routed to another link in the trunk, how long in takes and how many packets are lost in the process. 4. Conversely, when a disabled link of a trunk is re-enabled, we would like to know if the traffic is allocated to it and how long it takes. 5. Does the load balancing work. What is the policy for using new links given a new conversation. How is the distribution handled when a new connection is introduced? Item 1 of the above list is normally supplied by the switch vendor and can be observed in the switch configuration menu. Item 2 can be tested as follows. The setup consists of two switches connected with trunked links and a fixed number of nodes on each switch as shown in Figure This shows two switches A and B with a number of nodes on each. The switches are connected together via trunked links. For this measurement, we require that the number pairs of nodes communicating through the switches be greater than or equal to the number of links trunked so that the trunked link can be saturated. Traffic is sent systematically at the maximum rate between the nodes on switch A and the nodes on switch B and the received rate analysed. The achieved throughput between the switches should be a function of the number of links in the trunk. To test the effect of a broken link (Item 3), the same setup is used. We require two nodes on switch A and two nodes on switch B and two links in the trunk. Traffic is sent unidirectionally and systematically from the nodes on switch A to the nodes on switch B. During the transmission, one of the links of the trunk is unplugged to simulate a broken link. If the traffic is re-routed to

166 166 Chapter 7. Analysis of testbed measurements Transmitting nodes a1 Trunked links Receiving nodes b1 a2 Switch A Switch B b2 a3 b3 a4 b4 Figure 7.22: A setup to test trunking. Trunked links are used to connect two Ethernet switches the working link, then the received rate on each of the nodes on switch B should change from a high steady rate to a reduced steady rate. The time between these phases is the time taken by the switches to detect and route around the broken link. The number of packets lost can also be detected. To test Item 4, the same setup is used but this time re-connecting the link to simulate the reenabling of the link. The received rate of the nodes on switch B are examined to detect the change from a low steady rate to a higher steady rate. This will tell us how long it takes for the switches to re-allocate traffic to a re-enabled link. There are many ways in which the load balancing across the trunked links can be tested. An example is as follows. The setup should be similar to Figure 7.22 with three nodes on switch A, three nodes on switch B and two links in the trunk. Node a1 and a2 both transmit traffic in a systematic pattern to nodes b1 and b2 respectively at 100% of the link rate. This will saturate the trunked links. Once this setup is running, node a3 attempts to send traffic at 100% to node b3. By reducing the rate of node a1 and a2 alternately, the received packet rate at node b3 is analysed in each case to determine whether load balancing is taking place, that is, if the stream from node a3 to node b3 is able to take advantage of the maximum available link rate. If load balancing is taking place, we can measure how long it takes to occur by noting the time on node b3 when the transmit rates on node a1 or a2 is altered and noting the time again when the receive rate on node b3 becomes stable.

167 7.6 Characterising Ethernet switches and measuring model parameters Saturating Gigabit links As mentioned previously in Section 6.8, with our current approach, it is difficult to saturate a Gigabit link with one PC. The full link rate is needed to test whether the switches are truly nonblocking. There are two ways we can do this. Saturation using multiple switches The first is by using multiple nodes streaming to a single Gigabit link on a switch port such that the aggregate throughput reaches 1 Gbit/s. This saturated link can be used as a source for testing another switch s Gigabit port. On the output port of the switch under test, there needs to be a third switch which can distribute the aggregate rate to multiple nodes. To do this, we must be sure of that the first and third switches are able to sustain the required rates. Saturation using switches with VLANs Saturating a Gigbit link with VLANs requires the setup shown in Figure Critical features of this setup are the way VLANs are defined on the switch and how the switch ports are connected. The switches involved must support VLANs as described in Section and must be able to pass the test of Section The setup has two switches, A and B. Switch A has a number of input ports (four in this case) set to different VLANs, v1 to v4. Packets entering the input ports should be of the original Ethernet frame format, that is without the VLAN tag. Switch A has a single output port set to VLAN vt. This port belongs to all the defined VLANs on Switch A. It is also a tagged VLAN port, that is, frames which are forwarded from that port have the VLAN tag added. Switch A should always forward packet to the port marked vt because all other ports are in a different VLAN. For this reason, on switch A learning can be enabled or the forwarding table can be statically set to indicate that node b1 is found on the port marked vt. Switch B has a single input port marked vt and a number of output ports (four in this case). The port marked vt belongs to all the defined VLANs on Switch B. Switch B forwards frames received on the port marked vt to all ports in the VLAN indicated by the tag information of the received packets. The output ports of switch B are all set to untagged ports, that is frames have their 4-byte (type and tag control information fields) tags (see Section 3.3.2) removed before being forwarded. Learning should not be switched off on switch B since that would imply setting up

168 168 Chapter 7. Analysis of testbed measurements static forwarding tables which would cause the switch to always forward the packets to the same port or discard them if the input and output ports are in different VLANs. Loops in the network are made by connecting output ports of switch B to input ports of switch A in a similar way to that shown in Figure With this setup, frames of the original Ethernet frame format are sent from node a1 with the destination address of node b1. The frame is sent to the switch A port marked vt where the VLAN tag is added (based on the VLAN of the port node a1 is connected to) to the frame before being forwarded to switch B. When switch B receives the frame, it does not know the port on which to find node b1, so it forwards the frame to the port in the same VLAN, the port marked v1. Before the frame is forwarded, the tag control information fields are removed. The frame then reappears on the switch A port marked v2. The frame will loop through the system to v3 and finally to v4. If a1 continuously streams data, then in the steady state, the throughput on the Gigabit link will be equal to the number of loops in the system, n, plus one multiplied by the rate at which a1 succeeds in sending. In the case of Figure 7.23, n will be 3, therefore the rate through the Gigabit link will be 4 times the rate a1 sends. Transmitting node a1 v1 Gigabit Ethernet connection v1 v2 Switch A vt vt Switch B v2 v3 v4 v3 v4 Receiving node b1 Figure 7.23: Looping back frames to saturate a Gigabit link Having saturated the link, the link can be used to send data through a third switch in order to test it. An example of the results from the loopback measurement is given in Figure This was performed using two BATM Titan T4s with their Fast Ethernet ports. In the case of the loopback connection, a single loopback was used. As a result, half the Fast Ethernet rate of 12 MBytes/s is the maximum achieve throughput (Figure 7.24(b)). In Figure 7.24(a), the gradient of the loopback plot is É s/byte. This is twice the value obtained for the non-loopback case and corresponds to four Fast Ethernet store and forwards. The fixed overhead for the loopback case is 21.6 É s. This is also twice the value obtained for the non-loopback case and corresponds to four times the fixed overhead for a single Titan T4 switch.

169 7.7 Conclusions Two switches Two switches with 1 loopback Comms 1 BATM Titan T4 14 Two switches Two switches with 1 loopback Comms 1 BATM Titan T Latency. us 300 Throughput. MBytes/s Message size. Bytes Message size. Bytes (a) The switching Latency (b) The throughput Figure 7.24: Example results comparing a loopback and a non-loopback measurement on the BATM Titan T Conclusions We have described in this section measurements aimed at characterising Ethernet switches. We have illustrated the type of results we are likely to see and interpretations to those results. We can discover the following. Ê The architecture of the switch. Ê The architecture of the switching fabric. Ê The rate at which unicast, multicast and broadcast are handled. Ê The respective priorities of unicast, multicast and broadcast. Ê The loss rate. Ê The input and output buffer sizes. Ê The maximum inter-module and intra-module throughputs. Ê The maximum usable backplane throughput. Ê The maximum module throughput to and from the backplane. Ê How well trunking, VLANs and priorities work. We have identified these measurements based on our experiences with real switches and our efforts in modelling. These measurements can tell us sufficient details about the internals of the switch to allow us to model traffic passing through the switch port. As the modelling work evolves, other measurements may need to be defined such that the relevant parameters can be identified and measured.

170 170 Chapter 7. Analysis of testbed measurements

171 Chapter 8 Parameters for contemporary Ethernet switches 171

172 172 Chapter 8. Parameters for contemporary Ethernet switches 8.1 Introduction In our investigation of Ethernet for the ATLAS LVL2 network, we follow two approaches. Firstly, we look at real Ethernet switches. Their performance, scalability and how well they work based on the standards. Secondly, models of the switches are being developed based on the results of the performance tests, such that large scale models of a comparable size to the final ATLAS LVL2 network can be simulated and studied. As a result a large body of measurements and analysis has been done and continue to be done. To date, we have performed measurements on the Netwiz Turboswitch 2000, the Intel 550T, the BATM Titan T4, The Foundry BigIron 4000, the Cisco Catalyst 6509, the Cisco Catalyst 4912G, the Cisco Catalyst 4003, the Xylan OmniSwitchSR9 and the ARCHES switch [49] developed at CERN as part of the ESPRIT project. measurements. In this chapter, We present a selected few of these In the first part of this chapter, we present a validation of the parameterised model, where a comparison of models and real switches are made. In the second section we present some off the shelf Ethernet switches, their parameters for the purpose of modelling and we identify issues of interest to ATLAS based on our experiences. 8.2 Validation of the parameterised model Parameters for the Turboswitch 2000 A detailed description of the architecture of the Netwiz Turboswitch 2000 is given in Appendix C. The switch we had access to was equipped with eight Fast Ethernet modules and each module had four ports. The switch supports a proprietary VLAN implementation and flow control only in half duplex mode and therefore measurements of these features are not presented here. Management is via a graphical user interface running under the Microsoft Windows environment. The software was supplied by the vendor. It uses the Simple Network Management Protocol (SNMP) communicating over TCP/IP (see Section 3.5.6). Comms 1 measurements Figure 8.1 shows the end-to-end latency obtained from the comms 1 exercise. The figure shows the result for two nodes: directly connected, through the same module of the switch, through different

173 8.2 Validation of the parameterised model 173 modules of the switch and using broadcast through different modules of the switch. We were unable to obtain sensible results for broadcasts through the same module because of excessive losses. In this switch, multicast and broadcast are handled in the same way. These results are summarised in terms of the switch parameters in Table 8.1. P1 [frames]ë P2 [frames]ë P3 [Mbytes/s]Ì P4 [Mbytes/s]Ì P5 [Mbytes/s]Ì P6 [Mbytes/s]Ë P7 [Mbytes/s]Í P8 [Mbytes/s]Í P9 [ Î s ]Í P10[ Î s ]Í Unicast Broadcast/ Multicast unknown unknown 29.0 unknown Table 8.1: Model parameters for the Turboswitch 2000 Ethernet switches. The parameters obtained from the ping-pong measurement are marked with Ï. The parameters obtained from the vendors are marked with Ð. The parameters obtained from the streaming measurement are marked with Ñ (the maximum bandwidth for 1500 Bytes is given). The parameters marked with Ï are those extracted from the ping-pong measurements. These are P7, P8, P9 and P10. The fixed latency through the switch, P9 and P10 are obtained by extrapolating the lines of Figure 8.1 to a zero length message and subtracting the value obtained from that of the direct connection. This is interpreted as the minimum time to make a switching decision. Parameters P7 and P8 are the throughput reserved for a single packet going through the switch or the unpipelined throughput. These are obtained by taking the gradients of the lines of Figure 8.1 and subtracting the gradient of the direct connection. See Section for a full description of how parameters are obtained from the comms 1 measurement. The parameters marked with Ð are those obtained from the switch vendor. The parameters marked with Ñ are obtained from the streaming measurements and are described below. See Section for a full description of how parameters are obtained from the basic streaming measurement. Basic streaming Figure 8.2 shows the results of the basic streaming exercise. From this plot, we see that for the unicast case, we are able to obtain the same throughput through the switch as we can for direct connection. This implies that we are not limited by the switch. For broadcast however, we are not able to achieve the same rate as for the unicast. The maximum broadcast rate is 2.0 MBytes/s, a value confirmed by the vendor. Since there is no flow control, all packets sent above this rate are

174 174 Chapter 8. Parameters for contemporary Ethernet switches Direct Same mod Diff mod Bcast diff mod Comms 1 on switch A Direct Same mod Diff mod Bcast diff mod Comms 1 on switch A Latency. usecs 400 Latency. usecs Message size. Bytes Message size. Bytes (a) End-to-end latency as a function of message (b) End-to-end latency as a function of message size size. From 0 to 100 bytes Figure 8.1: The end-to-end latency for direct connection and through the Turboswitch 2000 dropped. Figure 8.3 shows the resulting plot of the minimum inter-packet time for the streaming measurement. The gradient of the unicast line is equivalent to the Fast Ethernet full line rate of 12.5 MBytes/s. We are able to reach the full line rate for all four ports of a single module. This gave us the parameter P5 summarised in Table 8.1. The minimum inter-packet time for the zero length packet is 3.2 É s for unidirectional streaming through the switch and for the directly connected nodes. Thus for unidirectional traffic, the switch is able to support the full rate of the end-nodes. For broadcast, the minimum inter-packet time for the zero length packet is 19.4 É s. Backplane access To investigate the module access to and from the backplane (that is the maximum rate that data can be sent into and received out of the backplane), we performed the measurement described in Section We termed this the 3111 setup. It consisted of four switch modules. One module had three nodes and the others each had one node. The results obtained for going into and out of the module for 1500 bytes and 500 bytes frames are shown in Figure 8.4. The results of Figure 8.4 show the accepted throughput against the end-to-end latency. We note that the latency is constant until we reach the saturation point. At this point, the formation of queues in the switch causes the latencies to rise sharply. For a given message size, the saturation point is different for packets going into a module compared with packets going out of a module.

175 8.2 Validation of the parameterised model Direct Same mod Diff mod Bcast diff mod Unidirectional streaming on switch A Direct Same mod Diff mod Bcast Diff mod Unidirectional streaming on switch A Throughput. MBytes/s 8 6 inter packet time. usecs Message size. Bytes Message size. Bytes Figure 8.2: The throughput obtained for unidirectional streaming with two nodes through the Turboswitch 2000 Figure 8.3: The minimum inter-packet time obtained for unidirectional streaming with two nodes through the Turboswitch 2000 As a quick cross check, we note that the latency for a given message size matches the line for different module shown in Figure 8.1(a). In Table 8.1, the values P3 and P4 represent the module access to and from the backplane. The value corresponding to 1500 byte frames is presented module dist. systematic fixed distribution. Cross module comms 1500B out 500B out 1500B in 500B in random exponential. Cross module. parameterised model vs measurements. Meas 1500B Meas 500B Mod 1500B Mod 500B Latency. us Average end to end latency. usecs Accepted throughput. MBytes/s Accepted traffic. MBytes/s Figure 8.4: The Turboswitch 2000 results from the 3111 setup to discover access into and out of a module Figure 8.5: Random traffic for 3111 setup through the Turboswitch Traffic is inter-module only. Measurements and model compared Figures 8.5 and 8.6 shows the results of measurements compared to the modelling with the nodes transmitting at various loads with a random traffic distribution, that is the destination for each

176 176 Chapter 8. Parameters for contemporary Ethernet switches bytes random traffic. Exponential dist. Cross module. In 10 1 Probability of greater end to end latency meas 87.8% meas 83.3% 10 4 meas 78.4% meas 59.7% mod 87.8% mod 83.3% mod 78.4% mod 59.7% Latency. us Figure 8.6: Histogram of latencies for various loads (as a percentage of the Fast Ethernet link rate) configuration random traffic. Model against measurements. packet sent from a node was randomly chosen and the inter-packet time was taken from an exponential distribution (See Section 7.6.3). The 3111 setup was used. Figures 8.5 shows the accepted traffic load against the average end-to-end latency for 1500 and 500 byte frames. This shows very good agreement between the model and the measurements. In Figure 8.6, histograms of the latencies for the same setup at various load are shown. The histogram is plotted as the normalised probability of finding a packet of a greater end-to-end latency. The load is represented as a percentage of the Ethernet link rate (all nodes were configured to transmit at the same rate). This shows that there is very good agreement between the parameterised model and the measured performance of the real switch Testing the parameterisation on the Intel 550T In order to test the ability of the parameterised model to model other switches, we modelled the Intel 550T Ethernet switch. The Intel 550T switch is an eight port Fast Ethernet switch. It has two expansion slots which can each host a module of four ports, bringing the total number of ports to 16. The expansion slots can also host a stacking module which allows the connection of up to seven 550T switches together to form a 96 port switch. It can operate in both store and forward and cut-through modes. The switch tested was a single eight port unit. For our tests, the switch was set into store and forward mode. The literatures supplied with the switch was unclear. The minimum latency of 11 É s is reported in the documentation and 7.5 É s in the description given on the web 1. The documents give 1 f.htm

177 8.2 Validation of the parameterised model Intel 550T. Ports 1,2,5,6,8 55 Intel 550T. Bidirectional systematic traffic. Ports 1,2,5,6,8 Average transmit throughput per node. Mbytes/s Total switch throughput. Mbytes/s Number of nodes Number of nodes (a) The average throughput per node as a function of the number of nodes (b) The total network throughput as a function of the number of nodes. Figure 8.7: The results of the bidirectional streaming tests on the Intel 500T switch. This shows that the up to four Fast Ethernet nodes can communicates at the full link rate. 6.3 Gbit/s aggregate internal bandwidth, 2.1 Gbit/s backplane, but 800 Mbit/s aggregate network bandwidth. Our request for clarification from the vendor went unanswered. Tests on the eight port switch showed that the zero message length latency was 5 É s in the store and forward setup. We also discovered that the switching fabric must be a shared bus or shared buffer since independent of number ports used, we were limited to an accepted load of 51 MBytes/s, equivalent to just over four ports running at full rate. This is shown in Figure 8.7. In Figure 8.7, the setup initially consisted of two nodes. Each node in the system sent to another at the full rate and the total received rate was measured. The number of nodes in the system was increased and the measurement repeated. Figure 8.7(a) shows the average throughput per node and Figure 8.7(b) shows the total throughput through the switch. Tolly 2, a third-party network equipment test house, tested the Intel 550T [40]. It is difficult to get information we can use to build our models from the Tolly report due to the configuration they used. Their chosen configuration was 56 ports, that is seven switches connected by a matrix module, with flow control disabled. Their report does not give the throughput achieved for all nodes transmitting and receiving bidirectionally. They do however supply the throughput for unidirectional traffic. For unidirectional traffic, with 28 streams all going through the backplane, they achieved a 2

178 178 Chapter 8. Parameters for contemporary Ethernet switches maximum of 2.8 Gbit/s, with no frame loss. In our own test, we came to the conclusion that the maximum backplane speed per switch was 51.1 MBytes/s or 408 Mbit/s bidirectionally. For seven switches, this would give the 2.8 Gbit/s measured by Tolly. In modelling the 550T, we performed the necessary measurement to obtain the parameters we required. However, we were unable to obtain the buffer sizes from the manufacturer. To investigate the size of the buffers, the output buffer was set to one and the input buffer was varied. Figure 8.8 how well the different configurations agreed with the measurements on the real system. The configuration was eight nodes sending packets of 1500 bytes to each other where the destination address and the inter-packet time was chosen randomly. Flow control was turned off for this Intel 550T switch; 8 nodes; rnd addr; IPG exp; 1500B frames; measured vs modelled Measured Modelled: RR, input FIFO 64 frames Modelled: RR, input FIFO 8 frames Modelled: RR, input FIFO 4 frames Modelled: RR, input FIFO 2 frames 700 Average latency. us Accepted throughput. MBytes/s Figure 8.8: Investigating the buffer size in the Intel 550T switch. For loads of higher than 51.1 MBytes/s the switch looses frames in such a way that the accepted throughput can be less than for a higher load. This is reason why the measurement line curves back to a lower accepted throughput as the latency grows. As the results show, the model with an output buffer size of four frames best matches the measurements. Figure 8.9 shows the measurement repeated with a frame size of 500 bytes. Figure 8.9(a) shows the accepted throughput against the average latency and Figure 8.9(b) shows the offered throughput against the lost frame rate. The results show very good agreement with the measurements. The full list of parameters used in modelling the Intel 550T switch is shown in Table 8.2.

179 8.2 Validation of the parameterised model 179 P1 [frames] P2 [frames] P3 [Mbytes/s]Ì P4 [Mbytes/s]Ì P5 [Mbytes/s]Ì P6 [Mbytes/s] P7 [Mbytes/s]Í P8 [Mbytes/s]Í P9 [ Î s ]Í P10 [ Î s ]Í 4 1 NA NA 51.1 NA NA 12.5 NA 5.0 Table 8.2: Model parameters for the Intel 550T Ethernet switches. The parameters obtained from the ping-pong measurement are marked with Ï. The parameters obtained from the streaming measurement are marked with Ñ (the maximum bandwidth for 1500 Bytes is given). NA implies not applicable. 700 Intel 550T switch; 8 nodes; addr rnd; IPG exp; 500B; <= 50MB/s measured modelled 4 inp buffer, dest IDLE, limit with filler frames 7000 Intel 550T switch; 8 nodes; addr rnd; IPG exp; 500B; <= 50MB/s measured modelled 4 inp buffer, dest IDLE, limit with filler frames Average latency. us Average lost frame rate. Frames/s Accepted throughput. MBytes/s Offered traffic. MBytes/s (a) Accepted throughput (b) Lost frame rate Figure 8.9: The performance of the Intel 550T Fast Ethernet switch with random traffic. Model against measurements.

180 180 Chapter 8. Parameters for contemporary Ethernet switches 8.3 Conclusions The parameterised model has been validated. The model reflects the behaviour of the real switch with an accuracy of five to ten percent at loads below saturation. Thanks to its simplicity, larger networks can be modelled without a dramatic increase in the modelling time, as was observed using the a detailed model. The applicability of the parameterised model to a wide range of switches with different internal and hierarchical architecture has been demonstrated in [34]. 8.4 Performance and parameters of contemporary Ethernet switches In this section, we present some off the shelf Ethernet switches, their parameters for the purpose of modelling and we identify issues of interest to ATLAS based on our experiences. In the measurements presented here, not all the results of all switches are present. There are a number of reasons for this. Firstly, not all the switches made available to us had the configuration to allow the full set of measurements to be done. For example, some switches were provided with only a single module or with Gigabit Ethernet ports only. Secondly, At the time of availability, we did not have the necessary equipment to fully test the switch. Thirdly, the some switches were only available for a limited period of time. Insufficient for the full set of measurements to be performed Switches tested In Tables 8.3, we present modelling parameters for BATM Titan T4 switch in both Fast and Gigabit Ethernet configurations, the BigIron 4000, the Alteon 180, the Cisco 6509, the Cisco 4912G, the Cisco 4003, the Xylan OmniSwitchSR9 and the ARCHES switch. Ê The BATM Titan T4 has a hierarchical architecture. A picture is shown in Figure It can host any combination of up to four Fast or Gigabit Ethernet modules. A Fast Ethernet module has eight ports and a Gigabit Ethernet module has a single port. We discovered that early models of these switches were blocking for both Fast and Gigabit Ethernet modules. After discussions with the vendor, it became clear that this was due to limitations in the memory speeds used in the switches. In Table 8.3, the blocking nature of the switch is shown by parameter P3 and P7 in the Gigabit Ethernet configuration. In

181 8.4 Performance and parameters of contemporary Ethernet switches 181 BATM Ti- BATM BigIron Alteon 180 Cisco 6509 Cisco Cisco 4003 Xylan Om- ARCHES tan T4 Fast Titan T G niswitchsr9 switch Ethernet Gigabit Ethernet Number ports module Number of per of of modules in chassis P1 [frames]ë P2 [frames]ë P3 [Mbytes/s]Ì P4 [Mbytes/s]Ì P5 [Mbytes/s]Ì P6 [Mbytes/s] P7 [Mbytes/s]Í P8 [Mbytes/s]Í Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown NA Unknown NA Unknown Unknown Unknown NA Unknown NA Unknown Unknown NA Unknown Unknown Unknown Unknown Unknown Unknown NA Ë 4000 NA Unknown NA Unknown Unknown NA Unknown NA Unknown Unknown NA NA Î Î P9 [ s ]Í NA Unknown NA Unknown Unknown 2.5 P10[ s ]Í 4.9 NA NA Table 8.3: Model parameters for various Ethernet switches. The parameters obtained from the ping-pong measurement are marked with Ï. The parameters obtained from the vendors are marked with Ð. The parameters obtained from the streaming measurement are marked with Ñ (the maximum bandwidth for 1500 Bytes is given). NA=not applicable.

182 Chapter 8. Parameters for contemporary Ethernet switches Figure 8.10: A picture of the BATM titan T4 Figure 8.11: The Foundry BigIron 4000 switch.

We performed a series of tests connecting multiple Titan T4 switches together via Fast and Gigabit Ethernet links.

182 182 Chapter 8. Parameters for contemporary Ethernet switches Figure 8.10: A picture of the BATM titan T4 Figure 8.11: The Foundry BigIron 4000 switch. order to support the full Gigabit rate, P3 and P7 should be 125 Mbytes/s (1000 Mbit/s), but instead they are both Mbytes/s. We performed a series of tests connecting multiple Titan T4 switches together via Fast and Gigabit Ethernet links. These showed no surprises in terms of latencies and throughputs, that is, the latencies grew linearly with increasing number of switches between sender and receiver and the throughput was limited by the connecting link s speed. Measurements of multiple Fast Ethernet nodes transmitting to a single Gigabit Ethernet node on the T4 have been looked at in Section Packet loss and frame prioritisation on the T4 have been discussed on Section and respectively. VLANs were proved to work on the T4 by the performance of the loopback test in Section Trunking and Broadcasts issues on the T4 are discussed at below. Ê The BigIron 4000, see Figure 8.11, has a hierarchical architecture that can host up to four modules. The switch as tested had two Gigabit Ethernet modules. Each module has eight Gigabit Ethernet ports. The performance of the BigIron 4000 going through the same module and different modules is very similar. The fixed overhead in the frame latency is the same for inter-module and intra-module transfers (parameters P9 and P10 in Table 8.3), however the byte dependant overhead is slightly different (parameter P7 and P8 in Table 8.3). The frame loss for the BigIron 4000 has been looked at in Section This switch is a highly configurable switch to the extent that the user can even configure rate of the broadcast and multicasts. In our experience modern switches are becoming more configurable. This is a good thing for ATLAS since it allows more flexibility.

Evaluation of network performance for triggering using a large switch

Evaluation of network performance for triggering using a large switch R.W. Dobinson a, S. Haas a, b, R. Heeley a, N.A.H. Madsen a, c, B. Martin a, J.A. Strong a, c, D.A. Thornley a, d a CERN, Geneva, Switzerland,