ANALYTICAL MODEL AND PERFORMANCE ANALYSIS OF A NETWORK INTERFACE CARD. Abstract

Similar documents
Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

On the Exact Analysis of Bluetooth Scheduling Algorithms

Cluster Analysis of Electrical Behavior

AADL : about scheduling analysis

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Wishing you all a Total Quality New Year!

Advanced Computer Networks

Real-Time Guarantees. Traffic Characteristics. Flow Control

Simulation Based Analysis of FAST TCP using OMNET++

Efficient Distributed File System (EDFS)

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Mathematics 256 a course in differential equations for engineering students

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

X- Chart Using ANOM Approach

A Binarization Algorithm specialized on Document Images and Photos

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

A fair buffer allocation scheme

Analysis of Collaborative Distributed Admission Control in x Networks

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

Optimization of Local Routing for Connected Nodes with Single Output Ports - Part I: Theory

Analysis of Continuous Beams in General

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

The Codesign Challenge

Solution Brief: Creating a Secure Base in a Virtual World

Modelling a Queuing System for a Virtual Agricultural Call Center

Parallel matrix-vector multiplication

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Reducing Frame Rate for Object Tracking

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Computer Communications

Virtual Machine Migration based on Trust Measurement of Computer Node

Petri Net Based Software Dependability Engineering

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Gateway Algorithm for Fair Bandwidth Sharing

Analysis of a Polling System Modeling QoS Differentiation in WLANs

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Evaluation of Parallel Processing Systems through Queuing Model

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Optimal Algorithm for Prufer Codes *

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Goals and Approach Type of Resources Allocation Models Shared Non-shared Not in this Lecture In this Lecture

S1 Note. Basis functions.

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) ,

TN348: Openlab Module - Colocalization

with `ook-ahead for Broadcast WDM Networks TR May 14, 1996 Abstract

ELEC 377 Operating Systems. Week 6 Class 3

RAP. Speed/RAP/CODA. Real-time Systems. Modeling the sensor networks. Real-time Systems. Modeling the sensor networks. Real-time systems:

Load Balancing for Hex-Cell Interconnection Network

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

Connection-information-based connection rerouting for connection-oriented mobile communication networks

Assembler. Building a Modern Computer From First Principles.

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach

Priority-Based Scheduling Algorithm for Downlink Traffics in IEEE Networks

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

CS 268: Lecture 8 Router Support for Congestion Control

Fibre-Optic AWG-based Real-Time Networks

A Hybrid Genetic Algorithm for Routing Optimization in IP Networks Utilizing Bandwidth and Delay Metrics

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Sample Solution. Advanced Computer Networks P 1 P 2 P 3 P 4 P 5. Module: IN2097 Date: Examiner: Prof. Dr.-Ing. Georg Carle Exam: Final exam

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Load-Balanced Anycast Routing

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Efficient Content Distribution in Wireless P2P Networks

Quantifying Performance Models

Efficient QoS Provisioning at the MAC Layer in Heterogeneous Wireless Sensor Networks

Feature Reduction and Selection

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Adaptive Load Shedding for Windowed Stream Joins

A Sub-Critical Deficit Round-Robin Scheduler

Adaptive Load Shedding for Windowed Stream Joins

ETAtouch RESTful Webservices

On Achieving Fairness in the Joint Allocation of Buffer and Bandwidth Resources: Principles and Algorithms

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

An Entropy-Based Approach to Integrated Information Needs Assessment

Comparisons of Packet Scheduling Algorithms for Fair Service among Connections on the Internet

Scheduling and queue management. DigiComm II

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

Avoiding congestion through dynamic load control

3D Virtual Eyeglass Frames Modeling from Multiple Camera Image Data Based on the GFFD Deformation Method

Fast Retransmission of Real-Time Traffic in HIPERLAN/2 Systems

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

IP Camera Configuration Software Instruction Manual

A Distributed Dynamic Bandwidth Allocation Algorithm in EPON

Concurrent Apriori Data Mining Algorithms

A New Transaction Processing Model Based on Optimistic Concurrency Control

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

The Impact of Delayed Acknowledgement on E-TCP Performance In Wireless networks

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Statistical Model Selection Strategy Applied to Neural Networks

Network Coding as a Dynamical System

Control strategies for network efficiency and resilience with route choice

Related-Mode Attacks on CTR Encryption Mode

Reliability and Performance Models for Grid Computing

Transcription:

ANALYTICAL MODEL AND PERFORMANCE ANALYSIS OF A NETWORK INTERFACE CARD Naveen Cherukur 1, Gokul B. Kandraju 2, Natarajan Gautam 3, and Anand Svasubramanam 4 Abstract One of the key concerns for practtoners and academcans s that there are almost no platforms based on analytcal models for testng the mpact of varous archtectural and desgn modfcatons for ntellgent Network Interface Cards (NICs). Smulatons are typcally tme-consumng, especally for expermentng dfferent scenaros and what-f analyss. In ths research, we study the performance of a NIC called Myrnet developed by Myrcom. We develop an open queueng network model to predct ts performance. We compare the analytcal results wth the smulatons. The reason there are very few analytcal models s because of the enormous complexty posed by the performance-analyss problem. In partcular, the problem s a combnaton of: (a) mult-class queueng network wth class swtchng, (b) pollng system wth lmted servce dscplne, and (c) fnte-capacty queues wth blockng. The above three ssues have been treated only n solaton n the lterature. However the problem becomes much harder when all three ssues are smultaneously present. One of the key contrbutons of ths paper s an analytcal approxmaton of ths complex system. From an analytcal modelng standpont, we observe that makng smplfyng assumptons to analyze nodes that are not bottlenecks does not mpact performance greatly. The man fndngs of ths research are the bottlenecks of the queueng network, utlzatons of the varous nodes and performance measures such as the expected delay. The model as well as fndngs can be used to test the performance mpact of varous enhancements to the operaton of NICs. Keywords: Network Interface Cards, Performance Analyss, Queueng Model, Smulaton 1 82 Devonshre Street, #V7B, Boston MA 02109, naveen_ch@hotmal.com 2 Dept. of Computer Scence and Engneerng, Penn State Unv., Unversty Park, PA 16802 kandraj@cse.psu.edu 3 Dept. of Industral Engr., 310 Leonhard Bldg, Penn State Unv., Unversty Park, PA 16802 ngautam@psu.edu Correspondng Author 4 Dept. of Computer Scence and Engneerng, Penn State Unv., Unversty Park, PA 16802 anand@cse.psu.edu

1. Introducton Dstrbuted applcatons requre rapd and relable exchange of nformaton across a network to synchronze operatons and/or to share data. The performance and scalablty of these applcatons depend upon an effcent communcaton faclty. In order to connect computers to a network for facltatng communcaton, a network nterface card (NIC) s necessary. The NIC s a computer crcut board or card that s nstalled n a computer. Personal computers and workstatons on a local area network (LAN) typcally contan a NIC specfcally desgned for the LAN transmsson technology, such as Ethernet or token rng. NICs are also used to nterconnect clusters of computers or workstatons such that the cluster can be used for hgh performance or massvely parallel computatons. Although clusters are slowly replacng large supercomputers due to ther low cost, one of the bggest stumble blocks for clusters to reach the performance of supercomputers s that ther NICs are neffcent. To address the neffcences of the NICs, three developments have been consdered: () Usng a processor on the NIC and thereby makng the NIC more ntellgent. These second-generaton ntellgent NICs (ncludng Myrnet [8], Fore Systems SA-200 [18], Gganet's clan [19], etc.) outperform tradtonal NICs (conventonal Ethernet NICs). Myrnet s one of the most popular second generaton NICs n use today. () Movng the network nterface much closer to the applcaton (called Vrtual Interface Archtecture (VIA) [1]). The VIA uses a vrtual nterface mechansm to transfer the most common messages drectly between memory and the NIC. The result s a substantal reducton n processng overhead along the communcaton paths that are crtcal to performance. () Removng the operatng system from the crtcal path of communcaton, User-Level Networkng (ULN) provdes the user wth the drect access to the NIC. The operatng system s used only to setup the protected channels whch can be accessed later durng communcaton wthout the costs of crossng protecton boundares. In ths paper we study such ntellgent VIA NICs that use ULN. Due to the rapd growth of ULN as a hgh-performance cost-effectve soluton wth clusters, ndustry has also taken note of ULNs potental, and attempted to standardze t n the form of a VIA specfcaton [1]. Ths specfcaton was released by a group of companes together called the Vrtual Interface Archtecture Consortum (whch ncludes Mcrosoft, Compaq, Intel). Hardware and software [20, 21, 22] mplementatons of VIA have also been developed and VIA s recevng a lot of attenton both from the ndustry front as well as academa. Hence, research s under way to further mprove the effcency of the NICs. However there s a need for analytcal models for NICs that the researchers could use to quckly test desgn alternatves. In ths paper, analytcal models are derved for the Myrnet [8] NIC, keepng n mnd that, smlar models can be derved for other network nterfaces as well. The lterature on performance modelng for ntellgent NICs s farly lmted. In [23], the NIC throughput s computed usng an average case analyss by modelng the entre system at a much macrogranularty level than what s consdered here. In fact [23] does not consder any detals of the Myrnet NIC. The Myrnet NIC and software system n [24] s modeled at a meso-granularty level (.e., n between the macro-granularty level n [23] and the mcro-granularty level we have consdered here) usng a system of 2 queues such that the processor polls between queues at the software and the hardware layers, thereby ncurrng tme durng the context swtch. However, snce the entre NIC s modeled as a sngle-server queue n [24], t s not clear how varous desgns nsde the NIC can be evaluated. The motvaton for consderng a mcro-granularty level for modelng the Myrnet NIC n ths paper s to test varous NIC desgns and evaluate the performance mprovement across the NIC. In partcular, a multstaton and mult-class open queueng network model s consdered to capture the multtude of operatons and queues nsde the NIC. The complexty of the problem s due to the fact that t s a combnaton of (a) mult-class queueng network wth class swtchng, (b) pollng system wth lmted servce dscplne, and (c) fnte-capacty queues wth blockng. However by dentfyng the bottleneck node and modelng t accurately, and usng approxmatons for the rest of the nodes, we are able to obtan system performance. 2

Networks of queues have proven to be useful models to analyze the performance of complex systems such as computer systems, swtches, routers and communcatons networks [2-5]. Ths method has contrbuted to sgnfcant desgn decsons for performance mprovements n varous computer and communcaton systems. Analytcal models are fast n obtanng the results and cost effectve n mplementaton when compared to a smulaton model or an experment. Another mportant advantage of analyss usng a network of queues s the flexblty to choose a wde range of operatng parameters and obtan the performance measures wth lttle effort. Smulatons are developed for benchmarkng and the results are compared wth the performance measures obtaned n the analytcal model. The rest of ths paper s structured as follows. Secton 2 deals wth some prelmnares ncludng a descrpton of the VIA NICs as well as some results from queueng networks. Secton 3 provdes a detaled explanaton of the varous modelng aspects and the smplfcatons used n obtanng the analytcal model. In Secton 4, two smulaton studes performed usng a commercal package ARENA [6] are presented to justfy the approxmatons. In Secton 5, performance of the analytcal model s compared aganst smulaton studes. Ths paper concludes wth hghlghts of the summary of the study, contrbutons from ths research, and recommendatons for future work n Secton 6. 2. Prelmnares We frst present a descrpton of the Myrnet NIC and then descrbe some known results from queueng networks. They consttute some of the prelmnares that are necessary n order to descrbe ths research. 2.1 The Myrnet NIC Fgure 1 [7] shows a Myrnet NIC. Myrnet s popular for deployng clusters because t provdes hgh hardware transmsson rates. The several hardware features that t provdes make VIA mplementatons more effcent. CPU Net Send DMA (NSDMA) Host DMA (HDMA) Memory (SRAM) Net Recv DMA (NRDMA) Control To Network Data From Network Fgure 1: Network Interface Card 3

The NIC contans a processor called LANa, a Drect Memory Access (DMA) engne (represented as HDMA) whch s used to transfer the data between the host memory and card buffer (SRAM), a DMA engne (represented as NSDMA) to transfer the data from SRAM onto the network, and another DMA engne (represented as RSDMA) to transfer data on to the SRAM from the network. From a modelng pont of vew, sendng can be translated to appendng a message to a queue n the card buffer and recevng can be translated to removng a message from the card buffer. A NIC provdes an electromechancal attachment of a computer to a network. Under program control, a NIC copes data from memory to the network medum, transmsson, and from the medum to memory, recepton, and mplements a unque destnaton for messages traversng the network. Myrnet NIC nterface comprses of varous physcal components. A descrpton of these components along wth the crtcal data movements s provded n detal below. Doorbells: Send or receve notfcaton to NIC by applcaton processes s done n VIA by a mechansm called doorbells. It s through a doorbell by whch the NIC knows that there s work placed n the work queue. There are two sets of doorbells, one each for send and receve. When an applcaton wants to send or receve a message, t creates a header for t (called a descrptor), makes t accessble to the NIC, and then rngs a doorbell. Descrptors: A Descrptor s a data structure recognzed by the NIC that descrbes a data movement request. It s organzed as a lst of segments. A Descrptor s comprsed of a control segment followed by an optonal address segment and an arbtrary number of data segments. The data segments descrbe a communcaton buffer gather or scatter lst for a NIC data movement operaton. Descrptors contan all the nformaton needed to process a request, such as the type of transfer to make, the status of the transfer and the queue nformaton. Drect Memory Access: There are stuatons n whch data must be moved very rapdly to or from a devce. Interrupt processng of each data transfer would be awkward and slow. Wth all of the bookkeepng nvolved n handlng and nterrupt, data would probably be lost. Drect Memory Access, or DMA, solves ths problem. It s a method for drect communcaton from perpheral to memory wth no programmng nvolved. DMA reduces CPU overhead by provdng a mechansm for data transfers that do not requre montorng by the CPU. The data s moved to memory va the bus, wthout program nterventon. LANa and SRAM: A Myrnet host nterface conssts of two major components: the LANa chp and ts assocated SRAM memory. The LANa s a processor chp that controls the data transfer between the host and the network. Besdes controllng the data transfer, the LANa s also responsble for automatc network mappng and montorng the networks status. SRAM s a precous resource that hosts many queues. The sze of the onboard SRAM ranges from 512 KB to 4 MB. The LANa communcates wth the host s devce drvers or user-level lbrares through work queues resdng n the SRAM. HDMA: Ths s one of the most mportant enttes of a NIC. Once a doorbell s detected, LANa processes the descrptor and then the correspondng data buffer. Allocatng space for too many descrptors can be a waste of precous NIC buffer SRAM. The descrptors are thus kept on the host memory, and HDMA transfers them to the card buffer. Then LANa examnes the descrptor and n the case of send, the HDMA transfers the correspondng data on to SRAM. In the case of receve, the drecton of transfer s reversed,.e., the data s transferred from SRAM nto the host memory. NSDMA: NSDMA (Network Send DMA engne) s a DMA engne n the Myrnet NIC that facltates the data transfer from SRAM on to the network. If NSDMA s dle and there s data on SRAM queued up to be sent on to the network, LANa programs NSDMA to pck t up from the queue (FIFO bass). Then the data goes through the network bus to the perpheral destnaton n the network. NRDMA: NRDMA (Network Receve DMA engne) s a DMA engne n the Myrnet NIC that facltates the data transfer from network on to SRAM. LANa programs NRDMA to pck up a packet from the network. The destnaton d of an ncomng message s extracted usng ths DMA engne and the receve descrptors are checked for a match. If there s a matchng descrptor, then the data transfer up to the host 4

can be ntated usng HDMA (dependng on the avalablty). Else, a receve descrptor needs to be brought down before the data can be transferred. Myrnet Control Program (MCP): The MCP (Myrnet Control Program) s the program that runs on the LANa chp on the host nterface board. It s the MCP s job to transfer messages between the host and the network. LANa ntates the followng operatons that are needed to be performed by MCP. Poll doorbell queues for an applcaton s send/receve notfcaton Transfer the descrptor assocated wth the doorbell from the host memory down onto the SRAM usng HDMA Transfer the data assocated wth a send descrptor from host memory to SRAM usng HDMA. Transfer the packet out onto the network usng NSDMA. Pck up packet from the network usng NRDMA. Transfer data from SRAM to the host memory usng HDMA. Transfer completon nformaton (of send/receve) to host memory usng HDMA. Sequence of Operatons: LANa goes through these operatons cyclcally: Pollng the doorbell queue, pollng the descrptor queue on SRAM and pollng the data queue. In addton, t programs NSDMA and NRDMA to send and receve the data to and from the network respectvely. LANa polls the doorbell queue and makes them avalable for HDMA to obtan the correspondng descrptors. Polled doorbells wat n a queue at HDMA to get servced on a FCFS bass. They are processed by HDMA and the correspondng descrptors are stored n the descrptor queue on SRAM. The descrptors n ths queue are polled by LANa and t makes them avalable for HDMA to obtan the correspondng data. In the case of a send descrptor, LANa ntates the transfer of data from the host memory on to the data queue on SRAM usng HDMA. In the case of a receve descrptor, LANa ntates the transfer of data (f any) from the network queue at NRDMA to the data queue on SRAM usng NRDMA. LANa polls the data queue and f the polled data s of type send, t checks whether NSDMA s busy. If not, t ntates the transfer of send data from SRAM data queue to NSDMA. If the polled data s of type receve, t ntates the transfer of data from SRAM data queue to host memory usng HDMA. 2.2 The Queueng Network Analyzer Staton 2 Staton 1 Staton 4 Staton 3 Class 1 Class 2 Class 3 Fgure 2: An example of a Mult-Staton and Mult-Class Open Queueng Network 5

We now recaptulate some of the results from the mult-staton and mult-class open queueng network usng the QNA approach by Whtt [9,10]. Fgure 2 s an example of a mult-staton and multclass open queueng network wth three statons and four classes of traffc. It s provded to gve a pctoral dsplay for a mult-staton and mult-class open queueng network. The followng descrpton forms the problem settng for the algorthm. (1) There are N servce statons (nodes) n the open queueng network. The outsde world s denoted by node 0 and the others 1,2 N. (2) There are m servers at node (1 m ), 1 N. (3) The network has multple classes of traffc, and class swtchng s not allowed. (4) Servce tmes of class r customers at node are ndependent and dentcally dstrbuted (d) wth mean 1/µ,r and squared coeffcent of varaton (SCOV) C 2. S, r (5) The servce dscplne s Frst Come Frst Served (FCFS). (6) There s nfnte watng room at each node. (7) Externally, customers of class r arrve at node accordng to a general nter-arrval tme dstrbuton wth mean 1 / λ 0,r and SCOV C 2. A, r (8) When a customer of class r completes servce at node, he or she or t jons the queue at node j (j [0, N]) wth probablty p j,r. (9) of node s the rato of mean arrval rate at node to the maxmum possble servce rate at node. Beng a rato, t has no unts. 2.2.1 Notaton The notaton that s gven here follows [11] and wll be utlzed for the decomposton algorthm to be presented later n ths secton. R : Total number of classes. λ j,r : Mean arrval rate from node to node j of class r. λ 0,r : Mean arrval rate to node of class r (or mean departure rate from node of class r) p j,r : Fracton of traffc of class r that ext node and jon node j. λ : Mean arrval rate to node. ρ,r : of node due to customers of class r. ρ : of node. µ : Mean servce rate of node. The next fve symbols are used to denote the squared coeffcents of varaton (SCOV) of dfferent parameters. C 2 A, r C 2 A : SCOV of class r nter arrval tmes nto node. : SCOV of arrval tmes nto node. C 2 D : SCOV of nter departure tmes from node. C 2 S : SCOV of servce tme of node. C 2,r j : SCOV of tme between two customers gong from node to node j. 2.2.2 Decomposton Algorthm The network s broken down nto ndvdual nodes, and analyss s performed on each node as an ndependent GI/G/m queue wth m servers n staton and wth multple classes. The requred parameters are mean arrval and servce rates as well as SCOV of the nterarrval and servce tmes. Obtanng these 6

wll be hard when multple streams are merged (superposton) or when traffc flows through a node (flow) or when a sngle stream s forked nto multple streams (splttng). The algorthm supposes that just before enterng a queue, superposton takes place whch results n one stream. Lkewse, t assumes that there s only one stream that gets splt nto multple streams [12, 13]. There are 3 basc steps n the decomposton algorthm. Step 1: Mean arrval rates, utlzatons and aggregate servce rate parameters are calculated usng the gven data n the followng way. λ j,r = λ,r p j,r (1) N λ,r = λ 0,r + λ j, r p j,r (2) R j= 1 λ = λ, r (3) ρ,r = r= 1 λ, r m µ, r R ρ = ρ, r The condton for stablty s ρ <1 (5) µ = r= 1 R r= 1 1 λ, r 1 m µ, r λ λ = ρ C 2 R λ, r µ = -1 + 2 S ( C + 1) 2 S, r r= 1 λ mµ, r Step 2: The coeffcent of varaton of nter-arrval tmes at each node s calculated teratvely by ntalzng C 2 j,r and performng superposton, flow and splttng cyclcally. () Superposton: C 2 N 1 2 = A C j, rλ j, r p j, r (8), r λ, r j= 0 C 2 R 1 2 = C A λ r, r (9), A λ r= 1 () Flow 2 2 ρ C 2 ( CS 1) 2 2 = 1 + + ( 1 ρ )( C A 1) (10) D m () Splttng C 2 j,r = 1 + p j,r (C 2-1) (11) D The splttng formula s exact f the departure process s a renewal process. The expressons for flow and superposton are approxmatons. Step 3: Treatng each queue ndependently, performance measures are obtaned as follows m m ρ + ρ +1 2 Choose α m such that α m = f ρ > 0.7 or ρ f ρ < 0.7 2 7 (4) (6) (7)

The mean watng tme for class r customers n the queue (not ncludng servce tme) at node s approxmately 2 2 α m 1 C A + C S W q (12) µ 1 ρ 2m The mean watng tme for class r customers at node (ncludng servce tme) s gven by 1 W = W q + (13) µ The mean queue length for class r customers at node (wthout customers n servce) s gven by Lq = W q λ (14) The mean number of customers at node (ncludng the customers n servce) s gven by L = W λ (15) The performance measures presented n Secton 3 are mean queue length ( L q ) and utlzaton (ρ ) at each node. Note that both are numbers and do not have unts. 3. Analytcal Model LANa NSDMA HDMA Class 1 (doorbells) Class 2 (descrptors) Class 3 (data) Fgure 3: Three-staton and three-class open queueng network model for NIC We model the performance of the NIC of a send staton, where the data flow s from the staton to the network. In essence, the hardware features of NIC consstng of only NSDMA are modeled n an open queueng network. The analyss provded for mult-staton and mult-class open queueng network n Secton 2.2 s used to obtan varous performance measures. Ths smplfed verson of NIC conssts of send doorbell, send descrptor and send data as three classes of traffc n the network. The servers n the network correspond to LANa, HDMA and NSDMA and henceforth wll be referred to as nodes. Fgure 3 shows a three-staton and three-class open queueng network model for the performance analyss of NIC. 8

3.1 Descrpton and Analyss The most crtcal assumpton that s requred for the analyss n Secton 2.2 s that class swtchng s not allowed. Strctly speakng, there s a class swtchng here n the form of Class1 to Class2 and Class2 to Class3. Doorbells are processed by HDMA to produce descrptors, and descrptors n turn are processed by HDMA to obtan data. Both the processes nvolve class swtchng: Doorbells to Descrptors and Descrptors to Data. If the performance results can be confrmed wth smulatons, t s a great gan n terms of applcablty of the analytcal model. Devatons can be nvestgated and one of the factors to whch they can be attrbuted wll be class swtchng. However the other assumptons made for the queueng network analyss n Secton 2.2 are vald, vz. (1) Servce tmes of class r customers are ndependent and dentcally dstrbuted (d). (2) The servce dscplne s FCFS, whch s the same for the NIC queues. (3) There s nfnte watng room at each node. The watng room at each node can be translated to the memory space avalable for each type of message on the NIC, and n realty there s a lmtaton. One desgn consderaton s that there are always suffcent doorbells to accommodate the data transfer to make sure that no data s practcally dropped off. It s thus safe to assume that there s nfnte watng room at each node, whch wll serve the purpose the applcaton of ths generc desgn methodology to obtan the performance measures n a qucker and convenent way as opposed to lengthy and costly smulatons. We now descrbe the nodes n detal. Node 1 (LANa) serves all the classes of traffc, namely doorbells, descrptors and data. The servce tme for each class of traffc s gven n the Table 2. All three classes are queued up n dfferent queues n the NIC. LANa polls these queues n an order determned by the Myrnet Control Program (MCP). For the purpose of mathematcal analyss, the physcal locaton of the queue does not matter as long as the orderng of pollng s consstent wth the MCP. A survey of the exstng lterature on the pollng models for any applcable models for determnstc tmes of servce [14-16] shows that an analytcal analyss would be extremely dffcult for a pollng model when the servce tmes are determnstc. Moreover, t s an asymmetrc pollng system wth feedback, nfnte buffers and a fnte swtchover tme, gated servce and cyclc servce order. All these condtons make the problem computatonally ntractable and shft the focus from the man am of gettng the performance measures of the NIC and valdatng them through extensve smulatons. Instead, a dfferent way of analytcal modelng for node LANa s adopted to whch the exstng methods of analyss can be appled. The performance of LANa s mathematcally approxmated as follows. LANa s a node that serves a sngle queue, whch contans three dfferent classes of traffc. The traffc correspondng to dfferent classes s thus pooled up nto a sngle queue, and the queue s served on a Frst Come Frst Served (FCFS) bass. Here les the basc assumpton of the modelng: for the purpose of analytcal analyss to obtan the performance measures under the exstng parameters of operaton, poolng up the class traffc s mmateral. Ths assumpton can be verfed wth smulatons. If all the other desgn s same, the followng two smulaton models can be compared to obtan the effect of poolng up the traffc of all classes nto a sngle queue at node LANa. (1) Smulaton model 1 n whch all the classes are queued up for LANa n the same queue. (2) Smulaton model 2 that has three separate queues for the three classes of traffc at node LANa. The servce tmes for dfferent classes are gven n the Table 2. Node 2 (HDMA) serves the traffc comprsng of Classes 1 and 2, namely, doorbells and descrptors. It models the operaton of HDMA. Classes 1 and 2 arrve nto an FCFS queue accordng to a general arrval process that s equvalent to the departure process from Node 1. HDMA servce effectvely transforms the class of messages. For example, HDMA servce of a doorbell assgns a descrptor and places t n the SRAM, whch n ths model s Node 1. It s modeled as changng a Class 1 message n to a Class 2 message upon departure from Node 2. Thus, after servce completon, Class 1 becomes a Class 9

2 and goes nto SRAM (node 1). Lkewse, Class 2 becomes Class 3 and resdes n SRAM to be pcked up by NSDMA. The servce tmes are gven n the Table 2. The servce tme for Class 2 messages s more than double the servce tme for Class 1 messages. Servcng messages of Class 1 n the hardware terms s equvalent to obtanng the descrptor contaned n the doorbell. Ths descrptor provdes the nformaton about the data that s pcked by the NIC and put on the network. The descrptor s processed by HDMA to get the memory locaton detals, and the data s transferred onto the temporary memory to be pcked up by NSDMA. Ths s the hardware level explanaton for the analytcal class swtchng from 2 to 3. Node 3 (NSDMA) has a queue that receves Class 3 messages, namely, data. Ths node models the operaton of the NSDMA. Class 2 messages after gettng served at Node 2, whch models the operaton of HDMA, are put on the SRAM of the NIC. In case there are multple data, they are queued up n the SRAM, and LANa processes them n FCFS bass. The hardware desgn of Myrnet NIC does not allow queueng of messages at NSDMA,.e., the node s effectvely a G/D/1/1 queue wth blockng. As part of the Myrnet Control Program (MCP), LANa checks for the status of the NSDMA and the SRAM queue for any data avalable for transfer. The entre operaton of puttng the data on the network depends on the status of the NSDMA and the avalablty of the data to be transferred on to the network. In all of the scenaros explaned below, data resdes n the SRAM queue, and the data that s to be transferred onto the network s the frst data that s n the queue. Scenaro 1: Data to be transferred on to the network s present n SRAM, and NSDMA s busy Result of MCP: There are no arrvals to NSDMA whle a servce s n progress snce LANa does not pass a message to t f the latter s stll busy. Scenaro 2: Data to be transferred on to the network s present n SRAM, and NSDMA s dle. Result of MCP: LANa passes a message to NSDMA by programmng t to pck up the message from SRAM. Scenaro 3: Data to be transferred on to the network s not present n SRAM, and NSDMA s busy. Result of MCP: Lana does not pass any message and contnues wth the next operaton under MCP. Scenaro 4: Data to be transferred on to the network s not present n SRAM, and NSDMA s dle. Result of MCP: Nothng happens n ths scenaro and LANa proceeds to next operaton under MCP. Table 1 summarzes all of the scenaros and the outcome for each of them. Scenaro Avalablty of Data Status of NSDMA Programmng of NSDMA by LANa 1 Avalable Busy No 2 Avalable Idle Yes 3 Not Avalable Busy No 4 Not Avalable Idle No Table 1 Four dfferent scenaros for programmng of NSDMA by LANa These scenaros have to be captured analytcally, to be ncorporated n our model. It s essentally a status checkng process by LANa. The modelng of NSDMA thus depends on some approxmatons that are stated here. One of the mportant condtons that needs to be satsfed for any network to model t as a mult-staton and mult-class open queueng network s the avalablty of nfnte watng space n the queues at each of ts nodes. For analytcal purposes, an attempt s made here to model NSDMA as a node that has nfnte space. The network s desgned such that the servce tme for messages of Class 3 at Node 1 s an average value that takes nto consderaton of all the four scenaros explaned above. In realty, LANa can program NSDMA only n Scenaro 2, and the tme that t takes to program NSDMA to pck up the data from SRAM s 10 mcroseconds. A way of gettng around ths problem s estmatng the probablty of the occurrence of Scenaro 2 and multplyng t wth the tme that s requred to program NSDMA (10 10

mcroseconds) and usng the result as the servce tme for messages of Class 3 at Node 1. It wll provde an average servce tme for servcng the traffc correspondng to Class 3 (data) by LANa. Each tme LANa servces data wth an approxmated servce tme, whch wll reflect the probablty of programmng of NSDMA by LANa. The next paragraph wll propose and explan n detal the approxmaton that s used to acheve the result stated above. Let t be the servce tme for a data message at node NSDMA. Let λ be the arrval rate of data traffc at NSDMA. These two values are avalable from Table 2. From the model, λ s the arrval rate chosen for the doorbells. The servce tme t (52.6887 mcroseconds) can be obtaned from the Table 2. Therefore, probablty that NSDMA s busy λ*t and probablty that NSDMA s dle 1 - λ*t. When NSDMA s dle, a message n the data queue on the SRAM mght be present or mght not be present. Assumng equal probablty, p = Probablty that NSDMA s dle and a message s present n SRAM 0.5*(1 - λ*t) Note that ths assumpton s made only for the analytcal model. Ths does not affect the smulatons. Table 2 uses p n the calculaton of estmated servce tme of data messages at node LANa. The average servce tme for the data messages at LANa s approxmately the product of probablty that NSDMA s dle and a message s present n SRAM and the tme that takes for LANa to program NSDMA. Therefore, servce tme for messages of type Class 3 at Node 1 10*0.5*(1 - λ*t) mcroseconds and servce rate for messages of type Class 3 at Node 1 1 / (10*0.5*(1 - λ*t)) mcroseconds. Ths value s used to compute the aggregate servce rate at Node 1. Mean arrval rates and aggregate servce rates are avalable from Table 2 at all of the nodes for the mult-staton and mult-class open queueng network algorthm to calculate the utlzatons at each node. Each class has a determnstc servce tme n each queue namely LANa, HDMA and NSDMA n accordance wth the followng table. Class Node 1 (LANa) Node 2 (HDMA) Node 3 (NSDMA) 1 (doorbells) 22 21 N/A 2 (descrptors) 0.12 68.3154 N/A 3 (data) 10*p N/A 52.6887 Table 2 Servce tmes n mcroseconds measured on a Myrnet NIC In Table 2, f a node does not serve traffc of a class, the term N/A (Not Applcable) s used to ndcate so. All the values are n mcroseconds. The MCP program runnng on the NIC, cycles through all of the destnatons, copes the message data to the NIC buffer, and sends the data n packets to all destnatons n turn. In order to deal effcently wth varable sze fragments, packets do not correspond to fragments, but are fxed sze blocks contanng segments of fragments. Fragments may span block boundares, and a block can contan one or more segments, dependng on fragment szes. The recevng NIC n the cluster network reassembles these segments back nto fragments and adds them to the message queue. To mantan the synchronzaton, the sender NIC s always emt packets, also when there s not enough data avalable for a partcular destnaton, n whch case, the block contans empty segments. Thus, each packet s a fxed sze block contanng segments of fragments. The values n Table 2 assume a block sze of 4 Kbytes, and correspondng transfer tmes at the DMA engnes are computed [17]. Table 2 provdes detals about servce tmes of varous classes of traffc at each node. Arrval rates for descrptors and data are the same as the arrval rate for doorbells. Varous numercal values are chosen for the arrval rate of doorbells. The squared coeffcent of varaton (SCOV) for the servce tmes at each node for each class s ntalzed n conjuncton wth the type of the servce. In the case of determnstc servce tmes as the present analytcal model for NIC, all the values C 2 for =1, 2 and 3 and r = 1, 2 and S, r 3 are ntalzed to zero. All these values are used n the decomposton algorthm n Secton 2.2 and performance measures at each node are obtaned. 11

3.2 Summary of the analytcal model Secton 3.1 contaned detaled descrpton of the analytcal model proposed for obtanng the performance of NIC. Pollng by LANa, programmng of NSDMA by LANa, class swtchng from doorbells to descrptors and from descrptors to data make t mpossble to use the decomposton algorthm descrbed n Secton 2.2. There are several assumptons and smplfcatons made n arrvng at the analytcal model for modelng the performance of the NIC. All of the smplfcatons are predcted to be observed and need to be verfed wth extensve smulatons. The most notable smplfcatons are: (1) Mathematcal approxmaton for pollng by LANa: In a NIC, LANa polls three queues, namely, doorbells, descrptors, and data. In the analytcal model, LANa s a node wth a sngle queue wth multple class traffc correspondng to doorbells, descrptors, and data. Ths approxmaton s requred to be able to apply the decomposton algorthm descrbed n Secton 2.2. (2) Mathematcal approxmaton for programmng of NSDMA by LANa: In a NIC, NSDMA s a node wth zero watng space and the servce of data traffc at NSDMA depend on four scenaros outlned n Secton 3.1. Analytcally, NSDMA s modeled as a node wth a sngle queue and nfnte watng space. Ths approxmaton s requred to be able to apply the decomposton algorthm descrbed n Secton 2.2. (3) Estmated probablty for Scenaro 2: An approxmaton s used n estmatng the probablty of Scenaro 2,.e., that there s data avalable n SRAM when NSDMA s dle. Ths approxmaton s needed to obtan an estmated servce tme for data traffc at node LANa and model NSDMA staton as havng a sngle queue wth nfnte watng space. Analytcal results wll be compared wth the smulaton results, where n the analytcal results we use p = 0.5, whch s not done n smulaton. (4) Class swtchng: There s a class swtchng nvolved n the network. After gettng served at node HDMA, doorbells are converted to descrptors. Smlarly, after gettng served at node HDMA, descrptors are converted to data. The decomposton algorthm explaned n Secton 2.2 can be appled only when class swtchng s not present n the queueng network. Secton 3.3 dscusses the numercal results to test the above smplfcatons and abstractons. 3.3. Numercal Results In ths secton, a detaled descrpton of numercal results usng the proposed mult-staton, multclass open queueng network model are presented for the performance analyss of a VIA NIC. The results are computed for sx dfferent test cases. For each case, an arrval rate for the doorbells s chosen n such a way that the sx cases comprse a range of traffc ntenstes. Once the arrval rate of doorbells s chosen, the arrval rate of descrptors and data wll be the same as the arrval rate of doorbells because Classes 2 and 3 are essentally the transformed versons of Class 1 at Node 2. Let ths arrval rate be λ. As explaned n the prevous secton, an estmated probablty of 0.5 s proposed for the probablty of data beng avalable when NSDMA s dle. Snce the servce tmes are determnstc, C 2 for = 1,2 and 3 and r = S, r 1,2 and 3 s zero. Fgure 4 shows the utlzatons of all the three nodes under varous arrval rates. Note that the utlzaton values do not have unts. of node s the rato of mean arrval rate at node to the maxmum possble servce rate at node. s are computed usng (6) n Secton 2.2. results provde valuable nformaton about the message loads that each node s subjected to and are very useful n fndng the bottlenecks n the system, whch forms a crucal part of queueng network analyss. s of the nodes ncrease wth the ncrease n arrval rate. From Fgure 4, at any gven arrval rate, HDMA s the node wth the hghest utlzaton, and LANa s the node wth the lowest utlzaton. The utlzaton of HDMA lnearly ncreases wth the ncrease n arrval rate and wll be 1 when 12

the mean arrval rate at node HDMA s equal to the mean servce rate of node HDMA. Compared to t, the utlzatons of the other two nodes are consderably less. Thus, HDMA s the bottleneck node n the NIC. Hence sgnfcant mprovements n the performance of NIC can be obtaned by mprovng the performance of HDMA rather than by mprovng the performance of the other nodes. For the purpose of analytcal approxmatons, t may be necessary to accurately model only the bottleneck node. s of the nodes usng analytcal model 1.2 1 0.8 0.6 0.4 0.2 0 0 0.005 0.01 0.015 Arrval Rate LANa HDMA NSDMA Fgure 4: s of the nodes usng the analytcal model Arrval Rate λ Queue Length at LANa Queue Length at HDMA Queue Length at NSDMA 0.00273 0.0059 0.0480 0.0133 0.00493 0.0191 0.1922 0.0378 0.00786 0.0486 0.8007 0.1006 0.00900 0.0642 1.5285 0.1384 0.01079 0.0940 11.2929 0.2250 0.01100 0.0980 24.1981 0.2383 Table 3 Mean queue lengths at the nodes usng the analytcal model Based on the mult-staton and mult-class open queueng network analyss n Secton 2.2, Table 3 provdes the mean queue lengths at the three nodes for dfferent arrval rates. Mean queue lengths are computed usng (14) n Secton 2.2. Note that mean queue length s a number and does not have unts. Mean queue lengths of the nodes ncrease wth the ncrease n the arrval rate. From Table 3, the length of the queue for HDMA s ncreasng rapdly wth the ncrease n the arrval rate and would approach nfnty as the queue becomes unstable at HDMA wth the ncreasng arrval rate. The percentage of ncrease n the queue length of HDMA s very large compared to the other nodes snce t s the bottleneck node n the network. It makes sense because HDMA s the node wth the hghest utlzaton n the network. The queue lengths at LANa and NSDMA are comparatvely nsgnfcant. Mean queue lengths depend upon the mean servce tmes and the mean arrval rate at each node and the assumptons summarzed n Secton 3.2 are not the reasons for the nsgnfcance of mean queue lengths of LANa and NSDMA. 4. Smulatons Ths secton deals wth the smulaton models that are developed to valdate the analytcal model. The smulatons are performed usng a commercal-off-the-shelf smulaton software package ARENA [6]. It s crucal to note that two dfferent smulaton models are desgned to mmc the performance of a NIC. Both smulaton models are desgned wth zero watng space at the NSDMA server (.e., they permt 13

blockng and LANa checks). The frst smulaton model depcts the archtecture of a NIC n whch all the messages are queued up n a sngle queue for LANa and t serves the messages n the order of arrval (FCFS). Ths s dentcal to the scenaro n the analytcal model wth respect to the operaton of LANa. The man purpose s to verfy the accuracy of the analytcal model. The second smulaton model depcts the archtecture of a NIC n whch LANa polls dfferent queues correspondng to varous components of NIC. Ths s a software smulaton of the exact NIC operaton, so that the analytcal model performance can use ths as a benchmark for comparson. The performance measures (mean queue length and utlzaton at each node) for both the smulaton models are obtaned and compared for accuracy. The assumpton nvolved n the analytcal model that the poolng-up of the messages n a sngle queue for LANa s a smplfcaton for the actual Myrnet NIC s checked wth the smulatons. We frst present the two smulaton models (Sectons 4.1 and 4.2) and then compare the models (Secton 4.3). 4.1 Frst Smulaton Model The model has 3 nodes, one correspondng to each of the LANa, HDMA and NSDMA. The three classes of traffc are doorbells, descrptors, and data. LANa s modeled as a node that serves all the three dfferent types of messages pooled up n to a sngle queue wth nfnte watng space. HDMA s modeled as a node that serves a queue that conssts of two types of messages: doorbells and descrptors. The queue has nfnte watng space. The performance of NSDMA s modeled n the exact way as n the NIC. The NSDMA node processes messages of type data but does not serve them from a queue. If NSDMA s free and f there s a message of type data avalable, LANa programs NSDMA to pck up the data and the servce for that partcular message wll be started at node NSDMA. A check for the avalablty of the resource NSDMA s placed n the smulaton model to detect ths scenaro. If NSDMA s busy, the message wll wat n the queue for LANa. Ths frst smulaton model conssts of one mathematcal approxmaton that s used n the analytcal model,.e., LANa serves messages from a sngle queue nto whch all three types of messages are pooled-up. Referrng to Secton 3.3 regardng the assumptons made n the analytcal model, modelng the performance of NSDMA n exact way as n NIC removes the necessty of usng the estmated probablty 0.5. 4.2 Second Smulaton Model Ths smulaton model captures the exact behavor of the NIC. In addton to the frst smulaton model dscussed n Secton 4.1, t removes the mathematcal approxmaton for LANa that t serves a sngle queue, whch contans multple class traffc. Instead t uses three dfferent queues: one for each of doorbells, descrptors, and data. Node LANa now serves these queues n cyclc order and wthn each queue; the order of servce s FCFS. The servce tmes are same as that are gven n Table 2. The second smulaton model s the only model among the three that provdes nformaton about the lengths of queues on SRAM correspondng to doorbells, descrptors, and data (see Table 4). Arrval Rate λ Doorbell Queue Length Descrptor Queue Length Data Queue Length 0.00273 0.0024 0.0022 0.0019 0.00493 0.0084 0.0076 0.0062 0.00786 0.0244 0.0216 0.0166 0.00900 0.0337 0.0293 0.0224 0.01079 0.0530 0.0444 0.0344 0.01100 0.0554 0.0464 0.0360 Table 4 Mean queue lengths at the nodes usng the second smulaton model 14

4.3 Comparson of the Two Smulaton Models The dfference between both the smulaton models s the modelng of performance of LANa. The frst smulaton model has LANa servng all types of messages pooled-up n to a sngle queue on FCFS bass. The second smulaton model conssts of three separate queues, one for each of the doorbells, descrptors and data. The two smulaton models are compared n order to verfy the assumpton of poolng dfferent SRAM queues n to a sngle queue. If the results show a match, modelng NIC analytcally usng a mult-staton and mult-class open queueng network s justfed. Table 5 compares the two smulaton models wth respect to the utlzatons of the dfferent nodes. The utlzatons of HDMA and NSDMA are exactly the same n both smulaton models. In the case of node LANa (the only node where the two smulaton models dffer), the utlzatons are nearly same wth the maxmum error equal to 1.38% at the largest arrval rate among the sx numercal values chosen. Mathematcal approxmaton of node LANa practcally dd not affect the utlzaton values of the nodes n the network. To obtan the performance measures of the network, t s a very useful result that supports the assumpton that for the gven parameters of Myrnet NIC, LANa servng a sngle queue whch has all the types of messages pooled up s equvalent to LANa servng dfferent queue whch are present n SRAM. Arrval Rate λ of LANa n model 1 of LANa n model 2 of HDMA n model 1 of HDMA n model 2 of NSDMA n model 1 of NSDMA n model 2 0.00273 0.0875 0.0875 0.2430 0.2430 0.1433 0.1433 0.00493 0.1588 0.1586 0.4393 0.4393 0.2591 0.2591 0.00786 0.2561 0.2551 0.7015 0.7015 0.4138 0.4138 0.00900 0.2954 0.2935 0.8032 0.8032 0.4738 0.4738 0.01079 0.3609 0.3564 0.9635 0.9637 0.5683 0.5683 0.01100 0.3690 0.3640 0.9822 0.9822 0.5794 0.5794 Table 5 Comparson of the two smulaton models for the utlzatons of nodes Arrval Rate λ Length of HDMA Queue n model 1 Length of HDMA Queue n model 2 0.00273 0.0464 0.0465 0.00493 0.1999 0.2002 0.00786 0.9433 0.9438 0.00900 1.8653 1.8653 0.01079 14.58 14.576 0.01100 30.506 30.499 Table 6 Comparson of the two smulaton models for mean queue lengths of HDMA Arrval Rate λ Queue Length at node LANa n model 1 Sum of the lengths of three ndvdual queues whch LANa polls n model 2 0.00273 0.0067 0.0064 0.00493 0.0238 0.0222 0.00786 0.0680 0.0626 0.00900 0.0925 0.0854 0.01079 0.1413 0.1317 0.01100 0.1478 0.1378 Table 7 Comparson of mean queue lengths at LANa n the two smulaton models 15

Table 6 and 7 compare the mean queue lengths of node HDMA and LANa respectvely n the two smulaton models. Table 7 shows a maxmum dfference of 6.76%. Strctly speakng, these values cannot be compared because the second smulaton model does not have a sngle LANa queue. They are presented here for the completeness of comparson of the both smulaton models. The most notable result that s obtaned by comparng the two smulaton models s that for the gven parameters of NIC, LANa can be mathematcally abstracted to be a node that polls a sngle queue n FCFS manner n to whch all of the dfferent knds of messages (doorbells, descrptors and data) are pooled up. 5. Comparson of Analytcal and Smulaton Models Arrval Rate λ Analytcal of LANa Smulated of LANa Analytcal of HDMA Smulated of HDMA Analytcal of NSDMA Smulated of NSDMA 0.00273 0.0721 0.0875 0.2438 0.2430 0.1438 0.1433 0.00493 0.1273 0.1586 0.4403 0.4393 0.2597 0.2591 0.00786 0.1969 0.2551 0.7020 0.7015 0.4141 0.4138 0.00900 0.2227 0.2935 0.8039 0.8032 0.4742 0.4738 0.01079 0.2620 0.3564 0.9637 0.9637 0.5685 0.5683 0.01100 0.2664 0.3640 0.9825 0.9822 0.5796 0.5794 Table 8 Comparson of utlzatons of nodes n analytcal and smulaton models Ths secton presents a comparson between analytcal and smulaton results. There are two types of smulaton results avalable, and the second smulaton model results are used to check the accuracy of the analytcal model. The second smulaton model s chosen because t represents the true behavor of NIC, and n ths secton t wll be henceforth referred to as the smulaton model. The effects of all the assumptons that are made n the analytcal model are observed. Table 8 shows the comparson between the analytcal and the smulaton models for the utlzaton of dfferent nodes. Mean queue lengths at HDMA n analytcal and smulaton models for dfferent arrval rates 35 30 Queue Length 25 20 15 10 5 0 0 0.005 0.01 0.015 Arrval Rate HDMA n analytcal model HDMA n smulaton model Fgure 5: Mean queue lengths at HDMA n analytcal and smulaton models The utlzaton results match very closely for HDMA and NSDMA nodes, wth the maxmum error percentage beng 0.35. The devaton n the utlzaton values for the node LANa (maxmum error percentage s 26.81) s due to the approxmaton nvolved n the estmated probablty of 0.5 that s 16

chosen for the probablty of messages beng present n the data queue when NSDMA s dle. The devaton n the utlzaton of node LANa s not a crtcal result for two reasons. Frstly, the LANa node s a mathematcal abstracton for the performance of LANa, and secondly HDMA s the bottleneck node n the network. Nonetheless, the approxmaton nvolved n the estmated probablty value s bound to nfluence the mean queue length of HDMA, whch s the other performance measure of nterest n the present study. Fgure 5 compares the mean queue lengths of HDMA n the two models for dfferent arrval rates. The analytcal queue length of queue that the NSDMA serves cannot be compared wth the smulaton because n smulaton, the NSDMA does not serve any queue. The queue length of node LANa can be compared by summng up the lengths of the queues that are polled by LANa n the smulaton model (see Table 9). Arrval Rate λ Analytcal queue length of node LANa Sum of lengths of three ndvdual queues that LANa polls (smulaton) 0.00273 0.0059 0.0064 0.00493 0.0191 0.0222 0.00786 0.0486 0.0626 0.00900 0.0642 0.0854 0.01079 0.0940 0.1317 0.01100 0.0980 0.1378 Table 9 Mean queue lengths at LANa n analytcal and smulaton models Summarzng the comparson between the analytcal and smulaton performance measures, (1) In most cases, analytcal performance measures are less than smulated performance measures. (2) values of HDMA and NSDMA match n both models. (3) values of LANa n analytcal model are lower than the values n smulaton model. (4) The analytcal model predcts the mean queue length of the bottleneck node n the network (HDMA) wthn an average error of 14% and a peak-utlzaton error of 20-25%, whch are farly good estmates. At lower utlzatons, the model predcts the mean queue length of HDMA wth hgher accuracy. The estmated probablty for Scenaro 2 explaned n Secton 3.1 affects the utlzaton value for LANa n the analytcal model. For all other performance values, the devatons are attrbuted to varous smplfyng assumptons under whch analytcal model s obtaned: class-swtchng, mathematcal abstracton for programmng NSDMA and the estmated probablty for Scenaro 2 n Secton 3.1. 6. Concludng Remarks and Future Work Ths secton summarzes ths research work and provdes ponters to future developments possble n ths area. Secton 6.1 provdes a short synopss of modelng, smulaton and the comparson between both of them to check the valdty. Secton 6.2 hghlghts the research accomplshments, and Secton 6.3 presents concludng remarks of the work wth possble enhancements that can be made and possble other research opportuntes that le ahead. 6.1 Summary of the Research Work A mult-staton and mult-class queueng network model s appled to study the performance of Myrnet NIC. The statons correspond to LANa and the DMA engnes. Dfferent messages n the system, whch are doorbells, descrptors and data, are modeled as dfferent classes of traffc n the queueng network. Varous smplfyng assumptons are made whch are essental for applyng the proposed 17

analytcal model. From the pont of vew of desgn, two mathematcal abstractons (modelng LANa as a node where as n realty t s a processor whch vsts certan number of queues n a predetermned order, and modelng the programmng of NSDMA by LANa to pck the avalable data messages on SRAM when t s dle) are needed to develop a mathematcally tractable model. Important performance measures for the desgn of NIC are obtaned analytcally. The bottleneck node of the system s HDMA. Two smulaton models are presented as part of the study. They dffer n the desgn of node LANa. LANa n the frst smulaton model servces on a FCFS bass a sngle queue n whch all the three classes of messages are pooled up. LANa n the second model polls cyclcally three queues correspondng to doorbells, descrptors and data, whch resde on SRAM. Results of the two smulaton models are almost dentcal, and the basc assumpton for the valdty of analytcal model that poolng up of the three classes nto a sngle queue s not gong to affect the performance s verfed. The major dfference between the analytcal and the smulaton models s the desgn of programmng NSDMA by LANa. The analytcal model uses an estmated probablty for the case of NSDMA beng dle and a data message beng avalable on SRAM. NSDMA serves a queue wth nfnte watng space n the analytcal model. The smulatons desgned the programmng by LANa n exactly the same way as n a NIC. NSDMA does not serve any queue and a resource status check s used for NSDMA to know whether t s busy or dle. The analytcal model predcts the mean queue length of the HDMA reasonably well. 6.2 Contrbutons from ths Research Work The contrbutons from ths research work are as follows. (1) Applyng the exstng queueng modelng prncples to obtan the performance measures of a NIC. It provdes an alternatve to obtanng performance measures by testng or by smulatons, whch may be expensve and computatonally tme ntensve. (2) Ablty to study varous desgn alternatves for the components of NIC quckly. Ths ablty s a very mportant feature of analytcal modelng. It provdes the ablty to evaluate alternate desgns wthout gong for smulatons or experments that are expensve and tme-consumng. For example, changng the order of servce for LANa s not gong to affect the performance. The analyss presented n Secton 2.2 wll hold good for any dstrbuton and for any knd of dscplne as long as t s work conservng. (3) Provdng a general framework for analyss of NIC that can be referred for future work. (4) Identfyng the bottleneck n the system to be HDMA, the man drect memory access DMA engne for processng doorbells and descrptors. (5) Reducng the effort n computaton to obtan the performance measures approxmately n a much easer and faster way. Once the requred parameter values are put n the algorthm for a NIC, the analytcal results are obtaned n less than a second. Table 10 provdes the tme taken for the pollng smulaton model at varous arrval rates. Arrval Rate λ Tme taken n mnutes for smulaton model 0.00273 3.55 0.00493 7.55 0.00786 16.12 0.00900 20.93 0.01079 35.07 0.01100 37.58 Table 10 Tme taken n mnutes for the second smulaton model to run 18