Node Application Logic. SCI Interface. Output FIFO. Input FIFO. Bypass FIFO M U X. Output Link. Input Link. Address Decoder

Size: px

Start display at page:

Download "Node Application Logic. SCI Interface. Output FIFO. Input FIFO. Bypass FIFO M U X. Output Link. Input Link. Address Decoder"

Sheena Carroll
5 years ago
Views:

1 Real-Time Message Transmission Over The Scalable Coherent Interface (SCI) Lei Jiang Sarit Mukherjee Dept. of Computer Science & Engg. University of Nebraska-Lincoln Lincoln, NE fljiang, Tapas K. Nayak Dept. of Computer Science & Engg. Indian Institute of Technology Kanpur, India Abstract The Scalable Coherent Interface (SCI) is a recently developed IEEE standard that denes a scalable high performance multiprocessor network. The SCI concept oers enormous potential improvement in both performance and life-cycle-cost with regard to the future of multiprocessor computing. Unfortunately, the high potential oered by SCI, as it is currently specied, cannot be directly exploited for real-time systems. Suggestions have been made by various SCI working group members on how to best extend/modify SCI to support real-time applications (SCI/RT). However, because of some limitations of each of the proposed candidate SCI/RT schemes, progress in developing a universal agreed upon SCI/RT standard has been slow. In this paper, we propose an ecient and low cost alternative SCI/RT scheme, called the job packing scheme. The scheme is based upon solid theoretical foundation of generalized rate monotonic scheduling theory and bin-packing methodology. It is exible and load sensitive, and thus can work eciently in a dynamic environment like SCI. A detailed simulation platform is built for its performance evaluation and comparison. We have also built simulators for some of the candidate SCI/RT schemes and have shown the superiority of the job packing scheme over them. We then investigate the applicability of several popular real-time message scheduling schemes in the SCI environment. Their pros and cons are studied and evaluated through simulation, and compared with the job packing scheme. This research is funded in part by the Air Force Oce of Scientic Research (AFOSR) and the Research and Development Laboratory (RDL) under contract number F C-0063.

2 1 Introduction Scalable Coherent Interface (SCI) [19] denes a new way to design parallel computers with distributed multiple processors and memory chips. It has been standardized by Institute of Electrical and Electronic Engineers(IEEE) as the document IEEE Std , hereforth referred to as SCI Standard or SCI Protocol. The SCI interface (also referred to as SCI node) is the unit through which the compute and memory components communicate with other compute and memory components connected in a ring topology. Its logical queueing structure is identical to that of a buer-insertion ring interface [1] (see gure 1). The node interface consists of two unidirectional links (input and Node Application Logic Output FIFO Input FIFO SCI Interface Output Link M U X Bypass FIFO Address Decoder Input Link Figure 1: SCI interface (also referred to as SCI node). output) which are used to connect nodes in a unidirectional ring. The bypass FIFO stores packets arriving from upstream neighbor while the node is transmitting packets. This enables a node to concurrently (1) transmit packets, (2) process packets addressed to other nodes, and (3) accept packets addressed to itself. The interfaces communicate by exchanging packets, which are nite sequences of symbols. A symbol is 16 bits and the smallest information fragment transmitted between node-interfaces. A link transmits one symbol at a time. Because SCI protocols are synchronous, special idle symbols are transmitted across the links in the absence of packets, and at least one idle symbol is always sent between consecutive packets. Although SCI uses idle symbols for a variety of purposes, they are of key importance in its ow control protocol [14], which prevents node starvation and fairly allocates bandwidth to all nodes on the ring. 1

3 Diculties in Real-Time Support: As the SCI was intended for time-shared applications (i.e., non real-time), the SCI designers were concerned with the optimization and ecient implementation to achieve low average response time, high average throughput, and fairness in bandwidth utilization. However, real-time systems require guarantees on when certain tasks complete. The notion of a deadline is used to measure the timeliness of task completion, that is, if a task completes before its deadline, it is on time. The correctness of the system depends on meeting the deadlines of the tasks [18]. Therefore, guaranteed timing behavior (i.e., guaranteed latency) is the essential metric for real-time systems. If the system activities are schedulable, then all requests will be serviced. For this reason, fairness and guaranteed forward progress are seldom of concern in real-time systems. Unfortunately, the SCI protocol, as it stands today, cannot be applied to real-time systems. This is because the SCI protocol ensures forward progress but not deterministic latency. Thus, the fundamental problem is to modify the SCI protocol from one which guarantees forward progress to a SCI/RT protocol which guarantees latency. Target SCI Interface SCI Interface Source Target SCI Interface SCI Interface Source SCI Interface SCI Interface Intermediate Node Intermediate Node Intermediate node is idle, packet gets forwarded to target (a) Intermediate node is busy, packet waits in bypass buffer (b) Figure 2: SCI ring with and without a busy intermediate node between source and target. The problem of obtaining guarantees on latency with a distributed network such as SCI is inherently complex 1. The complexity arises mainly because of the buer insertion feature of SCI. This is explained with the help of gure 2. Consider a SCI ringlet with three nodes as shown, 1 Real-time solutions exist for similar ring protocols, e.g., slotted ring [13], token ring such as FDDI [15], etc. 2

4 one intermediate node in between the source and the target. We are concerned with providing deterministic latency between the source and the target. The latency can be divided into two components: (1) waiting time at the source, which is the local component, and (2) the ring transfer time 2 between the source and the target, which is the distributed component. The most critical component of latency in a distributed environment is the ring transfer time. The ring transfer time in SCI consists of three components, namely the transmission time of the packet by the source, the propagation delay from source to target and buering delay (in bypass FIFO) in the intermediate nodes. Both transmission time and propagation delay 3 are xed for a given SCI network. The only variable component is the buering delay at the intermediate nodes. As this example will elaborate, this component of delay makes the real-time message delivery over SCI ring inherently complex. Consider gure 2(a). When the intermediate node is idle (i.e., not transmitting), then the source to target transmission traces the path indicated by the dashed line and reaches the target after the transmission time and the propagation delay. This results in deterministic ring transfer time. However, if the intermediate node is busy (i.e., transmitting), the source to target packet is buered in the bypass FIFO of the intermediate node (see the two part dashed line in gure 2(b)). It is forwarded to the target after the packet transmission from intermediate node is completed. Thus the ring transfer time between the source and the target becomes a function of the load at the intermediate node and moreover, it is a function of the length of the packet that contend for concurrent transmission at the intermediate node. transfer time in SCI. This results in the non-determinism in ring There are several other diculties with the SCI protocol regarding real-time trac support. These diculties include the FIFO queueing discipline, insucient number of priority bits, etc., that are independently identied by other researchers as well [4, 5]. We do not elaborate on these here, since the solution to the non-deterministic ring transfer time encompasses solutions to them as well. In this paper, we propose an ecient and low cost real-time extension of SCI, called the Job Packing Scheme, for real-time message transmission. The algorithm is based on generalized ratemonotonic scheduling theory (GRMS) [17]. However, unlike GRMS, our scheme is a distributed one 2 It is the time between the transmission of the rst bit of a packet from the source to the ring and the reception of the last bit of the packet by the target from the ring. 3 Since the distance between the source and the target is xed, and address decoding time at an intermediate node is constant, therefore, source to target propagation delay is xed. 3

5 that is suitable for an environment like SCI. Job packing algorithm keeps minimal global information to perform job admission task. The global information is updated by exchanging information between the neighboring nodes. This information updating makes one complete round through the ring to let every node know and collect the necessary information. Thus, global information is exchanged only when a job is accepted, not otherwise (job rejection is a local operation in our scheme). Moreover, the information kept per node is sucient to make sure that a local decision about a job admission will be accepted by all the nodes in the ring with a very high probability. We develop a simulation platform for the job packing and other SCI/RT candidate schemes. We choose the train protocol [16] and 2-bit/8-bit priority protocol [5] as candidate SCI/RT schemes, and evaluate and compare their performances with the proposed scheme. In addition, we have built a general purpose, priority-based scheduling for the SCI/RT platform. We use this platform to investigate several popular priority scheduling algorithms suitable for real-time message transmission over ring networks. The experiments show that in most cases, with a variety of workload, the job packing algorithm outperforms the rest. The rest of the paper is organized as follows: Section 2 introduces the existing SCI/RT proposals, including the Preemptive Priority Queue Protocol, Train Protocol, and 2-bit/8-bit Priority Protocol. Section 3 describes the proposed job packing algorithm in detail and analyzes the different approaches of the job admission algorithm. Section 4 describes the simulation model and shows the numerical results and comparison of dierent SCI/RT schemes. The paper is concluded in Section 5. 2 Real-time Extensions to SCI (SCI/RT) When the SCI standard was awaiting approval, interest had already grown in using the SCI protocol in real-time environment. This activity branched o into the SCI/RT working group (IEEE P1596.6) and work has progressed since then. The goal of SCI/RT working group is to modify the existing SCI protocol for real-time purposes. Currently, several modications to the SCI protocol have been proposed. The three major SCI/RT proposals are: Preemptive Priority Queue Protocol [2], Train Protocol [16], and 2-bit/8-bit Protocol [5]. Preemptive Priority Queue Protocol [2], by allowing preemption, can support rate-monotonically scheduled message transmission eciently over the network. This proposal was not accepted because it is very expensive to implement, and it deviates signicantly from the original SCI speci- 4

6 cations. The Train Protocol [16] is a token based scheme. It uses a special token, the LocalMotive that circulates around the ring, carrying priority information. A train is sent around the interconnect to determine which packets should be sent and which should be saved for later transmission to avoid interference with higher priority trac. Because of the high maintenance overhead, train protocol suers from low link utilization. The 2-bit/8-bit Priority Protocol [5] is a hybrid protocol scheme. The 2-bit priority protocol denes four priority levels. It is simpler to implement and appears to be sucient for typical personal-computer and workstation applications, but it does not have sucient priority levels to support more involved real-time applications. The 8-bit priority protocol implementation is more complex, but provides a relatively complete set of protocols for implementing hardware-based ratemonotonic scheduling [17] with limited priority levels per node. 3 The Job Packing Algorithm In this section we outline our algorithm for real-time job scheduling in a SCI ring. The algorithm is based on generalized rate-monotonic scheduling theory (GRMS) [17]. However, unlike GRMS, our scheme is a distributed one that is suitable for an environment like SCI. The proposed algorithm performs two essential functions: (1) job admission and (2) job sequencing. The job admission algorithm, running at each node, decides whether a new job can be admitted or not, so that its messages can be delivered within the deadline without violating the deadline of the previously accepted jobs. Once a job is accepted, the next step is to sequence the job in the nodes' transmission \calendar", i.e., to determine when the job should be transmitted with respect to the existing schedule of jobs. The intuition behind dividing the scheduling functionalities in two parts is to achieve eciency. We perform the job admission locally, i.e., a node need not consult other nodes in order to admit a new job. This is achieved by keeping \sucient" information per node about the global scheduling behavior. Note that the naive way of achieving this would be to replicate the schedule of all the nodes in each node. This information is huge and expensive to maintain. The job packing algorithm keeps minimal global information per node to perform job admission. The global information is updated in the job sequencing phase by exchanging information between the neighboring nodes. This information updating makes one complete round through the ring to let every node know 5

7 and collect the necessary information. Thus, global information is exchanged only when a job is accepted, not otherwise. Moreover, the information kept per node is sucient to guarantee that a local decision about a job admission will be accepted by all the nodes in the ring with a very high probability. We assume that a job consists of a set of real-time messages. For ease of exposition, in the rest we will assume that the size of the set is unity, and will use the terms, job and message, interchangeably. A real-time message M is dened as a ve tuple, M = fp; D; C; S; T g, where P; D; C, are, respectively, the period, deadline and transmission time of the message transmitted by source S to target T. The message is periodically generated at source S after every P time units. It is ready for transmission at the beginning of the period, and should be received within the deadline. We assume that D = P in the rest of the discussion and dene the utilization of a job as C=P. Notice that a job dened this way succinctly models real-time process control messages such as messages generated by a sensor periodically, real-time animation, etc. Messages that do not repeat, in other words, aperiodic messages such as interrupts can also be modeled in our framework by assuming that the message does not repeat (i.e. P = 1). In the rest of this section, we describe the job admission and sequencing schemes. In Section 4, we conduct a performance comparison of our scheme with the other proposed SCI/RT schemes and other popular real-time scheduling policies. 3.1 Job Admission Job admission algorithm determines the schedulability of a job set. A job set is schedulable if there exists a schedule in which no deadline is missed. If a job set is schedulable, the job admission algorithm accepts it. Then the job sequence algorithm schedules the jobs. It is often the case that schedulability test in a distributed system is NP-Hard in the strong sense [10]. We ensure that most of the jobs admitted meet their deadlines, although we could not guarantee the schedulability of all the jobs with all possible arrival times. We use three dierent approaches for admission control. The simplest one uses the utilization bound test. It is simple and can easily be done on the local node in the distributed environment. The more precise way is to use the M=G=1 priority queue model. Priority queue model gives the average queuing delay for the messages. Using this information, we can determine the deadline miss probability of a newly arrived job. The other scheme is to apply the GRMS theory in a distributed environment, to compute the minimum completion time for the new job. If it can complete before 6

8 its deadline at all the nodes between the source and the destination nodes, the job can be accepted. We elaborate on this in the following Node Utilization Bound In this method, we check if the eective transmission rate at a node is more than the aggregate job arrival rate. Each node in the SCI ring keeps track of the utilization of the node, and ensures that the cumulative job utilization for this node is lower than 1. The cumulative job utilization for a node is the summation of the utilization of all the jobs that originate from or pass through this node. When a job arrives at a node (originating or passing through), the node adds its utilization to the cumulative job utilization at this node. If the new cumulative job utilization is greater than 1, the job is rejected. If the job is accepted, the node updates the node utilization and broadcasts the information using the idle symbols of SCI. Utilization bound test only gives the necessary condition for the schedulability test. Since in a distributed environment, the latest update information cannot be seen by all the nodes immediately, the admission decisions are made based on an incomplete information. This local decision may conict with the current system status. In this case, although the job passes the job admission, it still may be rejected in the job sequencing step. Figure 3 shows the job admission behavior with dierent utilization bounds. Simulation results show if we use the utilization bound less than, the job reject ratio at job sequencing phase will be extremely low. Utilization bound check is the simplest way to perform the job admission control in a distributed environment. Although it reduces the link utilization for the real-time job, the remaining capacity can still be used by background tasks with lower priority, or by the aperiodic job, such as the interrupt M=G=1 Priority Queue Model The utilization bound scheme does not check the queuing delay along the transmission path and the block time caused by priority inversion. Here, we develop the schedulability model based on the queuing theory. We approximate a node in the SCI ring as a Priority M/G/1 Queue. We assign each periodic job a priority according to its deadline and period. In our model, the node's transmission speed (service rate) = 1 symbol per time unit. The periodic job i has C i symbols to transmit every P i time unit. We check the delay term along the transmission path. If the total delay on the path is larger than D i? C i, the job will not be able to 7

9 arrive at the destination node within the deadline, and therefore the job is rejected. The delay term is contributed by link propagation delay over the ring, and the queuing delay in the bypass buer (while the job waits for higher priority job transmissions). Since the link propagation delay is xed once the source and destination pair is xed, we can ignore it by subtracting link propagation delay from the deadline. We use the priority M/G/1 queue to model the transmission buer at a node in the SCI ring, and compute the average waiting time for the job. For each periodic job, we assign a priority level according to the earliest schedulable position, then nd the average waiting time for the job along the transmission path. Average waiting time is a function of current job sequence at the node. The average waiting time for the job with m th priority level (the earliest schedulable position is m) can be written as [12]: W m = W 0 (m) + mx i=1 X m?1 C i (C i =P i )W i + (C i =P i )W m : (1) We can solve this equations recursively, starting with the highest priority class. If we dene i = C i Pi, the equation above can be written as: W m = W 0 (m) i=1 (1? P m i=1 i )(1? P m?1 i=1 i) : (2) The term W 0 (m) represents the delay due to the job currently in transmission. We know that an outside observer will see the system working on the class i job i fraction of the time. Combining that with the residual lifetime of service for each class, we get: W 0 (m) = NX i=1 i C i 2 : (3) Combining Equations (2) and (3), we get the average delay for class m job: W m = P Ni=1 Ci 2 2Pi (1 P m? Ci i=1 Pi )(1 P? m?1 i=1 (4) Ci ): Pi Figure 3 shows that the queuing model can improve the node utilization and has lower job reject ratio than the utilization bound scheme Minimum Completion Time The queuing scheme only gives the statistical view of the system. To get a more precise view of the queuing delay, we can use the following algorithm [9] to check if a job can meet its deadline. 8

10 Consider any job n with a period P n, deadline D n = P n, and transmission time C n. Let tasks 1 to n?1 have higher priorities than n. Note that at any time interval t, the number of times job i arriving is d t Pi e and therefore the time demand in time t for i is C i d t e. So the total cumulative Pi demand on transmission time by these n jobs is W n (t) = C 1 d t P 1 e + : : : + C n d t P n e = nx j=1 C j d t P j e: We can use the following algorithm to nd out the minimum completion time for job n [7, 3]. P n Set t0 = j=1 Cj; k=0; repeat tk = Wn(tk?1); until (Wn(tk) = tk). If a newly arrived job can complete before its deadline at all the nodes along its transmission path, then the job is accepted. Figure 3 shows the simulation result for the minimum completion time scheme. It has better performance than the utilization bound scheme, and similar performance as the queuing model. Utilization Bound M/G/1 Priority Queue Minimum Completion time Figure 3: Performance of Job Admission Algorithm 3.2 Job Sequencing Algorithm Once a job passes the job admission controller, the job sequencer tries to put the job in the local job sequence. As mentioned before, each node keeps a job sequence, and the newly admitted job has to t in the sequence without violating any timing constraints. The job sequence for node s is dened 9

11 as 1 ; 2 ; : : : ; i ; : : : ; j ; : : : ; n, where k ; k = 1; 2; :::; n is a periodic job accepted by node s, and the priority of job i is higher than j if i < j. Note that we have omitted the node subscript since there is no confusion. The minimum delay between two successive transmissions of i is a function of the job sequence [17], and we denote it by f i ( 1; 2 ; : : : ; n ). The function can be evaluated as f( 1 2 ::: n ; n+1 ) = P nj=1 C j 1 P : n Cj? j=1 Pj In order to compute and update the sequence we introduce two state variables for every node, L[s; j] and M[s; j]. L[s; j] is the latest time by which j th job has to be scheduled at node s, so that its deadline can be met. M[s; j] is the earliest schedulable position of the j th job at node s in a job sequence so that the jobs before this one can meet their deadlines. L[s; j] and M[s; j] work together as a transmission time window. It gives the exibility to select a suitable start time for each message. The message can reach the destination node within the deadline as long as it is sent within the transmission window. After admitting a new job, these two variables are updated in a distributed fashion as described in the following. We use D[s; j] to denote the deadline of the j th job at node s. D[s + 1; j]? L[s; j] D [s + 1; M[s + 1; j]] n M[s; j] = min k L[s; k]? f j ( 1; 2 ; : : : ; j k+1 ) > 0 8 j?1 i i o =) L[s; j] D[s + 1; j]? D[s + 1; M[s + 1; j]] (5) Equation (5) is computed incrementally by all the nodes in turn and the new values of L[s; j] and M[s; j] are computed. This xes the job sequence. Each job in the sequence keeps information of L[s; j] and M[s; j]. In order to check if a newly arrived job can be tted in position j in the current job sequence, we need to compute M[s; j], and L[s; j]. M[s; j] can be computed directly from the denition: n o M[s; j] = min k L[s; k]? f j ( 1; 2 ; : : : ; j k+1 ) > 0 : 8 j?1 i i The job sequencing starts from the destination node, and propagates backwards to the source node. Consider an n node SCI ring, N 1 ; N 1 ; : : : ; N n. We dene i 1 = (i + 1)MOD n, i 1 = (i? 1)MOD n. The node N i1 is the downstream node of the node N i, and the node N i 1 is the upstream node of the node N i. 10

12 When a new job arrives at the node S, with the destination node T, the latest start time at the node T 1 is D? C (otherwise, the job cannot meet the deadline at node T ). Suppose the queuing delay at node T 1 is W T 1, written as L T 1 = W T 1. For node T 2 the job has to start transmission before time L T 1? W T 1; so L T 2 = L T 1? W T 1. By continuing this computation to the source node, we can get all Ms and Ls. We can use either M/G/1 priority queue model or minimum completion time to compute the queuing delay, depending on which admission algorithm is chosen. Once we t in the new job, each node will check if there is any time constraint violations. If there is a violation, the node will try to move the violated job to a higher priority level, and similar check will continue until no deadlines are violated. Refer to [6] for a detailed description of the algorithm and worked out examples. 3.3 Skip Over Algorithm Job packing algorithm treats the periodic jobs and the aperiodic jobs in the same fashion. In case of overloading, the system may drop a periodic job and an aperiodic job in the same way. A real-time system, with a mixture of periodic jobs and aperiodic jobs, should treat these jobs dierently. A periodic job, usually can have some sort of loss tolerance. For example, for a JPEG movie, a lost frame can be recovered from the neighboring frames. On the other hand, usually an aperiodic is more critical, like interrupt. We call a job an Occasionally Skippable Job (OSJ) [8] if the job is entered into the system periodically, has xed response time constraints (deadlines), and missing a few deadlines is acceptable, provided most deadlines are met. The basic idea behind the skip over algorithm [8] is, when an aperiodic job arrives, the scheduler tries to schedule it using job packing algorithm rst. If it cannot be scheduled, then the scheduler nds an OSJ from the list, skips the job one cycle in order to t in the newly arrived aperiodic job. By using the skip over algorithm, we can better utilize the communication channel. The loss tolerance of an OSJ is specied by the acceptable error ratio (packet lost ratio) Er. Er can also be seen as the skip over factor. The skip over algorithm only can skip those periodic jobs whose packet loss ratios are below Er. When we select a skip job, we can use several heuristics: skip any one; skip the job with the latest deadline; skip the job with the earliest deadline; skip the job with the longest transmission time; skip the job with the shortest transmission time, which is larger than the arrived aperiodic job. The choice of the latest deadline is based on the fact that skipping this job will have less eect on the other jobs. Skipping earliest deadline job can weaken the time constraint so that job packing 11

13 algorithm can insert more aperiodic jobs. The longest transmission time strategy may accept more aperiodic jobs at the same period, and the shortest transmission time will also be reasonable for reducing the channel waste. All of them can achieve better link utilization while keeping aperiodic job deadline miss ratio low. In Section 4, we show the simulation results for dierent schemes. 4 Numerical Results and Comparison In this section we present the results obtained from our experimentation with SCI/RT and discuss the results. We also compare the relative performance of various SCI/RT schemes. In order to study the performance of dierent candidate real-time schemes for SCI and our scheme, we have built a detailed simulation model for each of them. We started with a baseline simulation model developed at the University of Wisconsin [20]. It is a time-driven simulator that simulates only the base SCI packet transmission protocol (without the cache coherence protocol) in a single SCI ring with multiple nodes. It uses simplied buer management, i.e., single transmit and receive queue per SCI node. The simulator was extended to implement dierent plausible SCI/RT protocols. We have also built a simulator to study any priority based scheduling discipline on the SCI/RT platform. We use this platform to study the Earliest Available First (EAF), Earliest Deadline First (EDF), Smallest Slack-time First (SSF), and Farthest Away First (FAF) algorithms [10, 11]. Note that all the protocols in this suite are dierent variations of priority based scheduling. They dier in the way the priority per job is computed and assigned. EAF denes the priority level according to a message's ready time. The earlier the message is ready, the higher the priority it gets. EDF assigns a job's priority according to its deadline. SSF treats the job with the smallest slack time (which, in our terminology, is P? C) as the most critical job, and assigns it the highest priority. FAF algorithm considers the distance from the source to the destination. A job that is far away from its destination is assigned higher priority since it needs more time to pass through the longer path. We modied the base SCI protocol's queue from the simple FCFS queue to a priority queue, and developed algorithm specic priority computation procedures and plugged them in the corresponding scheme. 4.1 Workload Generation In order to evaluate the performance of dierent protocols and to compare the performance of the proposed job packing scheme, we created dierent sets of work-loads, periodic, aperiodic, and 12

14 mixture work-loads. The simulators were subjected to each set of work-loads. Each workload consists of a set of jobs. A job could be either periodic or aperiodic. The periodic jobs are representative of sensor generated data, while the aperiodic jobs characterize one time operation like interrupt processing. They are both representative of real-time trac [18]. Periodic Workload: Each job set contains 1000 periodic jobs, with an average job utilization of = P i C i=p i. The value of is varied over the job sets. Abiding by the standards of SCI packet sizes, we make the computation time of a job (C) equivalent to 16 (command/address), 80 (16 bytes address + 64 byte data), and 272 (16 bytes address bytes data) time units, where one time unit represents the time needed to transmit one symbol. In our job set, the computation time of each job is selected randomly from 16, 80, and 272. Once the value of C is selected, we go on to choosing the value of P. This value is chosen in such a way so that the value of C/P of each job falls randomly within [? r; + r], where r is a tunable parameter. Each job is assigned a source and a destination node randomly from all the nodes connected in the ring. Aperiodic Workload: Aperiodic job sets are created from periodic job sets by assuming that a job does not repeat. We use D = P to dene the deadline of an aperiodic job. Each job in an aperiodic job set is assigned a randomly selected arrival time. Mixture Workload: This set is created by adding randomly arrived aperiodic job into the existing periodic workload. For such a job set, we reserve a portion of the bandwidth for aperiodic job. Since each node can be treated as a single processor queuing system, the node utilization is limited by 1. We limit the cumulative utilization of periodic job by p < 1, so that it will always have 1? p of total bandwidth allocated for the aperiodic jobs. The service is work-conserving, i.e. a job can get the transmission channel if the channel is free. We assume that an aperiodic job has the same priority as the periodic job, and we study how the reserved bandwidth aect the deadline miss ratio in both periodic job class and aperiodic job class. 4.2 Comparative Study with Train and 2-bit Protocols We conducted several sets of experiments to study and evaluate the performance of the job packing algorithm with train and 2-bit protocols. We express the load on the ring in terms of cumulative job utilization of a job set. It is dened as P C i =P i (i.e., ) for all the jobs present in the set. Job reject ratio denes the fraction of jobs that were rejected by a particular protocol since their deadlines 13

15 Job Reject Ratio vs. Cumulative Job Utilization 1.0 random C/P: [+/ 5%] Average Node Utilization vs. Cumulative Job Utilization 0 random C/P: [+/ 5%] 0.70 Job Reject Ratio Job Packing Train 2Bit Average Node Utilization Job Packing Train 2Bit Cumulative Job Utilization Job Reject Ratio vs. Cumulative Job Utilization 1.0 random C/P: [+/ 35%] Cumulative Job Utilization Average Node Utilization vs. Cumulative Job Utilization 0 random C/P: [+/ 35%] 0.70 Job Reject Ratio Job Packing Train 2Bit Average Node Utilization Job Packing Train 2Bit Cumulative Job Utilization Cumulative Job Utilization Figure 4: Performance of Job Packing, Train, 2-Bit protocols for periodic job set cannot be met. The average node utilization is time averaged over the simulation duration. The experiments are classied according to the workload used, and are described below. Periodic Workload: Our rst set of experimentation used the periodic workload as the input to the real-time SCI protocols, and the results are plotted in gure 4. Each pair of graphs show the job reject ratio and the average node utilization as a function of cumulative job utilization. Dierent pairs of graphs show the simulation results with dierent degree of randomness in the workload job set (i.e., r). The following observations can be made from the gures: Job reject ratio is the lowest with job packing algorithm, and the highest for train protocol, with 2-bit protocol in between. This is due to the fact that the job packing algorithm tries to accommodate as many jobs as possible through local (very low overhead) job admission, and 14

16 global job sequencing (more overhead). It can move jobs around in the sequence so that more new jobs can get in. This results in low job reject ratio. The train protocol, on the other hand, wastes a signicant amount of ring resource in maintaining and circulating the train over the ring. A job has to wait at least one round trip before it gets permission (or rejection). This extra overhead forces the train protocol to reject more jobs. The 2-bit protocol, with its limited priority levels, cannot accept a lot of jobs. Since its overhead is lot lower than train protocol, it performs better. Average node utilization is highest for job packing algorithm, and lowest for train protocol, with 2-bit protocol in between. This behavior can be explained from the job reject ratio. More jobs a protocol admits, more utilization a node will achieve for the corresponding protocol. As the randomness in the job set increases, job packing algorithm performs even better (i.e. lower job reject ratio and higher node utilization). This is due to the fact that randomness in job parameters allows the job packing algorithm to make the packing tighter. In other words, during the job sequencing phase, there is more exibility in moving jobs around and this results in higher job acceptance rate. This exibility cannot be exploited by train or 2-bit protocols. 1.0 Deadline Miss Ratio vs. Job Arrival Rate (avg. job uti.=0.01) 1.0 Deadline Miss Ratio vs. Job Arrival Rate (avg. job uti=0.02) Deadline Miss ratio Job Packing Train 2Bit Deadline Miss Ratio Job Packing Train 2Bit Job Arrival Rate Job Arrival Rate Figure 5: Performance of Job Packing, Train, 2-Bit protocols for aperiodic job set. Aperiodic Workload: The next set of experiments use the aperiodic job sets as the workload. Note that cumulative job utilization does not make the same sense in this context as it does for 15

17 periodic jobs. Instead we use job arrival rate to dene the intensity of workload. The arrival rate is measured in time units of symbol time 4 to make it independent of link bandwidth. We assume that the jobs arrive according to a Poisson arrival process. The results obtained from the simulation are plotted in gure 5. Deadline miss ratio is dened as the fraction of jobs that miss their deadlines, which is the most important metric in their schedule. Observe from the gure that at low aperiodic job arrival rate (i.e., low load) the train protocol has the lowest miss ratio, job packing algorithm being the highest and 2-bit protocol in between. However, just the reverse sequence can be observed at higher job arrival rate (i.e., high load). The results can be explained again by the philosophy behind the design of each of these protocols. Since job packing algorithm tries to pack jobs as compactly as possible, in lower job arrival rate it does not perform very good since there is nothing much to pack (as the jobs do not repeat). Whereas both train and 2-bit protocols use the lightly loaded ring to send whatever job is coming, as quickly as possible. However, as the load increases, the overhead of train protocol and insuciency of 2-bit protocol priority levels become more prominent and they fail to guarantee the deadline. In this scenario, job packing algorithm works very well since it is able to construct the sequence more appropriately. This load sensitivity of the job packing algorithm is a desirable feature for real-time job scheduling over SCI. Mixture Workload: The last set of experiments use the mixture job sets as the workload. In this experiment, we choose one of the periodic job sets from the previous experiments, and add a set of aperiodic jobs with the periodic job set. Same as the aperiodic workload experiments, we use the aperiodic job arrival rate to dene the intensity of the work load. The periodic job set is xed during the simulation. The result is plotted in gure 6. Job miss ratio is dened as the fraction of jobs that miss their deadlines. In the mixture workload simulation, the job packing algorithm has the best performance for the aperiodic job since it has more exibility to t an aperiodic job in the existing sequence of periodic jobs. The 2-bit protocol and train protocol experience the diculty to accept the aperiodic job on the existing periodic workload. Since 2-bit protocol only has 2 priority levels, it has the worst performance in the aperiodic workload. Although 2-bit protocol and train protocol have better performance in the periodic job workload, job packing algorithm has better overall system tolerance in the mixture workload condition because it treats aperiodic job and periodic job equally. 4 One symbol time is dened as the time it takes to transmit one symbol over the ring. 16

18 Periodic Job Bandwidth Limit= Periodic Job Bandwidth Limit = 0.5 Job Miss Ratio 0.9 Aperidoc Job with Job Packing Algorithm Periodic Job with Job Packing Algorithm Aperidoc Job with Train Protocol Periodic Job with Train Protocol Aperiodic Job with 2Bit Protocol Periodic Job With 2Bit Protocol Job Miss Ratio 0.9 Aperiodic Job with Job Packing Algorithm Periodic Job with Job Packing Algorithm Aperiodic Job with Train Protocol Periodic Job with Train Protocol Aperiodic Job with 2Bit Protocol Periodic Job with 2Bit Protocol Aperiodic Job Arrival Rate (link speed=1) Aeriodic Job Arrival Rate (link speed =1) Figure 6: Performance of Job Packing, Train, 2-bit protocols for mixture job set. 4.3 Comparative Study with Popular Real-Time Schemes In this section we present our experimental results on the performance of popular real-time message scheduling algorithms on a SCI ring. We used similar experimental setup with dierent classes of workload as we did in the previous section. Below we describe and analyze the results obtained for each of the categories. We also compare the results with the job packing algorithm. Periodic Workload: The workloads and the performance metrics used in this set of experiments are the same as what were used before (train and 2-bit protocols). We evaluate and compare the performance of EAF, EDF, SSF and FAF schemes with the job packing algorithm. The results are plotted in gure 7. General conclusions drawn from these gures are the following: Simple algorithms like EAF, which is a variation of FCFS service discipline, does not work well in a real-time environment. FAF, which depends only on the destination, but ignores the deadlines of the jobs fails to capture the real-time requirements of the jobs. Both EDF and SSF work well in a real-time environment since both of them are sensitive to the deadline (and computation time for SSF). However, the algorithms may fail to guarantee message deadline at high load because they are not able to change the job priorities adaptively with load. 17

19 Job Reject Ratio vs. Cumulative Job Utilization 1.0 random C/P: [+/ 5%] Average Node Utilization vs. Cumulative Job Utilization 0 random C/P: [+/ 5%] 0.70 Job Reject Ratio Job Packing EAF EDF SSF FAF Average Node Utilization Job Packing EAF EDF SSF FAF Cumulative Job Utilization Job Reject Ratio vs. Cumulative Job Utilization 1.0 random C/P: [+/ 35%] Cumulative Job Utilization Average Node Utilization vs. Cumulative Job Utilization 0 random C/P: [+/ 35%] 0.70 Job Reject Ratio Job Packing EAF EDF SSF FAF Average Node Utilization Job Packing EAF EDF SSF FAF Cumulative Job Utilization Cumulative Job Utilization Figure 7: Performance comparison with popular real-time schemes for periodic job set. The job packing algorithm works well in a real-time environment. Although EDF and SSF work better than job packing at low load, the role reverses with increase in load and degree of randomness in workload. Job packing algorithm can exploit the exibility to maneuver the job sequence and change their priorities dynamically by re-sequencing the jobs. The average node utilization is a direct manifestation of the eect of job reject ratio, and can be explained in a similar way. Aperiodic Workload: We use the same aperiodic workload for this set of experiments. The results are plotted in gure 8. A trend similar to the one observed for the previous set of experiments can be observed here as well. At low load job packing performs not as good as others. At high load, due to the load sensitivity feature of job packing scheme, it performs much better than the 18

20 1.0 Deadline Miss Ratio vs. Job Arrival Rate (avg. job uti.=0.01) 1.0 Deadline Miss Ratio vs. Job Arrival Rate (avg. job uti=0.02) 0.9 Deadline Miss ratio Job Packing EAF EDF SSF FAF Deadline Miss Ratio Job Packing EAF EDF SSF FAF Job Arrival Rate Job Arrival Rate Figure 8: Performance comparison with popular real-time schemes for aperiodic job set. rest. Mixture Workload: In this experiment, we choose one of the periodic job sets from the previous experiments, and add a set of aperiodic jobs on the periodic job set. The results are plotted in gure 9. In the mixture workload, all the algorithms treat the periodic job and aperiodic job equally. However the job packing algorithm has the best performance in both periodic and aperiodic jobs. This is due to its ability to move the existing job sequence around to t in the newly arrived jobs. 4.4 Comparative Study with Skip-Over Algorithm We conducted a set of experiments to study the performance of the job packing algorithm with skipover modication in the soft real-time environment. The job workload is the mixture workload used in last the section. The bandwidth reservation for the aperiodic jobs is 20% of the total bandwidth. For the periodic jobs, the acceptable error ratio (Er) is dened as the fraction of lost packets among all the packets. In our experiments, we assume Er = 0:05. We investigate the following skip over schemes: skip the earliest deadline job, skip the latest deadline job, skip the longest transmission time job, and skip the shortest transmission time job. Figure 10 shows the simulation results for dierent skip-over schemes. Conclusions drawn from these gure are the following: The skip over algorithm not only reduces the aperiodic job deadline miss ratio, but also reduces the periodic job deadline miss ratio. This is because by skipping the OSJ job, one can relax the tied time constraints, so that the system can guarantee more jobs' deadline 19

21 Periodic Job Bandwidth Limit = Periodic Job Bandwidth Limit = Job Miss Ratio 0.9 Aperiodic Job with Job Packing Algorithm Periodic Job with Job Packing Algorithm Aperiodic Job with EAF Algorithm Periodic Job with EAF Algrithm Aperiodic Job with EDF Algorithm Periodic Job with EDF Algorithm Job Miss Ratio 0.9 Aperiodic Job with Job Packing Algorithm Periodic Job with Job Packing Algorithm Aperiodic Job with SSF Algorithm Periodic Job with SSF Algorithm Aperiodic Job with FAF Algorithm Periodic Job with FAF Algorithm Aperiodic Job Arrival Rate (link speed =1) Aperiodic Job Arrival Rate (link speed =1) Periodic Job Bandwidth Limit = 0.5 Periodic Job Bandwidth Limit = 0.5 Job Miss Ratio 0.9 Aperiodic Job with Job Packing Algorithm Periodic Job with Job Packing Algorithm Aperiodic Job with EAF Algorithm Periodic Job with EAF Algorithm Aperiodic Job with EDF Algorithm Periodic Job with EDF Algorithm Job Miss Ratio 0.9 Aperiodic Job with Job Packing Algorithm Periodic Job with Job Packing Algorithm Aperiodic Job with SSF Algorithm Periodic Job with SSF Algorithm Aperiodic Job with FAF Algorithm Periodic Job with FAF Algorithm Aperiodic Job Arrival Rate (link speed = 1) Aperiodic Job Arrival Rate (link speed = 1) Figure 9: Performance comparison with popular real-time schemes for mixture workload. requirements. Skip earliest deadline job scheme has better performance than skip latest deadline job scheme. Because the job packing schedule the job in the deadline order, skip latest deadline job has less eect to the current job sequence. On the other hand, skip the earliest deadline job needs to update more scheduling information, and also can get more slack time for the new job. So the skip earliest deadline job performs better, but introduces more overhead in the scheduling information update. The skip longest transmission time scheme has better performance than skip shortest transmission time scheme. This is because the former can leave more slack time to schedule other aperiodic jobs. Skipping the longest transmission time job also relaxes the time constraints 20

22 Job Deadline Miss Ratio Aperiodic job with job packing algorithm Periodic job with job packing algorithm Aperiodic job with skip earliest deadline job Periodic job with skip earliest deadline job Aperiodic job with skip latest deadline job Periodic job with skip latest deadline job Aperiodic Job Arrival Rate Job Deadline Miss Ratio Aperiodic job with job packing algorithm Periodic job with job packing algorithm Aperiodic job with skip longest job Periodic job with skip longest job Aperiodic job with skip shortest job Periodic job with skip shortest job Aperiodic Job Arrival Rate Figure 10: Performance comparison of skip-over modied job packing algorithm. for other jobs in the same period, and allows the job packing algorithm to guarantee more aperiodic jobs. 4.5 Discussion The experiments reveal that the train protocol suers from high maintenance overhead, whereas 2-bit protocol may fall short in providing sucient priority levels. Simple protocols like EAF and FAF do not work well in a real-time environment. EDF and SSF perform well during low load, but their high load performance is not so good since their job priority scheme is not load sensitive. An algorithm, like the job packing, that is load sensitive and is able to dynamically prioritize the real-time jobs is well suited in the SCI/RT environment. By applying the skip-over algorithm, job packing algorithm can also be used in the soft real-time environment, and also have very good performance. The job packing algorithm is a good candidate for both hard real-time and soft realtime applications. However, we would like to mention here that the current version of the proposed algorithm has moderately high overhead in the job sequencing phase. More work needs to be done to lower the complexity and make it more amenable to run online. 5 Concluding Remarks In this paper we have discussed our research on real-time message transmission over Scalable Coherent Interface. The main thrust of the work was to study the performance of dierent SCI/RT 21

23 candidate schemes and the suitability of some of the popular real-time message delivery techniques applied to the SCI paradigm. The study is made through extensive simulation of all these schemes. We observe that dierent schemes suer from dierent limitations, and conclude that a exible, load sensitive scheme is well suited for SCI/RT. In this regard we have developed a new realtime message scheduling protocol over SCI, called the job packing algorithm. We have conducted simulation study using real-time workload, and have shown the benet of the proposed scheme. Work is under way to reduce the computational overhead of the job packing algorithm and to specify the protocol in detail. References [1] B. W. Abeysundara and A. E. Kamal. High-Speed Local Area Networks and Their Performance: A Survey. ACM Computing Surveys, 23(2): , June [2] Duane L. Anderson. A Propoasl to the P (SCI/RT) Working Group for A Preemptive Priority Queue Protocol. Technical report, Edgewater Computer System, Inc., [3] A. Burns. Scheduling Hard Real-Time Systems: A Review. Software Engineering Journal, May [4] D. B. Gustavson, B. E. Stewart, and D. L. Anderson. SCI/RT:D0.13. November [5] David James and David Gustavson. Draft Proposals for Real-Time Transactions on SCI. Technical report, Apple Computer and SCIzzL, Version [6] Lei Jiang. Real-Time Message Transmission over The Scalable Coherent Interface (SCI). Department of Computer Science and Engineering, University of Nebraska-Lincoln, Master Thesis. May [7] M. Joseph and P. Pandya. Finding Response Times in a Real-Time System. British Comput. Soc., Oct [8] Gilad Koren and Dennis Shasha. Skip-Over: Algorithms and Complexity for Overloaded Systems that Allow Skips. IEEE Real-Time Systems Symposium, [9] J. P. Lehoczky, L. Sha, and Y. Ding. The Rate Monotonic Scheduling Algorithm{Exact Characterization and Average-Case Behavior. IEEE Real-Time Systems Symposium,

SCHEDULING REAL-TIME MESSAGES IN PACKET-SWITCHED NETWORKS IAN RAMSAY PHILP. B.S., University of North Carolina at Chapel Hill, 1988

SCHEDULING REAL-TIME MESSAGES IN PACKET-SWITCHED NETWORKS BY IAN RAMSAY PHILP B.S., University of North Carolina at Chapel Hill, 1988 M.S., University of Florida, 1990 THESIS Submitted in partial fulllment