Predicting the performance of general task graphs. with underlying queueing model. Abstract

Size: px

Start display at page:

Download "Predicting the performance of general task graphs. with underlying queueing model. Abstract"

Owen Anderson
5 years ago
Views:

1 In: Proc. 1st Annual Conf. of the Advanced School for Computing and Imaging, May 1995, pp. 293{302 Predicting the performance of general task graphs with underlying queueing model Henk Jonkers Gerard L. Reijns Delft University of Technology, Faculty of Electrical Engineering P.O. Box 5031, 2600 GA Delft, The Netherlands Abstract The Glamis methodology provides a framework for performance modelling of parallel applications, based on a combination of task graph models and queueing network models. This paper presents a new, moderately cheap algorithm which enables the analysis of arbitrary task graph models of parallel programs with general precedence relationships, given a queueing model of the underlying machine. The algorithm generalises over an earlier algorithm. An extension to this algorithm is proposed to allow for the analysis of a certain type of mutual exclusion at the task level. The features of the methodology are illustrated by means of a case study, modelling a task graph executor running on a distributed-memory parallel machine. Measurements carried out on the actual machine, as well as simulations, are used to validate the predictions. 1 Introduction For the eective design of parallel applications, fast but accurate performance predictions are required. Traditional performance modelling techniques, e.g. queueing networks, are not directly suited to model parallel applications, because they are not capable of expressing certain types of synchronisation. Other techniques, such as Petri nets and simulation, are often too computation-intensive. Our objective is to develop a methodology for performance modelling and prediction of general parallel systems. Our approach is based on a combination and extension of existing techniques, selecting the best features of these techniques in such a way that reasonably accurate predictions are obtained, while keeping the analytical cost within acceptable limits. Our predictions will typically be used to provide feedback to the user, in order to support decisions with respect to optimal system parameters and program design. Although fairly ecient, the analysis will generally be too slow to be used within compilers to support compile-time optimisations. For this purpose, very ecient static methods, preferably yielding symbolic predictions, are required, for a rst-order estimate of the performance of the dierent alternatives. An example of such a method, which includes the eect of contention, is serialisation analysis [5]. The Glamis methodology for performance modelling of parallel applications comprises a modelling formalism and analysis algorithms to predict the completion time of parallel programs, making use of queueing networks to model the inuence of the underlying parallel machine. While previous papers introduced the methodology and described algorithms to analyse a subclass of parallel programs [9, 10], this paper describes a new algorithm for the analysis of programs with arbitrary synchronisation patterns, thus generalising over the previous algorithms. It will be shown that a similar approach can be

2 followed to include the impact of course-grain, program-level mutual exclusion in the predictions. A case study has been carried out, concerning a task graph executor running on a distributed-memory parallel machine, serving to both illustrate the concepts introduced in this paper, and to validate the predictions. A models was made of this application, including the inuence of communications, dierent mappings, and two dierent scheduling strategies. The predictions were compared to simulations and execution times measured on the actual parallel machine. The remainder of this paper is organised as follows. The next section gives an overview of the main features of the Glamis methodology, and its relation to other approaches. Section 3 describes the representation of queueing models of a machine, while section 4 describes the program modelling formalism. Section 5 introduces the task graph analysis algorithm. In section 6 it is intuitively shown how this algorithm can be modied to include the analysis of course-grain mutual exclusion. Section 7 presents the case study and its results. Finally, in section 8 some conclusions are drawn. 2 Overview of the methodology Several people have adopted an approach combining queueing network and task graphs to model parallel applications [11, 14]. This combination has some attractive features. Both formalisms occupy a favourable position on the trade-o between modelling power and analytical eciency. In separation, they lack the expressiveness to reliably model parallel systems. Queueing networks are not capable of expressing condition synchronisations, while task graphs cannot express mutual exclusion. However, these shortcomings are complementary: when used together, both synchronisation types are covered. Glamis diers from related approaches on a few important points. A major dierence is the distribution of the task completion times. Most related approaches assume exponentially distributed completion times [7, 11, 14]. We assume a negligible variance in completion times. Our experiments, as well as a recent publication by Adve and Vernon [1], support this assumption, provided that certain requirements are met. The complexity of Mak's method [11] is polynomial, but its applicability is restricted to series-parallel (SP) graphs. The worst-case complexity of the analysis of Thomasian's models [14] is exponential. Machine models in Glamis are closed queueing models. We restrict ourselves to the class of separable networks, in order to keep the analytical cost low. Separable networks can be analysed in polynomial time using algorithms such as mean value analysis (MVA), while the worst-case complexity of the analysis of general queueing networks is exponential. Approximate MVA, such as the Schweitzer algorithm [13] can be applied to further reduce the complexity, sacricing some of the accuracy. Program models consist of tasks, condition synchronisations between the tasks and the workload imposed by each task on the system resources. The workload is specied in terms of instruction counts. Instruction counts rather than direct visit counts to the resources are used, in order to keep the program models machine-independent. This separation is important in order to obtain reusable models. It also leads to more comprehensible models. However, a complete separation is not possible: the models will always have to share the same (logical) instruction set. A mapping will be dened for the derivation of visit counts from the instruction counts. The instruction set together with the mapping is the model counterpart of the programming interface. Parallel systems and programs often display a high degree of symmetry. Examples are identical processors, memory banks or interconnection switches at the machine level, or

3 identical tasks at the program level.glamis exploits these symmetries through replicated model elements, aiming to reduce the analytical cost and to obtain scalable models. In queueing models, the analytical cost of certain symmetric subsystems is reduced signicantly using aggregation. 3 Machine models A model of a (virtual) parallel machine consists of a queueing model of the architecture, the instruction set of the machine and a mapping of the instruction set to a workload on the queueing model elements in terms of visit counts. A queueing model is dened as a tuple of building blocks, each block representing one or more identical system resources. Because at the low level MVA is applied to solve the queueing model, only the total workload on each service centre needs to be known, i.e. the exact way in which the building blocks are connected to each other is irrelevant. As analysability is a prerequisite of Glamis models, the building blocks are chosen in such a way that queueing models constructed with these blocks are (approximately) separable. A queueing model block consists of a letter denoting the type of service centre, a superscript denoting the number of service centres in the block (default 1), a subscript denoting the number of servers for each service centre (default 1) and an argument denoting the mean service time (or the total service demand on the centres for a visit to the block, in case of a block of several service centres). Three types of service centres are distinguished. A delay centre or innite server, denoted by an I, can have any service time distribution. A queueing centre with rst come, rst serve (FCFS) scheduling and a deterministic service time, denoted by a D, can only be analysed approximately. All types of queueing centres which are allowed in separable queueing networks [2] are denoted by a Q. For FCFS centres this requires an exponentially distributed service time with a common mean for all classes. For processor sharing (PS) or last come rst served centres with preemptive resume (LCFS-PR), any service time distribution is permitted. The queueing model describes the physical resources of a (parallel) system. In order to complete the model of a virtual machine, the programming interface needs to be described. This interface is modelled as an instruction set and a mapping from the instructions to visit counts representing the workload on the queueing model elements. Formally, a machine model is described by the tuple hq; S; Y; K; M; I; F i Q is the set of queueing model elements. The mean service time for every service centre is specied by the function S : Q! R +. The function Y : Q! fi; d; qg denes the queue type. In case of a multiple-server, the function K : Q! N species the multiplicity of the centre (for a single server the multiplicity is one). In case of a block of identical servers, the function M : Q! N species the number of centres. The instruction set of the machine is denoted by I. The function F : I Q! R + maps instructions to visit counts. It is often convenient to impose an order on the elements in the sets, resulting in a vector of queueing model elements ~q = (q 1 ; : : :; q jqj ) and an instruction vector ~i = (i 1 ; : : :; i jij ). Vector counterparts ~s, ~y, ~k and ~m of the functions S, Y, K and M can be dened, which are applied element-wise on their vector arguments. Mapping F can be represented compactly as a jij jqj matrix [F ].

4 4 Program models A parallel program is specied in terms of a general task graph. A task graph is a directed acyclic graph (DAG), in which the nodes represent the tasks (i.e. units of computation in a parallel program) and the edges represent task precedence relationships (i.e. condition synchronisations). Tasks are mutually independent except for their precedence relationships and shared use of the same (hardware or software) resources. In order to keep task graphs of large programs manageable and to improve scalability, we will also use replicate tasks, i.e. a parallel section consisting of k identical tasks (see gure 1). Tasks are identical if they share the same predecessors and successors, and impose the same workload on the system resources. Formally, a task graph is described by a tuple ht; N; P; I; Ci T is the set of tasks. The function N : T! N denes the multiplicity of every task in T. The successor function P : T! }(T ) denes the precedence relationships between the tasks: t j 2 P (t) if and only if t j is a direct successor of t i. The instruction set used in the program is given by I. This is the only overlap with the machine model. The function C : T I! R + species the instruction counts, i.e. C(t i ; i j ) is the average number of times every instance of task t i executes instruction i j. Together with the function F from the machine model this function determines the visit counts of the tasks on the dierent queueing model elements. The visit P count V (t; e) of a task t 2 T to queueing model element q 2 Q is given by V (t; q) = x2i C(t; x)f (x; q). k 1 k t 1 <5, 5, 5> <10, 5, 5> <15, 10, 10> k t 2 t k 3 t 4 <5, 5, 5> Figure 1: Task with a multiplicity of k Figure 2: Example task graph Similarly to the machine model case, it will often be convenient to use a task vector ~t = (t 1 ; : : :; t jt j ), and vector counterparts ~n and ~p of the functions N and P. The result of the latter vector function is a vector of subsets of T. In combination with an instruction vector ~i, the function C can be specied compactly as a jt j jij matrix [C]. Visit counts can then be derived by a simple matrix-matrix multiplication: [V ] = [C] [F ] 5 Model analysis With the rst introduction of Glamis [9], an iterative algorithm was presented for the analysis of programs with an SPS-structure, i.e. programs consisting of a sequence of parallel sections, each section possibly containing dierent types (or classes) of tasks. In other words, the only condition synchronisations considered were barrier synchronisations. In a follow-up paper [10], an alternative algorithm was presented, in many cases improving over the rst algorithm. This section generalises the latter algorithm, resulting in an algorithm allowing for the analysis of general task graph models as described by the formalism in the previous section.

5 Similar to previous algorithms, the analysis of a combined machine and program model can be distinguished in low-level analysis, capturing the inuence of machine aspects modelled with queueing network, and high-level analysis, yielding the overall program performance. The results of the low-level analysis are used as input to the high-level algorithm. Some aspects of low-level queueing model analysis specic for Glamis and the high-level task graph analysis algorithm are presented in the following subsections. Queueing network analysis. In the high-level program analysis algorithm, which will be described in the next subsection, the response times are calculated using an MVA function mva. Any variant of multiple-class MVA (either exact [12] or approximate [3, 13]) can be used. The choice between an exact or an approximate solution solely depends on the required accuracy and the analytical cost that is still acceptable, thus once again illustrating the trade-o between eciency and accuracy. By means of aggregation [4], the analysis of queueing networks containing replicated service centres can be made more ecient. A block of identical service centres is replaced by a single ow-equivalent service centre which, because of the symmetry, has a service rate given by a closed-form expression. For a block of centres with an exponentially distributed service time, Q m (S), this rate, for population n, is given by [9, 15]: m (n) = n=((m + n? 1)S) For a block of FCFS centres with a deterministic service times, D m (S), an exact owequivalent service rate can only be derived for small values of m or n. For the general case, a good approximation is the following expression [10]: m (n) = 1 n?1 X m? 1 j if m n, m (n) = 1 m?1 X n? 1 j if m n S m S n j=0 Because deterministic service times for FCFS centres violate the product-form requirements, an additional prediction error is introduced when incorporating this block in the total queueing network. However, this error is generally limited, and in most cases less than the error that results from the assumption of probabilistic service times, especially in the case of highly symmetric structures which often occur in parallel systems. High-level task graph analysis algorithm. The high-level algorithm captures the eects of condition synchronisations at the task level, using the results obtained with MVA to include machine inuences. For convenience, we introduce a predecessor function E, dened as: E : T! }(T ); 8u 2 T 8t 2 T : u 2 E(t), t 2 P (u). The algorithm, using the formal description of a task graph from section 4, is presented in gure 3. Set A, containing the active tasks, is initialised with all tasks without predecessors. Set B, containing the completed tasks, is initially empty. The response times for all active tasks are calculated using MVA, as described in the previous subsection. The number of job classes in the queueing network is equal to the number of active tasks. The number of jobs in a class is equal to the multiplicity of the corresponding task. The visit counts of the dierent tasks to the queueing model elements are derived using the functions C and F, as indicated in section 4. The function mva returns the response times R a for all tasks a 2 A obtained with MVA, given the set of active tasks A. All tasks with a minimum response time (denoted by R min ) are removed from A and added to B. Successors of these tasks are added to A, provided that all their predecessors are members of B. All steps are repeated until A is empty. The total completion time is equal to the sum of the minimum response times in every iteration. j=0

6 1: A := ft 2 T je(t) = g ; 2: B := ; R tot := 0:0 ; 3: while A 6= do 4: 8 a2a : R a = mva(a; a) ; 5: R min := min a2a R a ; R tot := R tot + R min ; 6: A min := fa 2 AjR a = R min g ; 7: B := B [ A min ; A := AnA min ; 1? R min Ra C(a; i) ; 8: 8 a2a 8 i2i : C(a; i) := 9: 8 a2amin : A := A [ fb 2 P (a)je(b) Bg ; 10: endwhile 11: return R tot ; Figure 3: Task graph analysis algorithm 6 Analysis of course-grain mutual exclusion This section concerns the analysis of a certain type of course-grain (task-level) mutual exclusion, which cannot easily be expressed in a regular queueing network, e.g. tasks being executed within a critical code section. These synchronisations resemble the regular precedence relationships specied by a task graph. However, the order in which two mutually exclusive tasks are executed is not specied. This fundamental dierence between condition synchronisation and mutual exclusion makes that a treatment of course-grain mutual exclusion similar to the way condition synchronisations are treated will result in less robust answers. Still, this approach has some attractive features. Firstly, resource usage modelled by the queueing network is still possible within a task participating in such a mutual exclusion relationship. In this way, ecient analysis of a certain type of simultaneous resource possession is made possible, thus extending the scope of the original methodology. Other types of simultaneous resource possession, e.g. occurring in circuit-switched interconnection networks, cannot be solved with this method. These require dierent solutions, e.g. the method of surrogates [8] (which can directly be combined with our high-level task graph analysis algorithms, replacing the mva function). A second advantage of using this type of mutual exclusion, which inherently represents an FCFS scheduling policy, is that it allows for dierent service times for dierent tasks. This is not allowed for an FCFS server in a queueing model. Finally, transient eects are automatically included in the predictions. A function U : T! }(T ) is dened in addition to the functions P and E, yielding a set of tasks to be executed in mutual exclusion with the argument task. For simplicity, we will assume that all tasks have a multiplicity of 1. Before activating a task t, an additional check is necessary to make sure no task in U(t) is active. When a task t nishes, members of U(t) are candidates for activation, in addition to the members of P (t). Because several tasks to be executed in mutual exclusion might simultaneously become ready for execution, the activation of tasks must be carried out strictly sequential (which is indicated by the use of a for-statement rather than the 8-symbol in the code fragments given below), in order to make sure that only one of these tasks will actually become active. The algorithm from gure 3 only needs to be adapted at two places. The rst line becomes 1: for t 2 T

7 if E(t) = ^ U(t) \ A = then A := A [ ftg Line 9 is replaced by 9: for a 2 A min for b 2 P (a) [ U(a) if E(b) B ^ U(b) \ A = then A := A [ fbg This approach to account for mutual exclusion can never replace queueing models, because of the completely dierent nature of this type of synchronisation. This approach assumes that the arbitration of contention is completely deterministic, which is normally not the case. Consequently, the algorithm yields one sample of the execution time distribution, which can be anywhere in the range of possible execution times. When interested in the mean execution time, a more complicated strategy is required, enumerating over all possibilities, similar to the way course-grain conditional statements are analysed [10]. However, when the number of decisions to be taken is large, this becomes infeasible. A nal restriction is that the method is only applicable for an FCFS scheduling policy. 7 Case study: task graph executor The case study concerns the performance prediction of a task graph executor using a farmer-worker strategy, running on a Parsytec GCel distributed-memory machine 1. A central processor (farmer) distributes tasks ready to be executed to the workers (the number of workers W ranging from 1 to 4). Tasks are distributed asynchronously, i.e. the farmer only determines which worker will execute which tasks, the workers are responsible for the scheduling of tasks and, if necessary, task queueing. A star topology is adopted, therefore all communications take place between farmer and workers, and are nearestneighbour. Two scheduling disciplines for the workers are distinguished. In rst-come rst-served (FCFS) scheduling, at most one task is active on a processor at a time, other tasks scheduled on the same processor are queued until the task nishes. This is modelled with mutually exclusive tasks, as described in section 6. The workload within the context of the processor is simply modelled as a delay centre. When using processor sharing (PS) scheduling, tasks running on the same worker are executed concurrently, in dierent threads. This is modelled as a workload on a single-server queueing centre. The communication model is based on the model of the Parsytec GCel described by Van Gemund [6]. The key property of this model is that every communication link is modelled as a single queueing centre, representing the exclusive use of the link and the DMA devices of sender and receiver. Sending a message also imposes a small workload on sending and receiving processor. Additional workload which occurs when two communications in opposite directions take place simultaneously (as a result of additional acknowledgements trac), is ignored. A queueing model of this system (for W = 4) is shown in gure 4. Every shaded box represents a node. The farmer node consists of one processor queue and W queues for the incoming communication links (accounting for both sender and receiver DMA channels). Every worker node consists of a processor queue and a link queue for incoming communications. One visist to the communication link models the transmission of one 120-byte packet. The transmission time of a packet over a link is 108 s. Consider the task graph shown in gure 5. The computation time is 0.3 sec. for tasks 1 and 8, and 0.2 sec. for the other tasks. The communication between tasks is negligible. 1 Kindly made available by the Interdisciplinary Center for Computer-based Complex systems research Amsterdam (IC 3 A).

8 Figure 4: Queueing model Figure 5: Example task graph This graph is executed on a varying number of processors, using dierent task to processor mappings. The dierence between PS and FCFS scheduling is studied. The results are FCFS PS P Map. Meas. Sim. Pred. Meas. Sim. Pred A B C D Table 1: Results of the example task graph summarised in table 1 for various congurations. In addition to measured execution times and analytical predictions, execution times obtained with model simulation also included, in order to distinguish between modelling errors and analytical errors. In all cases, the predictions and simulation results are almost identical to the measured times. The small dierences found can entirely be attributed to measurement inaccuracies, and some overhead not included in the models. In mapping A, tasks 1 to 4 are mapped on processor 1, and tasks 5 to 8 on processor 2. Measurement, as well as simulation and prediction, show that for this mapping FCFS scheduling performs better than PS scheduling. Simulation shows that, due to non-deterministic arbitration of mutual exclusion, the actual completion time in case of FCFS can vary from 1.2 to 1.6 sec. However, due to the implementation of the executor, only 1.2 sec. is measured in practice, the same value that is predicted by our algorithm. Mapping B (tasks 1; 3; 5; 7! proc. 1, tasks 2; 4; 6; 8! proc. 2) is the optimal mapping for 2 processors. In case of mapping C (tasks 1; 4; 5! proc. 1, tasks 2; 6; 8! proc. 2, tasks 3; 7! proc. 3), a similar situation as in mapping B occurs. Simulation shows that an execution time of either 0.7 or 0.8 sec. is possible for FCFS, while measurement and prediction only give 0.8 sec. PS scheduling performs slightly worse. Mapping D is optimal for 3 processors (tasks 1; 5; 7! proc. 1, tasks 2; 4; 8! proc. 2, tasks 3; 6! proc. 3). For four processors and an optimal mapping, the lowest possible execution time of 0.5 sec., corresponding to the critical path, is obtained. As a second experiment, the completion times of random task graphs with dierent communication behaviour are studied. Only PS scheduling is considered. The results are shown in tables 2 (average communication), 3 (only communication) and 4 (low communication). All times are in seconds. Relative deviations from the numbers in the previous column (in %) are shown in parentheses. In case of low communication, only task parameters are transferred from farmer to worker (64 data bytes). For average communication

9 (2.2) (0.6) (1.0) (0.5) (2.5) (0.9) (1.4) (0.0) (0.2) (0.1) (1.0) (0.2) (0.9) (0.1) (1.6) (0.3) Table 2: Random task graphs with 5 and 50 tasks, average communication and only communication, a message of 20,000 bytes is sent to the worker for each task assigned to it, and a result message of 20,000 bytes is sent back to the farmer after task completion. It appears that the errors are highest in task graphs with only communication, which is explained by the simplications in the communication model. Very accurate results are obtained for task graphs with relatively little communication. For most task graphs, an accuracy within 5% can be expected (2.1) (7.3) (3.6) (4.1) (8.0) (6.2) (7.0) (1.6) (1.4) (7.4) (3.5) (5.1) (4.1) (3.0) (1.5) (3.9) Table 3: Task graphs with 75 and 200 tasks, only communication (0.0) (0.3) (0.0) (0.3) (0.3) (0.4) (0.8) (0.2) (0.0) (0.6) (0.6) (0.7) (0.3) (0.8) (0.2) (0.3) Table 4: Task graphs with 35 and 60 tasks, low communication 8 Conclusions This paper presents a new algorithm for the analysis of parallel programs with general static synchronisation patterns, given a queueing model to capture the inuence of the underlying parallel machine, e.g. contention for hardware resources. This algorithm, which is part of the Glamis methodology for performance modelling of parallel applications, generalises over an earlier presented algorithm [10] applicable to only a subclass of task graphs. The algorithm is extended to include the analysis of programs with a certain type of mutual exclusion at the task level, although the results may be less accurate due to the non-deterministic character of mutual exclusion synchronisation. A measurement-based case study is presented, serving to illustrate the main features of the methodology and to validate the algorithm (including the extension). The Glamis methodology aims at the accuracy required for feedback to the user. Ef- cient numerical methods, yielding reasonably accurate performance predictions in poly-

10 nomial time (e.g. MVA), are used for the analysis of the queueing model. The number of invocations of the queueing model analysis algorithm is in the worst case equal to the number of tasks, which means that the overall complexity of the analysis remains polynomial. Because of the probabilistic foundation of the methodology, non-deterministic features can easily be taken into account. Other key features of Glamis include modelling ease, exibility and scalability of the models. The eciency and scalability can be improved when exploiting the symmetrical structure of many parallel architectures and programs. The case study shows that the most important characteristics of a realistic application, including the eects of communication and resource contention, can be captured in a relatively simple model. The performance predictions obtained with this model, either using simulation or our analytical methods, match the measured values well. The eects of dierent mappings and scheduling policies are correctly predicted. References [1] V.S. Adve and M.K. Vernon, \The inuence of random delays on parallel execution times," in Proc ACM SIGMETRICS Conf. on Measurement and Modelling of Computer Systems, May 1993, pp. 61{73. [2] F. Baskett, K.M. Chandy, R.R. Muntz and F.G. Palacios, \Open, closed, and mixed networks of queues with dierent classes of customers," J. of the ACM, vol. 22, Apr. 1975, pp. 248{260. [3] K.M. Chandy and D. Neuse, \Linearizer: A heuristic algorithm for queueing network models of computing systems," Comm. of the ACM, vol. 25, Feb. 1982, pp. 126{134. [4] P.J. Courtois, Decomposability: Queueing and Computer System Applications. Academic Press, [5] A.J.C. van Gemund, \Compiling performance models from parallel programs," in Proc. 8th ACM Int. Conf. on Supercomputing, Manchester, July 1994, pp. 303{312. [6] A.J.C. van Gemund and G.L. Reijns, \Predicting Parallel System Performance with Pamela," in these proceedings. [7] P. Heidelberger and K.S. Trivedi, \Analytic queueing models for programs with internal concurrency," IEEE Tr. on Computers, vol. 32, Jan. 1983, pp. 73{82. [8] P.A. Jacobson and E.D. Lazowska, \Analyzing queueing networks with simultaneous resource possession," Comm. of the ACM, vol. 25, Feb. 1982, pp. 142{151. [9] H. Jonkers, \Queueing models of parallel applications: The Glamis methodology," in Computer Performance Evaluation: Modelling Techniques and Tools (LNCS 794) (G. Haring and G. Kotsis, eds.), Springer-Verlag, May 1994, pp. 123{138. [10] H. Jonkers, A.J.C. van Gemund and G.L. Reijns, \A probabilistic approach to parallel system performance modelling," in Proc. 28th Hawaii Int. Conf. on System Sciences, Vol. II, IEEE, Jan. 1995, pp. 412{421. [11] V.W. Mak and S.F. Lundstrom, \Predicting performance of parallel computations," IEEE Tr. on Parallel and Distributed Systems, vol. 1, July 1990, pp. 257{270. [12] M. Reiser and S.S. Lavenberg, \Mean value analysis of closed multichain queueing networks," J. of the ACM, vol. 27, Apr. 1980, pp. 313{322. [13] P. Schweitzer, \Approximate analysis of multiclass closed networks of queues," in Proc. of Int. Conf. on Control and Optimization, Amsterdam, [14] A. Thomasian and P.F. Bay, \Analytic queueing network models for parallel processing task systems," IEEE Tr. on Computers, vol. 35, Dec. 1986, pp. 1045{1054. [15] J. Zahorjan et al., \Balanced job bound analysis of queueing networks," Comm. of the ACM, vol. 25, Feb. 1982, pp. 134{141.

In: Proc. 7th International Conference on Modelling Techniques and Tools for. Computer Performance Evaluation, Vienna, Austria, May 1994.

In: Proc. 7th International Conference on Modelling Techniques and Tools for Computer Performance Evaluation, Vienna, Austria, May 1994 Queueing Models of Parallel Applications: The Glamis Methodology