A Hierarchical Approach to Workload. M. Calzarossa 1, G. Haring 2, G. Kotsis 2,A.Merlo 1,D.Tessera 1

Similar documents
Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Workload Characterization Issues and Methodologies

execution time [s] 8192x x x x x512 # of processors

SAMOS: an Active Object{Oriented Database System. Stella Gatziu, Klaus R. Dittrich. Database Technology Research Group

proposed. In Sect. 3, the environment used for the automatic generation of data parallel programs is briey described, together with the proler tool pr

Active Motion Detection and Object Tracking. Joachim Denzler and Dietrich W.R.Paulus.

Rapid Prototyping with APICES

Managing test suites for services

Construction of Application Generators Using Eli. Uwe Kastens, University of Paderborn, FRG. Abstract

Natural Semantics [14] within the Centaur system [6], and the Typol formalism [8] which provides us with executable specications. The outcome of such

Conventionel versus Petri Net Modeling of a Transport Process in Postal Automation

Moby/plc { Graphical Development of. University of Oldenburg { Department of Computer Science. P.O.Box 2503, D Oldenburg, Germany

Skill. Robot/ Controller

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

Process Model Consistency Measurement

We consider the problem of rening quadrilateral and hexahedral element meshes. For

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

frequency [req/min] 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 time interarrival time [s]

LOGICAL OPERATOR USAGE IN STRUCTURAL MODELLING

Centre for Parallel Computing, University of Westminster, London, W1M 8JS

High-level Modeling with THORNs. Oldenburger Forschungs- und Entwicklungsinstitut fur. Informatik-Werkzeuge- und Systeme (Offis)

Inference in Hierarchical Multidimensional Space

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu

User Interface Modelling Based on the Graph Transformations of Conceptual Data Model

A Visual Editor for Reconfigurable Object Nets based on the ECLIPSE Graphical Editor Framework

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Quality-of-Service Testing. Specifying Functional QoS Testing Requirements by using Message. Sequence Charts and TTCN

the application rule M : x:a: B N : A M N : (x:a: B) N and the reduction rule (x: A: B) N! Bfx := Ng. Their algorithm is not fully satisfactory in the

when a process of the form if be then p else q is executed and also when an output action is performed. 1. Unnecessary substitution: Let p = c!25 c?x:

Design Process Ontology Approach Proposal

N. Hitschfeld. Blanco Encalada 2120, Santiago, CHILE.

Dierential-Linear Cryptanalysis of Serpent? Haifa 32000, Israel. Haifa 32000, Israel

Helper Agents as a Means of Structuring Multi-Agent Applications

Super-Key Classes for Updating. Materialized Derived Classes in Object Bases

Achilles: A High-Level Synthesis System. for Asynchronous Circuits. Jordi Cortadella, Rosa M. Badia, Enric Pastor, and Abelardo Pardo y

Flight Systems are Cyber-Physical Systems

Hierarchical Clustering of Process Schemas

Application Programmer. Vienna Fortran Out-of-Core Program

VISION-BASED HANDLING WITH A MOBILE ROBOT

Autolink. A Tool for the Automatic and Semi-Automatic Test Generation

A Modelling and Analysis Environment for LARES

A New Heuristic Layout Algorithm for Directed Acyclic Graphs *

Joint Entity Resolution

2 The original active contour algorithm presented in [] had some inherent computational problems in evaluating the energy function, which were subsequ

Abstract formula. Net formula

DRAFT for FINAL VERSION. Accepted for CACSD'97, Gent, Belgium, April 1997 IMPLEMENTATION ASPECTS OF THE PLC STANDARD IEC

A Hierarchical Statistical Framework for the Segmentation of Deformable Objects in Image Sequences Charles Kervrann and Fabrice Heitz IRISA / INRIA -

Delivery of Consistent and Integrated User s Data within a Multi-Tenant Adaptive SaaS Application

A Framework-Solution for the. based on Graphical Integration-Schema. W. John, D. Portner

Dewayne E. Perry. Abstract. An important ingredient in meeting today's market demands

Rance Cleaveland The Concurrency Factory is an integrated toolset for specication, simulation,

Performance Measures for Multi-Graded Relevance

Experiences with OWL-S, Directions for Service Composition:

Predicting the performance of general task graphs. with underlying queueing model. Abstract

AST: Support for Algorithm Selection with a CBR Approach

Dierencegraph - A ProM Plugin for Calculating and Visualizing Dierences between Processes

UNIVERSITÄT PADERBORN. ComponentTools

Using semantic causality graphs to validate MAS models

(b) extended UML state machine diagram. (a) UML state machine diagram. tr D2 tr D1 D2 D1 D2

TRANSPARENCY ANALYSIS OF PETRI NET BASED LOGIC CONTROLLERS A MEASURE FOR SOFTWARE QUALITY IN AUTOMATION

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139

Buffer_1 Machine_1 Buffer_2 Machine_2. Arrival. Served

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

A Boolean Expression. Reachability Analysis or Bisimulation. Equation Solver. Boolean. equations.

Qualitative Analysis of WorkFlow nets using Linear Logic: Soundness Verification

Stochastic Modeling of Scaled Parallel Programs

EL6483: Basic Concepts of Embedded System ModelingSpring and Hardware-In-The-Loo

PRIMA-UML: a performance validation incremental methodology on early UML diagrams

A Freely Congurable Audio-Mixing Engine. M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

proc {Produce State Out} local State2 Out2 in State2 = State + 1 Out = State Out2 {Produce State2 Out2}

Petri Nets ee249 Fall 2000

A Tool for Supporting Object-Aware Processes

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

CS SOFTWARE ENGINEERING QUESTION BANK SIXTEEN MARKS

An Automatic Tool for Checking Consistency between Data Flow Diagrams (DFDs)

Towards a Reference Framework. Gianpaolo Cugola and Carlo Ghezzi. [cugola, P.za Leonardo da Vinci 32.

Simulink/Stateflow. June 2008

A Bintree Representation of Generalized Binary. Digital Images

A Tool for the Computation of Worst Case Task Execution Times

Reconciling Dierent Semantics for Concept Denition (Extended Abstract) Giuseppe De Giacomo Dipartimento di Informatica e Sistemistica Universita di Ro

Techniques. IDSIA, Istituto Dalle Molle di Studi sull'intelligenza Articiale. Phone: Fax:

An Implementation of Causal Memories using the Writing Semantic

B2 if cs < cs_max then cs := cs + 1 cs := 1 ra

Mobile robots and appliances to support the elderly people

Automatic Scalability Analysis of Parallel Programs Based on Modeling Techniques

Analysis of Aspect-Oriented Models using Graph Transformation Systems

TEMPORAL AND SPATIAL SEMANTIC MODELS FOR MULTIMEDIA PRESENTATIONS ABSTRACT

Towards a formal model of object-oriented hyperslices

Supporting the Workflow Management System Development Process with YAWL

Brouillon d'article pour les Cahiers GUTenberg n?? February 5, xndy A Flexible Indexing System Roger Kehr Institut fur Theoretische Informatik

A Comparison of Modelling Frameworks. Norwegian Institute of Technology, Trondheim, Norway. The close relationship between the elds is indicated by a

Modelling the behaviour of management operations in TOSCA

Siegfried Loer and Ahmed Serhrouchni. Abstract. SPIN is a tool to simulate and validate Protocols. PROMELA, its

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a

CHAPTER 1 INTRODUCTION

- - PARADISE: Design Environment for Parallel & Distributed, Embedded Real-Time

q ii (t) =;X q ij (t) where p ij (t 1 t 2 ) is the probability thatwhen the model is in the state i in the moment t 1 the transition occurs to the sta

Compositional Schedulability Analysis of Hierarchical Real-Time Systems

100 Mbps DEC FDDI Gigaswitch

Transcription:

A Hierarchical Approach to Workload Characterization for Parallel Systems? M. Calzarossa 1, G. Haring 2, G. Kotsis 2,A.Merlo 1,D.Tessera 1 1 Dipartimento di Informatica e Sistemistica, Universita dipavia, Via Abbiategrasso 209, I-27100 Pavia, Tel +39 382 505 369, Fax +39 382 505 373, e-mail: [mcc,merlo,tessera]@gilda.unipv.it 2 Institut fur Angewandte Informatik und Informationssysteme, Universitat Wien, Lenaugasse 2/8, A-1080 Vienna, Tel +43 1 408 63 66 10, Fax +43 1 408 04 50, e-mail: [haring,gabi]@ani.univie.ac.at Abstract. Performance evaluation studies are to be an integral part of the design and tuning of parallel applications. Their structure and their behavior are the dominating factors. We propose a hierarchical approach to the systematic characterization of the workload of a parallel system, to be kept as modular and exible as possible. The methodology is based on three dierent, but related, layers: the application, the algorithm, and the routine layer. For each of these layers dierent characteristics representing functional, sequential, parallel, and quantitative descriptions have been identied. Taking also architectural and mapping features into consideration, the hierarchical workload characterization can be used for any type of performance studies. 1 Introduction The main reason to use parallel systems is to get more performance, i.e. either to be able to solve larger problems or to solve given problems in shorter time. So, in fact, performance is the driving force to develop new applications for parallel systems. The performance of such systems depends on many more aspects than that of a uniprocessor system, due to the more complex behavior of its hardware and software components. Therefore, from the performance point of view, there are more challenges but also deeper pitfalls [CS93]. Performance evaluation studies are to be an integral part of the design and tuning of applications, i.e. the workload, to reduce the development and performance debugging costs [CMM + 94]. More or less for all these studies, but especially for performance predictions of applications under development, a monolithic approach is not appropriate. Due to the fact that workload, architecture, and mapping are the features with the largest impact on the performance of parallel systems, various performance prediction frameworks have been developed, where these aspects can be specied and modied in a structured and exible? This work was supported by the Austrian-Italian Cooperation 1994-96 Project Number 32, and by the 40% MURST Project

way, as independent as possible from each other. Examples are the PAPS tool set [WKH93, WH94], the PRM [Fer90, Fer92] and the Mitchele Thiel [Her92, MT93] approaches. But even these approaches use a rather \at" description of the workload, reducing the understandability by increasing the representation complexity. In this paper we propose a hierarchical approach to the systematic characterization of the workload of parallel systems able to manage the complexity of such a workload following the same principles as used in software engineering. Hierarchical descriptions of workloads have already proven the usability for representing workloads of interactive [Har86] and distributed [CHS88, RVH94] systems. Our hierarchical methodology is general enough to be easily adapted according to the objectives of the performance study. These objectives reach from getting \performance gures" for an existing application (or algorithm), which is composed of smaller components with known behavior, to predicting the performance of an application under development before implementing and executing it. In all these cases, a hierarchical approach is natural and helps to manage the complexity of the workload characterization and to drive the corresponding performance studies. The methodology we present here can be applied both to functional and data parallel programming paradigms. From the architecture point of view, we can deal with distributed and shared memory,as well as virtual shared memory systems. Our characterization approach, which represents a logical view of the workload, is system independent, in the sense that the architectural characteristics come in when the nal performance model is built. At that stage, either the communication or the memory access behavior will be modeled using explicit timings or hidden delays, according to the load and data distributions. The paper is organized as follows. In Section 2 we sketch the underlying methodology of the proposed hierarchical approach and we dene each layer in terms of its components. The characterization of each layer is discussed in Section 3. Section 4 concludes the paper with an outlook on future activities. 2 Methodology Workload characterization studies are widely applied in the software design process and when a synthetic description of existing programs is required. Basic principles of software engineering technology, like abstraction and hierarchy, are taken to manage the complexity of the software engineering process or to better understand existing applications. In this section, the basic features of our hierarchical workload modeling methodology are pointed out. This approach corresponds to the user's or developer's mental model (view) of the software system. Figure 1 (a) shows the mental view corresponding to the user's natural perception of a software system. Three possible levels of granularity are identied. Solving a specic \problem" requires the utilization of a few \methods" and

- - - Problem Methods Blocks of operations User's mental viewpoint (a). Application Algorithms Routines Solution domain viewpoint (b). A = fa1 A2 :::g A i = fr i 1 r i 2 :::g r i j = fs j 1 sj 2 :::g Formal model viewpoint (c) Fig. 1. Correspondences between user mental view, solution domain, and formal model these methods can be described by means of \blocks of operations" they are built with. As can be seen from Figure 1 (b), there exists a conceptual correspondence between these software levels and potential layers of characterization of the load itself in a so called solution domain. The solution domain preserves the logical subdivision of the mental view and translates it in terms of \application", \algorithms", and \routines". The workload description adopted in our methodology is based on these hierarchical layers. At each of them, a generic but more formal specication of the load is given as follows (see Figure 1 (c)). Application Layer At the very top layer the load is characterized by a coarse granularity, in that, a whole application corresponds to the problem to be solved (e.g., biomolecular modeling, uid dynamics). As solving a problem requires a set of methods to be applied, an application A can be described in terms of the underlying algorithms A i (i =1 2 :::) in the solution domain. Formally: A = fa 1 A 2 :::g Algorithm Layer At theintermediate layer of our methodology, the granularity is rened in order to characterize the components of an application. For numerical applications, algorithms solving systems of nonlinear equations are examples for the methods used at this layer. An algorithm A i can be described by means of the routines r i j (j =1 2 :::). Precisely: A i = fr i 1 ri 2 :::g Routine Layer At the bottom layer, the specic routines used to implement an algorithm are considered. A routine r i j is composed of code segments

s j k (k =1 2 :::). (e.g., subroutines, functions, loops). At the very nest granularity, a routine can even be a single statement. Formally, a routine is specied as: r i = j fsj 1 sj 2 :::g According to the application under study and/or the objectives of the analysis to be performed, the granularity identied for the components at each layer may vary. Sometimes a few methods can be considered as a single algorithm component because they are logically related. Furthermore, characterization layers can be omitted. A small problem, which might require only one solution method, can be characterized starting at the algorithm layer, hence omitting the application one. 3 Characterization of the Layers After having identied and dened three possible layers and the corresponding components, we investigate the characteristics for each layer. Following the concept of an adaptive methodology, we keep the characterization modular and system independent. Such a characterization starts from a functional description of the load, where its basic components are identied. Their sequences are then analyzed and the exploitation of potential structural parallelism is then considered. The characterization ends with a quantitative description. It is possible to omit certain characteristics if they are irrelevant for the particular system under test or if they do not provide any information useful in the performance evaluation study. Table 1 summarizes the characterization of the three layers. Descriptions Layer Functional Sequential Parallel Quantitative application structural multiplicity Application set of algorithms behavior parallelism of algorithms, graph graph volume of data algorithm structural multiplicity Algorithm set of routines behavior parallelism of routines, graph graph volume of data routine structural multiplicity Routine set of statements behavior parallelism of statements, graph graph volume of data Table 1. Characterization at the three hierarchical layers Functional Description The identication of the basic components for each layer is the starting point for any further description. The statements at the

routine layer, the routines at the algorithm layer, and nally the algorithms at the application layer are the basic building blocks. As already mentioned in the previous section, the granularity ateachlayer depends on the workload under study and on the objective of the analysis. For example, it is possible to lump two algorithms together into a single algorithm component A i at the application layer. This lumping implies that all the other characteristics can only be specied in terms of the A i component. Since, at this layer, we neither introduce any order among the components, nor their possible multiplicities, it is sucient toidentify the set of components. Sequential Description Once we have dened the set of components at each layer (algorithms, routines, statements), we can represent their sequences by the corresponding models. Directed graph models, called behavior graphs, are adopted. The components at each layer correspond to the nodes of such graphs. If there is a positive (e.g. non-zero) probability, that component i is followed by component j, then an arc is drawn from node i to node j annotated with the associated probability. Abehavior graph could also be interpreted as a state transition diagram, where the components correspond to the states, and the arcs denote the transitions annotated with their probabilities. As an example we consider an application which consists of seven algorithms (represented by the nodes A 1 A 2 ::: A 7 ). As can be seen from Figure 2 (a) nodes A 1, A 2, A 3, A 4 and A 5 are in a sequential order, while A 5 is either followed by algorithm A 6 (with 0.2 probability) or by algorithm A 7. Then, from A 6 there is a loop back to algorithm A 3. Moving one layer down in our hierarchy would mean to derive the corresponding behavior graphs for the interactions of the routines within each algorithm. Each routine can in turn be represented in terms of the graph of its statements. Parallel Description Graph models have proven to be a convenient method for specifying the characteristics of parallel applications (e.g. synchronization or concurrency). We introduce Structural Parallelism Graphs (SPG), which allows a clear and concise characterization of the potential parallelism, i.e. the full parallelism that can be exploited within a workload component without any constraints with respect to the number of available processors. A Structural Parallelism Graph is a directed graph SPG = (V D W), where V is a set of vertices corresponding to the workload components and D =(D P [ D A ) is a set of directed arcs 3. W is a set of weight distributions for outgoing arcs. If no distribution is specied, all the outgoing paths are taken with probability 1. If a distribution is given, the corresponding arc is taken with the associated probability. Two types of arcs, precedence and activation arcs, are introduced in our 3 An arc can either represent the sending of an activation signal and/or exchange of data.

methodology. A precedence arc, belonging to D P, represents a precedence relation between two components, such that the second component is started after the rst one has nished. The two components can never run in parallel, since they are strictly sequential. Graphically such an arc is depicted as a thin arrow (see Figure 2 (b)). An activation arc, belonging to D A, expresses a relation between two components such that the second component is activated by the rst one before its completion. Thus, the two components may run in parallel, but the second one must not start before the rst one. The graphical representation is a thick arrow (see Figure 2 (b)). A communication or synchronization activity among two components, that have already been activated or started, can be neglected, since it does not change the potential parallelism. A 1 A 2 A 3 A 1 A 2 A 4 A 3 A 4 A 5 A 5 0.2 0.8 0.2 0.8 A 6 A 7 A 6 A 7 (a) (b) Fig. 2. Examples of a sequential (a) and a parallel (b) description For example, the application behavior graph depicted in Figure 2 (a) has been transformed in the corresponding SPG (see Fig. 2 (b)). By looking at the internal behavior of the algorithms A 1 and A 2 we nd out that they are completely independent of each other. To represent this in the SPG, we simply draw two nodesa 1 and A 2. The results that are produced by A 1

and A 2 are both needed by A 3.Soweaddthenode A 3 and connect A 1 and A 2 to A 3 using precedence arcs. A 4 has to wait for data computed by algorithm A 3 at a early stage, i.e. A 4 does not have towait for A 3 to nish. We therefore connect A 3 and A 4 with an activation arc. A 5 can start as soon as A 3 and A 4 have nished (precedence arcs) and depending on the output of A 5 either A 6 or A 7 follows (precedence arcs with associated weights). The loop back toa 3 does not have any inuence on the potential parallelism and it is therefore not represented. Quantitative Description All previous characteristics have specied the type of components and their possible interactions. To derive a performance model, we also need information about the internal characteristics of the components in terms of their multiplicity aswell as the volume of data used and/or exchanged. Note, that this description is still system independent and it does not consider the characteristics of the system. For example, we do not specify the duration of the components and the architecture dependent communication or memory access patterns. The proposed characterization is general enough such that it can be customized according to the objective of the performance studies. In the case of the existing software, measurements are basic information for bridging the gap between the logical description of the workload and the system characteristics [RS94]. In such a case, a bottom{up approach of the methodology is advisable. From measurements we can also infer the information to be used for performance prediction of applications under design. In this case, a top down approach, which starts from a coarse grain representation of the workload, seems more appropriate. 4 Conclusions A hierarchical methodology for the systematic characterization of the workload of parallel systems has been proposed. Such a methodology is based on a layered approach, which takes into consideration various types of descriptions of the workload components. The characterization obtained by means of this methodology focuses on the load, while discarding specic information on the architectures and mapping strategies. In this sense, our characterization is system independent, modular, and exible and supports independent specications of workload, architectures, and mapping. The system dependent characteristics are to be considered at the nal stage, where the relations and transformations between the various characterizations on the dierent layers have also to be analyzed, to obtain the nal performance model. Our future work will concentrate of the study of these interactions and on a possible integration of independent parts. References [CHS88] M. Calzarossa, G. Haring, and G. Serazzi. Workload Modelling for Computer Networks. In U. Kastens and F. J. Rammig, editor, Architektur

und Betrieb von Rechnersystemen. 10. GI/ITG-Fachtagung, pages 324{339. Springer Verlag, Informatik Fachberichte Nr. 168, 1988. [CMM + 94] M. Calzarossa, L. Massari, A. Merlo, D. Tessera, and A. Malagoli. Performance Debugging of Parallel Programs. In Proc. AICA Conference, pages 541{556, Palermo, Italy, Sept. 1994. [CS93] M. Calzarossa and G. Serazzi. Workload Characterization: a Survey. Proc. of the IEEE, 81(8):1136{1150, 1993. [Fer90] A. Ferscha. The PRM-Net Model - An Integrated Performance Model for Parallel Systems. Technical report, Austrian Center for Parallel Computation, University of Vienna, 1990. [Fer92] A. Ferscha. A Petri Net Approach for Performance Oriented Parallel Program Design. Journal of Parallel and Distributed Computing, 15(3):188 { 206, July 1992. [Har86] G. Haring. Fundamental Principles of Hierarchical Workload Description. In G. Serazzi, editor, Workload Characterization of Computer Systems and Computer Networks, pages 101{110. North Holland, 1986. [Her92] U. Herzog. Performance Evaluation as an Integral Part of System Design. In M. Becker et al., editor, Proceedings of the Transputers'92 Conference. IOS Press, 1992. [MT93] A. Mitschele-Thiel. Automatic conguration and optimization of parallel transputer applications. In Proc. World Transputer Congress, Aachen, Germany. IOS Press, 1993. [RS94] E. Rosti and G. Serazzi. Workload Characterization for Performance Engineering of Parallel Applications. In Proceedings of the Euromicro Workshop on Parallel and Distributed Processing, pages 457{462. IEEE Computer Society Press, 1994. [RVH94] S. V. Raghavan, D. Vasukiammaiyar, and G. Haring. Hierarchical approach to building generative network load models. Computer Networks and ISDN, 1994. (to appear). [WH94] H. Wabnig and G. Haring. PAPS - The Parallel Program Performance Prediction Toolset. In Proc. of the 7 th Int. Conf. on Modelling Techniques and Tools for Computer Performance Evaluation., pages 284{304. Springer- Verlag, 1994. [WKH93] H. Wabnig, G. Kotsis, and G. Haring. Performance Prediction of Parallel Programs. In G. Haring and G. Kotsis, editors, Proc. of the 7th GI/ITG Conference onmearsurement, Modelling and Performance Evaluation of Computer Systems, pages 64{76. Springer Verlag, New York, 1993. This article was processed using the LT E X macro package with LLNCS style