Politehnica University of Bucharest European Organisation for Nuclear Research California Institute of Technology MONARC 2

Size: px

Start display at page:

Download "Politehnica University of Bucharest European Organisation for Nuclear Research California Institute of Technology MONARC 2"

Chrystal Watkins
5 years ago
Views:

1 Politehnica University of Bucharest European Organisation for Nuclear Research California Institute of Technology MONARC 2 (Models of Networked Analysis at Regional Centers) - distributed systems simulation - Iosif Charles Legrand Ciprian Mihai Dobre Corina Stratan

2 Contents 1. ABSTRACT 4 2. INTRODUCTION 4 3. DESIGN OVERVIEW 5 4. REGIONAL CENTER MODEL 6 5. THE SIMULATION ENGINE The engine package Simulation events Simulation tasks Managing the Tasks and the Events The Scheduler s Algorithm Priority Queues Matching the Events with Predicates The Thread Pool THE JOB MODEL The Job Class Attributes for the Job Class Simulating the Execution of the Jobs The Implemented Types of Jobs Submitting Jobs: the Activities Job Scheduling Basic Job Scheduler Distributed Job Scheduler Estimating the execution time for the jobs The way jobs share the CPU Interrupts DATA MODEL NETWORK MODEL TESTS AND EXAMPLES FTP test 41 2

3 Determining the server usability The file size dependency Simple Distributed Scheduling Example Brief description The scheduling algorithm Simulation results CONCLUSIONS BIBLIOGRAPHY 52 3

4 1. ABSTRACT The LHC (Large Hadron Collider) experiments at CERN have envisaged computing systems of unprecedented complexity, for which is necessary to provide a realistic description and modeling of data access patterns. This necessity led to the development of a simulation tool, which was named MONARC (MOdels of Networked Analysis at Regional Centers); the project is now at the second version, MONARC 2, involving a team with members from CERN, Caltech and Politehnica University of Bucharest. A process oriented approach for discrete event simulation is well suited to describe various activities running concurrently, as well the stochastic arrival patterns specific for such type of simulation. Threaded objects or Active Objects can provide a natural way to map the specific behavior of distributed data processing into the simulation program. The simulation tool developed within MONARC project is based on Java(TM) technology which provides adequate tools for developing a flexible and distributed process oriented simulation. Proper graphics tools, and ways to analyze data interactively, are essential in any simulation project. This paper will present the design elements, features and current status of the MONARC simulation tool. 2. INTRODUCTION Computer system users, administrators, and designers usually have a goal of highest performance at lowest cost. Modeling and simulation of system design trade off is good preparation for design and engineering decisions in real world jobs. The aim of this project was to provide a realistic simulation of large distributed computing systems and to offer a flexible and dynamic environment to evaluate the performance of a range of possible data processing architectures. At the same time the simulation framework provides the possibility to evaluate different optimization procedures for data intensive applications. The goals are to provide a realistic simulation of distributed computing systems, customized for specific physics data processing and to offer a flexible and dynamic environment to evaluate the performance of a range of possible data processing architectures. This Simulation framework is not intended to be a detailed simulator for basic components such as operating systems, data base servers or routers. Instead, based on realistic mathematical models and measured parameters on test bed systems for all the basic components, it aims to correctly describe the performance and limitations of large distributed systems with complex interactions. For the second version of MONARC, the main goals we aimed to achieve were: to develop a more flexible and extensible application to create a general simulation framework, one that can also be used in other domains than physics experiments 4

5 to improve the performance of the simulation engine and the ability to work with large numbers of threads to use a general data model and to simulate data replication the simulation of distributed scheduling algorithms realistic modelling of local area and large area networks to improve the graphical user interface to evaluate the simulation engine s performances on different platforms (including multiprocessors) 3. DESIGN OVERVIEW An Object Oriented design, which allows an easy and direct mapping of the logical components into the simulation program and provides the interaction mechanism, offers the best solution for such a large scale system and also copes with systems which may scale and change dynamically. A process oriented approach for discrete event simulation is well suited to describe concurrent running programs, network traffic as well as all the stochastic arrival patterns, specific for such type of simulation. Threaded objects or "Active Objects" (having an execution thread, program counter, stack...) allow a natural way to map the specific behavior of distributed data processing into the simulation program. A discrete event, process oriented simulation approach, developed in Java(TM) was used for this modeling project. Java technology provides adequate tools for developing a flexible and distributed process oriented simulation. Java has build-in multithread support for concurrent processing, which can be used for simulation purposes by providing a dedicated scheduling mechanism. Another advantage of using Java for the project was that because 100% pure Java programs are compiled into machineindependent bytecodes, they run consistently on any Java platform. Large systems simulations will require significant processing power. A multithread simulation engine offers the possibility to easily deploy and run complex cases on multi-processor systems. An important part of the project required to evaluate and optimize the performance of the simulation engine on multi-processor systems. The simulation framework is used for modeling complex data processing programs, running on large scale distributed systems and exchanging very large amounts of data. It requires to correctly describe concurrent network traffic in an efficient way from the simulation point of view. The simulation framework provides the mechanism to easily evaluate different strategies in data replication or in the job scheduling procedures and the development of algorithms for the grid middleware components. A complex Graphical User Interface (GUI) to the simulation engine, which allows to dynamically change parameters, load user s defined time response functions for different components and to monitor and analyze simulation results, provides a powerful development tool for evaluating and designing large scale distributed processing systems. 5

6 The packages that map the components of our application are represented in the following diagram: Fig The structure of our application. The engine is the core of the simulator, which manages the tasks and the events; the center package contains some of the base components of the regional centers (the CPUs, the farm, the job scheduler etc.). There are also packages that model networks and databases (monrac.network, monarc.datamodel) and a package which implements several statistic distributions (monarc.distribution) that we use to describe the behaviour of the system and of the users. The monarc.output package is for generating the graphical results of the simulation. 4. REGIONAL CENTER MODEL The model we chose is based on regional centers: the system is composed of several interconnected regional centers. Each regional center has a farm of working stations (called CPU units), database servers and mass storage units, one or more local area networks; there also exists a scheduler for the submitted jobs and a waiting queue for the jobs that can t be processed at the moment. Any regional center can instantiate dynamically a set of Users or Activity Objects which are used to generate data processing jobs based on different scenarios. Inside a Regional Centre different job scheduling policies may used to distribute the jobs to processing nodes. 6

With this structure it is now possible to build a wide range of computing models, from the very centralized (with reconstruction and most analyses at CERN) to the distributed systems, with an almost

7 With this structure it is now possible to build a wide range of computing models, from the very centralized (with reconstruction and most analyses at CERN) to the distributed systems, with an almost arbitrary level of complication (CERN and multiple regional centers, each with different hardware configuration and possibly different sets of data replicated). These components of a regional center are represented in the picture below: Fig Regional center model In the following table we explain the meanings of some of the terms we use in the simulation: COMPONENT SHORT DESCRIPTION Data Container Contains Objects of a single type in defined range. Data Base Contains sets of Containers. It is associated with an AMS Server. Data Base Catalogue AMS Server Provide the mechanism to locate any Object and the AMS server which can retrieve this object. Performs data access (W/R) in an OODB Model. Handles data storage on local disks or tape. 7

8 Disk Array Mass Storage Unit CPU Node I/O Link (Link Port) LAN WAN Router Protocol Farm Associated with an AMS server as a storage unit for containers. High capacity unit, but slow, connected in LAN. It interacts with AMS Servers. Typical processing Node, having a defined processing power, memory, an I/O channel. Allows concurrent execution of multiple jobs. Describes the quality of I/O connection of each component on LAN. Allows simultaneously multiple Transfers It is the basic uniform where any simulated transfer can start off. Specifies how individual components are connected in LAN. Specifies the connectivity for the wide area network. An internet like naming scheme is used. Specifies a router used to connect two or more WANs together in the simulation. Describe the modality in which the transfer is simulated. A set of CPU Nodes, AMS servers, Mass Storage Unit and a Job Scheduler. Job Specifies typical tasks used in the simulation. Active Job Activity Used by simulation system to perform a user defined job. It is dynamically allocated to a CPU Node when load constrains are satisfied. Is a generic Object, used as a loadable module which defines a set of jobs and how they are submitted for execution. Fig Simulation concepts. 8

9 5. THE SIMULATION ENGINE 5.1. The engine package The programs running on distributed data processing systems are complex, data dependent and concurrently compete for shared resources. A natural way to simulate their behaviour, which was adopted in Monarc, is to use threaded objects ( active objects ), which have an execution thread, program counter, stack and mutual exclusion mechanism. With this approach we took advantage of Java s support for concurrent processing and used the advanced features that the language provides in this domain. We created a base class for the active objects, called Task, which must be inherited by all the entities in the simulation that require a time dependent behaviour. Such entities are the running jobs, the database servers or the networks. The module that manages the active objects and provides a communication mechanism between them is called simulation engine and is implemented in the engine package. To allow the communication and the synchronization between active objects (instead of active objects, we sometimes use the term task, because this is the name of the base class), we use simulation events. An event is initiated by a task and can be destinated to another task or even to the same task. For example, under certain conditions, a task can create an event which will terminate another task; or it can create a self addressed event which will be produced when the job managed by this task will be finished. As the number of jobs that must be simulated may be huge, we implemented a dedicated structure that allows active objects recycling in order to improve the simulation efficiency (a pool of active objects). Shared resources, like CPUs or I/O links, are represented as normal objects, but the acces to their update methods must be synchronized, since there are many running entities (tasks) that can try to execute these methods in the same time Simulation events Like mentioned before, the events can be used for communication and synchronization between tasks: when a task must notify another task about something that happened or will happen in the future (e.g., a change of state in the simulation), it creates an event addressed to that task, with the appropriate time stamp. Each event has a time stamp (the moment when it is produced), a source task and a destination task (which are represented by their IDs). The events can be self-addressed, which means that a task can send an event to itself. In the Event class there is also an Object data field, called data, which can store auxiliary data associated with the event (e.g., when a task receives the event of starting a new job, the job itself is stored in the data object). In order to define different types of events, in the Event class was introduced the etype data field, which is an integer that specifies the event type.we defined three constant values that can be assigned to evtype, corresponding to three kinds of events: the null event (ENULL), which has most of the parameters unspecified; it can be used as a dummy event; 9

10 the SEND event, used for communication between tasks the HOLD_DONE event, which can be used to wake up a task that was sleeping. When a task must be inactive for a certain period of time, before going to sleep it creates a HOLD_DONE event destinated to itself, that will be produced at the moment of time when the task can wake up (note that by period of time we mean period of simulation time, not period of real time that is why we don t use the Java sleep() function) The SEND events can also have various meanings, so there is another data field to make the distinction between them. This data field, named tag, is also an integer and represents the operation that must be done when the event is received, or the change of state that caused the event. We also defined several constants that can be event tags, and all of them can be found in the listing below, among with all the data fields of the Event class: public class Event implements Cloneable { /** The event type */ private int etype; /** The time when the event is produced */ private double time; /** The id of the task that generated this event */ private int ent_src; /** The id of the task that has to receive this event */ private int ent_dest; /** Used to specify the operation that has to be done when this event is produced */ private int tag; /** The event auxiliary data */ private Object data; /** A value that is used for the disposal of this event */ private boolean dispose; ///////////////////////////////////// // The event types currently defined ///////////////////////////////////// /** The null event */ public static final int ENULL = 0; /** The event type used for communication between tasks */ public static final int SEND = 1; /** The event type used to "wake up" a task that was "sleeping" */ public static final int HOLD_DONE = 2; 10

11 ////////////////////////////////////////// // The defined types of tags for an event ////////////////////////////////////////// /** When a task receives an event with this tag, it must finish its activity. */ public static final int TAG_STOP_TASK = 0; /** An event with this tag is sent to an AJob object in order to allow it to start executing the job assigned to it. */ public static final int TAG_START_JOB = 1; /** An event with this tag is sent to a farm when there are no more jobs for it to execute */ public static final int TAG_NO_JOBS = 2; /** This kind of event is produced when the state of a CPU unit changes (a job starts or finishes the execution on that CPU) */ public static final int TAG_CPU_CHANGED = 3; /** The event that informs upon the sending of a message */ public static final int TAG_SEND_MESS = 4; /** The event that informs upon the receiving of a confirmation message */ public static final int TAG_ARRIVE_MESS = 5; */ /** The event that asks for the retransmission of one message public static final int TAG_RETR_MESS = 6; /** The event that asks to inform a router that a message arrived at it */ public static final int TAG_INFORM_ROUTER = 7; /** The event used to inform a task that its message was delivered to the destination */ public static final int TAG_CONFIRM_SEND = 8; /** The event that informs upon the creating of a new database */ public static final int TAG_CREATE_DB = 9; 11

12 */ /** The tag that it is used to set a write confirmation event public static final int TAG_CONFIRM_WRITE = 10; /** The event produced when a Read operation on a database server is performed */ public static final int TAG_READ_DATA = 11; /** The event produced when a Write operation on a database server is performed */ public static final int TAG_WRITE_DATA = 12; Most of the methods from the Event class are for getting and setting the event parameters (the source, the destination, the time etc.) 5.3. Simulation tasks Some of the entities of the simulation have, as mentioned before, a time dependent behaviour and they need to be implemented as threaded objects ( active objects ). For this purpose, we created a base class, named Task, which implements the Runnable interface and must be extended by the threaded objects from the simulation. The diagram below represents all the classes that inherit from Task (the italic font is for abstract classes): Fig Simulation tasks. By simulation tasks we mean instances of the classes from the diagram above; they can be tasks that handle jobs on execution (engine.ajob) or they can represent components of the simulated system (monarc.center.farm, network.lan etc.). 12

13 What they have in common is the fact that their behaviour is influenced by events. Regardless of their type, all the tasks from the simulation are managed by a scheduler (an engine.scheduler object), which puts them on execution and stops them according to an algorithm that we ll discuss in the next section. When a new task is created, it is given an unique ID, so that it can be identified during the simulation. In order to execute the tasks, we use a pool of worker threads, to eliminate the overhead caused by creating new threads and destroying them. When a new task enters the system, the scheduler takes from the pool an worker thread which will execute the run() method of the task. This method has a few operations needed for initialiatin and synchronization and then calls the RUN()method; this last method is the one which actually does the task s work and must be overriden by the classes inheriting from Task. At a moment of time, a task can be in one of five possible states: created, ready, running, waiting and finished. In the Task class we defined a data field, state, which stores the task s state, and five constants, which are the values that this data field can have (each one corresponds to a state). A new task is in the created state until the scheduler finds in the pool a worker thread that can execute it; then, the task moves to the ready state. The scheduler will let all the ready tasks run (and set their state to running) after it finishes processing the events from the current simulation step. When a task must stop its execution (for example, if it has to wait for an event), it moves to the waiting state. The transitions between the last three states are done with the aid of a semaphore that each task maintains: when the task can start running, a V() operation is done on the semaphore, and when the task must block - a P()operation. The possible states of the tasks and the transitions between them are represented in the diagram below: 13

14 Running semaphore.v() semaphore.p() Created Assigned to worker thread Ready Finished Event happens or sleeping period is over Waiting Fig The possible states for a task. Briefly, the most important operations that can be done with a task, using the methods of the class, are: blocking the task (for a given period of time or until an event happens) creating an event which has the task as the source and another task (or the same task) as the destination unblocking the task (putting it in the ready state) selecting an event (a future one or a past one that was postponed) which is addressed to the task and satisfies certain conditions getting and setting the parameters of the task (the name, the ID etc.) 5.4. Managing the Tasks and the Events The Scheduler s Algorithm The simulation tasks and events are coordinated by a Scheduler object, which can be thought as the core of the whole system. To keep track of what happens to the other simulation objects, the scheduler maintains several data structures such as: a vector with all the tasks that are currently alive; in order to have faster access to the tasks, the schedulers also keeps two hashtables with the tasks (having the names of the tasks and, respectively, the IDs of the tasks as keys) a pool of worker threads that will execute the tasks, implemented as a Pool object 14

15 a priority queue with the future events (events that haven t been processed yet) another priority queue with deferred events (these events happened in the past, but their destination tasks were not expecting them, i.e. the tasks were not in the waiting state at that moment. So the events are moved to the deferred queue, and the destination tasks will eventually look for them here). At every simulation step, the scheduler executes the following operations: 1. Look at each simulation task and: a. If the task is in the created state, assign it to a worker thread from the pool and change the task s state to ready b. If the task is in the ready state, restart its execution by making a V() on the semaphore c. If the task is in the finished state, remove it 2. Wait until all the tasks that were running block again or finish their execution 3. Process the events: a. Take from the future queue the event(s) with the minimum time stamp. The simulation time advances, becoming equal to that time stamp. b. For each event taken from the queue, look for the destination task. If it is waiting for an event (i.e., it is in the waiting state), deliver the event to the task. Else, put the event into the deferred queue. These steps are executed until there are no more alive tasks and no more events in the queues Priority Queues Since the scheduler takes, at each step, the events with the minimum time stamp, the data structure chosen to store the events was a priority queue. To allow the users to try different priority queue variants, we wrote an interface, EventQueue, that must be implemented by the priority queue class. Any class that implements this interface can be used as the data structure which stores the events; the only thing that needs to be done is to set the name of the class from the configuration file. We tried two types of priority queue implementations: a sorted vector and a more sophisticated variant, called calendar queue, which is somehow similar to a hashtable. So far, none of the two calendar queue versions that we tested (SNOOPy and FELT) led to better performance, so for the moment we still use the sorted vector in our examples. In 15

16 the future Monarc releases, we plan to improve this part and add a better priority queue implementation Matching the Events with Predicates During the simulation, there are some situations when we need to select the events that satisfiy a certain condition (usually, the events that have a certain tag). For this purpose, we created the Predicate interface, which has only one method called match(): public abstract boolean match(event event); The condition (predicate) that an event must satisfy is implemented as an instance of a class derived from Predicate; the event satisfies the condition if the match() method with that event as argument returns true. We created two classes derived from Predicate: AnyPredicate, for matching every event (that is, the match() method always returns true) and TypePredicate, for matching events based on their tag. A TypePredicate object can hold up to three tags. An event matches the predicate if its tag is equal to one of the predicate s tags The Thread Pool Every job submitted to the system needs, in order to be simulated, a Java thread. When the job finishes, instead of destroying the thread it would be more efficient to keep it alive, for the next jobs that will appear (that is because the creation and the destruction of the threads are quite expensive). So, we have a set of worker threads which repeatedly execute the following steps, until the application stops executing: 1. Wait to receive a new job. 2. Execute the job. 3. Go to step 1. This is the idea of thread pool, which can be encountered in different types of applications (for example, in the web servers, which communicate with the clients using multiple threads). The difference between our case and the usual thread pool is that we can t limit the number of working threads, because if we do so, some jobs might be blocked until there is a thread available for them and the simulation results would be wrong. So, if a new job is submitted and all the worker threads are busy, we create a new worker thread for the job, so that it doesn t have to wait. Another aspect which is subject to further improvements is that when there are too many free worker threads, the pool s size should be lowered, so that we don t consume memory and other system resources. For the moment, our pool s size doesn t decrease even if the worker threads are not executing anything. The classes that implement the thread pool are WorkerThread, Pool and an auxiliary class used for synchronization (Done). 16

17 The main class is Pool, which holds a set of free worker threads and a set of assignments; the assignments are the tasks to be executed, and implement the Runnable interface (so that the worker threads will execute their run() method). In our simulations, the assignments are usually Task objects and the scheduler calls the assign() method of the pool to get a worker thread for the task. When they are ready to execute a new task, the worker threads call the getassignment() method of the pool. We will discuss these two methods below. In the assign() method, after adding the new task to the collection of assignments, the pool wakes up one of the workers (which were blocked in the getassignment() method, waiting to receive a task). Then, it has to handle the case when there are not enough workers for the tasks that must be executed. This is done by checking the condition: waitingworkers <= assignments.size(), which compares the number of waiting workers with the number of unassigned tasks. Since the methods involved (assign() and getassignment()) are synchronized, when this condition is checked we are in a consistent state: either a worker woke up and took the task (and both waitingworkers and assignments.size() decreased with 1), or no worker took the task (and waitingworkers and assignments.size() have not changed). The worker thread which calls the getassignment() method blocks as long as there are no tasks to be executed, and it can be unblocked when the pool receives a new assignment. This is similar to the producer-consumer problem, with a producer (the pool, that puts the assignments in a buffer ) and many consumers (the worker threads, which take the assignments from the buffer ). 6. THE JOB MODEL 6.1. The Job Class In order to describe the jobs that are executed by the system, we created the Job class, which is only a base class and defines a job that does nothing. To create jobs that actually perform some operations, the base class must be extended. We wrote a few classes which extend Job, and represent the most common types of actions that the system does, but the user can write his own classes if necessary Attributes for the Job Class A few of the most important characteristics for a job, that are represented by data fields in the class, are: schedulepriority the priority that the job has when it is scheduled in the regional center (it is a positive number, and greater numbers represent higher priorities) runningpriority the priority that the job has when it runs on a CPU (the CPU power allocated to it depends on this priority) 17

18 cpupower the CPU power needed by the job, measured in SI95 units (this attribute applies only to certain types of jobs) memory the memory needed by the job, measured in megabytes (this attribute applies only to certain types of jobs) processingtime the time needed to process the job, if it is given the requested CPU power; if not, the time is recalculated, like we ll show in section 6.4 (this attribute applies only to certain types of jobs) allocatedcpu the CPU power that is actually allocated to the job ti, tf the time when the job is submitted/finished (needed for statistics) nrdataunits the number of data units that are processed by the job dataunitname the type of the data units processed by the job CpuID some jobs must be run on a specific CPU (e.g., a network transfer job that has specific source and destination); this field is the ID of that CPU, or -1 if there is no such restriction nextjobs the user has the possibility to constrain some jobs to be processed after a certain job; this can be done by putting them in the nextjobs vector (which is a vector of jobs that will be processed after this one finishes) Simulating the Execution of the Jobs A job can be handled by a CPU or by the farm of the regional center (if it is a global job, such as the creation of a database). A job of the first type is scheduled to run on a CPU (in one of the following sections we will describe the scheduling algorithms) and it is assigned to an active job (an AJob object). Since the Ajob class extends the Task class, the active job is a threaded object managed by the engine s scheduler; it will be used to simulate the job s execution in time. The job scheduler of the regional center has a pool of free active jobs and, when it schedules a job, it takes an active job from the pool and assigns a job to it (this is somehow similar with the thread pool from the engine). The chosen active job is announced about the new job through an event with the tag TAG_START_JOB and which has the job itself as auxiliary event data. Then, it executes the run() method of the job and, when it finishes, it starts waiting for an event again.the run() method of the job is the one that should be overriden by the classes extending Job, because it defines the operations that make up the job (for example, processing data, transferring a file through the network or writing into a database). For some types of jobs (data processing jobs or maybe some other user-defined jobs), that are more CPU-intensive than I/O intensive, we estimate the time needed for execution using the attributes cpupower, memory and processingtime. This attributes represent the CPU power, memory and processing time needed by the job and, for the data processing jobs, they depend on the type of data that the job will work with (in the configuration file, the user can set this parameters for each data type used in the simulation). To simulate the execution of these CPU-intensive job, we use the 18

19 processoncpu() method, which dynamically reestimates the time needed for execution each time the state of the CPU changes (i.e., a new job starts running on it or a job finishes running). Basically, for a data processing job, the run() method contains only a call to processoncpu(), because there is nothing else to be done than to simulate consuming memory and CPU power. Job The Implemented Types of Jobs We will briefly describe below the types of jobs we implemented by extending The first type of job is the processing data job, implemented in the JobProcessData class. Its main purpose is to simulate the processing of some data, based on input parameters such as the needed cpu power and the processing. A processing job might be given as input parameters the cpu power, memory and processing time needed or it might be given the name of some data unit previously declared (see the input file configuration) for the simulation, from which the input parameters will be taken. Another basic job is the job that handles the sending of receiving of network messages. Its purpose is to simulate the network behavior for the running simulation. This job is also derived from the generic job description, so it also runs on a specific cpu unit or on any first available one. The difference is that this is a job that does not need any memory to be executed and it does not need any cpu power. So this type of jobs will never be delayed by the job scheduler. The network transfer job is the job that handles the communication between the cpu unit and other entities from the simulation (those could be other cpu units or a database server, etc). It must use the link port associated with the given cpu unit. If the other side of he transfer is not a cpu unit then it is its role to handle the transmission (the network transfer jobs that come from a database server for example are not scheduled by the job scheduler, but instead it is the job of the server to realize when it is time to create such job and to start running it those jobs are not run on active jobs, but instead that role is taken by the entity itself). As described in the network functionality section every network transfer has a transport protocol allocated. Based on this transfer protocol the network transfer job might be TCP based or UDP based. Based on this protocol the job constructor receives different types of parameters (maximum tcp window size, maximum udp segment size, etc). Those jobs are divided into two categories: the job that handles the receiving of a message from the network and the job that handles the sending of a message into the network. While the first one is always a blocking type of job (it must run until the message gets delivered) the second receives as input parameter whether it is to be blocked or not. A blocked sending message type of job will run until the message gets delivered to the other side, while the other one will just initiate the transmission of the message and then end. How the message gets delivered it is the transfer protocol s job to handle. Those jobs just inform the link port of the cpu unit that they exists. All they have to do is take care of the message delivered to them by the link port. 19

20 Another job description is the type of job that handles the communication with a database. This is already a composite type of job, since it is built upon the network transfer job, in order to incorporate the methods for handling the message transfer between the cpu unit and the database entity. What is does over that is that it handles the basic language of dialogue with the database entity (see below the description of the database simulation) and it handles both the sending of database commands and the receiving of the response from the database entity. As a new aspect there is also a special type of job that does not run on any cpu unit, but instead runs globally and handles the creation of a database on a database entity. This job is useful for handling the simulation of systems in which the database on the database entities must exists before the simulation begins Submitting Jobs: the Activities We will describe now the method used to create jobs and submit them to the regional centers. For this purpose, we have the Activity base class, which contains the mechanism for sending the jobs to a regional center and must be extended by the user (who has to create the actual jobs). Each regional center can have one or more activities which submit jobs to it. In the configuration file, the user must specify the class names of all the activities that he wants to associate with the regional center. Then, when the simulation starts, these classes are dinamically loaded and instances of them are created. Each activity has a vector which contains objects that can be jobs to be submitted and/or Double objects which represent the time intervals between the jobs (because we might not want to submit the jobs one immediately after the other, but to have pauses between them). The Activity class also extends Task and its RUN() method takes the elements from the vector one by one and : if the element is a job, creates an event with the tag TAG_START_JOB and with the job as auxiliary data and sends it to the farm if the element is a delay, sleep for the specified amount of time (using the simhold() method) The RUN() method is shown in the following listing: public void RUN() { sem.p(); // wait until all the jobs are in the system if (farm == null) return; if (jobs == null) { schedule( farm.getid(), 0.0, Event.TAG_NO_JOBS); return; // an activity without any jobs } int size = jobs.size(); for (int i=0; i<size; i++) { 20

21 } Job job = null; Double time = null;; try { job = (Job)jobs.get(0); schedule( farm.getid(), 0.0, Event.TAG_START_JOB, job ); } catch (Exception e) { time = (Double)jobs.get(0); simhold( time.doublevalue() ); } jobs.remove(0); } schedule( farm.getid(), 0.0, Event.TAG_NO_JOBS); When the Activity is done with submitting the jobs, it sends another event to the farm, with the tag TAG_NO_JOBS, to announce that it finished. What the user has to write is the pushjobs method, where he/she creates the necessary jobs and adds them to the jobs vector. A job can be added with the method: public void addjob( Job job ); A time delay is created using the method: public void addtime( double time ); The calls to these methods must be done in the same order the user wants the jobs/time intervals to be simulated. In the user guide there are some simple examples of Activity classes Job Scheduling Each regional center has a job scheduler, which takes decisions about the new jobs: if it is possible to execute them now or they should be put in a waiting queue, on which CPU to execute them or even to which other regional center they should be moved. A job scheduler class implements the JobSchedulerI interface: Public interface JobSchedulerI { /** The method called when a new job is submitted to the regional center*/ public void addjob( Job newjob ); /** The method called when an ajob has finished processing a job */ public void finishajob( AJob ajob ); } We wrote a basic job scheduler class, named JobScheduler, which can be extended in order to implement other scheduling algorithms. At least for the moment, if 21

22 the user decides to write his own class for job scheduling, he/she has to extend the JobScheduler class (implementing the interface is not enough). The JobScheduler class allows the jobs to be executed only in the regional center where they have been submitted. In the future, we intend to allow job migration between centers, and so far we wrote a class that implements a very simple distributed scheduling algorithm. We will discuss below the basic job scheduler and the distributed one Basic Job Scheduler The basic job scheduler, implemented in the JobScheduler class, can send the new jobs to be executed on CPU units from its regional center or put them in a waiting queue, if no CPUs are available. Some of the important data fields for this class are: activejobs a vector with the free active jobs (which is similar to a pool, like we explained above) jobqueue contains the jobs that can t be executed yet because no CPU is available for them activejobsrunning the number of active jobs that are currently running for the regional center Besides these, there are many other data fields that we internally use for constructing the statistics (for example, the total number of jobs processed, the average waiting queue size, the time when the last job was submitted and so on). When a new job is submitted, the addjob() method of the scheduler is called. Here, the scheduler first tries to find an available CPU unit to execute the job. The job might need to be executed on a specific CPU unit, like we mentioned before, and in this case the scheduler doesn t search anymore it knows exactly where to send it. The decision about the CPU unit is based on its load. By load we understand the total amount of memory used by the jobs that are already running on the CPU. A job can be executed on a CPU unit if the memory needed by the job, added with the current load of the CPU, doesn t exceed the amount of memory that the CPU has. The scheduler looks for the CPU unit with the minimum load, that has enough memory to execute the new job (the searching is done in the getcpuunit() method). If it doesn t find such a CPU, or if the job must be executed on a specific CPU and that CPU is too busy, the job is added to a waiting queue. The waiting queue is ordered by the job priorities (the schedulepriority data field from the Job class), so the first job that will be extracted from the queue will be the one with the highest priority. If a CPU was found, the scheduler also looks for an active job (AJob object) to assign the job to it (the free active jobs are in the activejobs vector). It sends an event with the tag TAG_START_JOB to the active job, and then the execution will start Distributed Job Scheduler There are two possible solutions for scheduling the jobs on other regional centers than the one they were submitted to: we can use a centralized algorithm (the jobs are sent to a global scheduler, which manages the whole system and which will decide where they 22

23 will run) or a distributed one (each local scheduler decides where it is better to send the job). We chose the second approach and we extended the JobScheduler class with a new one, DistribScheduler, which implements a distributed scheduling. Our distributed job scheduler has a very simple algorithm, and the results obtained at the tests proved its inefficiency. It works at follows: if the load percentage for each CPUs from the local regional center exceeds a certain value (given by the THRESHOLD_LOAD constant, which currently is 70%), the scheduler tries to send the job to another regional center the center with the minimum average load is chosen to execute the job (for the moment, we don t include the network traffic in these calculations) if the center with the minimum load is a remote one, the job is sent there; else, it will be executed in the local center. When a job is sent to another regional center, the method importjob() of the job scheduler from that regional center is called. If the scheduler receives a job via this method, it will always execute it (it won t try to send it to another regional center, because this way the job could move from one center to another for ever). Even if a job can t move in circle between regional centers, there still are cases when many useless job transfers are done. For example, suppose we have three regional centers. Suppose that regional center 1 receives a great number of jobs and decides to send them to another regional center - the regional center 2. So, the regional center 2 also becomes heavily loaded and starts sending jobs to another center - the regional center 3. Then, the load of the regional center 3 increases too and it sends jobs to regional center 1. In this situation, we can observe the lack of efficiency of the scheduling algorithm: all the regional centers are approximately equally loaded and they both import and export jobs, which adds some overhead due to the network traffic, while it would have been better if they processed locally all the jobs. We wrote a test that simulates a situation like the one described above, which is described in more detail in the chapter dedicated to tests. For all the regional centers involved (there were 4 regional centers) we monitored the number of imported and exported jobs, and we observed that they were almost equal: all the centers import jobs and export the same number of jobs, instead of executing their own jobs, without importing or exporting anything. We also monitored the rate (the per-hour average) of imported, exported and submitted jobs. The graphs which represent the rate of imported and exported jobs for two of the regional centers look like that: 23

24 Fig Job migration for the Cern regional center. Fig Job migration for the Caltech regional center Estimating the execution time for the jobs As we discussed above, for the processing data jobs the user must provide an estimation of their execution time and the amount of CPU power (the CPU power is measured in SPEC Int 95 units) that the jobs need in order to be finished in the specified 24

25 time. These values are stored in the processingtime and cpupower fields of the job object. When the jobs are running on a CPU, they are allocated a certain CPU power, which can be greater or smaller than the one provided by the user. Of course, if they are given more power, they will finish sooner, and if they are given less power, they will finish later. If the allocated power for the job is constant all the time the job is running, the execution time T can be calculated using the following formula: T = job.executiontime * job.cpupower / allocated_power The way jobs share the CPU A CPU can execute several jobs simultaneously, if the total memory that the jobs use does not exceed its own amount of memory. In this case, each job is allocated a fraction of the CPU s power, proportional with its running priority (remember that each job has a priority for running and another priority for scheduling). If we have n jobs (J1, J2,,Jn) with running priorities P1, P2,, Pn and they all run on the same CPU then the power allocated to the job Ji is: allocated_power_i = total_cpu_power * Pi / (P1 + P2 + + Pn) Interrupts If the CPU power allocated to the job were constant during its execution, things would be very easy: we would calculate the allocated power, then the execution time and we would be done simulating the job. But the real situation is not so simple: at any moment of time in the simulation, an activity may submit a new job, this new job may be sent for execution on a CPU and then, since the CPU has more jobs to execute, it will offer less power to each one which means that we must recalculate the allocated power for each job. A similar thing happens when a job finishes its execution: the others will get more power, and the exact amount must be recalculated. Actually, this is the reason we need threaded objects (tasks) for simulating the jobs: because we don t know from the beginning how long it will take to execute them. So, the threaded object (which is of type AJob) makes a first estimation of the time needed for processing the jobs, using the formulas above. Then, it goes to sleep for that period of time and waits to be announced about the changes in the CPU state. This is what we call the interrupts mechanism: when a new job is scheduled on a CPU, it interrupts the tasks that simulate the other jobs which are running on that CPU (i.e., it sends an event with the tag TAG_CPU_CHANGED to each one). The same event is produced when a task finishes and leaves the CPU, too. The tasks, which were sleeping (i.e., waiting for events), receive the events and reestimate the execution time, taking into account the new job that arrived (or not taking into account anymore the job that finished). Then, they go to sleep again, waiting for more events. When the sleep time expires, the job is finished. The graph below represents the interrupt mechanism for two jobs executing on the same CPU: 25

Fig. 6-3. Interrupts mechanism. The first task starts at time T1 and estimates that its finish time will be TF1.

26 Fig Interrupts mechanism. The first task starts at time T1 and estimates that its finish time will be TF1. But, at the moment T2, another task is scheduled on the same processor and the interrupt I1 is generated. So, the first task recalculates the time needed for completion (the arrow shows that TF1 is moved on the time axis). The second tasks also calculates its execution time, TF2. Then, when the first task finishes (the new TF1), another interrupt is generated and the second task must reestimate its execution time, too: it will finish sooner that was expected first, because it doesn t have to share the CPU with the first task anymore. 7. DATA MODEL It is foreseen that all HEP experiments will use an Object Database Management System (ODBMS) to handle the large amounts of data in the LHC era. Our data model follows the Objectivity architecture and the basic object data design used in HEP. The model should provide a realistic mapping of an ODBMS, and at the same time allow an efficient way to describe very large database systems with a huge number of objects. The model should provide transparent access to any data stored in the simulation. It provides as an advantage an automatic storage management. Also it provides an efficient way to handle very large number of Objects simulated. It emulates clustering factors for different type of access patterns. It handles related objects in different data bases. In this section we will describe the way the database simulation is implemented. For simulating the databases we implemented two main entities used to store data in real world: the database server and the mass storage center. The database server uses a disk in order to store the data internally, while the mass storage center is a facility that stores data on tape drives. The user has the right to interact with both those entities, but we ve done more than by implementing an algorithm that dynamically moves the data from the database server into some other mass storage server in order to create space when this is needed. This is because usually the database server capacity for storing data is relatively smaller than the one of a mass storage unit. All the database entities have link ports associated so the simulation of the interaction between jobs and those entities is done through the network implementation. That is also the case in the real world. The data inside the simulation is kept in a data container. The container is the equivalent of the real data unit stored in a database in reality. The container is defined by type or by name or by 26

size. In this way the container could simulate a data inside a database or a file inside a file system or any other model. The containers can be grouped together in databases.

27 size. In this way the container could simulate a data inside a database or a file inside a file system or any other model. The containers can be grouped together in databases. The atomic unit object is the Data Container, which emulates a database file containing a set of objects of a certain type. In the simulation, data objects are assumed to be stored in such data container files in a sequential order. In this way the number of objects used in the simulation to model large number of real objects is dramatically reduced, and the searching algorithms are simple and fast. Random access patterns, necessary for realistic modeling of data access, are simulated by creating pseudo-random sequence of indices. Clustering factors for certain types of objects, when accessed from different programs, are simulated using practically the same scheme to generate a vector of intervals. A Database unit is a collection of containers and performs an efficient search for type and object index range. The Database server simulation provides the client server mechanism to access objects from a database. It implements response time functions based on data parameters (page size, object size, access is from a new container, etc.), and hardware load (how many other requests are in process at the same time). In this model it is also assumed that the Database servers control the data transfers from/to mass storage system. Different policies for storage management may be used in the simulation. Database servers register with a database catalogue (Index), used by any client (user program) to address the proper server for each particular request. A schematic representation of how the data access model is implemented into the simulation program is presented in the following figure: Fig Data model representation. 27

28 This modelling scheme provides an efficient way to handle a very large number of objects and in the same time an automatic storage management. It allows to emulate different clustering schemes of the data for different types of data access patterns, as well as to simulate the ordered data access when following the associations between the data objects, even if the objects reside in databases located in different database servers. All the containers and database entities and the databases are managed by a database index defined globally with the project. The index is used in order to find the location of any data inside within the simulation. A special kind of language is used in order to implement the interrogation between the database entities and the activities. The language is basically formed from commands like READ, WRITE, GET and their corresponding types of answers.the algorithm used to implement the database functionality inside the project constructed on the client-server model. The job that implements the algorithm is JobDatabase. The job is built over the network job functionality because the interaction between the user and the databases in the simulation is carried on over a network. The database job has implemented the methods used to get the data out of the database entity, write some data on the database entity and just read the data from the database entity. A special role for the data model is played by the database index. Its purpose is to help the job find where the data that it needs to access reside or what is the IP address of the database entity that he wants. The algorithm that the database server uses is this: 1. If the job that the database server receives is carrying a WRITE_DATA message, then the data inside the message is of two possible types: a) The message contains the command for the creation of a database. For this the message also contains a string that represents the name of the database to be created. As a response to this job the database server creates the database, registers it with the database index and just sends back a confirmation message. This is the simplest possible type of job. b) The message contains the command for the writing on the database server of a number of containers. The message also contains in this case the container to be written and also the number of container that must be replicated on the database server based on that container. The database server will try then to write the container(s) by allocating the necessary space for each container before the effective write. If for the writing of a container the database server does not have the necessary space a moving algorithm is used. The algorithm is this: the database server find the less used container that reside there (the container that was used last as far into the past as possible). Then tries and find a mass storage that has enough space left to write the container on. The mass storages units are checked from the closer one to the database server to the far ones. If a mass storage entity is found, then the database server deletes the container from its space and sends a message containing that message to the mass storage facility. Then it 28

29 blocks until the mass storage reply. The blocking is necessary because the database server can not write any other message in this time. When the confirmation comes, the database server will write the original container for which all the moving was made (if possible) and then continues with the next request. In this way the database server will only contain the newest possible used containers, while any other containers less accessed will reside on tape drives. 2. If the job that the database server receives is of type READ_DATA, then the containers which belong to the database are compared against a container that is contain inside the message. This is the pattern based database matching. The comparison can be done by name or type or size of the container or the database that must contain the container. In the simulation in order to say any contain that match this argument that argument will be set to null. This insures the fact that a container can describe both files and data entities from the real world. In the message that is contain a number that say which is the number of containers that the users will like to retrieve based on the given pattern. If all the containers are found in the given database entity then a READ_CONF message is sent back to the job, along with the containers found which are stored in a vector. Otherwise, if any error occurred during the transmission or there were not enough containers to be found on the database entity then a message of type READ_ERROR will be returned to the job that initialized the dialogue. 3. If the job that the database server receives is of type GET_DATA, then the functionality of that command is somewhat the same as the one from the READ_DATA type of message, with the difference that if enough containers are to be found then those containers are returned to the sender job, but the containers are also deleted permanently from the database server. The time after which the response is sent back to the database job from the database server is computed in the way that this time simulates the behavior of a real world database server. In the reality a database server will impute a time to process the request and a time to serve the request, and those time are computed in the simulation based on the given input parameters of the database server and the total amount of data and work that the database server is doing at a given moment of time. For the mass storage entity the algorithm is approximately the same, but with different methods to compute the time the mass storage entity needs to serve the requests. The main difference is that the mass storage unit has a number of drives ports on which it can serve multiple request in the same time. In the simulation, as mentioned above, a container is defined by a set of input parameters, which are as follows: 29

30 /** The size of the container */ public double size = 0.0; /** The type of data stored in the container */ public String containertype = ""; /** The name of the container */ public String containername = ""; /** The effective data stored by the container */ public Object data = null; /** The database that this container belongs to */ public Database db = null; /** The number of times the container was accessed */ public int accessed = 0; /** The time when the container was last used */ public double lastused = 0.0; /** The lock on the container */ protected boolean lock = false; /** The place where the container exists on */ protected int place; As seen, the container can have a type and a name (much like a file unit), a physical size (in MB), data that pertain to the container and a database to which the container belongs to. Also the container has some parameters telling how many times it was accessed and at what time it was last accessed. Also it can have a lock. By setting the lock to true the container is ensured to be read-only, meaning that a write operation will have no effect on it. A container can also have three possible locations: /** On disk */ public static final int ondisk = 1; /** On tape */ 30

31 public static final int ontape = 2; /** In transfer */ public static final int ontransfer = 0; The database unit is very simple, all it contains is a list of all the containers that belong to it and a name: /** The name of the database */ public String databasename; /** The containers of the database */ private Hashtable containers = null; The MessageData is the data that is moved between a database job and a database entity. As mentioned in the network description any message has data inside to be carried. In the case of the database simulation this will be the message that moves with the network message. The database entity or the database job will decode the message data and based on the type of it will have the following possibilities: public static final int READ_DATA = 0; public static final int GET_DATA = 1; public static final int WRITE_DATA = 2; public static final int READ_CONF = 3; public static final int READ_ERROR = 4; public static final int GET_CONF = 5; public static final int GET_ERROR = 6; public static final int WRITE_CONF = 7; public static final int WRITE_ERROR = 8; The meaning of those messages is described above. Also the Message Data has data inside (the pattern container with which the database entity will do the matching or if the message data is a response the data will be a vector containing all the containers found, for instance for the read and get operations). Also the message data has a flag 31

32 telling if the size of the containers to be written was already reserved on the database entity. This is useful if the job must write the data and ensure that no other jobs will conquer for the same amount of space (first come first served). Another database basic unit is the database index. As mentioned above all the database entities and the databases and the containers are known through the ip network address where they reside. All those entities have a address where they can be found. In the case of the database and the container the address will be the address of the database entity that contains them. The database index is a global object (everywhere in the simulation a job will access the same object) that was created with the purpose of managing all those mapping. In this way a job will know precious with which database entity to carry the dialogue. The database index has then the following structures: /** The mapping of dbentities vs. addresses */ private Hashtable mapaddress = null; /** The mapping of dbentities vs. vector with their containers */ private Hashtable mapcontainers = null; /** The mapping of dbentities vs. databases */ private Hashtable mapdatabases = null; Whenever a container is created, moved or deleted or a database is deleted or created the operation will also affect this database index. The database index is the equivalent of the database metadata described in the theoretical part. The most important part in the database simulation is the database entity. It is the basic unit that implements the task that simulates all the servers (database server, generic database server, mass storage facility). The database entity is a class that was created that way in order to cover up the event handling inherits by any tasks. This class will take out network events and examine their interior in order to decode the message. Then, based on the type of message will call the read, write or get message of the class above. Also it is the class that can handle the sending of the reply message back to the sender. Of course, the above class can handle by themselves the network messages, but this class tries to hide the physical aspect of the task handling, so that the user will only have to deal with the correct database simulation and not the infrastructure below it. The database server simulates the behavior of a real database server. This unit is mapped over the database entity, so it does not deal itself with the receipt of the database. The database server has the following input parameters: 32

33 /** The reading speed */ public double readspeed = 0.0; /** The writing speed */ public double writespeed = 0.0; /** The reading latency */ public double readlatency = 0.0; /** The writing latency */ public double writelatency = 0.0; /** The total disk size of the database server */ protected double disksize = 0.0; The database server also contains a list with all the databases and a list with all the containers that belong to it. It simulates the time needed to complete the access, using the input parameters (readspeed, readlatency etc.). We also implemented a generic database server which is similar to the database server, with the single exception that do not deal with the recalibration process. Instead it will only send back WRITE_ERROR if there is not enough space to write the containers. Another unit derived from the database entity is the mass storage entity. It does not deal with the receiving of the messages, it only deals with the effective functionality. The mass storage unit has the following input parameters: /** The total number of drives the unit has */ private int nrdrivers; /** The used number of drives of the unit */ private int useddrivers; /** The size of one tape for the mass storage unit */ private double tapesize; /** The mount time of the mass storage unit */ private double mounttime; 33

34 /** The searching speed of the mass storage unit */ private double searchspeed; /** The reading speed of the mass storage unit */ private double readspeed; /** The writing speed of the mass storage unit */ private double writespeed; /** The number of tapes a silo has */ private int tapepersilo; /** The number of silos for the mass storage unit */ private int nrsilo; /** The number of tape drives the mass storage unit has */ protected int tapedrives; /** The total size of the mass storage unit */ protected double totalsize; As seen, the mass storage unit has a number of tape drives. Based on those it can serve a number of requests in parallel. The methods for simulating the delays based on the given input parameters are as follows: /** * The method used to compute the reading delay of a given size of data. size The size of the data to be read The delay that the reading operation takes */ protected double computereaddelay( double size ) { return (mounttime+tapesize*random.nextdouble()/searchspeed + size*readspeed); } /** * The method used to compute the writing delay of a given size of data. 34

* @param size The size of the data to be written * @return The delay that the writting operation takes */ protected double computewritedelay( double size ) { return (mounttime+size*writespeed); } 8.

35 size The size of the data to be written The delay that the writting operation takes */ protected double computewritedelay( double size ) { return (mounttime+size*writespeed); } 8. NETWORK MODEL Accurate and efficient simulation of networking part is also a major requirement for the MONARC simulation project. The simulation program should offer the possibility to simulate data traffic for different protocols on both LAN and WAN. This has to be done for very large amounts of data and without precise knowledge of the network topology (as in the case of long distance connections). It is practically impossible to simulate the networking part at a packet level for such large amounts of data. User defined time dependent functions are used to evaluate the effective bandwidth. The approach used to simulate the data traffic is again based on an interrupt scheme. Fig Network model. When a message transfer starts between two end points in the network, the time to completion is calculated. This is done using the minimum speed value of all the components in between, which can be time dependent, and related to the protocol used. The time to complete is used to generate a wait statement which allows to be interrupted in the simulation. If a 35

36 new message is initiated during this time an interrupt is generated for the LAN/WAN object. The speed for each transfer affected by the new one is re-computed, assuming that they are running in parallel and share the bandwidth with weights depending on the protocol. With this new speed the time to complete for all the messages affected is reevaluated and inserted into the priority queue for future events. This approach requires an estimate of the data transfer speed for each component. For a long distance connection an effective speed between two points has to be used. This value can be fully time dependent. This approach for data transfer can provide an effective and accurate way to describe many large and small data transfers occurring in parallel on the same network. This model cannot describe speed variation in the traffic during one transfer if no other transfer starts or finishes. This is a consequence of the fact that we have only discrete events in time. However, by using smaller packages for data transfer, or artificially generating additional interrupts for LAN/WAN objects, the time interval for which the network speed is considered constant can be reduced. As before, this model assumes that the data transfer between time events is done in a continuous way utilizing a certain part of the available bandwidth. The basic components involved in the network simulation are as follows: Fig The components of the network package. As shown in the theoretical part the network simulates the behavior of the TCP/IP network model. In order to do that every layer is implemented by some modules in our project. The first layer deals with the components that make out the network. A network can be composed from link port (the physical device that connects a computer to the network), lan (a medium that connect together link ports in order to provide communication between them), wan (a medium that connect together lans in order to 36

37 provide the necessary communication infrastructure between link ports situated in different parts of the simulation) and router (it connects together wans in order to provide communication over different regional centers). The details of each component are provided below. The second layer deals with what is the effective unit that moves from one link port to another. This is the message. A message must contain a destination ip destination address and has a source ip destination address). The message if effectively moved by the means of events. The event is moved between tasks, which in term contains inside a network message. The network message also contain inside data that are carried. Some other parameters such as message length are used in order to provide a mean by which the message time to arrival will be computed. Also the message can take out parameters (such as the time it took in order to arrive at the destination or the bandwidth that occupies) for the output clients. The third layer deals with the way the message moves through the network. This is implemented by the Protocol. In the project there were implemented two kinds of transport protocols. Those are TCPProtocol and UDPProtocol. The protocol moves the message from one task from its route to the next until the destination is reached. The way in which the message is moved is the actual way in which the given protocol functions in the real world. That is, for the tcp protocol for instance, the message is first fragmented, then each part is given to the next task after a delay that is computed based on the size of fragment and the bandwidth available. Then, after a number of fragments the protocol sends back an acknowledge in order to simulate the windowing problem described in the theoretical chapter. The fourth layer is represented by the applications that use the network communication, that is the jobs that implements the sending and receiving of messages. In the following we will describe in more detail the basic units presented above. The link port is the entity that received and sends messages. Every message is exchange only between the link ports. Also every entity involved in the simulation has a network link port (interface) attached to it in order to practically participate in the network simulation. Every link port is unique through the ip address of it. So, if a job says he wants to send a message to a given address, there is only one link port that will receive the message. But in the simulation a special kind of addressing was provided. A link port can also be described by the unit to which he is attached. Also, in order to provide more dynamism the addresses of the link ports are allowed not to be unique, in which case the message will be sent to the closer to the sender found link port. The link port s parameters are as follows: /** The address */ public String address; 37

38 /** The maximum speed (bandwidth) (MBps) */ public double maxspeed = 0.0; /** The number of active connections */ public int activeconnections; /** The vector with all the messages currently in order to be send */ public Vector messages = null; port */ /** The tasks that are connected to the link public Hashtable tasks = null; /** The number of tasks waiting for messages */ public int taskswaiting = 0; The link port is described also by the maximum available bandwidth available. This is given in Mbps. This bandwidth is divided between the jobs that effectively use the interface at a given moment of time to do some network activity. But not always the total bandwidth get used because a message uses the minimum amount of bandwidth allocate to it by all the entities that compose the route that he has to traverse between the source link port and the destination link port. This means that even if a message has an initial bandwidth when it exits the link port big enough, if the lan that receives the message can not allocate the desired bandwidth to the message then the message will use whatever the lan can provide and inform the link port that he does not need all the bandwidth provided to it. So the link port will effectively simulate the behavior or real type network systems in which the bottle nesses in any component in the network can provide an effective delay of transmission that is seen even on the link port interfaces. Also, even if the message uses a given bandwidth when a new message enters the link port on which the message is transmitted the bandwidth allocated to the messages is recomputed and new times to completion are provided. One other function that the link port implements is the effective found of the route that the message will traverse from the link port to the destination link port. A message route is composed from lans, wans and routers. If the destination link port is unreachable (no route was implemented between the two nodes) then a message is returned to the user. The message can provide either the ip address of the destination link port or the entity that contains a link port that will function as the message destination. If the destination is provided by ip address, then the link port request the lan to which is 38

39 attached to find the given destination. The lan also will try and find if the link port is connected directly to it. If not, it will inform the wan to which is connected to try and find the link port by that address. The wan will also inform the rest of the lans that are connected to it and if this doesn t help either it will inform the router (if any) to which is connected to try and find the address. If the destination link port is given by the entity that contain the link port (by farm s name, entity type and other information that characterize completely a unit in the simulation) then the link port will directly access the link port, get it s address and again try and find the route for the message. This is because all the network entities implements routing tables based on ip network addresses, so it s easier to find a route by the ip address. The lan and wan are entities that are used only to exchange events that carry message segments between destination link ports. Their main functionality is to organize the entities below them ( the one that directly attached to them, such as link ports into lan and lans into wan and wans into router) as best as possible in order to provide the necessary routing mechanism and to receive network events from the protocol and send back response also to the protocol. They are also defined by the input parameter maximum bandwidth available. The router is somewhat different. The router exhanges packages, so it will provide a delayed for the transmission based on the traffic that is currently occurring on it and the distances (hops) between the wans and the router. The method used to implement this is: /** * The method used to get the latency introduced by the router. The latency delay based on the number of packages currently handled by the router. */ public synchronized double getlatency( WAN wan ) { report(); packagesno++; report(); if (wans!= null) { for (int i=0; i<wans.size(); i++) { WAN w = (WAN)wans.get(0); if (w.equals(wan)) { int distance = ((Integer)w.distancesToRouters.get(i)).intValue(); } } } return (routelatency * packagesno * distance); 39

40 return 0.0; } The network message is not an entity by itself. Its only purpose is to act as a container for the data delivered between network nodes. It also collects information from the duration of the transmission. It contain also information that need to be shared by all the entities involved in the traffic. Also the message contains the protocol that is to be used for transmission. The protocol is implemented as an extendable module in order to easily implement new protocols for communication easily. The protocol itself in an interface that must be declared by the protocol implementation accordingly. The methods that need to be implemented are the following: /** * The method used to send a message. */ public void send( Message message ); /** * The method called when a message gets delivered (or a segment portion of * it, depending upon the protocol). */ public void receivemessage(); /** * The method used by some protocols to retransmit the last segment transmited. */ public void retransmitmessage(); /** * The method used to handle the function of acknowledge (used by reliables * protocols). */ public void receiveconfirmation(); 40

41 Of course some protocols (like UDP for instance) will not use all those methods so will declare some of the methods as not doing anything. The protocol interface also provide an abstract class that defines a method for computing the next route task and the delay to it based on the route precomputed by the link port for the message. The TCPProtocol implements the TCP type of protocol. The main functionality is that is uses a windowing mechanism for the network transmission. The message is divided into small fragments called packages and then a number of those packages from the message equals to a window are sent to the network task that is the next hop. After that an acknowledge is sent back or not through the same network entity. If the confirmation is received then the protocol proceeds to the next bunch of packages. If the task was wan then with a probability of 10% the acknowledge will not be sent back, simulating the defects in a real network simulation. If the conformation is not received back a time (idle time) period then the original packages are sent back again. So the network simulation tries through that mechanism to come even closer to the real functionality of a network. 9. TESTS AND EXAMPLES 9.1. FTP test With this test we try to determine the functionality of a computer simulated environment consisting of a central server and more workers. The workers nodes are connected with server through a lan network. Each worker has declared attributes such as memory and cpu power available to them. The functionality of the test is that the server sends ftp messages to workers, messages containing a number of data (events). A worker receives the message and does a processing depending on the type of data and the number of events. This ftp exchange of data happens between the server and all the workers, even more than one time. The type of data that we used has the next characteristics for one data: cpu power needed is 10.0 SI95, the memory needed is 1MB, and in order to fully process the data it has a time to completion of 0.2 s, this time having a fixed distribution. The network has the next characteristics: the server has a link port with a maximum bandwidth of Mbps, the lan can accept data with a maximum bandwidth of Mbps and every worker has a link port with a maximum bandwidth of Mbps. Every worker has the next characteristics: it has a processing power available of 20.0 SI95 and an available memory of 15.0 MB. Because we need to test the network interruptions we assumed that each new transfers occurs after a period of 1 seconds after the previous transfers. 41

42 Determining the server usability These first tests are meant to see how much it differs the server functionality once new worker are inserted into the system. For that we only want to modify the number of workers, but all the other parameters must be the same. We assume that each file has a size of MB, and each file carries 10 data events, so a processing job then needs 10 MB of memory for the full processing of all data (see before the data parameters). We can indicate how much iteration to use for the test, but let's say that for this suite we use only 2 cycles. We want to see how the jobs are comporting in waiting for the cpu unit, so let's say we put 2 processing jobs to run simultaneous on a worker on the same time. We have 1 second between the starting of two jobs, so the cycle goes like this. The server sends 1 message to a worker, then after 1 second he sends another message to the same worker. Let's say we have N workers. Then, because we defined two iterations, the server will send a new message to the same worker after 2N seconds, and then after 1 more second another message to the same worker. First let N be 1. Then the server sends at time 0.0 a message to the worker. The used bandwidth of the server is the same as the maximum bandwidth of the worker (100.0 Mbps), because that's the rate at which the worker is willing to accept messages. With this speed the first message leave and he will need approximately 8 seconds to arrive at the worker. But after 1 second another message departs, so each message will now have an available bandwidth of 50 Mpbs. The new time arrival of the messages will be reorganized accordingly. Then after 1 more second a new message depart (this one belonging to the second iteration). And then the fourth. The messages times are: Job number Arrival time The differences come from the fact that the first message run for 1 second with 100 Mbps as opposed to the rest, the second run for 1 second with 50 Mbps as opposed to 3 and 4, etc. The processing job needs 10 seconds in order to fully process all 10 events ( SI95 * 2 sec / 20 SI95 ). This means that at time , the first processing job starts and takes 10 seconds to process. The other processing job will start at moment , when 42

43 the first processing job will have already finished. So in this test no jobs are blocking waiting for another one to stop working on the cpu. These are the graphical representations: Now let's assume that N is 10. In this case we have the following: 43

44 You can see that the lan starts to produce a bottleneck. As seen from the table before, the slow the messages go, the little the time interval between the complete transfers shrinks, so now there are jobs that arrive one after another at time intervals below 10 seconds, so they have to wait for the processing job already running there to finish there. One other aspect is the bandwidth. Now there are moment when more then 10 transfers occur in the lan, and implicitly on the server's port, so now the available bandwidth is used at full capacity. And for N of 50, we have : 44

The problems seen before are even greater in this case. 9.1.3.

45 The problems seen before are even greater in this case The file size dependency In this series of tests we have the parameters shown before, only now we have N=10 always and we get different file size values. Let the file size be FS. The file size in the tests determines the time arrival of the messages. The little the file size is, the faster messages get to the destination, and the bigger the file size is, the more time is take for the messages to get to the destination. Let's examine the output. For the first test let FS be 10.0 MB. Then the output is : For a FS of MB we have : 45

46 For a FS of MB we have : 9.2. Simple Distributed Scheduling Example Brief description The purpose of this example is to test the migration of the jobs between regional centers. We have 4 regional centers (Cern, Caltech, Tokyo and Chicago) which process a 46

47 certain number of jobs every day, for a specified number of days. The number of jobs and the frequency of their arrivals depend on the period of the day: in the morning and in the evening we have fewer jobs than in the middle of the day. The regional centers are in different time zones, so they will actually start processing jobs at different moments in the simulation. From the configuration files we chose the number of jobs and their duration so that, in the busy periods of the day, the regional centers can't process all the jobs and have to put some of them in waiting queues. If the scheduling algorithm allows job migration, the jobs that can't be processed immediately will be sent ("exported") to other regional centers The scheduling algorithm For the moment we use a very simple and naive algorithm for distributed scheduling: if the load for all the CPUs in a regional center exceeds a certain limit, the center's job scheduler tries to send the newly arrived jobs to other regional centers. The jobs will be sent to the regional center which has the minimum average load (in terms of used memory) at that moment. The problem with this algorithm is the following: suppose we have three regional centers. Suppose that regional center 1 receives a great number of jobs and decides to send them to another regional center - the regional center 2. So, the regional center 2 also becomes heavily loaded and starts sending jobs to another center - the regional center 3. Then, the load of the regional center 3 increases too and it sends jobs to regional center 1. In this situation, we can observe the lack of efficiency of the scheduling algorithm: all the regional centers are approximately equally loaded and they both import and export jobs, which adds some overhead due to the network traffic, while it would have been better if they processed locally all the jobs Simulation results In the first test we used the following input data: for all the regional centers, we have the same number of CPUs (30), each one having 512 MB of memory and processing power 100 (SI95) the time intervals are defined as follows: (morning), (midday), (night) the processing time for the jobs is normally distributed, with an average of 3 hours; each CPU can execute only one job at a time (due to memory limitations) the number of jobs in the different time intervals: 50 jobs in the morning (with average interarrival time of 2 min), 80 jobs in the midday (with average interarrival time of 1 min), 40 jobs at night (with average interarrival time of 2 min) we simulated 5 days of activity While running the simulation, we monitored the number of imported and exported jobs and tried to analyze the efficiency of the scheduling algorithm. The graphs below show the rates of the submitted jobs, imported jobs and exported jobs for two of the regional centers (measured in jobs per hour): 47

48 Fig. 1. Job rate statistics for the first test 48

49 We'll give below a few specifications that can be useful for understanding and interpreting the graphs above: the simulated period is of approximately three days (as the time axis shows) and each one of the four graphs has some similar regions, corresponding to the working days the first center which starts the activity is Cern; at the other centers you can see some delays, caused by the time zone differences (actually the order in which they start is: Cern, Chicago, Caltech, Tokyo) the rate of the submitted jobs (drawn with the red line) has the same variations in all the centers: in the morning the average time interval between jobs is 4 minutes, so we'll have about 15 jobs per hour, in the midday we have 20 jobs per hour and at night we have 10 jobs per hour at the beginning of the simulation, the scheduling algorithm works fine: only the Cern regional center receives jobs and, when its load riches a certain limit, it starts sending the jobs to other centers, which are free (as you can see from the graph, the number of exported jobs for Cern, represented with blue, increases at a time interval after the activity started) later, when all the regional centers are active, things go as follows: in the morning and in the midday, they are heavily loaded and they export many jobs, and at night, when the number of local jobs is small, they import jobs from other centers In this first test, the positive aspect was that when the centers were exporting jobs (in the day time), the number of imported jobs was small and when they were importing jobs (at night), the number of exported jobs was small. We made a second test in which we changed the number of CPUs in all the regional centers, from 30 to 20. Since now there are fewer resources, the size of the waiting queues is greater, and so is the load in the centers. From the graph below, which represents the job statistics for Cern, you can see that, at night, the center imports and also exports many jobs (and this happens with the other centers, too). In this situation we have the problem that we mentioned earlier: the centers uselessly pass jobs one to the other, instead of processing their own jobs. 49

50 Fig. 2. Job rate statistics for the second test This problem happens at night because in this period the waiting queues have the greatest size (they are full with jobs that were submitted during the daytime and have not been processed yet). This can be observed from the next graph, which shows the number of running jobs and waiting jobs for Cern (in both graphs, the arrows point to the regions that represent the night periods): 50

51 Fig. 2. Job statistics for the second test This simple scheduling algorithm has poor performances, like our tests proved, so in our future work at this project we intend to implement some more efficient algorithms. They should be treated with great attention, because of the complications that can appear here: large amounts of data to be processed in the Physics experiments, the lack of synchronization between the scheduling algorithms and the database replication strategies and so on. 10. CONCLUSIONS The project is still under development, and some new members recently joined the team. Yet, a functional version of the program is available and can be downloaded from the home page of our project: So far, we have implemented: the simulation engine and the basic components, the database and network packages, the graphical user interface and some tests and examples. As we discussed in a previous chapter, we made some performance tests on machines with one or more processors, obtaining better results on multiprocessors. We 51

Multi-threaded, discrete event simulation of distributed computing systems

Multi-threaded, discrete event simulation of distributed computing systems Iosif C. Legrand California Institute of Technology, Pasadena, CA, U.S.A Abstract The LHC experiments have envisaged computing