Distributed Batch Controller. Department of Computer Science, University of Maryland, College Park, MD USA. Waterloo, ON N2L 3G1 Canada

Size: px

Start display at page:

Download "Distributed Batch Controller. Department of Computer Science, University of Maryland, College Park, MD USA. Waterloo, ON N2L 3G1 Canada"

Dennis Houston
5 years ago
Views:

1 Processing TOVS Polar Pathnder Data Using the Distributed Batch Controller James Du a, Kenneth Salem b, Axel Schweiger c, and Miron Livny d a Department of Computer Science, University of Maryland, College Park, MD USA bdepartment of Computer Science, University ofwaterloo, Waterloo, ON N2L 3G1 Canada c Applied Physics Laboratory/Polar Science Center, University ofwashington, 1013 NE 40th Street, Seattle, WA USA dcomputer Sciences Department,University of Wisconsin - Madison, 1210 W. Dayton St., Madison, WI USA ABSTRACT The Distributed Batch Controller (DBC) supports scientic batch data processing. Batch jobs are distributed by the DBC over a collection of computing resources. Since these resources may be widely scattered the DBC is well suited for collaborative research eorts whose resources may not be centrally located. The DBC provides its users with centralized monitoring and control of distributed batch jobs. Version 1 of the DBC is currently being used by the TOVS Polar Pathnder project to generate Arctic atmospheric temperature and humidity proles. Prole generating jobs are distributed and executed by the DBC on workstation clusters located at several sites across the United States. This paper describes the data processing requirements of the TOVS Polar Pathnder project, and how the DBC is being used to meet them. It also describes Version 2 of the DBC. DBC V2 is implemented in Java, and utilizes a number of advanced Java features such as threads and remote method invocation. It incorporates a number of functional enhancements. These include a exible mechanism supporting interoperation of the DBC with a wider variety of execution resources and an improved user interface. Keywords: batch processing, TOVS Polar Pathnder, DBC, Java 1. INTRODUCTION The Distributed Batch Controller (DBC) is a tool for scientic data processing. 1 It executes batch data processing jobs in pools of workstations. While it can be used to control jobs in a single pool, it is intended for users who wish to combine more than one pool into a single computational engine. For example, a collaborative research team could use the DBC to combine the computational resources of individual team members. The workstations used to process DBC jobs remain autonomous and are not assumed to be dedicated to DBCrelated work. Furthermore, they may be widely distributed. For example, the DBC system described later in the this paper utilizes workstation pools located in Washington, Wisconsin, and Maryland. The DBC addresses computational issues that arise in this environment. For example, pools may join or leave the DBC system at any time, the processing power available at the pools may uctuate with time, job input data must be shipped to remote pools and output data must be collected, and failures of various types must be detected and corrected. In addition, Other author information: K.Salem (correspondence): kmsalem@uwaterloo.ca; Telephone: x3485

2 user input repository output repository meta-data repository DBC master DBC agent DBC agent buffer Pool 1 buffer Pool N Figure 1. Architecture of the DBC the system provides a centralized administrative interface to allow its users to monitor and control distributed batch execution. The DBC is currently being used by the TOVS Polar Pathnder project to generate Arctic atmospheric temperature and humidity proles. This paper describes the project's data processing requirements and shows how they are supported by the DBC. It also describes DBC V2, a new Java-based implementation of the DBC architecture. DBC V2 incorporates a number of enhancements intended to make it more exible and powerful than its predecessor. 2. DBC OVERVIEW A single DBC job consists of a sequence of one or more job steps. A job step represents the execution of a program or a transfer of data from one location to another. Job steps may have formal parameters, which are typically used to represent input and output le names and run-time parameters for the job step program. A batch is a set of jobs. All of the jobs in a DBC batch are identical, except for their parameters. In the scientic domains to which the DBC is targeted, batches are often large and long-running. Figure 1 illustrates the run-time architecture of a DBC system. A DBC system is used to process a single batch of jobs. Batches may be dynamic, in the sense that jobs can be added to or removed from a batch while it is being processed. Although a single DBC system always processes a single batch, several DBC systems could run concurrently and share the available resources. Computational power is provided to the DBC by one or more resource pools. Pools are autonomous and may join, leave, and rejoin the system at any time. A DBC agent controls jobs at each pool, and a single DBC master coordinates the agents. The master maintains information about the entire batch, assigns jobs to pools, and is informed when jobs are completed. If a pool leaves the system, the master may reassign its tasks to other pools. The master also supports user interfaces that provide a central point of control for a human administrator. Each DBC pool includes one or more \execution resources" as well as some disk buer space. Resources normally represent workstations, or clusters of workstations, that are capable of executing job steps. The DBC agent at each pool is responsible for arranging for the execution of the jobs assigned to it by the master. Executing a job involves moving its input data from the input repository to the pool's disk buer, arranging for the execution of the job step

3 programs on the pool's execution resources, and moving the resulting output data from the pool buer to the output repository. One type of execution resource supported by the DBC is a workstation cluster managed by the Condor resource management system. 2,3 Condor harnesses the computational power of unused workstations in the cluster. When a DBC agent executes a job using a Condor resource, the Condor system locates an idle workstation in its cluster (if one is available) and assigns the job to it. By interacting with Condor in this way, the DBC is able to turn the power of unused workstations in the cluster into a valuable computing resource. Version 1 of the DBC is implemented in C and runs on a variety of UNIX platforms. Communication among DBC processes is implemented using the Parallel Virtual Machine (PVM) system. 4 DBC V2 is implemented in Java. We will describe DBC V2 in more detail in Section PROCESSING POLAR PATHFINDER DATA DBC is being used as a data processing engine for the TOVS Polar Pathnder (TOVS Path-P) project at the Polar Science Center (PSC) in the Applied Physics Laboratory at the University of Washington. This project, part of the NOAA/NASA Pathnder program, is producing daily Arctic atmospheric temperature and humidity proles from the High Resolution Infrared Sounder (HIRS) and Microwave Sounding Unit (MSU) for an 18 year period from A modied version of the Improved Initialization Inversion (3I) retrieval algorithm 5,6 is used to determine geophysical products (temperature and humidity proles) from satellite radiances. The algorithm performs the inversion of the radiative transfer equation starting from a \rst guess" prole which is selected from a library by minimization of brightness temperature dierences. The nal solution is found by correcting the rst guess. To avoid forward radiative transfer calculations this step utilizes precomputed Jacobian matrices that contain the partial derivatives of satellite brightness temperatures (TB) in dierent channels with respect to the temperature and humidity values at dierent levels in the atmosphere. The algorithm consists of ve steps. These steps are: 1. calibration of the input data (NOAA Level 1B), 2. interpolation of HIRS and MSU data to a common georeference system, 3. determination of initial guess, cloud detection and clearing, 4. inversion of temperature and humidity proles, 5. interpolation of algorithm results to standard levels. Despite the design of the algorithm for speeds to allow large scale or global processing, the computational demands of creating a 18-year data set are still substantial. Proles are being generated using data from approximately 100,000 orbital passes. Each execution of the prole retrieval program uses two Level 1b input les, one containing HIRS data and the other containing MSU data. The total size of these les is about 350 kilobytes compressed. The program generates about 1 megabyte (compressed) of output data. The time to execute the prole retrieval program ranges from about ve minutes of CPU time per orbit on a dedicated Sun Sparc-20/60, to about twelve minutes per orbit on a Sparc-10. Assuming no I/O-related delays, sequential processing of this data set would occupy such workstations for approximately one to two years. Since the PSC team expects to be reprocessing the data two or three times to incorporate modications in response to validation and user feedback, this application alone may represent three to ve workstation-years of computing. Each execution of the prole retrieval program is implemented as a single DBC job with one execution step and several data transfer steps. The data transfer steps bring the HIRS and MSU input les to the pool and send the output to the repository at the PSC. The execution step runs a simple csh script that uncompresses the input les, runs the prole retrieval program, and compresses the output. Alternatively, it would have been possible to treat each of the ve steps of the algorithm, plus data compression and uncompression, as separate job steps. We elected to treat them as a single step to avoid the run-time overhead of scheduling many short steps.

4 Total Jobs Completed CESDIS APL HPUX Wisc SunOS Wisc Solaris APL SunOS 0 Tu 6pm Wed 6am Wed 6pm Th 6am Th 6pm Fri 6am Fri 6pm Sat 6am Time Figure 2. Progress of the a Typical DBC Batch Run Retrieval jobs are normally run in batches of 500 to 5000 jobs, and the DBC is typically shut down between batches. (This mode of operation is preferred by the PSC, and is not required by the DBC.) Initially, all of the jobs in a batch are submitted to the DBC using a program that species which input les are to be processed by each job. This program is written in C and uses the DBC's API to inform the system of the new jobs. The DBC console is then used to monitor and control the remainder of the batch execution. Pools can be added and removed using console commands and progress statistics (number of executed, failed jobs etc.) can be displayed. Jobs can be retried or canceled. Detailed processing logs are kept in separate log and error les. 4. EXPERIENCE To generate proles, we have used a DBC system consisting of three widely-distributed workstation pools. The largest pool is at the University of Wisconsin Computer Science Department in Madison, Wisconsin, consisting of about 40 Sun/Solaris workstations, and 30 Sun/SunOS4.1.3 workstations. Its disk buer has a 180 megabyte capacity. The second pool is at the Center of Excellence in Space Data and Information Systems (CESDIS), located at the NASA Goddard Space Flight Center in Maryland. It consists of 14 Sun/SunOS4.1.3 workstations and a 160 megabyte disk buer. The third pool consists of the PSC's own workstations, including two Sun/Solaris workstations, four HPUX workstations, and a 100MB disk buer. All three pools are shared by other users, so the number of workstations available for DBC jobs is normally much less than the total. A RAID storage device at the PSC in Seattle is used as both the input repository and the output repository for the system. The DBC master is also located at the PSC site. This DBC system has been used to process tens of thousands retrieval jobs. The following are some observations concerning the performance and usability of the DBC system under this workload Performance The DBC system was able to make eective use of all three of its workstation pools. Figure 2 shows jobs completed as a function of time for one typical batch of 5000 jobs. The gure shows the jobs completed by each pool, as well as Jobs can also be submitted \manually" through the DBC console program. However, the use of a program is preferable when large numbers of jobs are to be submitted.

5 Pool Throughput (jobs/hr) %Downtime Ideal Throughput Wisconsin/Solaris Wisconsin/SunOS CESDIS/SunOS PSC/Solaris PSC/HPUX PSC/SunOS Total All Pools Figure 3. DBC Throughput With and Without Pool Downtime the total. Because the pools at Wisconsin and PSC each included two dierent types of machines, we have included two lines for each of those pools in the Figure. Each line represents jobs completed by machines of a particular type in a particular pool. This batch took 86 hours to complete, for an overall throughput of 58 jobs per hour. The remote CESDIS and Wisconsin pools contributed substantially, processing 77% of all of the jobs. An analysis of job response times from this batch showed that workstation availability was the throughput bottleneck at all three pools in the system. In other words, the DBC system had no problem supplying sucient jobs to utilize all of the idle workstations that Condor was able to give it from these pools. Data transfer, even to and from the remote pools, was not a performance-limiting factor. This was because the DBC was able to hide the data transfer latency by overlapping it with the processing time of other jobs. This is possible as long as the DBC has ample disk buer space at each pool Failures and Recovery Failures are a fact of life in a long-running distributed system like the DBC. Much of the implementation eort for DBC V1 went into failure detection and recovery. Nonetheless, failures still had an impact on the system's performance and ease of use. To examine the impact of pool failures, we analyzed data from a set of seven batches of jobs (about 15,000 jobs total). The rst column of Figure 3 shows the average throughput for each pool, measured over all of the batches, and the second column shows the percentage of time that each pool was down while the DBC was running. The third column shows the throughput we would expect if pools never went down, assuming their average processing rate remained unchanged. If all pools had run without failure, the overall throughput of the DBC would have been 85.9 jobs per hour, more than twice the measured value of 38.6 jobs per hour. Not all of the pool downtime was caused by failures. For example, during the batch execution shown in Figure 2, the Wisconsin pool was unavailable during the rst thirty hours of the batch run because of a planned shutdown for installation and testing of a new version of the Condor software. Nonetheless, failures of DBC pools were not uncommon. We observed pool failures for all of the following reasons: operating system crashes (on DBC agent machines), local area network and network le server failures, Condor failures, broken network connections between the DBC agent and the master, the DBC agents' own automatic shutdown mechanism, which is triggered by a burst of persistent job failures. DBC agents shut themselves down if they detect a high rate of failures among local DBC jobs. This mechanism exists to prevent a problem pool from consuming large numbers of batch jobs. Jobs can fail for a variety of reasons. Some failures are due to application problems such as bad input data or bugs in the job step programs. Some are due to a timeout mechanism built into the DBC that is used to detect \hung" jobs. This can fail to distinguish between a job that is truly hung and one that is simply slow, perhaps because the underlying Condor system is heavily loaded. Some failures are due to Internet problems that prevent the transfer of the job's input or output

6 data to the pool. All of these failures tend to be bursty, and may result in a pool shutdown if a suciently large burst of failures occurs. When a pool fails for any reason other than the automatic shutdown mechanism, the master attempts several times to restart it before eventually giving up. If restart is successful, jobs that were in progress at the time of the failure must be re-run (this is handled automatically) but downtime is normally minimal. Pools that succumb to the automatic shutdown mechanism are not restarted automatically, since jobs are likely to fail there again if the underlying problem is not corrected. When automatic restart fails or is not employed, it is up to the DBC user to restart the pool manually by issuing the appropriate command from the DBC console. Although our goal is fully automatic long-term operation, it is our experience that the DBC must still be monitored several times a day to determine whether manual intervention is required Metadata If a system like the DBC is to be useful, it must provide its users not only with output from the batch jobs, but also with meta-data describing how that output was produced. These meta-data are important for several reasons: Performance Monitoring and Analysis: The DBC user should be able to track how many jobs were processed by each pool and how long they took. As we have already noted, periodic run-time monitoring of these data is necessary to detect and correct some kinds of pool problems. O-line meta-data analysis is also useful for for identifying performance problems and for conguring the run-time system for future batches. Interpretation of Results: Dierent jobs in a DBC batch may run on dierent types of machines, depending on the makeup of the pools in the run-time system. The type of machine on which a job runs may have an impact on the output data it generates. For example, oating point numeric calculations may dier slightly on dierent machines. To properly interpret the output of a batch job, a record of the environment in which each job was processed is essential. Application Debugging: DBC application job step programs, like other programs, will not be free of errors. Such errors may manifest themselves more frequently in the heterogeneous execution environment provided by the DBC, since some environments may expose program bugs that others do not. When a job fails, some record of its state, e.g., a \core dump", should be available for subsequent post-mortem analysis. There must be some mechanism for a DBC user to collect the necessary state and environment information for failed jobs, which may be widely distributed. Currently, the DBC makes several types of meta-data available to its users. Performance statistics are shipped from the pools and are available immediately through the DBC master. This allows the user to determine the status of any job or pool in the system, and the rate of job completion at the various pools. In addition, a variety of information about job execution environments and execution times is shipped to the master and logged. Tools are provided for subsequent analysis of the logs. All of the measurements presented in this paper were produced from analyses of DBC-generated logs. In DBC V1, debugging data, such as core les, are not shipped to the master because they are bulky and because they are not always needed. Debugging data is saved in each pool's local disk buer. The amount of data to save can be controlled by run-time parameters. Options include saving everything (including bulky core les), saving just small les (typically stdout and stderr output), or saving nothing at all other than the output data (normal operation). Since the DBC assumes that it is allowed to use only a limited amount of disk space at each pool, the collection of debugging data can cause a graduate decline in a pool's performance. This is because the debugging data occupies buer space that could otherwise have been used for job input and output data. This reduces the amount of data prefetching that can be performed, and may limit the maximum number of batch jobs that can execute concurrently in a pool.

7 5. DBC VERSION 2 Version 2 of the DBC is a complete re-implementation of the DBC architecture. DBC V2 supports a more exible execution model than V1, which was restricted to one resource per pool and one execution step per job. It also provides an improved user interface and a standardized internal interface to external system components, such as Condor. This simplies the task of extending the DBC to interact with new systems. Finally, DBC V2 is a lighter weight implementation, in the sense that fewer DBC processes are required at each of the pools. DBC V2 is implemented entirely in Java. Commonly cited advantages of Java include its software engineering features, which lead hopefully to robust and maintainable code, and the promise of write-once-run-anywhere. Equally important, from our perspective, are several other features of the Java environment. First, Java provides machineindependent language support for threaded applications. Both the DBC master and the agents make heavy use of threads. In particular, Java threads eliminate the need for separate job monitor processes, which were required at each pool in DBC V1. Second, Java supports a distributed object model, in which one Java virtual machine can invoke methods of objects located in a remote virtual machine. In DBC V2, all communication between the master and the agents is accomplished using the Remote Method Invocation (RMI) mechanism. This has the eect of shielding the DBC from tasks such as parameter marshalling and unmarshalling. These tasks were performed explicitly by the DBC in Version 1, which relied on the PVM message passing facilities. Third, the Java environment provides standard interfaces to external components, such as operating systems. This hides, in principle if not always in practice, some of the heterogeneous nature of these components from the DBC The Job Execution Model DBC V2 supports a exible job execution model. Under the model, a single logical job step may be executed dierently depending on where it runs. This provides a simple way to take dierent types of computing resources and integrate them into a single DBC system. The job model requires that several things happen before the DBC runs jobs. First, job steps must be dened. Second, the resources available at each pool must be specied. Finally, the DBC must be told how to implement the various job steps using those resources. Figure 4 shows part of a DBC script that denes job steps, resources, and implementations for the prole retrieval application described in Section 3.. Normally, such a script is run once when the DBC system is being initialized. Once this is done, an arbitrary number of jobs can be created and submitted to the system for execution. As was described in Section 2., each job consists of a sequence of job steps. The DBC provides two built-in generic job step types: program execution steps and data transfer steps. A DBC user customizes these generic types to dene the job step types for a particular application. Customizing involves creating a unique name for the new step type and specifying default values for step parameters. For example, the four addjobsteptype commands in Figure 4 dene four step types for the TOVS Path-P application: 1. gethirs - a data transfer step to get the HIRS input le 2. getmsu - a data transfer step to get the MSU input le 3. send result - a data transfer step to move the resulting prole to the output repository. 4. execinversion - a program execution step to execute the prole retrieval script using the HIRS and MSU input les Job step types may be parameterized. In Figure 4, each step has a timeout parameter used by the system to detect step execution problems. Parameter values supplied with the job step type denition apply to all job steps of that type. It is possible to override such values for an particular job step, or for all steps of that type that are executed by a particular resource. DBC resources can represent anything that is capable of executing DBC job steps. Each DBC pool must include one or more resources. When a DBC pool is rst started, it is necessary to specify the execution resources that are available at that pool. The built-in resource types include Unix workstations and Condor-managed workstation clusters. In addition, there is an FTP resource, which represents the ability to transfer les to and from a pool.

8 First, define four types of job steps, two for transferring the HIRS and MSU input files, one for executing the inversion procedure, and one for transferring the resulting profile addjobsteptype gethirs XferStep direction[input] timeout[300] addjobsteptype getmsu XferStep direction[input] timeout[100] addjobsteptype send_result XferStep direction[output] timeout[600] addjobsteptype execinversion ExecStep timeout[3600] Next, create a DBC pool at foo.washington.edu Initially, the pool has no resources and cannot process jobs startpool foo.washington.edu /home/dbc/buffer dbc Now create a resource representing HPUX machines in the Condor cluster at foo.washington.edu addresource foo ConHPUX9 CondorResource runlimit[3] Now create an implementation that will allow execinversion steps to run on the newly-created resource addimpl foo execinversion Condor5Exec ConHPUX9 \ execpath[/home/dbc/iii/bin/pp_iii_run.hpux9.script] \ cputype[hpux9] \ requirements[(arch == "HPPAR") && (OpSys == "HPUX9")] \ execmode[vanilla] polling_freq[15] Create another resource to represent the Solaris machines in the Condor cluster addresource foo ConSolaris CondorResource runlimit[4] Allow execinversion steps to run on the Condor/Solaris machines addimpl foo execinversion Condor5Exec ConSolaris \ execpath[/home/dbc/iii/bin/pp_iii_run.solaris.script] \ cputype[solaris] \ requirements[((arch == "sun4m") (Arch == "sun4c")) && (OpSys == "Solaris")] \ execmode[vanilla] polling_freq[15] Finally, create an FTP resource and implementations for the data transfer steps. The input and output data resides at bar.washington.edu addresource foo FTP FtpResource runlimit[4] remote_site[bar.washington.edu] addimpl foo gethirs FtpXfer FTP addimpl foo getmsu FtpXfer FTP addimpl foo send_result FtpXfer FTP Figure 4. Part of a DBC Script for the TOVS Path-P Application

9 Figure 4 includes three addresource commands, each of which denes a resource at the pool foo.washington.edu. (The pool, in turn, was dened using the startpool command.) Two of the resources represent Condor-managed workstation clusters and the third is an FTP resource. Like job steps, resources may be parameterized. The runlimit parameter used in Figure 4 species an upper bound on the number of DBC job steps that may be running concurrently on a particular resource. This can be set to limit the load that the DBC can place on a particular machine or group of machines. Because there are dierent types of resources, the exact procedure used to execute a particular job step will depend on the resource that executes it. For example, if the resource is a Condor cluster, the agent executes a job step by submitting the step's program to Condor for execution. To execute that same job step on a resource consisting of a single workstation, the agent would create a new process to run the step's program. The DBC uses \implementations" to encapsulate knowledge of how to execute a particular type of job step on a particular resource The DBC includes built-in, parameterized implementations that map built-in generic job step types to the built-in resource types. For example, the rst addimpl command in Figure 4 denes an implementation that species how an execinversion job step is executed on the ConHPUX9 resource at the foo.washington.edu pool. An implementation parameter species the name of the csh script that performs the inversion. Additional parameters, some of which are passed through to the Condor system, are also specied. By specifying the implementations that are to be used at each pool, the DBC user can control which job steps can be executed on which resources at each pool. In our example, implementations for execinversion steps are dened for both the ConHPUX9 and ConSolaris resources, so execinversion steps will be able to run on either resource. The DBC requires that at least one implementation be specied for each job step type at a pool before that pool will be allowed to process jobs. This is to ensure that it is possible to completely execute jobs at the pool. The job model provides a layer of indirection that allows the DBC to exploit a heterogeneous collection of computing resources to process jobs. At the pool foo.washington.edu, the script used to perform an inversion procedure and the parameters passed to the underlying Condor system will depend on which resource is used to perform the inversion job step. This model separates the logical structure of the job from the details of how job steps are implemented at a particular pool. The model also simplies the task of extending the DBC to support new types of resources. To support a new resource, it is necessary to create a new DBC resource type and new implementations to map job steps to the new resource type. For example, one could create a new resource to represent workstations running the Windows NT operating system, 7 plus a new implementation that mapped program execution job steps to NT resources. Essentially, the new implementation would encapsulates the procedures necessary to run and monitor a program on an NT machine. The remainder of the would DBC remain unchanged User Interface Two user interface programs are available for monitoring and controlling DBC systems. Both programs communicate with the DBC master through a standard DBC application programming interface. The rst program provides a text-based console that can be used to control and monitor a DBC system. Console commands allow jobs to be submitted, pools to be started and stopped, and error logging to be controlled. Pool status and job status can be checked. Console commands are also used to dene job step types, pool resources, and implementations. These commands are normally read from a le when the DBC is rst started. The second program is DBCHttpd, which implements the HyperText Transfer Protocol. This program supports a graphical DBC monitor which can run in a Java-enabled web browser. A browser that connects to the DBCHttpd downloads a Java applet, which then retrieves current DBC status information from the DBCHttpd. This situation is illustrated in Figure 5. The applet displays a batch progress chart showing the number of jobs completed and aborted as a function of time. It can also display more detailed information about specic pools, such as the loads on the pools' resources. A snapshot of the monitor interface is shown in Figure CONCLUSIONS The DBC system demonstrates that a widely-distributed collection of workstation pools can be utilized to process TOVS Polar Pathnder data. Such a system may be particularly valuable to collaborative research teams, which

10 web browser HTTP DBCHttpd Java RMI DBC master web browser HTTP Figure 5. The DBC HPPT Daemon Figure 6. The Graphical Monitor

11 may control a variety of widely-distributed computing resources. By combining the power of several pools, the time required to complete large batch processing tasks can be substantially reduced. Such a system is exible, and can provide computing cycles even when some pools are unavailable. The heterogeneity and distribution of the system's computing resources are potential performance impediments which the DBC has been designed to overcome. More information about the DBC, including a user's guide and the DBC V2 code, and be found at the DBC's World Wide Web site: ACKNOWLEDGEMENTS We are grateful to CESDIS for providing access to its workstations in support of the DBC system described here. The Polar Pathnder team has been patient with the system. Victor Wiewiorowski, at the University of Waterloo, implemented the graphical front-end for DBC V1. Work on the DBC has been supported by NASA through its Applied Information Systems Research Program. REFERENCES 1. C. Chen, K. Salem, and M. Livny, \The DBC: Processing scientic data over the internet," in 16th International Conference on Distributed Computing Systems, pp. 673{679, May A. Bricker, M. Litzkow, and M. Livny, \Condor technical summary," Tech. Rep. TR 1069, Department of Computer Science, University of Wisconsin, Oct M. Litzkow and M. Livny, \Experience with the Condor distributed batch system," in Proc. of the IEEE Workshop on Experimental Distributed Systems, pp. 97{101, Oct A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, PVM: parallel virtual machine - a user's guide and tutorial for networked parallel computing, The MIT Press, Cambridge, MA, A. Chedin, N. A. Scott, C. Wahiche, and P. Moulinier, \The Improved Initialization Inversion method: A high resolution physical method for temperature retrievals from satellites of the TIROS-N series," J. Clim. Appl. Meteorol. 24, pp. 128{134, J. A. Francis, \Improvements of TOVS retrievals over sea ice and applications to estimating Arctic energy uxes," Journal of Geophysical Research 99.NO.D5(10), pp {10408, H. Custer, Inside Windows NT, Microsoft Press, 1993.

The DBC: Processing Scientic Data. Over the Internet 1. Chungmin Chen Kenneth Salem Miron Livny. USA Canada USA

The DBC: Processing Scientic Data Over the Internet 1 Chungmin Chen Kenneth Salem Miron Livny Dept. of Computer Science Dept. of Computer Science Computer Sciences Dept. University of Maryland University