A Design of an Experiment to Model Data Base

Size: px

Start display at page:

Download "A Design of an Experiment to Model Data Base"

Rosemary Singleton
6 years ago
Views:

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-2, NO. 2, JUNE 1976 A Design of an Experiment to Model Data Base System Performance 97 SAKTI P. GHOSH, SENIOR MEMBER, IEEE, AND WILLIAM G. TUEL, JR.

1 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-2, NO. 2, JUNE 1976 A Design of an Experiment to Model Data Base System Performance 97 SAKTI P. GHOSH, SENIOR MEMBER, IEEE, AND WILLIAM G. TUEL, JR., MEMBER, IEEE Abstract-A statistical design of an experiment for modeling the performance (access time) of a data base (DB) system in a controlled environment has been outlined. A three-factor pilot experiment with retrieval sequence, logical view (combined with access method), and target segment type as factors was designed and the results analyzed for a specific IMS batch DB. It was found that: 1) the variation of sequences of DB retrieval calls in the application program did not have any signifilcant effect on the access time, whereas the other two factors did have significant effect; 2) the variability in access time is not completely explained by these three factors; and 3) the distribution of the residual error is '"nonnonnal," with a large positive skew. Index Terms-Data bases, data management systems, design of experiment, design technique, file organization, information management system, statistical. I. INTRODUCTION A DATA BASE SYSTEM is a software system whose main objective is to provide a convenient way to access and modify large amounts of stored data. It provides an environment in which a user can represent and view information of the real world with complex logical relations. A practical data base (DB) can contain millions of records; hence, efficiency or- performance of a DB system becomes very important to a user. To be able to design a DB system intelligently it is necessary to know the access time needed to retrieve a particular segment of the DB. The access time varies widely across different DB designs and a poor design can significantly impair the usefulness of the entire DB system. In order to be specific, we shall concentrate on IMS/360, a DB system available from IBM. There are many factors that can affect performance of DB calls in IMS. Some of them are: the user's view of the DB (also referred to as logical view), the relative position of the target segments in the logical view, the underlying access method used to support the logical view, the type of data language (DL) call statements executed, the size of the DB buffer pqol, the statistical properties of the DB, the memory management algorithm, the data set device allocations, the operating system and its parameters, the hardware configuration, etc. It is obvious that studying the performance of IMS is a difficult task because of the large number and unknown nature of the factors that will affect the access time of a particular call at a particular state of the computing system. A survey of the difficulties of the problems involved in understanding the performance of computers in general has been given by Grenander and Tsao [1]. Little attention has Manuscript received November 5, 1974; revised November 13, The authors are with the IBM Research Laboratory, San Jose, CA been paid to the particular case of DB systems, e.g., IMS. Most of the work up to now has been confined to studying performance through indirect factors like page faults, page swaps, etc. Some of the references on such work are given in the Reference section. Tsao, Comeau, and Margolin [21 and Tsao and Margolin [3] have used statistical factorial experiments to study factors affecting page swaps in a time-shared environment. No attempt has been made up to now to study the factors which directly affect performance of DB's in an isolated environment. In this paper we describe a statistical approach to modeling the performance of IMS in a controlled, batch mode environment. Access time has been taken as the measure of performance. A geographical DB using the data from the Bureau of Census DIME file [10] was organized under IMS Version 2.3. IMS was run on an IBM S/370 Model 145 under OS VS2 release 1.6. The data calls for IMS were embedded in a PL/I program and the elapsed time to execute a given sequence of calls was taken as access time. Section II gives a detailed description of the IMS environment. Section III describes the statistical design of a three-factor controlle-d factorial experiment. Section IV discusses the experimental results. The last section discusses the conclusions reached from the experiment. II. THE IMS ENVIRONMENT IMS views data as hierarchical structures both at the physical description level and at the logical description level. A physical DB description (DBD) can be defined for storing and retrieving data by one of the following access methods: the hierarchical sequential access methods (HSAM), the hierarchical index sequential access method (HISAM), the hierarchical direct access method (HDAM), and the hierarchical index direct access method (HIDAM). A brief description of these access methods is given in the Appendix; however, readers without a basic knowledge of hierarchical data structures and access methods are urged to consult [41 - [61. Since HSAM has limited direct access capability, our discussion considers only the latter three access methods. The Census Bureau DIME file contains three types of entities: branches (which may be street sections or geographical features), blocks (enclosed areas), and nodes (intersections). Each branch is identified by its full name, and if it is a street, by the lowest address on the left side of the street. Associated with each branch are the blocks on its left and right sides and the nodes at each end of the branch. Blocks and nodes are identified by unique numbers, and the corresponding IMS segments contain data specific to an entire block or to a node.

2 98 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, JUNE 1976 BRANCH DB. (HIDAM) LV1: (HIDAM) (HISAM) (HDAM) (HISAM) points to NODE data base LV2: // BLOCK (HISAM) base points to NODE data base ADDRESS (HIDAM) IZr/ z/tj BRTYPE (NS)A (HIDAM) I/11////(,X I n BRTYPE (ST) BRNAME (NS) BLOCK DB (HISAM) BRNAME (ST) BLOCK Key Name BLOCKDATA 7 I BLOCKKEY ILCDAA points to/ STBLOCK in BRANCHDB points to NSBLOCK in BRANCHDB ADDRESS (HIDAM) LBY"iPE INS) (HIDAM) BRTYPE (ST) NODE DB (HDAM) NODE Key Name- 1 p I t I NODEKEY NODEDATA points to 1{ points to STNODEin NSNODE in BRANCHDB BRANCHDB *Fig. 1. Three physical DBD. BRNAME (ST)l Fig. 2. Three logical views. (Note: In the figure if an access method is not specified for a segment type, then the segment is retrieved using the access method associated with its parent. In the experiment, the segment types which are- qualifled by the SSA's of the GU calls are shaded. A detailed explanation of the qualifications of the GU calls is discussed later in the paper.) From these thfree entities, three interrelated physical DBD's were constructed. An important factor affecting the access time is the access method underlying a physical DBD. In order to study the effect of the different access methods, each DBD (physical) was organized using a different access method and possible- logical relations were specified by means of pointers. They are diagrammatically shown in Fig. 1. An example is shown in the Appendix. A logical view may be identical to or contained within one physical view, or may be constructed from multiple physical views. Pointers (or key values) are used to connect multiple physical views to form a complex logical view.' Each logical view has associated with it all the three physical DBD's (based on the three access methods) but the root segme,nt of the different logical views have different access methods. Fig. 2 shows the three logical views-that are constructed-. The logical view used is one of the factors which. affect performance. In the experiment, the logical view along with the access method supporting it was used as a factor; henceforth the term "logical view" should be interpreted as that combination. Having established his logical view, a user issues one or more calls to retrieve, replace, insert, or delete an instance of a segment type in the IMS data base. These calls are referred to as DLI calls. There are nine DLI functions which are as follows: DLI Functions Get Unique Get Next Get Next within Parent Get Hold Unique Get Hold Next Get Hold Next within Parent Insert Delete Replace Codes GU GN GNP GHU GHN GHNP ISRT DLET REPL

GHOSH AND TUEL: MODELING DATA BASE SYSTEM PERFORMANCE 99 The first six function codes are called DLI retrieve calls because they are used to fetch data from the IMS data base.

3 GHOSH AND TUEL: MODELING DATA BASE SYSTEM PERFORMANCE 99 The first six function codes are called DLI retrieve calls because they are used to fetch data from the IMS data base. The GU function is used for retrieving a unique segment from the data base satisfying certain qualifications on the segment type and key field. The GN function is used for retrieving the next sequential segment from the data base satisfying certain qualifications. The GNP function is used for retrieving the next sequential segment associated with a segment at a lower level of the hierarchical structure satisfying certain qualifications. GHU, GHN, and GHNP functions are similar to GU, GN, and GNP functions, respectively, except that they are used before the execution of ISRT, DLET, and REPL. The ISRT function is used to insert segments, the DLET function is used for deleting segments from the data base, and the REPL function is used for replacing the data of a segment. The IMS organized DB is permanently stored on an auxiliary storage device; however to facilitate retrieval there is a DB buffer pool in the main memory for temporary storage of data blocks. When a DLI function is ussued, the DB buffer pool is first checked to see if the requested segment or pointer pointing to the requested segment is present in the DB buffer pool. If not, data blocks are fetched from auxiliary storage into the DB buffer pool using direct or symbolic pointers to locate the data. When the DB buffer pool is full, space is created by deleting data blocks. The DLI functions ISRT, DLET, and REPL modify segments in the buffer pool and leave them there. When the DB buffer pool is full or when the application program terminates, the modifications are written into the DB. Thus these functions are not completely executed during the DLI calls. The argument of a DLI call may contain one or more segment search arguments (SSA's). If an SSA contains the segment type and the key value of a segment to be referenced it is called a qualified SSA. If it contains only the segment type, then it is called unqualified. The details of specification or execution of unqualified or partly qualified SSA's are discussed in [41-[61. One measure of performance of an IMS organization is the access time associated with executing the DLI functions. From the point of view of IMS, an application program can be characterized by a sequence of DLI functions, which is one factor that should be considered. Hence the performance of an IMS organization associated with an application program can be measured by the sum of the access times associated with each of the DLI functions in the application program. The access time of each DLI function depends on the type of DLI function, SSA, logical view, and also on other factors. In the next section we shall focus on some of the significant factors that influence performance of the IMS DB outlined in this section. III. STATISTICAL DESIGN OF AN EXPERIMENT A statistical design of an experiment to study some of the important factors that affect the performance of the IMS DB described previously is outlined. Factors to be considered are chosen from among those that are controllable by the highlevel user; then an experiment is designed to test whether these factors and their interactions are significant in comparison to the uncontrolled errors (i.e., the variability due to uncontrolled factors). As there are many (interacting) factors affecting access time of IMS, it is appropriate to set up a factorial experiment to study performance. Logical view plays an important role in IMS. The user views the DB in the format provided by the logical view, while IMS processes the DB according to the structure (using the supporting access method) defining that logical view. Hence it is important to find out what effects changes in the logical views have on access time. We shall choose the logical view as one factor for the factorial experiment. In order to reduce the number of factors in the experiment it was decided to combine the access method and the logical view into one factor. This technique is widely used in statistical design of experiments and is called confounding (of access methods with logical views) (see [71 ). The three levels of this factor are the three different logical views. The three logical views are created from the three physical views shown in Fig. 1. The logical view L VI is shown in Fig. 2, is created from BRANCH DB, BLOCK DB, and NODE DB. It is structurally identical with the physical view of BRANCH DB. The logical view L V2 has as its root segment the BLOCK DB and the descendent segments are created from the BRANCH DR. The logical view L V3 has as its root segment the NODE DB and the descendent segments are created from the BRANCH DB. Both L V2 and L V3 have the same descendent segment types as shown in Fig. 2. The different access methods associated with the different segments of the three logical views are shown in Fig. 2. Application programs are also an important factor affecting performance of IMS because they contain the DLI functions. We shall consider application program as another factor for the factorial experiment. Characterization of application programs by DLI calls requires a little more insight into the retrieval process during the execution of a DLI call. In Section II we discussed the characterization of an application program using the sequence of DLj functions in it. Since the DLI functions, ISRT, DLET, and REPL may not be completely executed during the program execution, their characterization is difficult to measure without building special probes and monitoring during actual execution of the process. In many practical DB environments, ISRT, DLET, and REPL constitute a small percentage of the DLI calls, hence they are omitted from this experiment. The execution of a GHU is almost identical with that of GU with the only exception being that the retrieved segment is held for executing a DLET or REPL function. Similarly, GHN is equivalent to GN, and GHNP is equivalent to GNP. Hence for application program retrieval characterization it is considered sufficient to examine sequences of the three DLI calls GU, GN, and GNP. With one minor exception, execution of a GU call does not depend on the position in the DB established by the previous DLI call, whereas the GN and GNP calls do utilize the relative position established by the previous DLI call. Hence, it is sufficient to characterize application programs by sequences of DLI calls which begin with GU. Since each sequence of DLI calls starts with GU and there are only three distinct calls, there are only

100 three possible sets of DLI calls, namely GU-GN-GNP; GU-GN; and GU-GNP. Thus in our factorial experiment, the application program factor contains three levels.

4 100 three possible sets of DLI calls, namely GU-GN-GNP; GU-GN; and GU-GNP. Thus in our factorial experiment, the application program factor contains three levels. In order to allow for a little more variability in the access time between the sequences, we have chosen the three levels as API: AP2: AP3: GU(Sl) GN(S2) GNP(S3) GU(S4) GN(S5) GN(S6) GN(S77) GU(S8) GNP(Sg) GNP(SIO) GNP(S11) where the Si are the SSA's of the DLI calls. The third factor in the experiment is the set of segment types of the logical view referred to by the SSA. For the experiment described here, the SSA's for DLI calls of the same category qualify the same segment type but not necessarily the same value of the attribute. The attributes in the SSA's for GU are chosen as the key fields of the segment types with values chosen at random over the domain of existing key fields. The attributes and their values in the SSA for GN or GNP calls are unqualified, i.e., only the segment type is qualified for these two categories of DLI calls. Two levels of this factor are considered as follows. Level STI The SSA's of GU and GN calls specify the root segment type of the logical view, and the SSA of GNP calls specifies the last segment type of the hierarchical structure specified by the logical view. (The hierarchy is considered ordered top to bottom, left to right.) Level ST2 The SSA's of GU and GN calls specify the next-to-last segment type of the hierarchical structure specified by the logical view, and the SSA of GNP calls specifies the last segment type. (Note: the key of the segment sought is specified, not the keys of a path to the segment.) Typical combinations of the three factors are given in the Appendix. Other factors like the amount of storage -available, the allocation of data seis on devices, DB buffer pool sizes, operating system, hardware configuration, and contents of the DB are held fixed. The effect of the remaining factors such as the order of choosing the level combinations and the actual key values used in the SSA's, etc. are assumed to be averaged out by proper -randomization during the actual conduct of the experiment. As we are interested in studying the effects of the three factors and their interactions, a factorial design with a first factor with three levels, the second factor with three levels, and the third factor with two levels is considered reasonable. Eighteen level combinations = 3 X 3 X 2 are associated with this design. In the experiment the access time is measured for each of the 18 combinations of the levels of the three factors. The set of observations is repeated (replication) for a number of times with different sets of random keys to obtain high reliability in the estimates of the parameters of the model. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, JUNE 1976 Let tijkl = the observed access time (in milliseconds) when logical view is at level i, application program is at level j, and segment type factor is at level k, for the Ith replication where i = 1,25,3, j= 1,2, 3, k = 1, 2, I= 1, 2, --,r The linear model for the design and analysis of the experiment (7], [8] are assumed to be tijkl= / + tlvi + tapj + tstk + tlvxapii + tlvxstik + tapxstjk + tlvxapxstijk + tri + Cifil where,p - constant; tlvi = effect on the access time due to ith level of logical view; tlvxapij = effect on the access time due to the interaction between the ith level of logical view and jth level of application program; tlvxapxstiik = effect on the access time due to the interactions between the ith level of logical view, the jth level of application program, and the kth level of segment type; tri =effect on the access time due to the Ith replication, etc.; eijkl = uncontrolled random error with expectation E(e1kl) = 0 and variance E(e t)= a2. This model assumes that the effects of the different factors and their interactions on the access time are additive. The effect due to replication is really not a factor and has no interactions with the other factors. For details on the subject see [7] and [8]. The above parameters (e.g., tlvi, tri, etc.) are uniquely estimable when the following constraints are imposed: E tlvi ZtAPI= tstk trio i - j k I E tlvxapij E tapxstik = 0 for each j i k E2 tlvxapij = Z tj1xstik = O for each i J k E tljvxj) = E tapxst/k i i~~~~~~~~ Ej tjjj)(apxs7tijk = 0 for all j, k E tlvxapxstj-k = 0 for all i, k k tlvxapxstijk 0 for all i, j. j = 0 for each k When such constraints are imposed, the t's are no longer absolute access times but deviations around 0. These parame-

5 GHOSH AND TUEL: MODELING DATA BASE SYSTEM PERFORMANCE 101 Fig. 3. Distribution of residual errors. ters are estimated by least square methods and the observations are analyzed by using analysis of variance techniques. These are well-known techniques and are given in [7] and [8]. Analysis of variance gives an insight into the amount of variability that is generated by the different factors and their interactions. IV. EXPERIMENTAL RESULTS The DB environment described in Section II was used to run the pilot experiment. DLI calls were generated by a PL/I program which also performed the randomization of levels and selection of the keys at random. In order to provide controlled results, no other jobs were run concurrently. Onehundred thirty-three replications of the 18 combinations of levels were executed yielding 2394 observations of access time. Within each replication the 18 combinations were executed in a random sequence to reduce any bias that could arise due to the order in which the combinations were executed. The values of the attributes in the SSA of GU were chosen at random, i.e., uniformly over the actual key values represented for the segment type being retrieved. The access time for each combination of levels was measured in milliseconds of elasped time. The parameters of the linear model were estimated by the method of least squares ignoring the effects of replications. The residual error for each observation was then calculated. Its frequency distribution is shown in Fig. 3. It was found that the residual errors do not have a symmetric distribution. They have considerably less skewness on the negative side of the mode. Hence, Wilcoxon's sign rank test for means [1 1] was used to test the mean of the residual errors. The value of the Wilcoxon statistic for the 2394 residual errors was Since the Wilcoxon statistics deviate so much (more than the 99-percent level of significance which is 2.97) from zero, the conclusion is that the mean of the residual errors for the underlying model is not zero. This means that the model (without the replications as a factor) does not account for all the factors affecting access time of the given IMS data base. It is possible that there are some uncontrolled effects which have not been averaged out or that the underlying process is not linear. The variability of the data was then analyzed using analysis of variance technique; the results are presented in Table I. Since the errors do not have a normal distribution, the ratios of the mean sums of squares do not have the usual statistical distribution (also known as F distribution). But, because of the low skewness on the negative side of the errors, the distribution of the ratios will also reflect the same shape; hence any ratio which was significant under F distribution is also significant under this situation. Thus the variations due to the main effects of ST and L V, and the interaction between ST and L V are significantly larger than the variability due to the error at the usual 5-percent level of significance. The exact distribution of the ratio of two mean sums of squares when the denominator does not have a "normal" distribution is not known, thus the other ratios may or may not be insignificant. The ratios for the interactions between AP and LV, and also the triple-order interaction, are and 0.822, respectively. It is likely that they are insignificant at

6 102 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, JUNE 1976 TABLE I RESULTS OF ANALYSIS OF VARIANCE Source of Variation Degrees Mean Sums of Ratio of the of Freedon Squares MISS to Error tiss Replication x Main effect due to logical 6 views (LV) x Main effect due to application program (AP) x Main effect due to segment 6 type of logical view (ST) x Interaction bet LV and AP x Interaction bet LV and ST x Interaction bet AP and ST x Interaction bet LV, AP and ST x Error x 106 TABLE II TESTING THE DIFFERENCE BETWEEN LEVELS OF AP I Name of Value of Normalized the difference the difference difference Result of the test of significance at 5X AP1 - AP Insignificant AP1 - AP Insignificant AP2 - AP Insignificant the 5-percent level because the 5-percent level of significance for the F ratio for 2 and oo degrees of freedom is For testing the main effect of the level of application program some detailed analysis was performed. As the mean square of the error was estimated with 2244 degrees of freedom, this estimate of the variance of the error was considered to be fairly accurate. Thus, the difference between the means of any levels of the application program can be tested by the test for the difference between means. The normalized difference (i.e., difference divided by its standard deviation) between means has the "normal" distribution under the Central Limit Theorem, hence it can be tested by "normal" distribution. The results are given iv Table II. Thus the difference between any two levels of the application program was insignificant. Hence there is good reason to believe that the access time, in this experiment, does not depend on the sequence of the DLI calls. For making any reasonable conclusions regarding the interaction between AP and ST, further analysis is needed. There is a good possibility that the access time effect of application programs can be characterized by a linear function of the type a X (#of GU calls) +, X (#of GN calls) +y X (#of GNP calls) but this has not been explored. As the experiment requires sequential observation of the effects of the different level combinations, the state of the system may change between one replication and the next. In order to estimate this effect on the residual error the factor AP and its interactions were deleted and replication was treated as an additional factor in the model. The residual errors were analyzed again under the following model for the access time: where tiki = /1+ tlvi + tstk + tl STik + tri + Eiki i = 1,2, 3, k = 1,~2, 1= 1, 2,,399. The restrictions on this model are the same as before. Under this model the parameters are estimated and the errors recalculated. The new error distribution is shown in Fig. 4. The Wilcoxon statistic now has the value which is very close to the 1-percent level of significance. This implies that for this data ST, L V, and ST X L V are the important controllable factors that affect access time. Possibly the effect due to replication R also has some significant effect. Statistically, there is no reason to believe that the mean of the residual errors (after these factors are eliminated) differs from zero. Using the original model again, average access time for each combination of levels was calculated to test the difference between them. The critical difference calculated under the "normality" assumption (to.95 & v'27m3) was ms. This was used to test the differences. Table III gives the results. All combinations which are covered by one continuous line are not significantly different. In the table the first coordinate of the combination refers to the logical view, the

7 GHOSH AND TUEL: MODELING DATA BASE SYSTEM PERFORMANCE 103 Fig. 4. Distribution of residual errors with replication factor. TABLE ;III Combination Average access Critical of Levels time (ms.) difference line (3,2,1) (3,3,1) (3,1,1) (2,2,1) (2,1,1) (2,3,1) (1,3,1) (1,1,1) (1,2,1) (2,3,2) ,2,2) II (2,1,2) (3,3,2) II (3,2,2) (3,1,2) (1,1,2) (1,2,2) (1,3,2) Critical difference = ms. second to the retrieval sequence, and the third to the segment several DB design parameters. A factorial experiment was used type. to capture the interactions between the parameters. For this particular enivronment 1) the logical view/access method, V. CONCLUSIONS 2) the segment type retrieved, and 3) their interaction are In this paper we have outlined the design and analysis of a significant components which affect response time, whereas controlled experiment for studying the performance effects of the particular call sequence is not significant.

104 The residual error distribution is highly asymmetrical with positive skew. Hence standard hypothesis tests are not valid and new nonparametric methods are required.

8 104 The residual error distribution is highly asymmetrical with positive skew. Hence standard hypothesis tests are not valid and new nonparametric methods are required. Further analysis is required to understand the reason behind such distributions. APPENDIX ACCESS METHODS An index sequential organization method (ISAM) showing the track indices are given in the following diagram. The blocks are records and the numbers in them represent keys. The first row represents the track indices. TI DT IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, JUNE 1976 available space in the OSAM. Suppose that location is the third record in the fifth track on the tenth cylinder. The new record is stored there. The pointer associated with the key 215 is now filled in to point to this location. Thus after the insertion the third track of the second cylinder appears as follows. I::::E> >3 LED The pointer associated with the newly inserted record in the location of the third record on the fifth track of the tenth cylinder points to the next record in the index sequential organization, i.e., the key 230. The inserted record appears as follows. Tl I T3 lls OT There are many other techniques for handling the overflow problem associated with index sequential organizations. ISAM has many good features, but there are some disadvantages associated with it. One of them is the handling of the overflow problem. Two access methods which combine ISAM with another access method known as the ordinary sequential access method or OSAM in a hierarchical manner are given here. A pointer points to the location of a record in a particular track on a particular cylinder. A pointer is symbolically written as to indicate that it is pointing to the location of the zth record in the yth track on the xth cylinder. One of the access methods which is based on an ISAM and an OSAM is HISAM. In HISAM the original records are organized as ISAM and the records which are added later (overflow records) are stored in an OSAM in the order that they arnrve. The overflow records are chained by pointers to their exact location in the index sequential arrangement so that they can be retrieved quickly. Suppose the keys of the records on the third track of the second cylinder of the ISAM part of a HISAM organization are 212, 215, and 230. Initially, these records are stored on that track with their blank pointers as follows. Another hierarchical access method which uses an ISAM organization and an OSAM organization in a hierarchical manner is HIDAM. HIDAM is used to create an index from a nonkey or key field. The index file is organized as a HISAM organization. The records themselves are stored in an OSAM organization. In the HISAM organization a record is now called an index record. It consists of the value of the attribute on which the indexing has been performed and a pointer which points to the first record in the OSAM organization relevant to the value of the attribute. The records relevant to the same value of the index are chained together in the OSAM organization. The insertions and deletions of indices in the HISAM are performed in exactly the same manner as described before. Similarly, the records can be deleted and inserted in the OSAM and then the pointers from the index file and/or the records in OSAM are changed accordingly. When the records are stored in OSAM but are addressed by a key to address transformation then the access method is referred to as HDAM, SAMPLE DATA BASE AND CALL SEQUENCES A few records from the DIME file are shown in Fig. 5. Some of the pointers providing logical relationships are exhibited. Three sample call sequences are listed below. Each level of each factor is represented. AP: level 1 ST: level 1 LV: level 1 GU GN GNP BRNAME (BRKEY= GUADELUPECREEK) BRNAME NSNODE returns data from node # Now suppose a new record with key 219 is to be added to the HISAM organization. The new record is stored in the first AP: level 2 ST: level 2 LV: level 3 access through NODE data base

9 GHOSH AND TUEL: MODELING DATA BASE SYSTEM PERFORMANCE 105 L'sy. J GUADELUPE CREE I1 wmi:* I L key pointer 4ROSS CRZ ai (BRNAIE) (BRTYPE) (pty) L Y key pointer pointer (ADDRESS) [~~~2130~j] Fq (ADDRESS) 5 3 BRlACt data base L F IR T1 (NSBLOCK) (NSOODE) Block Data Block Datea Block Data BLOCK data base Node Datea Node Data Node Data Node Data NODE data base Fig. 5. Sample data base records. AP: ST: LV: GU GN GN GN level 3 level 1 level 2 GU GNP GNP GNP BR TYPE(NS) (TYPEKEY = 0fifiggg) BR TYPE(NS) BR TYPEI(NS) BR TYPE(NS) returns 4 BR TYPE segments access through BLOCK data base BLOCK (BLOCKKEY= ) BRNAME (NS) BRNAME(NS) BRNAME(NS) returns 3 BRNAME segments which occur within the given block. The other 15 level combinations are constructed similarly. ACKNOWLEDGMENT The authors wish to thank M. C. Smyly, B. Krampetz, and D. B. Hildebrand of the IBM Research Laboratory, San Jose, CA, for their help during this work. REFERENCES [1] U. Grenander and R. F. Tsao, "Quantitative methods for evaluating computer system performance: A review and proposal," in Statistical Computer Performance Evaluation, W. Freiberger, Ed. New York: Academic, 1972, pp [2] R. F. Tsao, L. W. Comeau, and B. H. Margolin, "A multi-factor paging experiment: I-The experiment and conclusions," in Statistical Computer Performance Evaluation, W. Freiberger, Ed. New York: Academic, 1972, pp [3] R. F. Tsao and B. H. Margolin, "A multi-factor paging experiment II: Statistical methodology," in Statistical Computer Per- New York: Academic, formance Evaluation, W. Freiberger, Ed. 1972, pp [4] Information Management System Version II: IBM Education Guide No. ZR [5] Information Management System/360, Version 2 System Programming Reference Manual, IBM publ. SH [6] Information Management System/360, Version 2 System Application Design Guide, IBM publ. SH [7] 0. Kempthorne. The Design and Analysis of Experiments. Huntington, NY: R. E. Krieger, [8] W. G. Cochran and G. M. Cox, Experimental Design. New York: Wiley, [9] M. G. Kendall and A. Stuart, The Advanced Theory of Statistics, vol. 2. New York: Hafner, [10] The DIME Geocoding System, Rep. 4, 1970 Census Use Study Series, Bureau of the Census, U.S. Commerce Dep., Washington, DC. [11] J. L. Hodges, Jr., and E. L. Lehman, Basic Concepts of Probability and Statistics. San Francisco, CA: Holden-Day. [12] J. Buzen, "Analysis of system bottlenecks using a queuing net-

106 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-2, NO. 2, JUNE 1976 work model," presented at the ACM SIGOPS Workshop System Performance Evaluation, ACM, New York, 1971. [13] J. E. Shemer and D.

10 106 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-2, NO. 2, JUNE 1976 work model," presented at the ACM SIGOPS Workshop System Performance Evaluation, ACM, New York, [13] J. E. Shemer and D. E. Heying, "Performance modeling and empirical measurements in a system designed for batch and time sharing users," Proc. AFIPS FJCC, vol. 35, pp , [141 S. Sherman, F. Baskett, 111, and J. C. Browne, "Trace driven modeling and analysis of CPU scheduling in a multiprogramming system," Comm. ACM, vol. 15, pp , [15] P. Denning, "The working set model for program behavior," Comm. ACM, vol. 5, pp , ing Guest Lecturer at many universities in the United States and abroad. Dr. Ghosh is a life member of the Calcutta Statistical Association. He is also a member of the ACM, American Statistical Association, and the Institute of Mathematical Statistics. Wiliam G. Tuel, Jr. (S'63-M'65) received the S 6 B.S.E.E., M.S.E.E., and Ph.D. degrees from the Rensselaer Polytechnic Institute, Troy, NY, in 1962, 1964, and 1965, respectively. His graduate work was supported by a NASA fellowship. He joined the Research Laboratory, IBM Corporation, San Jose, CA, in 1965 and was involved in the areas of automatic control theory and applications of computers to electrical power systems. From 1967 to 1968 he was assigned to the IBM Nordic Laboratory in Sakti P. Ghosh (M'69-SM'74) received the B.S. degree (Honors) and the M.Sc. degree in statis- tics from Calcutta University, Calcutta, India, in 1955 and 1957, respectively. He received the Ph.D. degree in statistics from the University of California, Berkeley, in He joined the IBM Research Division at the Thomas J. Watson Research Center in His contribution to the STORM System led to the x2 and F probability computation subroui I_ i tines in IBM Scientific Packages. He developed the Balanced Filing Schemes based on fmite geometrics and the consecutive retrieval properties for file organizations. He is currently a Research Staff Member with the Research Laboratory, IBM Corporation, San Jose, CA. He is working on data management techniques and performance. He has published more than 35 technical papers. He has held a teaching position at New York University and has served as Visit- Stockholm, Sweden, and studied the automation of machine tool monitoring. He received an Outstanding Contribution Award for this work. Currently, he is Manager of the Data Base System Characterization Project with the Systems Evaluation Department, Research Laboratory, IBM Corporation, San Jose, CA. This project involves the measurement and modeling of computer systems and workloads, particularly data base systems. Dr. Tuel is a member of Tau Beta Pi, Eta Kappa Nu, Sigma Xi, and the Association for Computing Machinery. Distributing a Data Base with Logical Associations on a Computer Network for Parallel Searching SAKTI P. GHOSH, Abstract-The problem of distributing a data base (with logical associations between segment types) on a computer network such that multiple segment types satisfying a query can be retrieved in paralel from different nodes has been introduced. Properties of such distributions without redundancy and with redundancy have been discussed. Lower bounds on the number of nodes needed for such distributions have been given. Algorithms for constructing such distributions have also been given. Distributions of data bases for queries whose target segments form a combinatorial set have been studied in detail. Closed form expressions for redundancy have been obtained for such query sets. Index Tenns-Algorithm of data bases, computer network, data bases, distributing data bases, multiple segment queries, parallel search, redundancy. I. INTRODUCTION C ONNECTING COMPUTERS or multiple terminals to computers started with reservation systems for airlines in the fifties. Many sophisticated computer networks C Manuscript received June 15, 1974;revised August 6, The author is with the IBM Research Laboratory, San Jose, CA SENIOR MEMBER, IEEE have been built since then by research groups. The most famous of them all is the computer network sponsored by the Advanced Research Projects Agency (ARPA). The ARPA network is a decentralized, heterogeneous network connecting about 32 (as of this time) computing sites throughout the United States. Much basic research has been done with data transmission, resource sharing, load leveling, data synchronization, etc. An excellent tutorial on the subject along with some important references has been provided by Merrill [3]. Most of the work on computer networks up to now has concentrated on the communication feasibility aspect of the network. Distributing data bases on computer networks is just beginning to surface. Merrill [3] does mention a few aspects of data distribution, viz., location of data/data descriptors, data access rationale, replication of data/data descriptors, partitioning of data, etc. These problems are being examined by computer systems experts, but very little basic research has been done towards understanding them. Casey [1] has investigated the probiem of optimum conditions for creating multiple copies of a file. None of the research up to now has been directed at

DL/1. - Application programs are independent from the physical storage and access method.

DL/1. - Application programs are independent from the physical storage and access method. DL/1 OVERVIEW The historical approach to data processing was to have individual files dedicated to each application. This led to considerable data duplication, and therefore wasted space and additional