Database performance optimization

Size: px

Start display at page:

Download "Database performance optimization"

Clyde Bennett
6 years ago
Views:

1 Database performance optimization by DALIA MOTZKIN Western Michigan University Kalamazoo, Michigan ABSTRACT A generalized model for the optimization of relational databases has been developed and implemented. The model, an extension of previous works, is more general and more complete than former models. It consists of a set of algorithms and cost equations, and its output is an optimal set of indices for all fields of all files in the database. It also determines which fields should not be indexed. It allows the user to indicate indices to be evaluated and it takes into consideration periodic reorganization, a variety of transaction types, and the multifield/multifile effects of some transaction types. It distinguishes between dense and nondense attributes, between primary and secondary fields, and between sorted and unsorted files. The optimal physical database configuration produced by the model is generated so that it can work within reasonable system constraints of time and space. 555

Database Performance Optimization 557 INTRODUCTION The design of physical database is concerned with the optimization of access time and space requirements and with the prediction of database

3 Database Performance Optimization 557 INTRODUCTION The design of physical database is concerned with the optimization of access time and space requirements and with the prediction of database performance. These problems have been approached from two sides. One aspect is query optimization, and the other is the selection of an optimal set of indices and reorganization points. This paper concentrates on the second aspect. The selection of the indices of a database is an important part of physical database design. Since index performances vary, there is a need to select the most suitable index for each field of each file. Whereas appropriate indexing improves performance considerably, excessive indexing can result in major performance degradation as well as in significant increases in storage requirements. performance of some types of indices deteriorates in time due to overflow situations and other problems. Reorganization is then required. Problems of modeling, optimization, and prediction of database performance have been studied by many researchers, and interesting results have been published. 1-3,4-8,1-12,16,17,19,21,23 Additional bibliography, related to earlier work, can be found in extensive surveys. 18, 2, 22 However, previous mode1ing and optimization techniques are not complete; they suffer from one or more of the following problems: 1. The work is file-oriented rather than database-oriented. 2. The list of evaluated indices is inadequate: In some models only a few indices can be evaluated, having omitted the entire B-tree family and other indices. In some, a variety of updating techniques is not incorporated. Others compare too many indices rendering the models slow and inefficient. 3. Periodic reorganization is not addressed: Some file organizations and index structures require periodic reorganization due to overflow and other problems. Some models do not take reorganization into consideration in the selection process. 4. System constraints are not taken into consideration: Many computer systems, especially microcomputers, have limited amount of space. Other systems have time constraints due to heavy workload. These constraints do not playa role in many current models. 5. Important transaction types and the effects of these transactions are not included: Some models are oriented toward retrieval only. Others take into consideration maintenance operations such as insertions, deletions, and modification, but do not include the multi-index! multifield effects of transaction. The following example may illustrate a problem of the kind cited in 5: field i (say salary field) of record R has to be modified. To find the record R, the system uses the value of field j (say social security number). An index on field j will obviously improve the search performance, but an index on field i will not contribute to the speed of the search. On the other hand, the index on field i will have to be modified, thus decreasing the performance of this transaction. 6. The interaction and distinction between primary and secondary fields are lacking. Some models are oriented toward primary fields only, others toward secondary fields. Some models evaluate indices for all fields but do not distinguish between primary and secondary fields. 7. There is no discrimination between dense and nondense field attributes. 8. Performance prediction is not provided: Some models present optimal database configuration and/or reorganization points. But they do not provide the user with the time and space requirements of the database. This work is an extension of previous work integrating the performance issues 1 to 8 above. The databases considered are assumed to be relational. All files are assumed to be in first normal form (possibly they are also in any or all other normal forms). No reference is made to modeling network or hierarchical databases. DESCRIPTION OF THE MODEL The model is composed of four parts: input parameters, the algorithm, performance and cost equations, and the output. We will first describe the input so that the parameters affecting the model will be evident; we will then describe the output (I.e., the outcome of the model). This will be followed by a general description of the algorithm, the computations, and the assumptions made. The detailed formulas used can be found in Motzkin. 15 Input Parameters The input to the system consists of four groups of parameters: system parameters, database parameters, user workload parameters, and index parameters. (For examples of parame-. ters, see the section below, "Experimental Results." System paradleters These parameters are concerned with system constraints and costs. They include total number of available blocks, the

558 National Computer Conference, 1985 blocking factor (number of characters per block), average access time, cost per block (per day), and cost per access.

4 558 National Computer Conference, 1985 blocking factor (number of characters per block), average access time, cost per block (per day), and cost per access. Note that the total time available per unit of time, such as a day or a month, is not an input parameter. The time required to execute user workload is provided as part of the output. The user can modify the space allocation and achieve better access time. This procedure is described below in the section "An Overview of the Algorithms, Computations, and Assumptions. " Database parameters These parameters provide database information such as the number of files, the number and names of fields in each file, and needed information on each field. The parameters include number of files; for each file, the name of the file, a flag indicating whether the file is sorted or unsorted, the name of the field on which the file is sorted, the name of the primary field and the total number of records in the file; for each field, the name of the field, a flag indicating whether the field attribute is dense or not dense (this parameter is needed because different types of indices are suitable for dense attributes and nondense attributes), the number of distinct attribute values (this is meaningful only for non dense attributes), and the number of characters in the field. User workload parameters This group of parameters is concerned with various transactions such as retrievals (also referred to as searches), insertions, deletions, modifications (also referred to as updates), and frequency of reorganization. It is difficult to obtain the values for user workload parameters. The values may be estimated, or a program may count them over a period of time and come up with an average per unit of time such as a day or a month. This paper does not discuss methods by which user workload parameters may be obtained. It is assumed that such input is available for the model. User workload parameters include processes that can use indices; thus accesses that are results of operations such as PROJECT are not included. The user workload parameters include total number of records inserted in each file per unit of time, total number of records deleted from each file per unit of time, total number of searches per field per unit of time, total number of updates per field per unit of time, and frequency of merge. We used the day as the unit of time. Note that some transactions are required for each file, whereas other transactions are required for each field. Insertions and deletions are measured per file because when a record is inserted or deleted, all indices have to be modified. Searches are measured per field. Database users usually request records with given field values. Updates (modifications) also affect individual fields; for example, if a salary field is changed, only the salary index is modified. It is assumed that the searches for records to be deleted or modified are done using the primary field. The frequency-of-merge parameter indicates how often reorganization is done. Reorganization involves merging overflow areas with main areas, removing empty areas that might have been created as a result of deletions, regenerating some indices, and other related operations. This input parameter is required per file, since each file is reorganized along with the file indices. Index parameters The system provides a few of the more widely used indices as a default option. The user may add any additional indices to be used in the valuation. Default option: The default option is used when the user does not specify her/his choice of indices. For dense fields the system evaluates the B-tree index and sequential index. The B-tree index is chosen as a representative of the B-tree family, which includes B-tree, B +tree, Btree, and multilevel sequential index with block-splitting techniques used for updating. These four directories have similar performance; therefore, one representative is selected. The formulas for the B-tree are taken from Horowitz and Sahni. 9 The B-tree family has a very efficient access time, but space utilization may be as low as 5%. Therefore, the other directory chosen as a default option is a simple one-level sequential directory. The sequential directory is not as fast as a B-tree, but it is more economical than a B-tree in space requirements. In some database environments, especially on small computers, the space constraints may be stronger than the time constraints; thus the slower but more economical sequential directory may be more suitable. For the nondense attribute an inverted file is selected as the default option. It was pointed out by Motzkin 14 and others that inverted files are superior to multilists in most situations. The uniform organization of inverted files 13 is assumed. The detailed formulas used can be found in Motzkin. 15 Additional user-selected indices to be evaluated A user may wish to evaluate and compare indices other than the default option ones. It is possible to enter additional indices and their characteristics. The~system will incorporate all additional indices into the optimization process. Output The output includes total time required by the database operations described above per unit of time (per month in our implementation); total space required by the database; related cost of the database; and a list of files, fields, and selected indices-as well as fields for which not having an index was more cost effective. (For sample output see the section "Experimental Results.") An Overview of the Algorithm, Computations, and Assumptions For each field of each file the system first selects the best index (that with the lowest cost) out of all indices to be eval-

5 Database Performance Optimization 559 uated. In selecting the best index, the following cost considerations are included: 1. The cost of retrievals (searches) that use the index 2. The cost of index modifications due to insertions and deletions to the corresponding file 3. The cost of index modifications due to changes of corresponding field values in the corresponding file 4. The cost of space occupied by the index 5. The cost of index reorganization due to overflow and other deterioration factors The searches for records to be deleted and modified are assumed to be done using the primary field. They are added to the cost of each index evaluated for each primary field of each file. After the best index has been determined for a field, the cost of related processing of the field without an index is computed. Cost without an index will include the cost of direct search in the file for records associated with certain field values. Obviously the cost of direct search in the file will be significantly higher than the search that uses an index; however, there will be no cost of index space and index maintenance. Now, for each field, the cost without index is compared with the cost with the best index. It is then determined whether the best index or no index will be selected for the field. When indices (or no indices) are selected for all fields, the total database space is computed, including the space occupied by the files and the indices. If the total database space is greater than the available space, then the least useful index is removed. The process of removing the least useful index continues until enough indices have been removed yielding a total database space that is less than or equal to the available space (see Figure 1-utline of the algorithm). The usefulness of an index is determined by the difference between the cost associated with the corresponding field if an index is not used for the field and the cost associated with the field when an index (the best) is provided. An index is considered less useful if it does not reduce the cost of the field considerably. An index is normally more useful when the corresponding field has more searches and less modification, and if the index does not occupy a very large amount of space. (The exact formulas used in the cost equations can be found in Motzkin. 15 ) At the end of the computations the user is presented with the total space, the total time of accesses computed from the output, and the related cost of the database. It is possible that FOR i = 1 TO number of files DO FIND the best index for the primary field p; denote it by INDj,p FIND whether it is "better" to have INDj, p or no index for field p of file i FOR j = 1 TO number of fields in file i DO IF I4P THEN find the "best" index for field j; denote it by INDi, i. Find whether it is better to have INDi, i or no index for field j of file i. Store the information on INDi,i' ENDIF END FOR STATEMENT END FOR STATEMENT Compute total database space (include space requirement for files and indices). IF total database> total space available THEN FOR i = 1 TO number of files DO FOR j = 1 TO number of fields in file i DO USEFULi,j = COST_OF YIELDi,i without index - COST_OF YIELDi,i with index END OF FOR STATEMENT END OF FOR STATEMENT SORT USEFULi,i denote the sorted list USEFULti (k = 1 for USEFULi, i with smallest value and k = number of indices for USEFULi, i with highest value of USEFUL.) FOR k = 1 TO number of indices DO Remove IND~ i from database Database Space ~ Database Space - Space of INDt i IF Database Space:5 Available Space THEN Exit Loop END OF FOR STATEMENT END IF Compute Database Cost and Time Print output reports Figure 1-utline of the algorithm

6 56 National Computer Conference, 1985 while the space is acceptable, the time figure is too large. The database designer may then allow for more space for the database and run the optimization program again. The additional space allocation will allow the database to use more indices and thus improve the time figure. The user may also try to put less weight on the space by reducing the cost of a block; this reduction may also increase the number of indices used. The frequency-of-merge parameters can also be changed. This iterative procedure may continue until an acceptable configuration is achieved or until there is no further improvement. Outline of the Algorithm An outline of the algorithm appears in Figure 1. A note on the complexity of the algorithm: The separability assumptions have been used. 4, 19 Thus the computations are performed on each field of each file separately. Denote the total number of fields over all files of the data base by NF, and denote the number of indices to be evaluated by NI. Then the time required for the optimization process is T = NF. (NI + 1). Each additional iteration will take another T time. EXPERIMENTAL RESULTS The optimization and prediction model has been implemented by a PASCAL program. Four different simulation runs are provided (Figures 2-5). Field 1 is assumed to be the primary field in all files. The input parameter FILE TYPE with values U or S means unsorted or sorted file. Sorted files in this implementation are assumed to be sorted on the primary field. The FIELD TYPE parameter with values D or N means dense or nondense attributes. The time and cost figures are related to the accesses and maintenance parameters that were included in the input. (Processes that do not use indices, such as PROJECT operations, are not part of this model.) The input parameters, such as costs and user workload, are given per day. The output summary is computed per month. The rest is self-explanatory. CONCLUDING REMARKS A model for prediction and optimization of the performance of relational databases has been developed. The model is concerned with selection of an optimal set of indices and reorganization points. It provides the total cost, time, and space associated with the selected indices for the given input parameters. It is a natural extension of previous work. It takes into consideration the effects of transaction on different fields and the total system's capacity and constraints, and it allows the user to evaluate indices that the user is interested in. The model distinguishes between primary field and secondary fields, between dense and nondense attributes, and between sorted and unsorted files. The model has been implemented by a PASCAL program, and sample simulation runs are provided. It is more complete than previous work, and it is easy to use. The complexity of the algorithm is (number of fields) (number of evaluated indices + 1». ACKNOWLEDGMENTS The author wishes to thank Mr. Chung-Liang Lin for converting the algorithms into a PASCAL program and generating the tests. The author also wishes to thank Dr. Donna Kaminski for her valuable comments and suggestions. REFERENCES 1. Batory, D. s. "B+ Trees and Indexed Sequential Files: A Performance Comparison." ACM ProceedingsofSIGMOD. New York: ACM, 1981,pp Batory, D. S. "Optimal File Designs and Reorganization Points." ACM Transactions on Database Systems, 7 (1982), pp Batory, D. S., and C. C. Gotlieb. "A Unifying Model of Physical Databases." ACM Transactions on Database Systems, 7 (1982), pp Bonfatti, F., D. Maio, and P. Tiberio. "A Separability-based Method for Secondary Index Selection in Physical Database Design." In Methodology and Tools for Database Design. Amsterdam: North-Holland, 1983, 5. Carlis, J. V., S. T. March, andg. W. Dickson. "Physical Database Design, a DSS Approach." Information and Management, 6 (1983), pp Chen, P. P., and S. B. Yao. "Design and Performance Tools for Database Systems." IEEE Proceedings of the International Conference on Very Large Data Bases. New York: IEEE, 1977, pp Christodoulakis, S. "Estimating Record Selectivities." Information Systems, 8 (1983), pp Hoffer, J. A. "An Integer Programming Formulation of Computer Database Design Problems." Information Science, 11 (1976), pp Horowitz, E., and S. Sahni. Fundamentals of Data Structures. Rockville, Md.: Computer Science Press, Lum, V. Y., and H. Ling. "An Optimization Problem on the Selection of Secondary Keys." Proceedings of ACM Annual Conference. New York: ACM, 1971, pp March, S. T., and D. G. Severance. "The Determination of Efficient Record Segmentation and Blocking Factors for Shared Data Files." ACM Transactions on Database Systems, 2 (1977) 3, pp Mendelson, H. "Analysis of Extendible Hashing." IEEE Transactions on Software Engineering, SE-8 (1982) 6, pp Motzkin, D., K. Williams, and K. Chang. "Uniform Organization of Inverted Files." AFIPS, Proceedings of 1984 National Computer Conference (Vol. 53), 1984, pp Motzkin, D. "The Use of Normal Multiplication Tables For Information Storage and Retrieval." Communication of the Association for Computing Machinery (CACM), Vol. 22, (1979) 3, pp Motzkin, D. "Computer Assisted Optimization and Prediction of Database Performance." Western Michigan University, Computer Science Department, Report 84-1, September Nicolas, G. S. "A Generalized Database Access Path ModeL" AFIPS Proceedings of the National Computer Conference, 1981, pp Schkolnick, M. "The Optimal Selection of Secondary Indices for Files." Information Systems, 1 (1975), pp Schkolnick, M. "A Survey of Physical Database Design Methodology and Techniques." Proceedings of the Fourth International Conference on Very Large Databases. New York: IEEE, 1978, pp Whang, K. W., G. Wiederhold, and D. Segalowics. "Separability-An Approach to Physical Database Design." Proceedings of the Seventh International Conference on Very Large Databases. New York: IEEE, 1982, pp Yao, S. B., and A. G. Mertin. "Selection of File Organization Using an Analytic Model." Proceedings of the International Conference on Very Large Databases. New York: IEEE, 1975, pp Yao, S. B., K. S. Das, and T. J. Theorey. "A Dynamic Database Reorganization Algorithm." ACM Transactions on Database Systems, 1, pp Yao, S. B. "Modelling and Performance Evaluation of Physical Database Structures." ACM Proceedings of ACM National Conference. New York: ACM, 1976, pp Yao, S. B. "An Attribute Based Model for Database Access Cost Analysis." ACM Transactions on Database Systems, 2 (1977), pp

7 SYSTEM PARAMETERS AVAILABLE CHARACTERS -AVERAGE BLOCKS PER BLOCK ACCESS TIME COST PER BLOCK (;OST PER ACCESS SEC FILE INFORMATION : DATABASE PARAMETERS USER WORKLOAD PARAMETERS FILE FILE PRIMARY TOTAL II II OF II OF MERGE NAME TYPE ATTRIBUTE OF RECS INSERTIONS DELETIONS FREQUENCY 1 U FIELD III S FIELD FIELD FIELD INFORMATION : DATABASE PARAMETERS USER WORKLOAD PARAMETERS FILE FIELD FIELD NUMBER OF II OF II OF DISTINCT NAME NAME TYPE CHARACTERS SEARCHES UPDATES VALUES 1 1 D N N D N N D N N N DIRECTORIES TESTED OUTPUT 1. SEQUENTIAL 2. B TREE 3. INVERTED 4. MULTI-LEVEL SEQUENTIAL RECOMMENDED DIRECTORIES FOR ALL FIELDS OF ALL FILES IN DATABASE # 1 ARE: FILE# FILE-TYPE FIELD# FIELD-TYPE ORGANIZATION # OF BLOCKS DIR ACCESS TIME ACCESS COST PER MONTH DAY HR MIN -- SEC 1 U 1 D M LEVEL INDEX 1 U 2 N inverted FILE 1 U 3 N INVERTED-FILE 2 S 1 D M LEVEL INDEX 2 S 2 N inverted FILE 2 S 3 N INVERTED-FILE 3 S 1 D M LEVEL INDEX 3 S 2 N SO DENSE NO DIR 3 S 3 N SO-DENSE-NO-DIR 3 S 4 N SO-DENSE-NO-DIR - -- TOTAL FOR DATABASE DIRECTORY SUMMARY OF DATABASE # DAY HR MIN $ SEC NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 1 NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 2 NilliBER OF BLOCKS NEEDED FOR FILE NUMBER 3 NUMBER OF BLOCKS NEEDED FOR ALL FILES IN DATABASE NUMBER OF BLOCKS NEEDED FOR DIRECTORIES OF DATABASE TOTAL SPACE AVAILABLE FOR STORAGE TOTAL SPACE PER MONTH FOR ENTIRE DATABASE TOTAL COST PER MONTH FOR ENTIRE DATABASE TOTAL TIME PER MONTH FOR ENTIRE DATABASE $ DAYS 23 HOURS 45 MINUTES SECONDS Figure 2-Simulation run: Database 1

8 SYSTEM PARAMETERS : AVAILABLE CHARACTERS AVERAGE COST PER COST PER BLOCK PER BLOCK ACCESS TIME BLOCK ACCESS SEC $.15 $.7 FILE INFORMATION : DATABASE PARAMETERS USER WORKLOAD PARAMETERS FILE FILE PRIMARY TOTAL II II OF II OF MERGE NAME TYPE ATTRIBUTE OF RECS INSERTIONS DELETIONS FREQUENCY U FIELD S FIELD FIELD INFORMATION : DATABASE PARAMETERS USER WORKLOAD PARAMETERS FILE FIELD FIELD NUMBER OF II OF /I OF DISTINCT NAME NAME TYPE CHARACTERS SEARCHES UPDATES VALUES 1 1 D N N D N N DIRECTORIES TESTED 1. SEQUENTIAL 2. B TREE 3. INVERTED 4. MULTI-LEVEL SEQUENTIAL OUTPUT RECOMMENDED DIRECTORIES FOR ALL FIELDS OF ALL FILES IN DATABASE # 2 ARE: FILE# FILE-TYPE FIELD# FIELD-TYPE ORGANIZATION # OF BLOCKS ~B-ACCESS_IIME ACCESS COST PER MONTH 1 U 1 D B TREE 1 U 2 N INVERTED-FILE 1 U 3 N INVERTED-FILE 2 S 1 D B-TREE 2 S 2 N SO DENSE NO DIR 2 S 3 N SO-DENSE-NO-DIR - -- TOTAL FOR DATABASE DIRECTORY SUMMARY OF DATABASE # 2 NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 1 NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 2 NUMBER OF BLOCKS NEEDED FOR ALL FILES IN DATABASE NUMBER OF BLOCKS NEEDED FOR DIRECTORIES OF DATABASE = TOTAL SPACE AVAILABLE FOR STORAGE TOTAL SPACE PER MONTH FOR ENTIRE DATABASE TOTAL COST PER MONTH FOR ENTIRE DATABASE TOTAL TIME PER MONTH FOR ENTIRE DATABASE DAY HR $ MIN SEC $ DAY 22 HOURS 51 MINUTES 4. SECONDS Figure 3-Simulation run: Database 2

9 SYSTEM PARAMETERS : AVAILABLE CHARACTERS AVERAGE COST PER COST PER BLOCK PER BLOCK ACCESS TIME BLOCK ACCESS SEC S.15 $.7 FILE INFORMATION DATABASE PARAMETERS USER WORKLOAD PARAMETERS FILE FILE PRIMARY TOTAL II II OF II OF MERGE NAME TYPE ATTRIBUTE OF RECS INSERTIONS DELETIONS FREQUENCY 1 2 U S FIELD INFORMATION : FIELD IiI 1 6 FIELD DATABASE PARAMETERS USER FILE FIELD FIELD NUMBER OF II OF NAME NAME TYPE CHARACTERS SEARCHES D D N D D N N 4 25 WORKLOAD II OF UPDATES PARAMETERS DISTINCT VALUES DIRECTORIES TESTED 1. SEQUENTIAL 2. B TREE 3. INVERTED 4. MULTI-LEVEL SEQUENTIAL OUTPUT : RECOMMENDED DIRECTORIES FOR ALL FIELDS OF ALL FILES IN DATABASE # 3 ARE: FILE II FILE-TYPE FIELDII li'ield-type ORGANIZATION 1 U 1 D M LEVEL INDEX D M-LEVEL-INDEX 1 U 2 1 U 3 N RANDOM NO DIR 2 S 1 2 S 2 D D M LEVEL INDEX M-LEVEL-INDEX 2 S 3 N INvERTED FILE 2 S 4 N INVERTED:=FILE TOTAL FOR DATABASE DIRECTORY SUMMARY OF DATABASE #3 # OF BLOCKS _!UK.. ACGf;S'._:.tl.~ ACCESS COST PER MONTH $ NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 1 NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 2 NUMBER OF BLOCKS NEEDED FOR ALL FILES IN DATABASE NUMBER OF BLOCKS NEEDED FOR DIRECTORIES OF DATABASE = TOTAL SPACE AVAILABLE FOR STORAGE TOTAL SPACE PER MONTH FOR ENTIRE DATABASE TOTAL COST PER MONTH FOR ENTIRE DATABASE $ TOTAL TIME PER MONTH FOR ENTIRE DATABASE o DAYS 9 HOURS 5 MINUTES SECONDS Figure 4-Simulation run: Database 3

10 DATABASE EXCEEDS AVAILABLE SPACE, THE FOLLOWING ADJUSTMENTS HAVE BEEN MADE : FILE II DELETED DIR FIELD II SPACE SAVED SPACE NEEDED FOR NEW DATABASE AFTER ADJUSTMENT, THE RECOMMENDED DIRECTORIES FOR ALL FIELDS OF ALL FILES IN DATABASE II 3 ARE: FILE/I FILE-TYPE FIELD/I FIELD-TYPE ORGANIZATION II OF BLOCKS DIR ACCESS TIME ACCESS COST PER MONTH 1 U 1 D M LEVEL INDEX 1 U 2 D RANDOM NO DIR 1 U 3 N RANDOM-NO - DIR 2 S 1 D SO DENSE-NO-DIR 2 S 2 D SO-DENSE-NO-DIR 2 S 3 N INVERTED FILE 2 S 4 N INVERTED=FILE TOTAL FOR DATABASE DIRECTORY SUMMARY OF DATABASE II NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 1 NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 2 NUMBER OF BLOCKS NEEDED FOR ALL FILES IN DATABASE NUMBER OF BLOCKS NEEDED FOR DIRECTORIES OF DATABASE = TOTAL SPACE AVAILABLE FOR STORAGE TOTAL SPACE PER MONTH FOR ENTIRE DATABASE TOTAL COST PER MONTH FOR ENTIRE DATABASE TOTAL TIME PER MONTH FOR ENTIRE DATABASE $ o DAYS 11 HOURS 41 MINUTES 19.3 SECONDS Figure 4-( continued)

11 SYSTEM PARAMETERS : AVAILABLE CHARACTERS AVERAGE COST PER COST PER BLOCKS PER BLOCK ACCESS TIME BLOCK ACCESS SEC $.15 $.7 FILE INFORMATION : DATABASE PARAMETERS USER WORKLOAD PARAMETERS FILE FILE PRIMARY TOTAL /I II OF II OF MERGE NAME TYPE ATTRIBUTE OF RECS INSERTIONS DELETIONS FREQUENCY 1 2 U S FIELD INFORMATION : FIELD FIELD DATABASE PARAMETERS "I< USER FILE FIELD FIELD NUMBER OF "I< II OF NAME NAME TYPE CHARACTERS "I< SEARCHES "I< 1 1 D 2 "I< N N D 15 SO 2 2 N N 5 SO 2 4 N 8 25 DIRECTORIES TESTED WORKLOAD II OF UPDATES SO 25 PARAMETERS DISTINCT VALUES 5 SO SEQUENTIAL 2. B TREE 3. INVERTED 4. MULTI-LEVEL SEQUENTIAL OUTPUT RECOMMENDED DIRECTORIES FOR ALL FIELDS OF ALL FILES IN DATABASE II 4 ARE: FILEII FILE-TYPE FIELDII FIELD-TYPE ORGANIZATION II OF BLOCKS DIR ACCESS TIME ACCESS COST PER MONTH 1 U D M LEVEL INDEX 1 U 2 N INvERTED FILE 1 U 3 N INVERTED-FILE 2 S 1 D M LEVEL INDEX 2 S 2 N INvERTED FILE 2 S 3 N INVERTED-FILE 2 S 4 N INVERTED-FILE TOTAL FOR DATABASE DIRECTORY $ SUMMARY OF DATABASE II 4 NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 1 NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 2 NUMBER OF BLOCKS NEEDED FOR ALL FILES IN DATABASE NUMBER OF BLOCKS NEEDED FOR DIRECTORIES OF DATABASE = TOTAL SPACE AVAILABLE FOR STORAGE TOTAL SPACE PER MONTH FOR ENTIRE DATABASE TOTAL COS-T PER MONTH FOR ENTIRE DATABASE $ TOTAL TIME PER MONTH FOR ENTIRE DATABASE o DAYS 12 HOURS 44 MINUTES SECONDS Figure 5-Simulation run: Database 4

12 DATABASE EXCEEDS AVAILABLE SPACE, THE FOLLOWING ADJUSTMENTS HAVE BEEN MADE : FILE II DELETED DIR FIELD II SPACE SAVED SPACE NEEDED FOR NEW DATABASE AFTER ADJUSTMENT, THE RECOMMENDED DIRECTORIES FOR ALL FIELDS OF ALL FILES IN DATABASE II 4 ARE: FILEfl FILE-TYPE FIELDII HELD-TYPE ORGANIZATION II OF BLOCKS DIR ACCESS TIME ACCESS COST PER MONTH 1 U 1 D RANDOM NO DIR 1 U 2 N RANDOM-NO-DIR 1 U 3 N RANDOM-NO-DIR 2 1 D M LEVEL INDEX 2 2 N INvERTED FILE 2 3 N INVERTED-FILE 2 4 N INVERTED-FILE TOTAL FOR DATABASE DIRECTORY SUMMARY OF DATABASE II DAY $ HR MIN SEC NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 1 NUMBER OF BLOCKS NEEDED FOR FILE NUMBER 2 NUMBER OF BLOCKS NEEDED FOR ALL FILES IN DATABASE NUMBER OF BLOCKS NEEDED FOR DIRECTORIES OF DATABASE = TOTAL SPACE AVAILABLE FOR STORAGE TOTAL SPACE PER MONTH FOR ENTIRE DATABASE TOTAL COST PER MONTH FOR ENTIRE DATABASE TOTAL TIME PER MONTH FOR ENTIRE DATABASE DAYS 1 HOURS 37 MINUTES 4.6 SECONDS Figure 5-( continued)

Kathleen Durant PhD Northeastern University CS Indexes

Kathleen Durant PhD Northeastern University CS 3200 Indexes Outline for the day Index definition Types of indexes B+ trees ISAM Hash index Choosing indexed fields Indexes in InnoDB 2 Indexes A typical