Query Processing Strategies in Distributed Database

Size: px

Start display at page:

Download "Query Processing Strategies in Distributed Database"

Lillian Farmer
6 years ago
Views:

1 Query Processing Strategies in Distributed Database Kunal Jamsutkar, M.Tech, Department of Computer Engineering and Information Technology, V.J.T.I., Mumbai Viki Patil, M.Tech, Department of Computer Engineering and Information Technology, V.J.T.I., Mumbai Dr.B.B.Meshram, Professor, Department of Computer Engineering and Information Technology, V.J.T.I., Mumbai ABSTRACT Query optimization is an important part of database management system. In this paper, through the research on query optimization technology, based on a number of optimization algorithms commonly used in distributed query, It aims to arrive at an optimal query processing plan for a given distributed query. As per the approach, the query plans having the required data residing close to each other are considered more efficient and, therefore, these generated query plans would result in efficient query processing.. Keywords: Database, query processing, distributed query strategy, system model, query processing cost, cost measures. Introduction In recent years, with the development of computer network and database technology, distributed database is more and more widely used; with the expanding application, data queries are increasingly complex, the efficiency requests are increasingly high, so query processing is a key issue of the distributed database system. In a distributed database environment, data stored at different sites connected through network. A distributed database management systems (DDBMS) support creation and maintenance of distributed database. The research literature proposes a wide variety of query optimization algorithms. Yu/chang give comprehensive overviews on various query optimization techniques for distribute database management system [20]. However, these overviews do not attempt to develop a model of query optimization that explains and presents the algorithms in a uniform way. This understanding in case we want to change or extend existing algorithms to adapt them to new requirements. In this research we consider query processing algorithms for a Distributed Database system. There has been many research done on distributed query processing methods (see [2],[3]). Increased reliability and performance can also be attained with a distributed database. All database systems must be able to respond to requests for information from the user i.e. process queries. How a DBMS processes queries and the methods it uses to optimize their performance are topics that will be covered in this paper. In certain sections of this paper, various concepts will be illustrated with an example. Since many optimization algorithms differ in their computational behavior while reflecting aspects of the implementation environment at the same time, it is the purpose of this paper to understand all of them by few simple concepts. Finally, we summaries our findings and discuss future work. General aspects of optimization To provide a better understanding of what we mean by the term query, query processing and query optimization. Further we discuss the algorithms of query optimization that can found in all optimization algorithms described in the papers. Definitions And Examples A. What Is A Query? A database query is the instructing a DBMS to update or retrieve specific data to/from the physically stored medium. The actual updating and retrieval of data is performed through various low-level operations. Examples of such operations for a relational DBMS can be relational algebra operations such as project, join, select,cartesian product, etc. B. The Query Processor There are three phases that a query passes through during the DBMS processing of that query: 1. Parsing and translation 2. Optimization 3. Evaluation Most queries submitted to a DBMS are in a highlevel language such as SQL. During the parsing and translation stage, the human readable form of the query is translated into forms usable by the DBMS. These can be in the forms of a relational algebra Blue Ocean Research Journals 71

2 expression, query tree and query graph. Consider the following SQL query: SELECT make FROM vehicles WHERE make = Toyota. This can be translated into the following relational algebra expressions: ( π make (vehicles)) make (vehicles)) And represented as a query graph: Toyota Make= Camaro Fig 2. Query Graph vehicles After parsing and translation into a relational algebra expression, the query is then transformed into a form, usually a query tree or graph that can be handled by the optimization engine. The optimization engine then performs various analyses on the query data, generating a number of valid evaluation plans. From there, it determines the most appropriate evaluation plan to execute. After the evaluation plan has been selected, it is passed into the DMBS query-execution engine (also referred to as the runtime database processor), where the plan is executed and the results are returned. B.1- Parsing and Translating the Query The first step in processing a query submitted to a DBMS is to convert the query into a form usable by the query processing engine. High-level query languages such as SQL represent a query as a string, or sequence, of characters. Certain sequences of characters represent various types of tokens such as keywords, operators, operands, literal strings, etc. Like all languages, there are rules (syntax and grammar) that govern how the tokens can be combined into understandable (i.e. valid) statements. The primary job of the parser is to extract the tokens from the raw string of characters and translate them into the corresponding internal data elements (i.e. relational algebra operations and operands) and structures (i.e. query tree, query graph).the last job of the parser is to verify the validity and syntax of the original query string. B.2- Optimizing the Query In this stage, the query processor applies rules to the internal data structures of the query to transform these structures into equivalent, but more efficient representations. The rules can be based upon mathematical models of the relational algebra expression and tree (heuristics), upon cost estimates of different algorithms applied to operations or upon the semantics within the query and the relations it involves. Selecting the proper rules to apply, when to apply them and how they are applied is the function of the query optimization engine. B.3- Evaluating the Query The final step in processing a query is the evaluation phase. The best evaluation plan candidate generated by the optimization engine is selected and then executed. (Note that there can exist multiple methods of executing a query. Besides processing a query in a simple sequential manner, some of a query s individual operations can be processed in parallel either as independent processes or as interdependent pipelines of processes or threads. Regardless of the method chosen, the actual results should be same.) C. Query Processing Query processing is defined as the activities involved in parsing, validating, optimizing and executing a query. The main aim of query processing is Transform query written in high-level language (e.g. SQL), into correct and efficient execution strategy expressed in low-level language (implementing Relational Algebra) and to find information in one or more databases and deliver it to the user quickly and efficiently. High level user query Query Processor Low level data manipulation commands Fig.3 Flow of Query Processing D. Query Optimization Query optimization is defined as the activity of choosing an efficient execution strategy for processing a query. Query optimization is a part of query processing. The main aims of query optimization are to choose a transformation that minimizes resource usage, Reduce total execution time of query and also reduce response time of query. Distributed Query Processing Methodology: Blue Ocean Research Journals 72

Journal of Engineering, Computers & Applied Sciences (JEC&AS) ISSN No: 2319 5606 Distributed query processing contains four stagess which are as follows: 1. Query decomposition 2. Data localization 3.

This stage is again divided in four stages they are Normalization, Restructuring Analysis, Simplification and Input: Calculus query on global relations Normalization Manipulate query quantifiers and

query More than one translation is possible use transformation rules. D.2- Data localization: in this stage Algebraic query on distributed relations is input and fragment query is output.

Finding best global schedule is done in this stage. D.4- Local optimization: Best global execution schedule is input and localized optimization queries are output in this stage.

Distributed Query Optimization: Distributed query optimization is defined as finding efficient execution strategy path in distributed networks.

Access Method: The methods which are used to access data from distributed environment like hashing, indexing etc. Join Criteria: In distributed database data is presented in different sites.

3 Journal of Engineering, Computers & Applied Sciences (JEC&AS) ISSN No: Distributed query processing contains four stagess which are as follows: 1. Query decomposition 2. Data localization 3. Global optimization 4. Local optimization. D.1- Query decomposition In this stage we are giving Calculus Query as an input and we are gettingg output as Algebraic Query. This stage is again divided in four stages they are Normalization, Restructuring Analysis, Simplification and Input: Calculus query on global relations Normalization Manipulate query quantifiers and qualification Analysis detects and rejects incorrect queries Possible for only a subset of relational calculus Simplification eliminate redundant predicates Restructuring calculus query ==> algebraic query More than one translation is possible use transformation rules. D.2- Data localization: in this stage Algebraic query on distributed relations is input and fragment query is output. In this stage fragment involvement is determined. D.3- Global optimization: in this stage Fragment Query is input and optimized fragment query is output. Finding best global schedule is done in this stage. D.4- Local optimization: Best global execution schedule is input and localized optimization queries are output in this stage. It containn two sub stages they are Select the bestt access path, Use the centralized optimization techniques. E. Distributed Query Optimization: Distributed query optimization is defined as finding efficient execution strategy path in distributed networks. Query optimization is difficult in distributed environment. There are three components of distributed query optimization they are Access Method, Join Criteria, and Transmission Costs. Access Method: The methods which are used to access data from distributed environment like hashing, indexing etc. Join Criteria: In distributed database data is presented in different sites. Join criteria is used to join the different sites to get optimized result. Transmission Costs: If data from multiple sitess must be joined to satisfy a single query, then the cost of transmitting the results from intermediate steps needs to bee factored into the equation. At times, it may be moree cost effectivee simply to ship entire tables across the network to enable processing to occur at a single site, thereby reducing overall transmission costs. This component of query optimization is an issue only in a distributed environment. There are many distributed query optimization issues somee of them are types of optimizers, optimization granularity, network topologies and optimization timing. Fig.4 Query processing methodology 3. Optimal Distribution Strategies for Simple Queries. Query optimization algorithms that derive optimal distribution strategies for a class of distributed queries called simple queries. Blue Ocean Research Journals 73

4 There are various algorithms are used for query optimization such as Algorithm PARALLEL [3] was used to derive a minimal response time distribution strategy for any given simple query. Algorithm SERIAL [3] strategy consists of transmitting each relation in a serial order. Algorithm GENERAL. Minimization of response time and total time is done by three different versions of the algorithm, which are A. Response Time Version B. Total Time Version C. Handling Redundant Data Transmission Algorithm-S is a static algorithm, as are PARALLEL, SERIAL, GENERAL, and D. In a static algorithm, the strategy is generated before any transmission or intersite joining takes place. Therefore, the algorithm must include some method for estimating the effect of a semijoin on the parameters. Related Work The query is decomposed into single-joining-attribute subqueries. Candidate schedules are generated for each subquery separately. There is an integration step but no synchronization step. By contrast, algorithm-s uses a more precise interpretation of attribute independence which takes into account forced reductions in the projected size of nonjoining attributes with low value multiplicity (keys, for instance). Since reductions are not restricted to single attributes, the decomposition into subqueries is no longer desirable and is not done. The integration step which follows is very similar to that of GENERAL. The final SYNCHRONIZE step is used to detect beneficial semi join delays which might have been missed because integrated schedules are generated for each relation separately. In modifying and extending GENERAL, we get different strategies which result in reduced costs. These substantial cost savings show up when using the response time minimization objective as well as the total time minimization objective. For most complex queries, algorithm-s provides the same the integrated schedules are chosen to be strategy whether the response-time or total-time version is used. A.1- AN OPTIMIZATION EXAMPLE Assume that the COURSE table and the ENROLLMENT table exist at Site 1; the STUDENT table exists at Site 2.If either all of the tables existed at a single site, or the DBMS supported distributed multi-site requests. However, if the DMBS cannot perform (or optimize) distributed multi-site requests, programmatic optimization must be performed. There are at least six different ways to go about optimizing this three-table join. Option 1: Start with Site 1 and join COURSE and ENROLLMENT, selecting only physics courses. For each qualifying row, move it to Site2 to be joined with STUDENT to see if any are seniors. Option 2: Start with Site 1 and join COURSE and ENROLLMENT, selecting only physics courses, and move the entire result set to Site 2 to be joined with STUDENT, checking for senior students only. Option 3: Start with Site 2 and select only seniors from STUDENT. For each of these examine the join of COURSE and ENROLLMENT at Site 1 for physics classes. Option 4: Start with Site 2 and select only seniors from STUDENT at Site 2, and move the entire result set to Site 1 to be joined with COURSE and ENROLLMENT, checking for physics classes only. Option 5: Move the COURSE and ENROLLMENT tables to Site 2 and proceed with a local three-table join. Option 6: Move the STUDENT to Site 1 and proceed with a local three-table join. Which of these six options will perform the best? Unfortunately, the only correct answer is "It depends." The optimal choice will depend upon: 1. the size of the tables; 2.the size of the result sets that is, the number of qualifying rows and their length in bytes; and 3.the efficiency of the network. B. THE ROLE OF INDEXES The utilization of indexes can dramatically reduce the execution time of various operations such as select and join. Let us review some of the types of index Blue Ocean Research Journals 74

5 file structures and the roles they play in reducing execution time and overhead: Dense Index: Data-file is ordered by the search key and every search key value has a separate index record. This structure requires only a single seek to find the first occurrence of a set of contiguous records with the desired search value. Sparse Index: Data-file is ordered by the index search key and only some of the search key values have corresponding index records. Each index record s data-file pointer points to the first data-file record with the search key value. While this structure can be less efficient (in terms of number of disk accesses) than a dense index to find the desired records, it requires less storage space and less overhead during insertion and deletion operations. Primary Index: The data file is ordered by the attribute that is also the search key in the index file. Primary indices can be dense or sparse. This is also referred to as an Index-Sequential File [5]. For scanning through a relation s records in sequential order by a key value, this is one of the fastest and more efficient structures -- locating a record has a cost of 1 seek, and the contiguous makeup of the records in sorted order minimizes the number of blocks that have to be read. However, after large numbers of insertions and deletions, the performance can degrade quite quickly, and the only way to restore the performance is to perform reorganization. Secondary Index: The data file is ordered by an attribute that is different from the search key in the index file. Secondary indices must be dense. Multi-Level Index: An index structure consisting of 2 or more tiers of records where an upper tier s records point to associated index records of the tier below. The bottom tier s index records contain the pointers to the data-file records. Multi-level indices can be used, for instance, to reduce the number of disk block reads needed during a binary search. Clustering Index: A two-level index structure where the records in the first level contain the clustering field value in one field and a second field pointing to a block [of 2nd level records] in the second level. The records in the second level have one field that points to an actual data file record or to another 2nd level block. B+-tree Index: Multi-level index with a balanced-tree structure. Finding a search key value in a B+-tree is proportional to the height of the tree maximum number of seeks required is lg height. While this, on average, is more than a single-level, dense index that requires only one seek, the B+-tree structure has a distinct advantage in that it does not require reorganization it is self-optimizing because the tree is kept balanced during insertions and deletions. Many mission-critical applications require high performance with near-100% uptime, which cannot be achieved with structures requiring reorganization. The leaves of the B+tree are used to reorganize the data file. C. New query optimization techniques in distributed database: C.1- Cost based query optimization: Objective of Cost-based query optimization is estimate the cost of different equivalent query expressions and chose the execution plan with the lowest cost. Cost based query optimization mainly depends on two factors they are solution space and cost function. Solution space: this is depends on the set of equivalent algebraic expressions. Cost function: cost function is equivalent to summation of I/O cost, CPU cost and communication cost. It also depends on different distributed environments. By considering these factors cost based query optimization is processed in distributed environment. C.2- Heuristic based query optimization: Heuristic based query optimization process involve following steps: 1) Perform Selection operations as early as possible. 2) Combine Cartesian product with subsequent selection whose predicate represents join condition into a Join operation. 3) Use associatively of binary operations to rearrange leaf nodes so leaf nodes with most restrictive Selection operations executed first. Blue Ocean Research Journals 75

6 4) Perform Projections operations as early as possible. 5) Eliminate duplicate computations. It is mainly used to minimize cost of selecting sites for multi join operations. Advantages of Distributed query optimization: Distributed Query optimization techniques provide exact results in distributed environment. These techniques provide efficient performance in different distributed networks. In internet these techniques helps to search exact information and extract the required one. D. Query Processing in Relational Database Systems The conventional method of processing a query in a relational DBMS is to parse the SQL statement and produce a relational calculus-like logical representation of the query, and then to invoke the query optimizer, which generates a query plan. The query plan is fed into an execution engine that directly executes it, typically with little or no runtime decision-making (Figure 5). The query plan can be thought of as a tree of unary and binary relational algebra operators, where each operator is annotated with specific details about the algorithm to use (e.g., nested loops join versus hash join) and how to allocate resources (e.g., memory). In many cases the query plan also includes low-level physical operations like sorting, network shipping, etc. that do not affect the logical representation of the data. Certain query processors consider only restricted types of queries, rather than SQL. A common example of this is select project-join or SPJ queries: an SPJ query essentially represents a single SQL SELECT-FROM-WHERE block with no aggregation or subqueries. User Query Execution Query Query Query Optimizer Plan Executor Result Example for SPJ queries: SELECT * FROM R,S,T,U WHERE R.s=S.a AND S.b=T.b AND T.c=U.c Fig 5. Query Plan E. Results for related Algorithm Based above examples we summarize the Complexities of all algorithms. Table 1 Complexity table Algorithms Complexity 1.Parallel O(m 2 ) 2. Serial O(mlog 2 m) 3.General 3.1 Procedure Total O (σm 2 ) 3.2Procedure Response 4.Algorithem S. O(mlogm) Where m is the number of required relations in the query. Conclusion Algorithm-S is a straightforward modification and extension of Apers, Hevner, and Yao's algorithm GENERAL. In GENERAL, the attribute independence assumption is interpreted to mean that a semijoin has no effect on the projected size of nonjoining attributes. This is significant, since low response time costs and low total time costs are both desirable objectives, even though one may predominate in a given situation.most real-world data is not well structured. Today's databases typically contain much non-structured data such as text, images, video, and audio, often distributed across computer networks. To process these kinds of data and optimize queries on this data requires these distributed query optimization techniques. References [1] R. Hevner and S. B. Yao, Query Processing in distributed database systems," IEEE Trans. Software Eng., vol. SE-5, pp ,May [2] William Perrizo, A Method for Processing Distributed Database Queries, IEEE Trans. Software Eng., vol. SE-10,No.4,JULY1984. [3] Peter M. G. Apers, Alan R. Hevner, And S. Bing Yao, Optimization Algorithms for Distributed Queries, IEEE Trans.,1983 [4] M. Tamer Ozsu, GTE Laboratories, Patrick Valduriez, Distributed Database Systems: Where Are We Now? IEEE INRIA, [5] Sakti Pramanik And David Vineyard, Optimizing Join Queries in Distributed Databases, IEEE, Blue Ocean Research Journals 76

7 [6] Arbee L. P. Chen and Victor 0. K. Li, Improvement Algorithms for Semijoin Query Processing Programs in Distributed Database Systems, IEEE, [7] AviSilbershatz, Hank Korth and S.Sudarshan. Database System Concepts, 4 th a. Edition. McGraw-Hill, [8] RamezElmasri and Shamkant B. Navathe.Fundamentals of Database Systems, second Edition. Addison-Wesley Publishing Company, [9] Donald Kossmann and Konrad Stocker. Iterative Dynamic Programming: A new Class of Query Optimization Algorithms. ACM Transactions on Database Systems, Vol. 25, No. 1, March 2000, Pages [10] Hsiao-Fei Liu, Ya-Hui Chang and Kun-Mao Chao. An Optimal Algorithm for Querying Tree Structures and its Applications in Bioinformatics. ACM SIGMOD Record Vol. 33, No. 2, June [11] Thomas Schwentick. XPath Query Containment. ACM SIGMOD Record, Vol.33, No. 1, March [12] Wesley W.Chu and Paul Hurley, Optimal Query Processing for Distributed Database Systems. IEEE Trans. computers, vol.c-31, No.9, September [13] W.Cellary, Z.Krolikowski and T.Morzy, Other Comments on Optimization Algorithms for Distributed Qyeries. IEEE Trans. On Software Engineering, vol.14, No.4, April [14] PauraS.M.Tsai,ArbeeL.P.Chen, Optimizing Queries with Foreign Function in a Distributed Environment, IEEE Trans. On Knowledge and data engineering, vol.14,no.4,july/august [15] Dave D.Straube and M.TamerOzsu, Query Optimization and Execution Plan Generation in Object Oriented Data Management Systems, IEEE Trans. On Knowledge and data engineering, vol.7, No.2, April [16] Stefano Ceri and George Gottlob, Translating SQL into Relational Algebra Optimization, Semantics and Equivalence of SQL Queries, IEEE Trans. On Software engineering, vol.se- 11, No.4, April [17] P.A.BersteinN.Goodman,E.Wong,G.L.Reeve and J.Rothmie, Query Processing in a system for distributed database (SDD-1), ACM Trans.DatabaseSyst.,Vol 6,Dec [18] S.Chaudhari and K.Shim, Query Optimization in presence of foreign Function, Proc.intl conf. vary large data bases, [19] D.Chiu and Y.Ho, A methodology for interpreting tree queries into optimal semi-join expression, inproc.acmsigmod,may [20] C.Yu and Caching, Distributed Query Processing, ACM Comput.Surveys, Vol.16, no.4, Dec [21] Ming Syan Chen and Philip S.Yu, Using Combination joins and semijoins operations for distributed query processing, IEEE Transactions on Knowledge and Data Engineering, [22] Chihping Wang and Ming Syan Chen, On the Complexity of Distributed Query Optimization, IEEE Transactions on Knowledge and Data Engineering, Volume 8,no.4, Aug [23] Ming Syan Chen and Philip S.Yu, Using join operations As reducer for distributed query processing, IEEE Transactions on Knowledge and Data Engineering, [24] Konrad Stocker,Donald Kossmann,Reinhard Braumandl and Alfons Kemper, Integrating semi- join Reducers into State-of-the-Art Query Processors,IEEE, Blue Ocean Research Journals 77

Architecture of Cache Investment Strategies

Architecture of Cache Investment Strategies Sanju Gupta The Research Scholar, The IIS University, Jaipur khandelwalsanjana@yahoo.com Abstract - Distributed database is an important field in database research