SQL Query Optimization on Cross Nodes for Distributed System

Size: px

Start display at page:

Download "SQL Query Optimization on Cross Nodes for Distributed System"

Rolf Howard
5 years ago
Views:

1 2016 International Conference on Power, Energy Engineering and Management (PEEM 2016) ISBN: SQL Query Optimization on Cross Nodes for Distributed System Feng ZHAO 1, Qiao SUN 1, Yan-bin JIAO 1 and Jia-song SUN 2,* 1 Beijing GuoDianTong Network Technology Co., Ltd, Beijing, China 2 E. E. Department, Tsinghua University, Beijing, China *Corresponding author Keywords: Query path optimization, Cross nodes, Cost of query, SQL, Distributed system. Abstract. Query terms are also expanded from several dimensions of simple query expansion to complex query of multi dimension. It is becoming more and more difficult to extract, store and analyze the massive data by using the traditional database software. So the better query strategy is the key to optimize the database query operation. As the important method of database analysis data, SQL query plays an important role in analyzing and processing data. Through SQL inquiries, users can get their most concerned information quickly. With the continuous development of the Internet industry, the data need to deal with in database continues to expand its scale. Database shows its characteristics as large amount, multiple types, fast processing speed and low density value. This paper proposed a new global adaptive optimization processing method in order to achieve the purpose of reducing the number of data I/O and load balance when SQL query is carried out in the distributed parallel system. Its characteristic is: to build a multi-factor decision fuzzy evaluation model for each sub-query path optimization decision and to definite the optimized global cost function for adaptive optimization on global query path which will satisfy the total cost minimum requirements for the purpose of query. The experiments show that the global query method has the faster query speed. At the same time, the total cost of query is controllable by defining the global optimized cost function. Introduction The distributed database system is the organic combination of computer network and database system. Because of the large amount of data transmission in the network, query processing and optimization becomes the key factor to improve the query performance of the distributed database. Query processing and optimization is through the reasonable algorithm to reduce the amount of information, so as to improve the response time of the query and reduce the cost of system. The query optimizer is widely considered to be the most important part of a database system. The main aim of the optimizer is to take a user query and to provide a detailed plan called a Query Execution Plan (QEP) that indicates to the executer exactly how the query should be executed. The problem that the optimizer faces is that for a given user query there exists a large space of different equivalent QEPs that each have a corresponding execution cost. The plans are equivalent in the sense that they return the same result for the user query but the cost of plans may differ by orders of magnitude. In a centralized database system an estimate of the number of I/Os performed is typically used as a cost metric for a plan. If the optimizer choses a plan with a poor cost the execution can take several days while another plan may exist that performs the execution in seconds [1]. The retrieval of data from different sites in a network is known as distributed query processing [2]. The difference between query processing in a centralized database and a distributed database is the potential for decomposing a query into sub queries which can be processed in parallel, and their intermediate results can be sent in parallel to the required computers. Finding an efficient way of processing a query is important. If a query is processed inefficiently, it not only takes a long time before the end user gets his answer, but it might also decrease the performance of the whole system because of network congestion.

2 There are two distinct types of nodes that implement the query processing functionality [3]: 1) Control Node. The control node manages the distribution of query execution across the compute nodes, accepts client connections to the PDW appliance and manages client authentication. In addition to containing a SQL Server instance, the control node contains additional software to support the distributed architecture of the PDW. This includes the engine that coordinates the data warehousing functions that are specific to processing parallel queries, stores appliance-wide metadata and configuration data, and manages appliance and database authentication and authorization. 2) Compute Nodes. Each compute node is the host for a single SQL Server instance. It also runs a DMS process for communication and data transfer with the other nodes in the appliance. Each compute node stores a portion of the user data. Big data applications often need to access datasets on different platforms that may even be cross-domain. For structured data, the time cost of data extraction and loading cannot meet the real-time requirement. The different platforms involved interconnect via LAN or Internet, resulting in a distributed and heterogeneous network topology. In this network topology, the data sources are dynamic, heterogeneous, and autonomous. The literature [4] implement a cross-platform query interface using which the clients can directly execute online join query between discrete deployments of Banian or between Banian and any other relational database (such as MySQL and Oracle). The cross-platform query interface contains three main components: SQL interface, cross-platform module, and global table. The SQL interface provides a command shell for users and forwards query commands to the cross platform module. If a request command involves several datasets on different platforms, the cross platform module queries the global table and gets the information of Location (a data structure). Then, it splits the command according to the variable tag name of Location, sends the sub-command to the slave platform as master, and receives the result. At present, the generally query steps for the distributed system is: Query decomposition according to the user's query content and initialization query path; Checking the local database. If there is, then apply local implementation; if there is not, then global query processing module according to the path query to select a node processing the query optimization. Namely, the choice of the database manipulation table query cost minimum database node. The connection with the optimized node is established, and the query command is transmitted to the optimized node to execute. In the process, because the distributed database system exists in the network environment, it must take account of the communication costs between the nodes and the distributed computing processing. The current steps of query decomposition, data localization, local optimization, global optimization method, running with the communication cost and query the actual cost, is still unable to satisfy the requirements of users and globally optimal execution nodes is not available in the distributed systems. This paper investigates a new optimization method in order to achieve the purpose of reducing the number of data I/O and load balance when SQL query is carried out in the distributed parallel system. Its characteristic is: to build a multi-factor decision fuzzy evaluation model for each sub-query path optimization decision and to definite the optimized global cost function for adaptive optimization on global query path which will satisfy the total cost minimum requirements for the purpose of query. The experiments show that the global query method has the faster query speed. At the same time, the total cost of query is controllable by defining the global optimized cost function. The rest of this paper is organized as follows. Section 2 provides a generally query introduction for the distributed system. Section 3 demonstrates the new optimization method can improve the query speed and total cost of query. Section 4 demonstrates the performance of this approach. Section 5 is the conclusion. Query for the Distributed System Big data is currently a research focus in both academic and industry. To analyze massive amounts of data and obtain valuable information and knowledge, researchers have developed many excellent

3 systems and technologies [5, 6, 7, and 8]. Query processing and optimization is through the reasonable algorithm to reduce the amount of information, so as to improve the response time of the query and reduce the system cost. Compared with the traditional single machine optimization method, the distributed query optimization has better data reliability, faster query speed and scalable storage capacity. Distributed query optimization generally include: query decomposition, data localization, local optimization, global optimization. Its concrete way is: 1) Query decomposition Query decomposition is a relational algebra expression which is defined as the relationship between the global relations and the query problem (such as the SQL statement). 2) Data localization Data localization is to implement a query on the global relationship to implement the query to the appropriate (to make possible localization or near to the localization) of the query. 3) Global optimization The input of global optimization is the query of the slice, that is, the query on the segment. The goal of query optimization is to find a near optimal execution strategy. The global optimization is the best operation order to find the split query, including the minimum cost function. The output of the global optimization processing layer is an optimized and the relational algebra query over the segment. 4) Local optimization The local query optimization consists of all the sites executing the fragments related to the query. The sub queries executed at each site are called local queries. It is optimized by the DBMS on the site, and the algorithm is optimized by using the centralized database system. Figure 1. Steps of overall merging implementation. Table 1. Query Latency of GA and TF under 10 SQL commands. Commands Query Latency (s) Commands Query Latency (s) GA TF GA TF Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ New Global Adaptive Optimization Approach Query terms are extended to complex queries by several dimensions and simple combinatorial queries. Using traditional database software for massive data of extraction, the storage, the analysis becomes more and more difficult to get results in real time, so the choice of better query strategy becomes the focus of database optimization of complex query operations. This paper proposed a new adaptive global optimization approach to achieve in distributed parallel SQL query system reduced the frequency of data I/O and load balancing. Its characteristics are: to build a multi factor decision fuzzy assessment model for each sub query path optimization decision, and the definition of global optimization of the cost function of global query path adaptive optimization, to meet the total cost minimum requirements for the purpose of query. The implementation steps in detail (shows in the Fig.1):

4 1) Computing global query total cost Determine the total cost of the global query minimum requirements which defined as the weighted sum of the error and response time; 2) Local optimization phase In the local optimization phase, to construct a multi factor decision fuzzy assessment model, for each sub query path optimization decision, through the query decomposition and data localization of multifactor decision making fuzzy evaluation. The results of the assessment as the input of the local optimization 3) Global optimization phase In the global optimization stage, the objective of the global optimization is to find the best operation sequence of the slice query, and minimize the cost function. Through the definition of global optimization of the cost function, using based on BP neural network adaptive method for all query path overall adaptive optimization, makes the global total query cost meet in step 1 of minimum requirements. The advantage of this method is that the global query has a faster query speed by optimizing the query path, and the total cost of the query is controllable by defining the global optimization cost function. Table 2. Query Latency of GA and TF under 20 SQL commands. Commands Query Latency (s) Commands Query Latency (s) GA TF GA TF Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Experiments and Performance In this section, we evaluate the performance and scalability of our global adaptive optimization approach (GA) and compare the results with those of traditional four steps (TF) of query for the distributed system. Firstly, on a cluster containing 10 nodes, we run ten SQL commands on a 10GB database to evaluate the query latency, refer to Table.1; then, we run twenty SQL commands on a 100GB database to compare and analysis the results of the queries, refer to Table.2. During this evaluation, all the SQL do 50 times and the average value is provided. In the 10GB level queries, the Query Latency of GA is average 2.34 times faster than TF, especially for command_6 as 3.5 times. In the 100GB level queries, the Query Latency of GA is average 2.38 times faster than TF, especially for command_1 as 4.26 times. From the experiments we can conclude that the global adaptive optimization approach technique is significantly advantageous compared with the traditional mechanism. Summary Query processing and optimization is the key factor to improve the query performance of the distributed database. This paper proposed a new adaptive global optimization approach to achieve in distributed parallel SQL query system reduced the frequency of data I/O and load balancing. Its merit is the optimization of each sub slice query path and reducing the computational burden of global

5 optimization. Another advantage is that the adaptive optimization of the query is realized by the adaptive optimization of the global query path based on the neural network. Acknowledgement This research was financially supported by Science and Technology Project of the State Grid Corporation of China (SGZJ0000BGJS ) and the State Grid Information & Telecommunication Group CO., LTD.(SGITG-KJ-JSKF[2015]0003). References [1] Robert Taylor. Query Optimization for Distributed Database Systems. Master thesis of University of Oxford, August [2] P.M.G. Apers, A.R. Hevner, S.B. Yao. Optimization Algorithms for Distributed Queries. IEEE Transactions on Software Engineering, Vol 9: 1, [3] [4] Tao Xu, Dongsheng Wang, and Guodong Liu. Banian: A Cross-Platform Interactive Query System for Structured Big Data. Tsinghua Science and Technology, ISSN, , 07/11, pp 62-71, Volume 20, Number 1, February [5] S. Ghemawat, H. Gobioff, and S.T. Leung, The Google file system, ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp , [6] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun. of ACM, vol. 51, no. 1, pp , [7] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop distributed file system, in Proceedings of IEEE Conference on Mass Storage Systems and Technologies (MSST), 2010, pp [8] D. Borthakur, J. Grap, J.S. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, et al., Apache Hadoop goes realtime at facebook, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 2011, pp

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190