ONLINE INDEXING FOR DATABASES USING QUERY WORKLOADS

Size: px

Start display at page:

Download "ONLINE INDEXING FOR DATABASES USING QUERY WORKLOADS"

Todd Stokes
6 years ago
Views:

1 International Journal of Computer Science and Communication Vol. 2, No. 2, July-December 2011, pp ONLINE INDEXING FOR DATABASES USING QUERY WORKLOADS Shanta Rangaswamy 1 and Shobha G. 2 1,2 Department of Computer Science and Engineering, R.V. College of Engineering, Bangalore , India. 1,2 {shantharangaswamy, ABSTRACT A query response in DBMS that is developed based on sequential search or static snapshot (or static indexing) may significantly degrade if the query patterns change and also with the increase in the database size. Online indexing addresses these issues by considering a parameterizable technique to recommend the indexes based on index types that are frequently used for data sets and to dynamically adjust indexes as the query workload changes. The two parameters we have considered are support (prediction of sale of frequent products in the future considering transactions of past) and confidence (the probability with which a product moves with respect to another product/ products in the frequent item set). Based upon the values for the support and confidence parameters, index also gets dynamically changed so that it now indicates the new attributes which are referenced very often. Frequent item set mining algorithm is applied to effectively select the frequent items from the transactions. Association rule mining algorithm is applied on the frequent item set to establish the relationship among products. The concept presented here could be applied for the real world applications involving high dimensional databases for efficient retrieval of data and also to predict the fast moving products in the future with the help of indexing, support parameter and confidence parameter. Keywords: High Dimension Indexing, DBMS, Query. 1. INTRODUCTION AND CONCEPTS A high-dimensional database poses a challenge with respect to efficient access. The fast retrieval of data is very much useful in the current scenarios where the computations require the information requested in less amount of time to carry out specific tasks. However, users are usually interested in querying the data over a relatively small subset of the entire attribute set at a time. A potential solution might be to use lower dimensional indexes that accurately represent the user access patterns. An increasing number of database applications such as business data warehouses and scientific data repositories deal with high-dimensional data sets. As the number of dimensions/attributes and the overall size of data sets increase, it becomes essential to efficiently retrieve specific queried data from the database in order to effectively utilize the database. Indexing Support parameter is needed to effectively prune out significant portions of the data set that are not relevant for the queries. Multidimensional indexing, dimensionality reduction, and Relational Database Management System (RDBMS) index selection tools could be applied to the problem. However, for high-dimensional data sets, each of these potential solutions has inherent problems. A multidimensional index over the data set could be developed so that one can directly answer any query by only using the index. However, the performance of multidimensional index structures is subject to Bellman s curse of dimensionality and rapidly degrades as the number of dimensions increases. In worst cases, such an index would perform much worse than a sequential scan. Another possibility would be to build an index over each single dimension. The effectiveness of this approach is limited to the amount of search space that can be pruned by a single dimension. Another possible solution would be to use some dimensionality reduction techniques, index the reduced dimension data space, and transform the query in the same way that the data was transformed. However, the dimensionality reduction approaches are mostly based on data statistics and perform poorly, especially when the data is not highly correlated. They also introduce a significant overhead in the processing of queries. Yet another possible solution is to apply feature selection to keep the most important attributes of the data according to some criteria and index the reduced dimensionality space. However, traditional feature selection techniques are based on selecting attributes that yield the best classification capabilities. Therefore, they also select attributes based on data statistics to support parameter classification accuracy rather than focusing on the query performance and workload in a database domain. In addition, the selected features may offer little or no data pruning capability, given query attributes. Many commercial RDBMSs have included index recommendation systems to identify indexes that will work well for a given workload. These tools are optimized

2 428 International Journal of Computer Science and Communication (IJCSC) for the domains for which these systems are primarily employed and the indexes that the systems provide. They are targeted towards lower dimensional transactional databases and do not produce results that are optimized for single high dimensional databases. One approach can be based on the observation that in many high-dimensional database applications, only a small subset of the overall data dimensions is popular for a majority of queries and that recurring patterns of dimensions queried occur. For example, Large Hadron Collider (LHC) experiments are expected to generate data with up to 500 attributes at the rate of per second.[3] However, the search criterion is expected to consist of parameters. Another example is High- Energy Physics (HEP) experiments, where subatomic particles are accelerated to nearly the speed of light, forcing their collision. Each such collision generates on the order of 1-10 Mbytes of raw data, which corresponds to 300 Tera bytes of data per year consisting of million objects. The queries are predominantly range queries and mostly involve around five dimensions out of a total of 200. [4] The high-dimensional database indexing problem can be addressed by selecting a set of lower dimensional indexes based on the joint consideration of query patterns and data statistics. The approach is also analogous to dimensionality reduction or feature selection, with the novelty that the reduction is specifically designed for reducing query response times rather than maintaining data energy, as in the case for traditional approaches. The reduction technique might consider both data and access patterns and results in multiple and potentially overlapping sets of dimensions rather than a single set. The new set of low-dimensional indexes might be designed to address a large portion of expected queries and allows effective pruning of the data space to answer those queries. Query pattern evolution over time presents another challenging problem. Researchers have proposed workload- based index recommendation techniques. Their long term effectiveness is dependent on the stability of the query workload. However, query access patterns may change over time, becoming completely dissimilar from the patterns on which the index set was originally determined. There are many common reasons that query patterns change. A pattern change could be the result of periodic time variation (for example, different database uses at different times of the month or day), a change in the focus of user knowledge discovery (for example, a researcher discovery spawns new query patterns), a change in the popularity of a search attribute (for example, current events cause an increase in queries for certain search attributes), or simply the random variation of query attributes. When the current query patterns are substantially different from the query patterns used to recommend the database indexes, the system performance will drastically degrade, since incoming queries do not benefit from the existing indexes. Initial index selection occurs by traversing the query workload representation and determining which frequently occurring attribute set results in the greatest benefit over the entire query set. This process is iterated until an indexing constraint is met or no further improvement is achieved by adding additional indexes. In order to facilitate online index selection, a control feedback system is proposed with two loops: a fine-grained control loop and a coarse control loop. As new queries arrive, the ratio of the potential performance to the actual performance of the system in terms of cost might be monitored, and based on the parameter set for the control feedback loops, major or minor changes to the recommended index set can be made. The main idea behind the development of project was that an efficient and fast way was needed for the data (frequently referenced items) to be retrieved from high dimensional databases. Earlier systems posed a drawback with respect to searching as it was based on sequential search or static indexing. It was more time consuming and altogether a new approach was needed to remove the drawbacks from the existing system. Moreover it was also needed to inform the user about the fast moving products in the market considering the previous transactions. The online index selection is motivated by the fact that query patterns can change over time. By monitoring the query workload and detecting when there is a change on the query pattern that generated the existing set of indexes, we are able to maintain good performance as query patterns evolve. In our approach, we use control feedback to monitor the performance of the current set of indexes for incoming queries and determine when adjustments should be made to the index set. In a typical control feedback system, the output of a system is monitored, and based on some functions involving the input and output, the input to the system is readjusted through a control feedback loop. Our situation is analogous but more complex than the typical electrical circuit control feedback system in several ways: 1. Our system input is a set of indexes and a set of incoming queries rather than a simple input such as an electrical signal. 2. The system output must be some parameter that we can measure and use to make decisions about changing the input. Query performance is the obvious parameter to monitor. However, because lower query performance could be related to other aspects rather than the index set, our decision making control function must necessarily be more complex than a basic control system.

3 Online Indexing for Databases Using Query Workloads We do not have a predictable function to relate system input and output because of the nondeterminism associated with new incoming queries. For example, we may have a set of attributes that appears in queries frequently enough that our system indicates that it is beneficial to create an index over those attributes, but there is no guarantee that those attributes will ever be queried again. Control feedback systems can fail to be effective with respect to response time. The control system can be too slow to respond to changes, or it can respond too quickly. If the system is too slow, then it fails to cause the output to change based on input changes in a timely manner. If it responds too quickly, then the output overshoots the target and oscillates around the desired output before reaching it. Both situations are undesirable and should be designed out of the system. Fig. 1 represents our implementation of dynamic index selection. Our system input is a set of indexes and a set of incoming queries. Our system simulates and estimates costs for the execution of incoming queries. System output is the ratio of the potential system performance to the actual system performance in terms of database page accesses to answer the most recent queries. We implement two control feedback loops. One is for finegrained control and is used to recommend minor inexpensive changes to the index set. The other loop is for coarse control and is used to avoid very poor system performance by recommending major index set changes. Each control feedback loop has decision logic associated with it. Figure 1: Dynamic Index Analysis Framework 1.1 Fine-Grained Control Loop The fine-grained control loop is used to recommend lowcost minor changes in the index set. This loop is entered, when the ratio of the hypothetical performance to the actual performance is below some input minor-change threshold. Then, the indexes are changed from I (current set of attribute sets used as indexes) to I new (hypothetical set of attribute sets used as indexes), and appropriate changes are made to update the system data structures. Increasing the input minor change threshold causes the frequency of minor changes to also increase. 1.2 Coarse Control Loop The coarse control loop is used to recommend changes that are more costly but with greater impact on the future performance of the index set. This loop is entered, when the ratio of the hypothetical performance to the actual performance is below some input major-change threshold. Then, the static index selection is performed over the last w (window size) queries, abstract representations are recomputed, and a new set of suggested indexes I new is generated. Appropriate changes are made to update the system data structures to the new situation. Increasing the input major-change threshold increases the frequency of major changes. 1.3 Challenges in High-dimensional Databases Dimensionality is an issue that can arise in every scientific field. Generally speaking, the difficulty lies on how to visualize a high dimensional function or data set. This is an area which has become increasingly more important due to the advent of computer and graphics technology Curse of Dimensionality Curse of dimensionality is a term coined by Richard Bellman (1961) applied to the problem caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. It is a significant obstacle in high dimension data analysis, which refers to the fact that a local neighbourhood in higher dimensions is no longer local, or to put in another way, the sparsity increases exponentially given a fixed amount of data points. This is illustrated below: 64 data points are simulated form a uniform (0; 1) distribution. In one dimension, all the data points are clustered together. However, in two dimensions, the data become much sparser, and this is even obvious in three dimensions. So to achieve the same accuracy, much larger data sets are needed even when dimension is moderate and such large data sets are not available in practical situation [1].

430 International Journal of Computer Science and Communication (IJCSC) 1.3.2 Need for Index Figure 2: Data Clustering In database design, it is defined as a list of keys (or keywords), each of which identify a unique record.

4 430 International Journal of Computer Science and Communication (IJCSC) Need for Index Figure 2: Data Clustering In database design, it is defined as a list of keys (or keywords), each of which identify a unique record. Indices make it faster to find specific records and to sort records by the index field that is, the field used to identify each record. There are three options which are available with respect to searching. Sequential search is concerned with searching sequentially by scanning each record. Second is static indexing in which fixed number of indexes are given to each of the products and are hence searched with respect to the given index. Third option is of dynamic indexing, details about which are explained in section 2 of the paper. The rest of this paper is organized as follows: Section 2 presents the related work in this area, our proposed index selection and control feedback framework. Section 3 presents the summary, conclusion and potential further enhancements in this work. 2. RELATED WORK We have developed a flexible index selection frame work to achieve dynamic index selection for high dimensional data with the help of parameters namely support parameter and Confidence parameter. A control feedback technique is introduced for measuring the performance. Through this a database could benefit from an index change, online index selection is designed with the motivation if the query pattern changes over time. The information about the support parameter and confidence parameter with respect to different products is shown in different figures later, thereby giving an idea to the user about the fast moving products and also the frequently requested items. 2.1 Index Selection The index selection problem has been identified as a variation of the Knapsack problem, and several papers proposed designs for index recommendations based on optimization rules. These earlier designs could not take advantage of modern database systems query optimizer. Currently, almost every commercial RDBMS provides the users with an index recommendation tool based on a query workload and uses the query optimizer to obtain cost estimates. A query workload is a set of SQL data manipulation statements. The query workload should be a good representative of the types of queries that an application support parameters. Microsoft SQL Server s AutoAdmin tool selects a set of indexes for use with a specific data set, given a query workload. In the AutoAdmin algorithm, an iterative process is utilized to find an optimal configuration. First, one-dimensional candidate indexes are chosen. Then, a candidate index selection step evaluates the queries in a given query workload and eliminates from consideration those candidate indexes that would not provide a useful benefit. The remaining candidate indexes are evaluated in terms of the estimated performance improvement and index cost. The process is iterated for increasingly wider multicolumn indexes until a maximum index width threshold is reached or iteration does not yield any improvement in performance over the last iteration. Costs are estimated using the query optimizer, which is limited to considering those physical designs offered by the DBMS[2]. 2.2 Introduction to Dynamic Indexing Thus far, it is assumed that the document collection is static. This is fine for collections that change infrequently or never (e.g., the Bible or Shakespeare). But most collections are modified frequently with documents being added, deleted and updated. This means that new terms need to be added to the database; and it has to be updated for existing terms [2]. The simplest way to achieve this is to periodically reconstruct the index from scratch. This is a good solution if the number of changes over time is small and a delay in making new documents searchable is acceptable - and if enough resources are available to construct a new index while the old one is still available for querying. In many high-dimensional database applications, only a small subset of the overall data dimensions is popular for a majority of queries and that recurring patterns of dimensions queried occur. The high-dimensional database indexing problem could be addressed by selecting a set of lower dimensional indexes based on the joint consideration of query patterns and data statistics. The new set of low-dimensional indexes is designed to address a large portion of expected queries and allows effective pruning of the data space to answer those queries. The challenging problem here is the query access patterns may change over time, becoming completely dissimilar from the patterns on which the index set was originally determined. When the current

Online Indexing for Databases Using Query Workloads 431 query patterns are substantially different from the query patterns used to recommend the database indexes, the system performance will

5 Online Indexing for Databases Using Query Workloads 431 query patterns are substantially different from the query patterns used to recommend the database indexes, the system performance will drastically degrade, since incoming queries do not benefit from the existing indexes. To make this approach practical in the presence of a query pattern change, the index set should evolve with the query patterns. For this reason, a dynamic mechanism is implemented to detect when the access patterns have changed enough that the introduction of a new index, the replacement of an existing index, or the construction of an entirely new index set is beneficial Online Index Selection The index selection technique uses the query workload and the data set to generate the abstract representation of the query workload by mining patterns in the workload. This abstraction consists of a set of attribute sets that frequently occur over the entire query set. Algorithms Used are Frequent Item set Mining and Association rule Mining Frequent Item Set Mining Module This module is mainly used to group the frequently occurring items into a set. It groups the items which are frequently occurring in the set of transactions which are being considered. The module is primarily responsible for proper grouping of the items which is used as an input for association rule mining of items. The module takes the transaction list as input. If the number of transactions is multiple of five then it groups the items in the transactions that have occurred frequently. It takes the frequent items from the transaction and hence forth stores them in the database. It considers the items which are present in the transactions made by the user. : In the frequent item set algorithm, frequent item set of products is generated considering the transactions done by the user. The greatest benefit of generating the frequent item set over the entire queried set using frequent item set mining algorithm is beneficial in the allotment of indexes [3] Association Rule Mining By applying the association rule mining, the relationship between the records is established. This module finds the relationship between the items defined in the frequently used item set. The module is primarily responsible for finding the relationship between the items and hence forth forwards the relationships established to the next module to calculate support and confidence. This module takes input from frequently used item set. After taking the item set it develops the relationship among the items present in the item set. This module takes input of frequently used item set which is generated after every five transaction. This module is called for every five transactions to find the relationship between the items. Association rule mining is used for calculating the relationship between the records, find the support parameter and confidence parameter and hence determine the pattern of frequently occurring items [4] Support and Confidence Module This module forming the core part calculates the support and confidence parameters for each item set defined. It calculates the support and confidence parameters based on the formulas defined below: This module is mainly responsible for giving the support and confidence parameters information to the administrator regarding the fast moving products in the future and also about the frequently requested items. Composition Support Parameter gives the assurance with which a product can move in the future considering the past transaction list. The formula to calculate support is defined as: Confidence parameter gives the probability with which frequent products can move in relation with respect to one another. The formula to calculate support is defined as: Support = {(X Y).count}/n Confidence parameter gives the probability with which frequent products can move in relation with respect to one another. The formula to calculate support is defined as: Confidence = {(X Y).count}/ X. count The snapshots below shows the login page, different shopping pages and the support and confidenceanalysis based on the customers online browsing with the products. Figure 3: Login Page

432 International Journal of Computer Science and Communication (IJCSC) Figure 7: Search Table Figure 4: Shopping Page 1 Figure 5: Shopping Page 2 Figure 6: Support and Confidence 3.

assigned for products present in frequent item set) was proposed to remove the loopholes persisting in the current environment.

which is done with the help of online indexing and other parameters. The concept implemented here gives a vague idea about carrying out the entire process of providing online indexing.

6 432 International Journal of Computer Science and Communication (IJCSC) Figure 7: Search Table Figure 4: Shopping Page 1 Figure 5: Shopping Page 2 Figure 6: Support and Confidence 3. CONCLUSION AND SUMMARY Considering the fact that there are a lot of loopholes associated with static indexing and sequential search, a new method called online indexing (dynamically indexes are assigned for products present in frequent item set) was proposed to remove the loopholes persisting in the current environment. The major problem associated with huge databases is indexing and retrieving the frequently occurring products as quickly as possible to reduce the searching time and to increase the performance level which is done with the help of online indexing and other parameters. The concept implemented here gives a vague idea about carrying out the entire process of providing online indexing. For the concept to be realized, two parameters called support and confidence (from data mining concepts) were considered for identifying the fast moving and frequent products from the transactions made by users in the past. Support parameter gives the assurance with which a product can move in the future and confidence parameter gives the probability with which frequent products can move in relation to another. Frequent Item Set mining algorithm is implemented to efficiently separate the frequent items from the transactions made. Association Rule mining algorithm is applied to establish relationships among the products in the frequent item set. To conclude, higher the support and confidence values for different products, higher is the assurance/ probability that they will move well in the future. With the help of these parameters one may predict the sale of fast moving products in the future. The implementation presented here may be carried out for high dimensional databases. The concept could be applied to high dimensional databases in real world applications and the provision for calculating the support and confidence for more than three products can be calculated. Also the

7 Online Indexing for Databases Using Query Workloads 433 admin could be provided with the option of viewing the support and confidence and the requested items in list. REFERENCES [1] Li Wang, Department of Statistic and Probability, Michigan State University, High Dimensional Data Analysis, pp [2] Stephane Azefack, Kamel Aouiche and Jerome Darmont, Dynamic Index Selection in Data Warehouses, Proc. 4th Int l Conf. Innovations in Information Technology (Innovations 07), 2007, pp [3] Bart Goethals, Frequent Set Mining, Data Mining and Knowledge Discovery Handbook, ISBN: , pp [4] Sotiris Kotsiantis, Dimitris Kanellopoulos, Dept of Mathematics, University of Patras, Greece, Association Rules Mining: A Recent Overview, 2006, pp [5] G. Jayalakshmi, Dr.K. Nageswara Rao, Mining Association Rules for Large Transactions using New Support and Confidence Measures, Journal of Theoretical and Applied Information Technology, 7, No.2, 2009, pp [6] Bert Bates, Kathy Sierra, Head First Java, O Reilly Media, ISBN:

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM G.Amlu #1 S.Chandralekha #2 and PraveenKumar *1 # B.Tech, Information Technology, Anand Institute of Higher Technology, Chennai, India