An Efficient Interval Query Algorithm Based on Inverted List in Cloud Environment *

Size: px

Start display at page:

Download "An Efficient Interval Query Algorithm Based on Inverted List in Cloud Environment *"

Stanley Watson
5 years ago
Views:

1 Proceeding of the IEEE International Conference on Information and Automation Shenyang, China, June 2012 An Efficient Interval Query Algorithm Based on Inverted List in Cloud Environment * Zhiqiong Wang, Ke Gong, Shikai Jin, Wenjun Li and Zixi Liu Sino-Dutch Biomedical and Information Engineering School Northeastern University Shenyang , P.R. China wangzq@bmie.neu.edu.cn Abstract - Interval overlap query has played a more and more significant role in genomics researches and the development of biomedicine. However, traditional query approches based on single computer cannot handle the problem of limited query speed in the query process properly. A new algorithm based on cloud computing technology named CNCList+ has been proposed to increase the query speed. Nevertheless, the mechanism of CNCList+ that it needs to scan the data of subgroups orderly in every query process reduces the degree of query speed enhancement. Considering the significant role of inverted list in data idex area, the concept of inverted list and the technique of cloud computing are combined together in this paper, forming an efficient query algorithm named IQIL to futher speed up the query speed. In addition, detailed comparison experiments between IQIL and CNCList+ prove the superior performance of IQIL on query speed, thus demonstrating the extraordinary ability of IQIL on solving the limited query speed problem of interval overlap query. Index Terms - Cloud Computing; Inverted List; Interval Overlap Query; Performance. I. INTRODUCTION Interval overlap query has played a significant role in the development of modern genomics and it has proved to be a fundemental and essential tool in theoretical researches and practical applications of biomedicine. One application of interval overlap query is that by comparing the overlapping level between the sequence that needs to be measured and the target sequence interval of a specific disease, the potential risk that people or animals that contained such kind of sequence may get this disease can be predicted, which will help a lot for early diagnosis and clinical treatment. Therefore, interval overlap query is of great importance in modern biomedical area. Nevertheless, many problems turn up and block the effective use of technique of interval overlap query. Among all the problems, limited query speed of traditional query method is supposed to be the most remarkable one. While an efficient interval query of genome alignment and interval databases in cloud environment named CNCList+ is created [1], this accelerates the query speed dramatically by bringing in the extraordinary capability of cloud computing technology in dealing with large amount of data. However, the mechanism of CNCList+ that it needs to scan the data of subgroups orderly in every query process reduces the degree of query speed enhancement. Considering the great importance of inverted index in data index field, we decide to combine inverted index method and cloud computing technique to futher speed up the proces of interval overlap query. The major contributions of this paper can be summarized as following: 1) An efficient interval overlap query algorithm based on inverted list (IQIL) that combines the concept of inverted list and the outstanding capacity of cloud computing in dealing with massive data volumes is proposed. 2) Elaborate experiments are executed to demonstrate the superior ability of the interval overlap query algorithm based on inverted list (IQIL) over CNCList+ in query speed. The rest of the paper is organized as follows. Related work is reviewed in Section 2. Section 3 describes the general idea of CNCList+. Section 4 introduces the details and process of IQIL. Elaborate comparison experiments are presented and the experimental results are analyzed in Section 5. Finally, we conclude this paper in Section 6. II. RELATED WORK Since the interval overlap query has become more and more important in biomedical area, many approches has been attempted on this problem. An excellent browser named the human genome browser (HGB) at UCSC [2] provides an access to the sequence and annotations of the human genome. A database of genomic DNA sequence alignments and annotations called GALA [3] was developed to execute complex queries across multiple forms of information simultaneously or multiple genes. Segment R-tree [4] was designed as an indexing technique for interval data in multiple dimensions. In addition, the conventional multi-column B- Tree has also been used while it is only suitable for the small database. Besides these conventional methods, some techniques from the related field of spatio-temporal indexing came up. The Relational Interval Tree [5] was proposed for any relational or object-relational table containing intervals. Based on the Relational Interval Tree, a new join algorithm [6] was created for interval data. MV3R-Tree [7] was presented as a structure to utilize the concepts of multi-version B-trees and * The paper was supported by Liaoning Provincial Natural Science Foundation of China (No ) and Overseas Distinguished Foreign Expert Project of Universities directly under the Ministry of Education (No. MS2011DBDX021) /12/$ IEEE 221

2 3D R-trees. An interval overlap query algorithm named NCList [8] was created aimed at accelerating interval query of genome alignment and interval database. However, all these methods are executed on single computer, which limite the query speed. An efficient query algorithm based on cloud computing named CNCList and its advanced algorithm CNCList+ were created. The idea of CNCList is that all the sequences are assigned to several big groups by satisfying the rule that all the sequences inside one big group must be in ascending order. Then, the query process is carried on by distribuing the task of querying sequences inside every big group into several common computers. In this case, all the query processes are executed simultaneously, which improves the query speed dramatically. Additionally, two optimazation strategies that are subgroup formation and boundary interval filter are proposed on CNCList and thus form an advanced query algorithm named CNCList+ that further enhances the query speed. Besides these query algorithm attemps, many tequniues related to inverted list arise because of its significant status in database query. A new ranking paradigm for relational databases called Structured Value Ranking that can be supported by a new family of inverted list indices and associated query algorithms was designed [9]. An Apriori algorithm was presented for mining frequent patterns based on inverted list [10]. In addition, a combination-tree algorithm was created for mining frequent patterns based on inverted list [11]. By studing efficient query processing in distributed web search engines with global index organization, an optimized inverted list assignment in distributed search engine architectures was proposed [12]. What s more, a new character-based indexing algorithm which generates all locations of target text to the inverted list in existed bit form turned up [13]. III. CNCLIST+ At beginning, all the sequences are assigned to several original big groups to ensure that all the sequences inside the same big group must be in ascending order, which also means that there is no containment relationship in every big group. Then two optimization strategies are brought in, which are the subgroup formation and bouandry interval filter to further increase the query speed. Subgroup formation enables every original big group divided into several subgroups by observing two rules, maximum efficiency length and adjacent gap rule, which will optimize query process. Moreover, boundary interval filter rule marks every big group and subgroup as an interval that will be checked on the inclusion relation with the target query interval. If the big groups or the subgroups are completely contained by the target query interval, all the sequences within that big group or subgroup are the result sequences. On the contrary, if the big groups or the subgroups are out of the target query interval, all the sequences inside them will be discarded. If the relationship is neither full inclusion nor complete exclusion, which means there is intersection between them, the query process will be executed on the intersected big groups or subgroups only. Owing to the introductions of theses optimization concepts and cloud computing technology, CNCList+ performances better than other traditonal algorithms on final query time. However, since the mechanism of CNCList+ needs to scan the data of subgroups orderly in every query process, it reduces the degree of query speed enhancement. If the superior ability of cloud computing in processing massive data volumes can be made good use of, the query speed can be further increased. Addtionaly, since inverted list is of great importance in database field, the idea that maybe cloud computing and inverted list can be combined together to address the problem of limited query speed comes up and form a new interval overlap query algorithm based on inverted list (IQIL). IV. INTERVAL QUERY BASED ON INVERTED LIST A. Sub-intervals Formation All the sequences will be cut into several fragments during the process of sub-intervals formation. Firstly, the sequence the head of which is the most left will be scanned at first. Secondly, when the start or the end of a sequence is encountered, the interval from the start of the most left sequence to the first encountering start or the end of the sequence will be cut off. Then the scanning process begins from the first encountering start or end of this sequence until another start or end of a sequence is encountered. As the same method before, the interval from the first encountering start or end to the second encountering start or end will be cut off. The sub-intervals formation continues until the right most end of a sequence has been scanned. As is shown in Figure 1, we mark the sequences as S1, S2, S3, S4 and S5 in their start ascending order. Firstly, the sequence S1 will be scanned at first since its start is the left most. Secondly, the scanning process continues from left to right until the first start or the end of a sequence is encountered, which is the start of sequence S2. According to our sub-intervals formation rule, the interval between the start Fig. 1 There are five sequences marked as S1, S2, S3, S4 and S5 in the process of sub-intervals formation. 222

3 Fig. 2 The first interval forms between the start of S1 to the first encountering start or end of a sequence, which is the start of S2. Then another interval formation continues from the start of S2 to the second start or end of a sequence, which is the start of S3. The sub-interval formation continues until all the sequences are scanned. of S1 to the start of S2 will be cut off, forming a sub-interval marked as interval A, as shown in Figure 2. Then the scanning process goes on from the first encountering start or end of the sequence, which is the start of S2 in this example, until another start or end of a sequence is encountered, which is the start of sequence S3. Based on the same rule of sub-intervals formaiton, the segment from the start of S2 to the start of S3 will be cut off and a new sub-interval marked as B in Figure 2 will be formed. The process of sub-interval formation keeps going until the right most end of a sequence, which is S5 in our example, is scanned. The result of the sub-interval formation will be what is shown in Figure 2, that there are nine sub-intervals marked as A, B, C,,I formed during the sub-intervals formation process. B. Intersection Checking Between Sub-intervals and Target Query Interval Fig. 3 All the sub-intervals will be stored after the sub-interval formation. Then the intersection relationship between the target query interval T and all the sub-intervals will be checked and as a result, the sub-intervals C, D and E intersect with target query interval T. These qualifying sub-intervals will the source to track back to the result sequences, which are S1, S2, S3 and S4. Since all the sequences have been cut into several subintervals, all the sub-intervals will be stored. When querying the result sequences that contain the target query interval, what we need to do is checking whether there is an intersection between a sub-interval and the target query interval. If there is, the sequences that contain such subinterval will be the result sequence. As is shown in Figure 3, all the sequences from S1 to S5 have been cutted into nine sub-intervals. Next, all the subintervals A, B, C,, I will be stored. The ultimate goal in our exmaple is to query the sequences that contain the taget query interval T, so what we need to do is to execute the process of intersection checking between sub-intervals and target query interval T. In Figure 3, it is obvious that subintervals C, D and E intersect with target query interval T, thus, the sub-intervals C, D and E are the qualifying sub-intervals. According to the qualifying sub-intervals, the result sequences will be tracked down to because they must contain at least one of these three qualifying sub-intervals. The result sequences will be S1, S2, S3 and S4 shown in Figure 3. C. Process of IQIL As is shown in Algorithm I in Appendix, firstly, all the sequences will be cut into several sub-intervals based on the rules above. What s more, every unit length segment of the same sub-interval will be marked with the same lable (see Function 1). Secondly, the intersection relationship between the target query interval and sub-intervals will be checked and all the qualifying sub-intervals will be got through the lable (see Function 2). According to the inverted table that has been formed, map function will track down to the qualifying sequences and pass the data to reduce function (see Function 3). At last, all the result sequences will be got in reduce function based on the data received from map function (see Function 4). V. EXPERIMENTAL EVALUATION A. Setup In our experiments, we use ubuntu 10.04, linux generic as operating system. The environment is Hadoop 0.21, RAM is 2.0 GB and the switch is net-core NSD1016D (16 port Fast Ethernet Switch, 10M/100M). CPUs and their remaining disk spaces are listed below (see Table I). TABLE I Machine Configuration CPU Remaining Disk Space Intel core.2.duo E GHz 26.3GB Intel core.2.duo E GHz 261.1GB Intel core.2.duo E GHz 27.4GB Intel core.2.duo E GHz 23.9GB Intel core.2.duo E GHz 27.9GB Intel core.2.duo E GHz 27.3GB Intel core.2.duo E GHz 27.0GB Intel core.2.duo E GHz 27.7GB B. Different Number of Machines Since simultaneous operation of lots of common computers that share the computaional resources and 223

4 constitutes a large distributed cluster system yields the extraordinary capability of cloud computing technology, the number of machines is a key factor to the final query time. The following experiment will test the relationship between the final query time and the number of machines. In this experiment, the variable is the number of machines and the amount of test data is set to be 10GB. Result analysis: 1) Figure 4 shows that for both CNCList+ and IQIL algorithm, the more of the number of the machine, the less query time will be used; 2) For the same number of machine, the query time of IQIL is impressively shorter than the query time of CNCList+. Thus, IQIL demonstrates superior query speed ability over CNCList+ under the same amount of machine. does not significantly contribute to the query time. However, the number of sub-intervals will have a major impact on the query time because the checking time of intersecting subintervals will increase if the total number of sub-intervals increases, so if the sub-intervals of a larger data size experiment outnumbers that of a smaller data size, it can still consume more time. This can illustrate the shorter query time of 10GB than that of 8GB in Figure 5; 2) for CNCList+, the larger the amount of data, the longer final query time will be used; 3) for the same amount of data, the query time of IQIL is distinctly shorter than that of CNCList+. Fig. 5 Relationship between the amount of data and query time. Fig. 4 Relationship between number of machine and query time. C. Different Amount of Data When the configuration of platform and the amount of the machines is fixed, we are curious about whether different amount of data will affect the final query time. Therefore, the final experiment is to test the influence from the amount of data to the final query time. In this experiment, the variable is the amount of data and we set the number of machines as eight. In addition, the other configuration would be the same, which guarantee the accuracy. The result is shown on Figure 5. Result analysis: according to the tendency of the experiment curve, we can see that: 1) the amount of data does not make a remarkable effect on the query time of IQIL, and even in some cases the query time of larger amount of data is shorter than that of smaller data. For example, 8GB data has a query time of 35 seconds, whereas 10GB data only needs 28 seconds. The reason is that query process needs to check the suitable intersecting sub-intervals and trace back to all the result sequences containing these intersecing sub-intervals. The amount of data represents the number of sequences. Since the process of tracing back to the result sequences containing intersecting sub-intervals consumes very little time, data size D. Summary The above experiments shows that IQIL has superior query speed ability over CNCList+ for both the same amount of machine and data. In addition, we can enhance the query time by increasing the number of machine. Thus, for researchers focusing on genomics and biomedine, the IQIL algorithm based on cloud computing will be an attractive choice to improve the interval query efficiency and reduce the query speed when interval query that contains massive data volumes needs to be executed. VI. CONCLUSION Overlap interval query has become a more and more crucial tool to data mining and clinical diagnosis in biomedical field. Many biomedical doctors and scientists have made many researches on this, and many approaches have been applied to this topic while they are all based on single computer. An effecinent algorithm based on cloud computing named CNCList+ has been proposed and increases the query speed dramatically. However, since the data of subgroups need to be scanned orderly in every query process of CNCList+, the degree of query speed enhancement is reduced. Considering the great importance of inverted list in data index field, the inverted index method and cloud computing technique are combined, forming a new efficient interval query algorithm 224

5 named IQIL to futher speed up the proces of interval overlap query. Elaborate experiments demonstrate the superior performance of IQIL than the CNCList+ on query speed. Consequently, the IQIL algorithm will contribute a lot for biomedical researchers on data mining and clicical diagnosis and make interval overlap query a more practical tool in biomedical area. REFERENCES [1] Z. Wang, K. Gong, S Jin, W. Li and Z. Liu, Efficient Interval Query of Genome Alignment and Interval Databases in Cloud Environment ICCIP 2012, in press. [2] W.J. Kent, C.W. Sugnet, T.S. Furey, et al, The human genome browser at UCSC, Genome Res., 12, (2002) [3] B. Giardine, L. Elnitski, C. Riemer, et al, GALA, a database for genomic sequence alignments and annotations, Genome Res., 13, (2003) [4] C.P. Kolovson and M. Stonebraker, Segment indexes: dynamic indexing techniques for multi-dimensional interval data, In SIGMOD Conference (1991) [5] H.P. Kriegel, M. Pötke and T. Seidl, Managing intervals efficiently in object-relational databases, In Proc. 26th International Conference on VLDB, Cario (2000) [6] J. Enderle, M. Hampel, and T. Seidl, Joining interval data in relational databases, In Proc. ACM SIGMOD Conference on Management of Data, Paris (2004) [7] Y. Tao and D. Papadias, Mv3r-tree: a spatio-temporal access method fortimestamp and interval queries, In Proc. 27th VLDB Conference, Roma (2001) [8] A.V. Alekseyenko, and C.J. Lee, Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Bioinformatics 23(11), (2007) [9] L. Guo, J. Shanmugasundaram, K. Beyer and E. Shekita, Efficient Inverted Lists and Query Algorithms for Structured Value Ranking in Update Intensive Relational Databases, In Proc. of the International Conference on Data Engineering, (2005) [10] Y. Liu and Y. Hu, Mining Frequent Patterns Based on Inverted List, In Proc. of the International Conference on Machine Learning and Cybernetics (2006) [11] Y. Liu and Y. Hu, Combination Tree for Mining Frequent Patterns Based on Inverted List, In Proc. of the International Conference on Computational Intelligence and Security (2006) [12] J. Zhang and T. Suel, Optimized Inverted List Assignment in Distributed Search Engine Architectures, In Proc. of the IEEE International Symposium on Parallel and Distributed Processing, l [13] C. Khancome and V. Boonjing, Character-Based Indexing Using Inverted Lists, In Proc. of the International Conference on Computer Technology and Development, APPENDIX Algorithm I //Function 1 Encode_every_intervals (allintervals): int i=0; FOR_each (interval m in allintervals) i++; FOR_each(int element in m) Mark (element, i); END END //Function 2 Locate_queried_interval (queried_interval): intervalset=intervals_between(queried_interval.begin.mark, queried_interval.end.mark); return intervalset; //Function 3 Map: geneinterval=inverted_table (one interval of intervalset); pass_to_reduce (geneinterval); //Function 4 Reduce: pass_to_resultfile (data_received_from_map); //Function main Main: encode_every_intervals (allintervals); intervalset=locate_queried_interval(queried_interval ); setup_mapreduce (map, reduce); mapreduce_beginwork (); 225

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database Algorithm Based on Decomposition of the Transaction Database 1 School of Management Science and Engineering, Shandong Normal University,Jinan, 250014,China E-mail:459132653@qq.com Fei Wei 2 School of Management