Efficient Top-K Problem Solvings for More Users in Tree-Oriented Data Structures

Size: px

Start display at page:

Download "Efficient Top-K Problem Solvings for More Users in Tree-Oriented Data Structures"

Delilah Barbara Allison
6 years ago
Views:

1 Efficient Top-K Problem Solvings for More Users in Tree-Oriented Data Structures Matúš Ondreička Faculty of Mathematics and Physics Department of Software Engineering Charles University in Prague, Czech Republic Jaroslav Pokorný Faculty of Mathematics and Physics Department of Software Engineering Charles University in Prague, Czech Republic Abstract This paper focuses on efficient searching the best K objects in more attributes according to user preferences. User preferences are modelled locally with fuzzy functions and globally with an aggregation function. Because of local preferences, we have used B + -tree for sorting objects according to a fuzzy function. We deal with the usage of TAalgorithm, which uses B + -trees, and MD-algorithm, which is based on multidimensional B-tree. In this paper we develop a new algorithm, MXT-algorithm, which id based on integration of MD-algorithm with more instances of TAalgorithm. We develop also a new tree-oriented data structure based on B + -trees, multidimensional B-tree with lists, in which MXT-algorithm can effectively find the best K objects according to user preferences. Finally, we show that according to the type of object attribute domains, it is possible to choose the best data structure for objects storage and also top-k algorithm for efficient top-k problem solving. 1 Introduction Nowadays, users of various systems are trying to find various objects, such as flats, cars, holiday stays, etc. These objects have more various attributes. According to the values of these attributes, each user is finding objects with other values of attributes [12][14]. In general, users find a few most convenient objects for him/her. Sometimes, a user looks for only one best object, for example, he/she can buy only one flat. In this paper we assume the set of objects of the same type, which is stored in one data structure. This data structure is common for all users. It is possible to find the best K objects from the set of objects X with more attributes, only if it is possible to decide which objects are better or worse for a user. In this sense we can use a ranking function. Moreover, every user prefers objects with own preferences. 1.1 Related work The problem of searching the best K objects according to values of different attributes in the same time is indicated as a top-k problem [12][13]. In last few years, research of top-k problem solvings is in progress in various domains such as relational databases [1], XML [10], multimedia search [12], the Web [14], or distributed systems [11]. In this paper we focus on the family of Fagin s algorithms [13], which has been widely studied for efficiently computing top-k queries. These algorithms assume that set of objects is stored in lists and the ranking functions are monotone. We say that an f is monotone if f(x 1,x 2,...,x m ) f(x 1,...,x m), whenever x i x i, for every i. However, the ranking functions are not necessarily monotone. In [6] and [7] were presented top-k problem solvings using arbitrary ranking functions. These two approaches use an analytic expression of a ranking function and treeoriented data structures. OPT* algorithm [7] uses indexation of all attributes by B + -trees and in [6] authors use indexation by B + -trees, too. In this paper user preferences are modelled locally with fuzzy functions and globally with an aggregation function [3][8], i.e. we are using arbitrary ranking functions. Moreover, in context of local preferences, we focus on nominal attributes and ordinal attributes. Because of local preferences application, we describe usage of B + -tree for sorting objects according to a fuzzy function [2][3][5]. We focus on searching the best K objects without accessing all the objects. Therefore we deal with methods and data structures for effective top-k problem solving via Fagin s TA-algorithm [13] and also MD-algorithm [2], which is based on multidimensional B-tree (MDB-tree) [15].

2 1.2 Main contribution In this paper TA-algorithm and MD-algorithm use data structures based on B + -trees. These tree-oriented data structures are independent on user preferences. Moreover, it is possible to update these data structures easily and quickly. We developed a new top-k algorithm, MXT-algorithm, and a new data structure based on B + -trees (MDB-tree with lists), in which MXT-algorithm can effectively find the best K objects according to user preferences without accessing all the objects. We present a comparison of MXT-algorithm with the results of TA-algorithm and MD-algorithm. Next we show that MXT-algorithm in some cases achieves the best results. Moreover, we show that according to the type of object attributes, it is possible to choose the best data structure for objects storage and also top-k algorithm for efficient top-k problem solving. 1.3 Paper organization The paper is organized as follows. Section 2 describes top-k problem and user preferences. Section 3 is devoted to explaining principles of TA-algorithm and MD-algorithm. Section 4 describes application of local user preferences in these algorithms. In Section 5, we describe our new MXT-algorithm and new data structure, which the MXTalgorithm uses. Section 6 presents the results of the tests TA-algorithm, MD-algorithm, MXT-algorithm, and their comparison for various data sets and various user preferences. Finally, Section 8 provides some suggestions for a future research. 2 Top-K problem Top-K problem is searching the best top-k objects. In this article, we suppose a set of objects X with m attributes A 1,...,A m. Every object x X has m values a x i,...,ax m of these attributes. 2.1 Rating function It is most suitable to use a rating function (ranking function), which assigns rating for each object x X. In this paper, we suppose a function R with m variables a 1,...,a m specified by scheme R(a 1,...,a m ) : [0,1] m [0,1]. We denote a rating of object x X as a function R(x) = R(a x i,...,ax m) with one variable. R(x) maps every object x X according to the m attribute values into interval [0,1]. For the worst object x X, R(x) = 0 holds, and for the best one, R(x) = 1 holds. According to R it is possible to sort objects from X in descending order and determine the best top-k objects. In this work we suppose that if there are more objects with the same rating as rating of the best K-th object, a random object is chosen. 2.2 User preferences In this paper, we consider a solution of top-k problem for more users with various user preferences. Every user chooses his/her user preferences, which determine suitability of the object x X in dependence on its m values of attributes. In this work, we differentiate between local preferences and global preferences Local preferences Local preferences reflect how the object is preferred according to only one attribute. In this case, we express local preference for i-th attribute A i, as a fuzzy function f i. Fuzzy function f i is understood as a mapping f i : dom(a i ) [0,1], which maps every value of actual attribute A i domain into [0, 1] interval. Local preferences of user U for the attributes A 1,..., A m are represented by user fuzzy functions denoted as f1 U (x),...,fm(x), U respectively. Then a user fuzzy function fi U(x) : ax i [0, 1], where i = 1,...,m, maps every object x X according to the value of its i-th attribute a x i into interval [0, 1]. In general, we differentiate two possible attribute types. Nominal attributes. Nominal attribute has a finite range of possible values, usually strings. For a nominal attribute, the user has to set a rating of each attribute value. For example, brand or kind of some products is a nominal attribute. Ordinal attributes. Ordinal attribute has some natural value ordering, other than lexical ordering. Typical examples are integer numbers. The domain of ordinal attributes is subset of continuous interval. In this case, it is possible to use as the user fuzzy function a continuous function. For example, a price is usually ordinal attribute Global preferences Global preferences express how the user U prefers objects from X according to all attributes A 1,...,A m. We introduced local preferences, where user U prefers every single attribute A i by fuzzy function fi U. In this case, the global preference of user U defines some mutual relations between the attributes A 1,...,A m. We consider aggregation function, which we denote with m variables p 1,...,p m specified 1,...,p m ) : [0,1] m [0,1]. For the user U with his/her user fuzzy functions f1 U,...,fm, U a user rating function R U originates

3 by means of substitution of p i = fi U(x). Then RU (x) U (x),...,fm(x)), U for every x X. With R U (x) it is possible to evaluate global rating of each object x X and to find top-k objects for user U. With aggregation function, a user U can define the mutual relations of the attributes. In practical applications, for implementation of user influence to the aggregate function, it is possible to use weighted average, where weights w 1,...,w m of single attributes A 1,...,A m determine how the user prefers single attributes, i.e. R U (x) = w 1 f U 1 (x) w m f U m(x) w w m. When the user does not care about i-th attribute A i, he/she can then set w i = 0 in the aggregate function. 3 Top-K algorithms We denote algorithms, which solve the top-k problem, as top-k algorithms. The easiest solution how to find the best K objects is to read all objects x X and for every object x to calculate its rating. Then K objects with the highest rating are chosen. In this case, all objects x X have to be accessed. In this section, we show two top-k algorithms, Fagin s TA-algorithm and MD-algorithm, which solve top-k problem without searching of all objects. These algorithms can find the best K objects according to the aggregate function. 3.1 Fagin s TA-algorithm Fagin et al. describe in [13] top-k algorithm TA (threshold algorithm). This algorithm assumes that the objects are stored in m lists L 1,...,L m. Each i-th list L i contains pairs (x,a x i ) for all objects x X and it is sorted in descending order according to the values of i-th attribute. The aggregate must be monotone according to the ordering in lists, e.g. weighted average. TA-algorithm searches the lists sequentially and obtains pairs (x,a x i ). For every object x, which is detected for the first time in obtained pair (x,a x i ), TA-algorithm obtains the missing attribute values of x by a direct access to the other lists and calculates rating of object x, which we TA-algorithm uses the temporary list T K, in which it keeps the best actual K objects ordered according Rating of the K-th best object in T K is denoted M K. TAalgorithm uses a threshold T h last 1,...,a last m ), where a last 1,...,a last m are the last seen values of attributes in the lists L 1,...,L m in the sequential access. When T h M K, TA-algorithm is able to stop and return T K, which contains the best K objects according TA-algorithm can finish before it comes to the end of all the lists [13]. It means that all the object need not be accessed. Figure 1. Set of six objects with values of three attributes stored in sorted lists. The following pseudo-code describes TA-algorithm. The procedure getnextpair obtains next pair (x,a x i ) from one of the list L 1,...,L m sequentially [13][9]. Input: Lists L 1,...,L m, int K; Output: List T K ; var List T K ; begin while( T K < K or T h > M K )do (x,a x i ) = getnextpair(l 1,...,L m ); a last i = a x i ; T h last 1,...,a last m ); if(x / T K )then get the missing attribute values of the object x; if( T K < K)then insert x to the list T K on according else if(@(x) > M K )then begin delete K-th object from the list T K insert x to the list T K according end; endwhile; return T K ; Example 1. Figure 1 contains six objects with values of three attributes stored in sorted lists L 1,L 2,L 3. If TA-algorithm is searching the best three objects according to aggregate = a x 1 + a x 2 + a x 3, then TAalgorithm gets only three pairs (x,a x i ) from each of the lists. In this moment T h last 1,...,a last m ) = 1.8, T K includes three objects with 1 ) = 2.4,@(x 3 ) = 2.2,@(x 4 ) = 2.0, respectively, and holds M K 4 ) = 2.0. Then T h M K holds and TAalgorithm is able to stop and it need not read object x MD-algorithm based on MDB-tree Now we describe MD-algorithm [2], which efficiently solves top-k problem with using the multidimensional B-

4 search. Analogously to the TA-algorithm, MD-algorithm uses the temporary list T K, in which it keeps the best actual K objects ordered according Rating of the K-th best object in T K is denoted M K. MD-algorithm can find best K objects in MDB-tree with the recursive procedure findtopk according to a monotone aggregate and without getting all the objects. The next statement specifies this fact more precisely. Figure 2. Set of eleven objects with values of three attributes stored in MDB-tree. tree (MDB-tree) [15]. MDB-tree allows to index set of objects X by attributes A 1,...,A m, m > 1, in one data structure. In this case, MDB-tree has m levels and values of one attribute are stored in one level. We use a variant of MDB-tree, nodes of which are B + -trees. i-th level of MDBtree is composed from B + -trees containing key values from dom(a i ). For explanation of MD-algorithm, we introduce pointer of the key k, the identifier of B + -tree and the best rating of B + -tree [2]. The pointer of the key k in B + -tree in i-th level of MDBtree we denote by ρ(k i ). If i < m, then ρ(k i ) refers to B + -tree in (i + 1)-th level of MDB-tree. If i = m, i.e. B + -tree is in the last level of MDB-tree, then ρ(k i ) refers to object array, where objects with the same values of all the m attributes are stored. For explicit identification of B + -tree in MDB-tree, we use the sequence of keys called tree identifier here. Tree identifier of B + -tree in i-th level is (k 1,..., k i 1 ). B + -tree in the first level of MDB-tree has tree identifier ( ). In Figure 2, (k 1, k 2 ) = (1.0, 0.0) is the identifier of B + -tree at the third level, which contains keys 0.0, 0.7 and refers to objects x F, x G, x H. In MDB-tree we use a best rating B(S) of B + -tree S. For every B + -tree S in MDB-tree there is a uniquely defined subset of X, which we call a set of available objects from S. For example, in Figure 2, the X S of S with identifier (0.0) contains objects x A, x B, x C, x D. By the best rating B(S) of B + -tree S with identifier (k 1,..., k i 1 ) in the i-th level of MDB-tree we understand the maximal possible rating of not yet known object x from the set of available objects from B + -tree S. Analogously to TA-algorithm we assume that aggregate is nondecreasing in all its variables. Then B(S) is calculated x 1,...,a x m), where the first i 1 attributes values of the object x are k 1,...,k i 1 and values of other attributes are 1 (max. of interval [0, 1]), i.e. B(S) 1,...,k i 1,1,...,1). MD-algorithm is based on a recursive procedure findtopk, which searches MDB-tree in depth-first Statement 1. [2] Let the key k i from the B + -tree S with the identifier (k 1,...,k i 1 ) be the key obtained one by one in descending order by a run of procedure findtopk. The pointer ρ(k i ) refers to B + -tree P in the next level or to the object array P, the best rating of P is B(P) 1,...,k i,1,...,1) and aggregate is monotone. If B(P) M K holds, no such object x X P can not get in T K. Moreover, it is not necessary to obtain a next key ki next from B + -tree S, which refers to P next, because ki next k i and it mneans that B(P next ) B(P) M K. The procedure findtopk can stop in B + -tree S, because no such object x X S can not get in T K. The following pseudo-code describes MD-algorithm. Input: MDBtree MDB-tree, int K; Output: List T K ; var List T K ; begin findtopk(mdb-tree, ( K); return T K ; procedure findtopk(mdbtree MDB-tree, TreeId (k 1,...,k i 1 ), int K); while(exists next key in B + -tree (k 1,...,k i 1 ))do k i = getnextkey(mdb-tree,(k 1,...,k i 1 )); {ρ(k i ) refers to B + -tree P or to object array P } if( T K = K and B(P) M K )then return; {Statement 1.} if(p is B + -tree)then findtopk(mdb-tree, (k 1,...,k i K); if(p is the object array)then while(there is the next object x in P )do if( T K < K )then insert object x to T K according else if(@(x) > M K )then begin delete K-th object from the list T K insert object x to the list T K end; endwhile; endwhile;

5 4 Application of local preferences This section discusses application support of local preferences (see Section 2.2.1) in TA-algorithm and MDalgorithm. For application of local user preferences, we used B + -tree. Moreover, this structure is common for all users and independent on users preferences [2]. 4.1 Usage of B + -tree Figure 3. MD-algorithm is searching the best object in MDB-tree according to aggregate function. procedure getnextkey(mdbtree MDB-tree, TreeId (k 1,...,k i 1 )); choose the next key k i with next highest value of A i in B + -tree of MDB-tree with identifier (k 1,...,k i 1 ); return k i ; Example 2. In Figure 3, fourteen objects with values of three attributes are stored in MDB-tree with three levels. MD-algorithm is searching the best object according to aggregate = a x 1 + a x 2. MD-algorithm starts in B + -tree ( ) and it obtains key 1.0, which refers to B + -tree (1.0). MD-algorithm obtains key 0.5, which refers to object array, where it obtains object W with = 1.5. MD-algorithm inserts object W the temporary list T K, because T K is empty. Then MD-algorithm obtains next key 0.4 in B + -tree (1.0). This key refers to object array, which has the best rating smaller than rating of the K-th best object in T K. MD-algorithm can stop in B + -tree (1.0) and continues in B + -tree ( ). It obtains next key 0.8, which refers to B + - tree (0.8). In this B + -tree MD-algorithm obtains key 0.8, which refers to object array, where object M with = 1.6 is obtained. MD-algorithm deletes object W from the list T K and inserts object M to the list T K, Then MD-algorithm obtains next key 0.0 in B + -tree (0.8). This key refers to object array, which has the best rating smaller than rating of the K-th best object in T K. MD-algorithm can stop in B + -tree (0.8) and continues in B + -tree ( ) in the first level of MDB-tree. MD-algorithm obtains next key 0.6, which refers to B + -tree (0.6). The best rating of this B + -tree is 1.6. It is less and MD-algorithm can stop. The best object according to the aggregate function is M. It means that MD-algorithm does not search all the objects in MDB-tree. In B + -tree the keys are sorted in ascending order. Since the leaf nodes of the B + -tree are linked in two directions, it is possible to cross the B + -tree through the leaf level and to get all the keys. Therefore, it is possible to obtain objects from B + -tree in descending order according to course of user fuzzy function f U [2][3][5]. When the user fuzzy function f U is monotone on its domain then the following holds. Let the f U be nondecreasing. We have to cross the leaf level of the B + -tree from the right to the left. It is possible to get the pairs (x,f U (x)) in the descending order according to the user preference f U, because a x a y f U (x) f U (y) holds. Let the f U be nonincreasing. We have to cross the leaf level of the B + -tree from the left to the right. It is possible to get the pairs (x,f U (x)) in the descending order according to the user preference f U, because a x a y f U (x) f U (y) holds. In general, user fuzzy function f U might not be monotone in its domain. In this case, the domain can be divided into continuous intervals, where f U is monotone on each of these intervals. Then the leaf level of B + -tree is divided into some parts according to the intervals. From these parts, objects can be obtained concurrently according f U as well as for nondecreasing and nonincreasing fuzzy functions [2]. Example 3. Figure 4 shows a fuzzy function f U and B + -tree. The domain of f U is divided into monotone intervals w 1,...,w 5. Objects are obtained from the B + -tree concurrently according to these intervals. Finally, we get objects x K, x M, x N, x G, x H, x F, x T, x S, x E, x C, x Q, x U, x R, x D, x Y in descending order according to f U. 4.2 Application in TA-algorithm Original TA-algorithm (see Section 3.1) offers the possibility to rate objects with aggregate and to find the best K object for the user U only according to his/her global preference. For the support of the local preferences, it is necessary that every i-th list L i contains pairs (x,fi U (x)) in descending order according to user fuzzy function fi U(x).

6 5 MXT-algorithm In this section, we describe a new top-k algorithm, which is based on integration of MD-algorithm and Fagin s TA-algorithm. This new algorithm can also find the best K objects according to aggregate without searching of all objects. 5.1 Usage of TA- and MD-algorithm Figure 4. An example of objects obtained from the B + -tree concurrently according to a fuzzy function. Therefore, TA-algorithm uses as the lists L 1,...,L m a set of m B + -trees B 1,...,B m. In B + -tree B i, all objects are indexed by values of i-th attribute A i. TA-algorithm can search B + -trees B 1,...,B m sequentially. Pairs (x,f U i (x)) can be obtained one by one from B + -tree B i. TA-algorithm also uses the direct access to the lists L 1,..., L m, where for object x, it is needed to obtain its unknown value a x i from L i. Because B + -tree is not able to make this operation, for a realization of direct access we can use, for example, an associative array. 4.3 Application in MD-algorithm Because MDB-tree is composed from B + -trees, it is possible to apply the local user preferences directly in MD-algorithm by obtaining keys from every B + -tree. The following procedure getnextkey changes the MDalgorithm. procedure getnextkey(mdbtree MDB-tree, TreeId (k 1,...,k i 1 ), FuzzyFunction f U i ); choose the next key k i with next highest value of f U i (k i) in B + -tree of MDB-tree with identifier (k 1,...,k i 1 ); return k i ; In general, during the computations of TA-algorithm and MD-algorithm the number of accessed objects is less than the number of all objects. The number of accessed objects depends on more factors. MD-algorithm has the best results, when the objects stored in MDB-tree have uniform distribution [2]. When attributes of objects have different sizes of their actual domains, the order of attributes in levels of MDB-tree is very important for efficiency of MD-algorithm. For MDalgorithm it is better to build MDB-tree with small-sized domains in its higher levels and attributes with big-sized domains in its lower levels. When most of the attributes have big-sized actual domains, the usage of MD-algorithm is not suitable solution of top-k problem. In this case, the usage of TA-algorithm is more suitable. Example 4. We had data about of flats for rent in Prague at disposal. These flats have four important attributes for users, District, Type, Area, and Price. These attributes have the following domain sizes: dom(district) = 10, dom(t ype) = 10, dom(area) = 229, dom(p rice) = 411. When a user prefers attributes District and Type, then it is better to store flats in MDB-tree and to use MD-algorithm. On the other hand, when a user prefers attributes Area and Price, then it is better to use TA-algorithm and to store flats in Fagin s lists. In general, the attribute with a small-sized domain is nominal and attribute with big-sized domain is ordinal (see Section 2). This is valid also in Example 4. Attributes District and Type are nominal attributes, Area and Price are ordinal attributes. 5.2 Integration of TA- and MD-algorithm For a set of objects with more nominal attributes and more ordinal attributes, we developed a new top-k algorithm, MXT-algorithm, which is based on integration of MD-algorithm and Fagin s TA-algorithm. MXT-algorithm uses a new data structure, MDB-tree with lists, which is composed of MDB-tree and Fagin s sorted lists.

7 Figure 5. MDB-tree with lists, in which a set of objects with two nominal attributes and two ordinal attributes is stored. We suppose a set of objects X with m attributes A 1,...,A m. Attributes A 1,...,A n are nominal attributes and A n+1,...,a m are ordinal attributes. Attributes A 1,...,A n are stored in MDB-tree with n levels. Instead of the following m n levels of MDB-tree, groups of m n Fagin s sorted lists are used. These lists contain pairs (x,a x i ) with values of attributes A n+1,...,a m. MDB-tree with lists is shown in Figure 5. Two nominal attributes are stored as MDB-tree and two ordinal attributes are stored as groups of Fagin s lists. MXT-algorithm uses also the temporary list T K, in which it keeps the best actual K objects ordered according Rating of the K-th best object in T K is denoted M K. MXT-algorithm is developed on the base of MD-algorithm. Values of the first n attributes A 1,...,A n are searched in the same way as during the computation of MD-algorithm. Analogously to MD-algorithm, Statement 1 holds for MXT-algorithm (see Section 3.2) In every B + -tree in n-th level of MDB-tree, there are keys with pointers, which refer to groups of m n Fagin s sorted lists. In each of these groups a new instance of TAalgorithm is run. Each instance of TA-algorithm uses a local threshold Th local. It is not needed to obtain the best K objects from each the group of Fagin s lists. It is sufficient that Th local is compared with M K in temporary list T K, because MXTalgorithm is not searching the best K objects in a group of m n Fagin s sorted lists, but the best K objects throughout the whole data structure. Analogously to the TA-algorithm, when Th local M K holds in an instance of TA-algorithm, this instance is able to stop. Then computation of MXT-algorithm continues as in MD-algorithm. The efficiency of MXT-algorithm is based on idea, that during the computation of MXT-algorithm, it is not needed to obtain the best K objects from each the group of Fagin s lists, i.e. only objects > M K. In Figure 5, MXT-algorithm is searching for the best few objects. Under dotted line, a part of the data structure, in which the MXT-algorithm does not access during its computation, is illustrated. The following pseudo-code describes MXT-algorithm. Procedure getnextkey is the same as in the MDalgorithm (see Section 3.2). Procedure getnextpair obtains next pair (x,a x i ) from one of the list L 1,...,L m n sequentially as in the TA-algorithm (see Section 3.1). Input: MDBtree MDB-tree, int K; Output: List T K ; var List T K ; {temporary list of objects} begin findtopk(mdb-tree, ( K); return T K ; procedure findtopk(mdbtree MDB-tree, TreeId (k 1,...,k i 1 ), int K); while(exists next key in B + -tree (k 1,...,k i 1 ))do k i = getnextkey(mdb-tree, (k 1,...,k i 1 )); {ρ(k i ) refers to B + -tree P or to group of lists P } if( T K = K and B(P) M K )then return; {Statement 1.} if(p is B + -tree)then findtopk(mdb-tree, (k 1,...,k i K); if(p is group of lists)then while( T K < K or Th local > M K )do (x,a x i ) = getnextpair(l 1,...,L m ); a last Th local i = a x i ; last 1,...,a last m ); if(x / T K )then get the missing attribute values of the x; if( T K < K)then insert object x to the list T K on the right place else if(@(x) > M K )then begin delete K-th object from the list T K ; insert object x to list T K end; endwhile; endwhile; 5.3 Application of Local Preferences Analogously to MD-algorithm, for attributes A 1,...,A n it is possible to apply the local user preferences. Procedure getnextkey changes MXT-algorithm in the same way as in the MD-algorithm (see Section 4.3). For attributes A n+1,...,a m, it is also possible to apply the local user preferences. Analogously to TA-algorithm, we use as the group of m n Fagin s lists L 1,...,L m n a group of m n B + -trees B 1,...,B m n (see Section 4.2).

8 Moreover, after we were making these modifications, we obtain tree-oriented data structure, which is composed only of B + -trees. In other words, it is an MDB-tree with n levels, where leaf nodes of B + -trees in n-th level refer to a group of m n B + -trees. 6 Experiments We implemented and tested the described top-k algorithms. The implementation of TA-algorithm, MD-algorithm and MXT-algorithm have been developed in Java with tree-oriented data structures created in memory. Important for us was the number of accesses into these data structures during calculation of top-k algorithms. We tested the top-k algorithms. During the tests, we used user fuzzy functions with linear course as user local preferences and the arithmetic average as user global preference. Objects from X with their m attributes values were stored in data structures considered. Obtaining one attribute value of one object we conceive as one access into data structures. We can simulate access to external memories in this way. 6.1 Distribution of Attribute Values At first, we tested two sets of objects with 5 attributes with normal and uniform distribution of attribute values. We used TA-algorithm, MD-algorithm and three variants of MXT-algorithm, i.e. MXT 3, MXT 2 and MXT 1. For example, MXT 3 uses first 3 nominal attributes, which are stored as MDB-tree with 3 levels, and other 2 attributes are stored as groups of 2 Fagin s sorted lists. Figure 6 and Figure 7 show results of this test. The best results have been achieved with MXT 3 and MD-algorithm for the set of objects with the uniform distribution of the attributes values. The test for sets of objects with normal distribution of attribute values has shown Figure 7. Uniform distribution of attributes values. that the new MXT-algorithm can in some cases also achieve worst results. 6.2 Flats for Rent We tested the sets of flats for rent in Prague (see Section 5.1, Example 4). There were two nominal attributes with a small domain size and two ordinal attributes with a big domain size. We used TA-algorithm, MD-algorithm and the most suitable variant of MXT-algorithm. Figure 8 shows results of this test. Figure 8. Finding the best flats in Prague. The best result has been achieved with MXT-algorithm. This test shown that MXT-algorithm is most efficient solution of top-k problem in this case. In practice, most objects, which are searched by users, have more nominal attributes and more ordinal attributes. In this case, it is suitable to use MXT-algorithm. Figure 6. Normal distribution of attributes values. 6.3 Various user preferences In general, it is problematic to test efficiency of top-k algorithms in dependence on user preferences. For various

9 settings of the user preferences and for various distributions of attribute values, top-k algorithms achieve different results. In this subsection we focus on global user preference expressed by weighted average (see Section 2.2.2). We used a set of objects with two nominal attributes and three ordinal attributes. The distribution of attribute values was uniform. Figure 9 shows results of the test, where the weight for each attribute was the same. In this test, the worst results have been achieved with TA-algorithm. For choosing the best objects, MXT-algorithm and MD-algorithm needed more than 10 times less accesses than the TAalgorithm. MXT-algorithm and MD-algorithm achieved nearly the same results. This shows that using MXT-algorithm and MDalgorithm is the most efficient solution for set of objects with more nominal attributes, which are preferred by users. Figure 10 shows results of the test, where weights of nominal attributes were equal to 0. In this test, the result of searching the best K objects is independent on values of nominal attributes. TA-algorithm is not disadvantaged and achieves good results. Finally, Figure 11 shows the test, where MXT-algorithm achieves the best results in number of accesses. There the weight of the first ordinal attribute was equal to 0. MDalgorithm achieved worse results than MXT-algorithm, because of the attributes order in levels of MDB-tree. The weight of attribute just in third level was 0 and MD-algorithm was often searching B + -trees in this level without rising of best rating B(S) (see Section 3.2). These and other tests, which we accomplished, have shown that MXT-algorithm is efficient solution of top-k problem in some cases. On the other hand, TA-algorithm and MD-algorithm achieved the best results in some different cases. Figure 9. Weights of all the attributes were the same. MXT-algorithm and MD-algorithm achieved nearly the same results. Figure 10. Weights of nominal attributes were equal to 0. Figure 11. Weight of first ordinal attribute was 0. 7 Conclusion We developed a new MXT-algorithm, which can efficiently find the best K objects by user preferences without accessing all the objects. We implemented top-k algorithms TA-algorithm, MD-algorithm and MXT-algorithm with support of user preferences. MXT-algorithm is based on integration TA-algorithm and MD-algorithm. Results of MXT-algorithm have shown that it is comparable with results obtained by other top-k algorithms. According to the types of object attributes, it is possible to store these objects in MDB-tree, in Fagin s lists or in MDB-tree with lists. Each one of the implemented top-k algorithms searches in a different data structure. According to the properties of set of objects we can decide in which of the data structures objects should be stored, in order for the objects to be searched by the most efficient top-k algorithm. The process of choosing the best data structure can be automated according to analyzing attribute domains. In this sense, information about the types and sizes of attribute do-

10 mains are important. MXT-algorithm is the most efficient solution of top-k problem in some cases. Especially, it is efficient for a set of objects, which has several nominal attributes with smallsized domains and several ordinal attributes with big-sized domains. Moreover, MXT-algorithm can find the best K objects in each of these tree-oriented data structures. In this sense, TA-algorithm and MD-algorithm are extreme cases of the MXT-algorithm. Because of MXT-algorithm construction, it can be also interesting that some of instances of TA-algorithm should be computing continuously. It can be also interesting to develop MXT-algorithm with usage of parallel computing. In this work, we used a model of preferences based on local and global user preferences. In future work, we can use user preferences based on different models. For example, when a dependence exists between values of more attributes, a user has to set his/her preference together for these attributes. In this case, we should evolve some modifications of top-k algorithms. It can be a new direction of future research. Motivation of future research can be also to find application of developed algorithms in diferent contexts. For example, in some cases it is needed to find K objects most similar to a given query object. Similarities between objects are most often computed as aggregated similarities of their attribute values. In [4] is described multi-dimensional indexing of non-metric spaces and top-k algorithm, which performs much better than the family of Fagin s algorithms. Some attribute values can be stored on remote servers. In this case, some of attribute values from web-accessible external sources might not be available in the same time. In [14] authors study how to process top-k queries efficiently in this setting. Implementation of our top-k algorithms in environment of more information resources could be a next direction of our research. 8 Acknowledgments This research is supported by Grant Agency of Charles University (GAUK), grant number 9209 (204-10/259011), Charles University in Prague, Czech republic. References [1] Ilyas, I. F., Beskales, G., and Soliman, M. A A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40, 4 (Oct. 2008), [2] Ondreička, M., Pokorný J.: Extending Fagin s algorithm for more users based on multidimensional B-tree. In: Proc. of ADBIS 2008, P. Atzeni, A. Caplinskas, and H. Jaakkola (Eds.), LNCS 5207, Springer-Verlag Berlin Heidelberg, 2008, pp [3] Gurský P., Vaneková V., Pribolová J.: Fuzzy User Preference Model for Top-k Search. In: Proceedings of IEEE World Congress on Computational Intelligence (WCCI), Hong Kong, FS0377, [4] Deshpande, P. M., P, D., and Kummamuru, K Efficient online top-k retrieval with arbitrary similarity measures. In Proceedings of the 11th international Conference on Extending Database Technology: Advances in Database Technology (Nantes, France, March 25-29, 2008). EDBT 08, vol ACM, New York, NY, pp [5] Eckhardt, A., Pokorný, J., Vojtáš, P.: A system recommending top-k objects for multiple users preference. In: Proc. of 2007 IEEE International Conference on Fuzzy Systems, July 24-26, 2007, London, England, pp [6] Xin, D., Han, J., and Chang, K. C.: Progressive and selective merge: computing top-k with ad-hoc ranking functions. In: Proc. of the 2007 ACM SIGMOD international Conference on Management of Data (Beijing, China, June 11-14, 2007). SIGMOD 07. ACM, New York, NY, pp [7] Zhang, Z., Hwang, S., Chang, K. C., Wang, M., Lang, C. A., and Chang, Y.: Boolean + ranking: querying a database by k-constrained optimization. In Proc ACM SIGMOD international Conference on Management of Data, Chicago, IL, USA, June 27-29, 2006, pp [8] Vojtáš, P.: Fuzzy logic aggregation for semantic web search for the best (top-k) answer. Capturing Intelligence, Chapter 17 Volume 1, 2006, pp [9] Gurský, P., Lencses, R., Vojtáš, P.: Algorithms for user dependent integration of ranked distributed information. In: Proceedings of TED Conference on e-government (TCGOV 2005), pp , [10] Marian, A., Amer-Yahia, S., Koudas, N., Srivastava, D.: Adaptive Processing of Top-k Queries in XML. In Proc. of the 21st international Conference on Data Engineering, April 05-08, 2005, ICDE IEEE Computer Society, Washington, DC, pp [11] Michel, S., Triantafillou, P., and 3kum, G.: KLEE: a framework for distributed top-k query algorithms. In Proc. of the 31st international Conference on Very Large Data Bases (Trondheim, Norway, August 30 - September 02, 2005). Very Large Data Bases. VLDB Endowment, pp [12] Chaudhuri, S., Gravano, L., Marian, M.: Optimizing Top-k Selection Queries over Multimedia Repositories. IEEE Trans. On Knowledge and Data Engineering, August 2004 (Vol. 16, No. 8), pp [13] Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences 66, 2003, pp [14] Bruno, N., L. Gravano, L., Marian, A.: Evaluating top-k queries over web-accessible databases. In: Proc. of ICDE, 2002, pp [15] Scheuerman, P., Ouksel, M.: Multidimensional B-trees for associative searching in database systems. Information systems, Vol. 34, No. 2, 1982, pp

Combination of TA- and MD-algorithm for Efficient Solving of Top-K Problem according to User s Preferences

Combination of TA- and MD-algorithm for Efficient Solving of Top-K Problem according to User s Preferences Matúš Ondreička and Jaroslav Pokorný Department of Software Engineering, Faculty of Mathematics