PAPER SRT-Rank: Ranking Keyword Query Results in Relational Databases Using the Strongly Related Tree

2398 PAPER SRT-Rank: Ranking Keyword Query Results in Relational Databases Using the Strongly Related Tree In-Joong KIM, Student Member, Kyu-Young WHANG a), and Hyuk-Yoon KWON, Nonmembers SUMMARY A top-k keyword query in relational databases returns k trees of tuples where the tuples containing the query keywords are connected via primary key-foreign key relationships in the order of relevance to the query. Existing works are classified into two categories: 1) the schema-based approach and 2) the schema-free approach. We focus on the former utilizing database schema information for more effective ranking of the query results. Ranking measures used in existing works can be classified into two categories: 1) the size of the tree (i.e., the syntactic score) and 2) ranking measures, such as TF-IDF, borrowed from the information retrieval field. However, these measures do not take into account semantic relevancy among relations containing the tuples in the query results. In this paper, we propose a new ranking method that ranks the query results by utilizing semantic relevancy among relations containing the tuples at the schema level. First, we propose a structure of semantically strongly related relations, which we call the strongly related tree (SRT). An SRT is a tree that maximally connects relations based on the lossless join property. Next, we propose a new ranking method, SRT-Rank, that ranks the query results by a new scoring function augmenting existing ones with the concept of the SRT. SRT-Rank is the first research effort that applies semantic relevancy among relations to ranking the results of keyword queries. To show the effectiveness of SRT-Rank, we perform experiments on synthetic and real datasets by augmenting the representative existing methods with SRT-Rank. Experimental results show that, compared with existing methods, SRT-Rank improves performance in terms of four quality measures the mean normalized discounted cumulative gain (ndcg), the number of queries whose top-1 result is relevant to the query, the mean reciprocal rank, and the mean average precision by up to 46.9%, 160.0%, 61.7%, and 63.8%, respectively. In addition, we show that the query performance of SRT-Rank is comparable to or better than those of existing methods. key words: keyword query, relational database, semantic relevancy, lossless join, strongly related tree query languages. Figure 1 shows the Entity-Relationship (ER) diagram of the TPC-H database [18]. The ER diagram represents relationships among customers, suppliers, and parts, where parts are ordered by a customer and supplied by a supplier, who belongs to a nation and a region. Figure 2 shows a sample dataset of the TPC-H database. Here, a directed edge between tuples indicates a reference from one tuple to the other via primary key-foreign key relationship. A keyword query to find the information that a customer whose name is Washington orders parts from a supplier Fig. 1 The ER diagram of the TPC-H database. 1. Introduction As the applications for storing text data in relational databases have become prevalent, keyword queries in relational databases have become an important issue [3], [10], [11], [21], [24]. A keyword query returns a tree of tuples where each leaf node of the tree is mapped to a tuple containing the query keywords, and each pair of adjacent tuples in the tree is connected via a primary key-foreign key relationship. Such a tree is called the Joined Tuple Tree (simply, JTT) [11]. Keyword queries in relational databases have an advantage that users can find their desired information easily without any prior knowledge about database schemas or structured Manuscript received February 7, 2014. Manuscript revised April 18, 2014. The authors are with the Department of Computer Science, KAIST, Korea. a) E-mail: kywhang@mozart.kaist.ac.kr DOI: 10.1587/transinf.2014EDP7040 Fig. 2 A sample dataset of the TPC-H database. Copyright c 2014 The Institute of Electronics, Information and Communication Engineers

KIM et al.: SRT-RANK: RANKING KEYWORD QUERY RESULTS IN RELATIONAL DATABASES USING THE STRONGLY RELATED TREE 2399 Fig. 3 The results of a keyword query { Washington, Smith }. whose name is Smith is represented as { Washington, Smith }. Figures 3 (a) and (b) show the results of the query. In this paper, we propose a new method to improve the ranking quality of the top-k keyword query results in relational databases. A top-k keyword query returns k JTTs in the order of relevance to the query. Relevance to a query is calculated using a predefined scoring function. Therefore, how to design the scoring function to reflect the user s intention as close as possible is the key to improving the ranking quality of JTTs. Many research efforts have been made to improve the ranking quality of JTTs [1], [2], [6], [7], [10], [11], [16], [25]. Existing works can be classified into two categories depending on whether the schema information is utilized or not [24]: 1) the schema-based approach [1], [6], [7], [10], [11] and 2) the schema-free approach [2], [16], [25]. The former first creates SQL queries that connect relations, to which the tuples containing the query keywords belong, using natural join operations; then, it retrieves the JTTs by evaluating each SQL query. The latter first translates relational data to graph data by mapping each tuple in a relation to a node and each reference between tuples to an edge in the graph; then, it retrieves the JTTs by finding subtrees or subgraphs that connect the tuples containing the query keywords from the graph data. Since the schema-based approach does not have the overhead of translating relational data to graph data, it has an advantage in handling a large amount of data. On the other hand, the schema-free approach has an advantage of differentiating the importance of the JTTs at the instance level. In this paper, we use the schema-based approach to handle a large amount of data. The scoring methods of existing schema-based approaches can be classified into two categories: 1) the syntactic scoring methods [1], [6] and 2) the IR-combined scoring methods [7], [10], [11]. The intuition of the syntactic scoring methods is that the more joins are needed to connect the tuples containing the query keywords, the less closely associated those tuples are [6]. The syntactic scoring methods use the syntactic score that promotes ranks of the JTTs where tuples containing the query keywords are the most closely adjacent to each other. That is, the representative syntactic score is calculated as the inverse of the JTT size [1], [6]. Here, the JTT size is defined as the number of relations where each relation includes a tuple in the JTT [11]. The intuition of the IR-combined scoring methods is that text attributes in relational databases can contain rich contents that can be useful for scoring JTTs while those text contents are not considered in the syntactic scoring methods [7]. The IR-combined scoring methods use a combination of the syntactic score and the IR score that has been used in information retrieval. The representative example of the IR score is the TF*IDF, where TF represents the term frequency for a given keyword in a tuple, DF represents the document frequency for a given keyword, and IDF represents the inverse of DF [13]. However, none of the existing works considers semantic relevancy among relations containing the tuples in the JTT. In Sect. 3, we observe that semantic relevancy among relations must be considered as an important factor for improving the ranking quality of JTTs. In this paper, we first propose a structure of relations that are semantically strongly related, which we call the strongly related tree (SRT). An SRT is a tree of relations that maximally connects the relations through the lossless join property. Next, we propose an SRT score that reflects the concept of the SRT in the scoring function. The SRT score is the first ranking method that applies semantic relevancy among relations to ranking the results of keyword queries. Semantic relevancy among relations has been proposed to unambiguously interpret selection queries in the universal relational model [12]. However, it has never been used to rank keyword query results effectively. Last, we propose a new ranking method, SRT-Rank, that ranks the JTTs by using a new scoring function augmenting the scoring functions of existing methods with the SRT score. To show the effectiveness of SRT-Rank, we perform experiments on synthetic and real datasets by applying SRT- Rank to the representative existing methods. Experimental results show that SRT-Rank improves the ranking quality in terms of four quality measures the mean normalized discounted cumulative gain (ndcg), the number of queries whose top-1 result is relevant to the query, the mean reciprocal rank, and the mean average precision by up to 46.9%, 160.0%, 61.7%, and 63.8%, respectively, compared with existing methods. We also show that the query performance of SRT-Rank is comparable to or better than those of existing methods. Specifically, we show that SRT-Rank improves the query performance by up to 1.78 times. The rest of this paper is organized as follows. Section 2 formally defines the top-k keyword query in relational databases. Section 3 presents the motivation of this paper. Section 4 explains the semantic unit based on the lossless join property. Section 5 proposes the SRT, which is a structure of relations that are semantically strongly related, and SRT-Rank, which utilizes the concept of the SRT for ranking the JTTs. Section 6 presents the experimental results. Section 7 describes related work. Section 8 summarizes and concludes the paper.

2400 2. Problem Definition We consider a relational database schema R consisting of a set of n relations {R 1, R 2,, R n }. We denote the primary key-foreign key relationship where a foreign key of R i references the primary key of R j as R i R j. A schema graph SGis a directed graph that maps each relation to a node and each primary key-foreign key relationship between relations to a directed edge. We denote the set of functional dependencies that are defined on R as F. A top-k keyword query Q in relational databases is defined as a pair (W, k) where W is a set of keywords {w 1, w 2,, w l } and k is the desired number of results. The results of Q are k JTTs with the highest (or the lowest) scores that are calculated by a scoring function. Without loss of generality, in the rest of the paper, we assume that we obtain JTTs with the highest scores as the results. In this paper, we use AND semantics, i.e., only JTTs that contain all query keywords can be the JTTs. We denote the size of a JTT T as size(t). We call a subset of tuples in a relation in the relational database as a tuple set [7]; the set of tuples containing the keywords of a query Q in a relation R i as the non-free tuple set [7] of R i, denoted by R Q i ; the whole set of tuples in R i as the free tuple set [7], denoted by R {} i.thecandidate Network (simply, CN) [7] of a query Q is a tree connecting free tuple sets and non-free tuple sets for Q through join operators. From now on, to represent a join operator connecting the two relations in a CN, we use the primary key-foreign key relationship between the two relations. We use CNs to generate JTTs as we explain in Sect. 3; we denote a CN generating a JTT T as CN(T) [11]. 3. Motivation In this section, we review existing methods using the schema-based approach for top-k keyword queries in relational databases and discuss their limitations. Existing methods use a relation as a unit to rank JTTs. Specifically, the syntactic scoring method simply uses the number of relations for the JTTs. DISCOVER [6], a representative syntactic scoring method, uses S yntactic-score(t, Q) = 1 size(t) as the scoring function, where T is a JTT, so that the smaller the size(t) is, the more relevant the T is to the query. The IR-combined scoring method also uses the syntactic score as a base score. The IR-combined scoring methods such as EFFICIENT [7], EFFECTIVE [10], and SPARK [11] use IR-combined-score(T, Q) = S yntactic-score(t, Q) * IR-score(T, Q) as the scoring function taking into account not only the syntactic score but also the IR score. However, both methods do not take into account semantic relevancy among relations. As a result, in some cases, none of the existing methods can rank the JTTs effectively. Examples 1 and 2 demonstrate these cases. Example 1 shows the limitation of using the syntactic score; Example 2 that of using the IR score. In these examples, for simplicity, Table 1 The ranking of the results for Q: ({ Washington, Smith }, 2) in existing methods. Scoring method score (rank) T 1 T 2 Syntactic 0.333 (top-1) 0.2 (top-2) IR-combined 0.667 (top-1) 0.4 (top-2) we use TF*IDF [13] as the IR score. We also assume that the IR score of T, denoted by IR-score(T, Q), is calculated by t T IR-score(t, Q), where t is a tuple in T. Here, if DF is zero, we regard IDF as zero. We denote TF and IDF for a keyword k by tf k and id f k, respectively. Example 1. Let us suppose that a user gives a top-k keyword query Q: ({ Washington, Smith }, 2) to find the information that a customer whose name is Washington orders parts from a supplier whose name is Smith from the TPC-H database in Fig. 2. Since a tuple c 1 in the Customer relation contains Washington and s 1 in the Supplier relation contains Smith, non-free tuple sets of Q are Customer Q = {c 1 } and S upplier Q = {s 1 }. Free tuple sets are Region {},Nation {}, Customer {}, Supplier {},Orders {},LineItem {}, Partsupp {}, and Part {}. We obtain CNs for Q by joining free tuple sets and non-free tuple sets where each leaf node of a CN is mapped to a non-free tuple set. The CNs so obtained are cn 1 : Customer Q Nation {} Supplier Q and cn 2 : Customer Q Orders {} LineItem {} Partsupp {} Supplier Q. The results for Q are T 1 : c 1 n 1 s 1, which is obtained by evaluating cn 1, and T 2 :c 1 o 1 l 1 ps 1 s 1, which is obtained by evaluating cn 2. As shown in Table 1, S yntactic-score(t 1, Q) = 1/3 = 0.333 and S yntactic-score(t 2, Q) = 1/5 = 0.2 since size(t 1 ) = 3 and size(t 2 ) = 5. There are two approaches to define a document and calculate DF for a query keyword. The first approach defines a single text attribute value as a document and calculates DF for a query keyword as the number of tuples that have the given keyword in a text attribute of a relation [7], [10]. In this approach, we calculate the IR score of a JTT for a query keyword by summing up the IR scores of each text attribute value in a JTT. The second approach defines a JTT as a document and calculates DF for a query keyword as the number of JTTs with the same schema that contains the given keyword [11]. Here, for a JTT T, the schema of T is a tree where a node in the tree represents a relation containing a tuple in T,andan edge in the tree represents a join between two relations containing two adjacent tuples in T. The second approach is more effective than the first approach. However, calculating DF in the second approach is much more complex than that in the first approach. Thus, explaining the examples using the first approach may dilute the focus of the paper. Hence, for ease of explanation, we have chosen to use the first approach in the examples. However, we have used both approaches in the experiments, i.e., the first approach for EF- FICIENT [7] and the second approach for SPARK [11]. Besides cn 1 and cn 2, there are more CNs for Q: cn 3 : Customer Q, cn 4 : Supplier Q, cn 5 : Customer Q Nation {} Customer Q, cn 6 : Supplier Q Nation {} Supplier Q, cn 7 : Customer Q Nation {} Region {} Nation {} Supplier Q, cn 8 : Customer Q Nation {} Region {} Nation {} Customer Q, and cn 9 : Supplier Q Nation {} Region {} Nation {} Supplier Q. However, these CNs do not generate JTTs and thus are not considered here.

KIM et al.: SRT-RANK: RANKING KEYWORD QUERY RESULTS IN RELATIONAL DATABASES USING THE STRONGLY RELATED TREE 2401 Table 2 The ranking of the results for Q: ({ Gladiator, Russell, Crowe }, 2) in existing methods. Scoring method score (rank) T 1 : m 1 c 1 p 1 T 2 : mi 1 m 1 c 1 p 1 Syntactic 0.333 (top-1) 0.25 (top-2) IR-combined 1 (top-2) 1.25 (top-1) Fig. 4 The ER diagram and a sample dataset of the IMDB database. The IR-combined scoring method calculates the scores of JTTs as follows. IR-score(T 1, Q) is computed as follows. IR-score(c 1, Q) = tf Washington *idf Washington + tf Smith *idf Smith = 1*1 + 0*0 = 1; IR-score(s 1, Q) = tf Washington *idf Washington + tf Smith *idf Smith = 0*0 + 1*1 = 1. Therefore, IR-score(T 1, Q) = IR-score(c 1, Q) + IR-score(s 1, Q) = 1 + 1 = 2. Likewise, IR-score(T 2, Q) = IR-score(c 1, Q) + IR-score(s 1, Q) = 1 + 1 = 2. Consequently, as shown in Table 1, IR-combined-score(T 1, Q) = 2 1/3 = 0.667; IR-combined-score(T 2, Q) = 2 1/5 = 0.4. In Example 1, we observe that T 1 is ranked higher than T 2 for both syntactic scoring and IR-combined scoring. However, we note that this result is less likely to be the user s intention. The user who gives the keyword query { Washington, Smith } is likely to prefer T 2 to T 1. T 2 represents the information that a customer whose name is Washington orders parts from a supplier whose name is Smith ; T 1 represents the information that a customer whose name is Washington belongs to the same nation as the one to which a supplier whose name is Smith belongs. Here, we observe that the tuples in T 2 are more strongly related in terms of semantics than those in T 1 are even if size(t 2 ) is larger than size(t 1 ). However, existing methods rank T 1 higher than T 2 since they do not consider semantic relevancy among relations. Example 2. Let us consider the IMDB database [8]; its ER diagram is shown in Fig. 4 (a). This database represents For ease of explanation, we use simplified scoring functions in Example 1. The scoring functions used in existing works are different but produce the same ranking results as in Table 1. Specifically, according to Eq. (2) of EFFICIENT [7], the scores of T 1 and T 2 are 1.0730, 0.6438, respectively; according to Eq. (5.2) of EFFEC- TIVE [10], the scores are 0.9742, 0.8814, respectively; according to Eq. (1), Eq. (2), Eq. (3) of SPARK [11], the scores are 1.4038, 0.8584, respectively. information on movies and actors that appear in the movies. Figure 4 (b) shows a sample dataset of the IMDB database. In this dataset, MovieIn f o consists of representative lines in the movie. Let us suppose that a user gives a top-k keyword query Q: ({ Gladiator, Russell, Crowe }, 2)toretrieve the movie Gladiator starring Russell Crowe. Since a tuple p 1 in the Person relation contains Russell and Crowe and both m 1 in the Movie relation and mi 1 in the MovieIn f o relation contain Gladiator, non-free tuple sets of Q are Person Q = {p 1 },Movie Q = {m 1 },MovieIn f o Q = {mi 1 }. Free tuple sets are Person {},Movie {},MovieIn f o {}, Cast {}, Character {}, and Role {}. The CNs obtained are cn 1 : Movie Q Cast {} Person Q and cn 2 : MovieIn f o Q Movie Q Cast {} Person Q. The results for Q are T 1 : m 1 c 1 p 1 from cn 1 and T 2 :mi 1 m 1 c 1 p 1 from cn 2. As shown in Table 2, S yntactic-score(t 1, Q) = 1/3 = 0.333 and S yntactic-score(t 2, Q) = 1/4 = 0.25 since size(t 1 ) = 3 and size(t 2 ) = 4. IR-score(T 1,Q) is computed as follows. IR-score(m 1,Q) = tf Gladiator *idf Gladiator + tf Russell *idf Russell + tf Crowe *idf Crowe = 1*1 + 0*0 + 0*0 = 1; IR-score(p 1, Q) = tf Gladiator * idf Gladiator + tf Russell * idf Russell + tf Crowe * id f Crowe = 0*0 + 1*1 + 1*1 = 2. Therefore, IR-score(T 1, Q) = IR-score(m 1, Q) + IR-score(p 1, Q) = 1 + 2 = 3. Likewise, IR-score(T 2, Q) = IR-score(mi 1, Q) + IR-score(m 1, Q) + IR-score(p 1, Q) = 2 + 2 + 1 = 5 since IR-score(mi 1, Q) = tf Gladiator *idf Gladiator + tf Russell *idf Russell + tf Crowe * id f Crowe = 2*1 + 0*0 + 0*0 = 2. Consequently, as shown in Table 2, IR-combined-score(T 1, Q) = 3 1/3 = 1; IR-combined-score(T 2, Q) = 5 1/4 = 1.25. In Example 2, we observe that T 1 is ranked higher than T 2 for syntactic scoring, but T 2 is ranked higher than T 1 for IR-combined scoring. We note that the result of the IRcombined scoring method is less likely to be the user s intention. The user who gives the keyword query { Gladiator, Russell, Crowe } is likely to prefer T 1 to T 2. T 1 represents the movie Gladiator starring Russell Crowe ; T 2 represents the lines that refer to Gladiator in the movie Gladiator starring Russell Crowe. The IR-combined scoring method ranks T 2 higher than T 1 since it considers only the IR score without considering semantic relevancy among relations. As shown in Examples 1 and 2, there are some cases that existing methods do not rank the JTTs effectively. To rank the JTTs more effectively, we need to take into account semantic relevancy among relations in addition to the existing ranking measures. In Sect. 4, we introduce the concept of the semantic unit that defines the relevancy among relations in terms of the functional dependency. In Sect. 5, we

2402 propose a new semantic unit and a new scoring method using the semantic unit. 4. Semantic Unit In this section, we first define a semantic unit based on the lossless join property. Then, we define the semantic relevancy among relations using the semantic unit. The database schema designer decomposes a relation into a set of multiple relations to minimize redundancies through the normalization process [19]. The lossless join property must be preserved during normalization since it guarantees that an original relation can be correctly restored by joining the decomposed relations [19]. Thus, we can consider the lossless join property to represent the set of relations that are semantically related. If the set of relations R i and R j has the lossless join property, it is indicated 1) by functional dependencies or 2) by multivalued dependencies. The former requires ((R i R j ) (R i R j )) F + or ((R i R j ) (R j R i )) F + ; the latter ((R i R j ) (R i R j )) F + or ((R i R j ) (R j R i )) F + [19]. The lossless join property indicated by functional dependencies is a property stronger than the one indicated by multivalued dependencies. That is, if a functional dependency B i A i holds, so does the multivalued dependency B i A i. However, even if a multivalued dependency B i A i holds, a functional dependency B i A i might not hold. To distinguish the lossless join property indicated only by multivalued dependencies from that indicated by both functional dependencies and multivalued dependencies, we denote the former as LJPbyMVDs and the latter as LJPbyFDs. Figures 5 and 6 show the set of relations that has LJPbyMVDs and LJPbyFDs, respectively. In these figures, the dotted single-headed arrow represents a functional dependency; the dotted double-headed arrow a multivalued dependency. Here, we classify the set of attributes into three groups according to their characteristics: 1) {A 1,, A n } is the set of attributes that is determined by other attributes, 2) {B 1,, B m } is the set of attributes that determines other attributes, and 3) {C 1,, C m } is the set of the remaining attributes. Figure 5 shows the set of relations that has LJPbyMVDs. The set of relations R i and R j has LJPbyMVDs since a) ((R i R j ) (R i R j )) F + holds and b) neither ((R i R j ) (R i R j )) F + nor ((R i R j ) (R j R i )) F + holds. Figure 6 shows the set of relations that has LJPbyFDs. Here, ((R i R j ) (R i R j )) F + holds. We claim that LJPbyFDs is a property more desirable than LJPbyMVDs in representing the set of relations that are semantically related. If a set of relations has LJPbyMVDs, the original relation includes attributes that are independent of each other [19]. That is, if α β holds on R, fora given value of α, there exists a Cartesian product of β and R β α. For example, in Fig. 5, for a given value of F + is a closure of F, which is the set of all functional and multivalued dependencies that can be inferred from F [19]. Fig. 5 Fig. 6 A set of relations that has LJPbyMVDs. A set of relations that has LJPbyFDs. {B 1,, B m }, there exists a Cartesian product of {A 1,, A n } and {C 1,, C k }, which means that {A 1,, A n } and {C 1,, C k } are independent of each other. Thus, those attributes must be separated into different relations, and we consider the relations R i and R j as independent relations. Here, as shown in Fig. 5 (c), if a functional dependency B 1 B 2 B m holds, we can further normalize {R i, R j } into the 4th normal form [19], which results in {R i1, R k, R j1 } where R i1 = {A 1,, A n, B 1 }, R k = {B 1,, B m }, and R j1 = {B 1, C 1,, C k }. If a set of relations has LJPbyFDs, we have a relation in the set, a root relation, where the entire set of attributes in the root relation functionally (i.e., uniquely) determines the entire set of attributes of the other relations in the set.

KIM et al.: SRT-RANK: RANKING KEYWORD QUERY RESULTS IN RELATIONAL DATABASES USING THE STRONGLY RELATED TREE 2403 The relation R j in Fig. 6 is the root relation. The entire set of attributes {B 1,, B m, C 1,, C k } of R j functionally determines the entire set of attributes {A 1,, A n, B 1,, B m } of the relation R i. Therefore, we can consider all the attributes in the original relation R ij in Fig. 6 as one relation conceptually. Thus, we have the following definition on the semantic unit. Definition 1. A set of relations R sem is defined as a semantic unit if R sem has LJPbyFDs with respect to the set of functional dependencies F that consists of the functional dependencies among attributes in each relation of R sem and the ones among relations in R sem. We define the semantic relevancy among relations as in Definition 2. Here, we regard that there is an undirected edge between two semantic units if they contain the same relation in common. Definition 2. The semantic relevancy among relations is how semantically strongly related the relations are and is inversely proportional to the minimum number of semantic units where those semantic units collectively contain the relations and are connected. Let us consider a set of relations R set. If all relations in R set are contained in one semantic unit, we can infer that the relations have been decomposed from one original relation since the relations in a semantic unit has the LJPbyFDs (A tuple in one relation uniquely determines tuples in other relations). Thus, we can conclude that semantic relevancy among the relations in R set is strong. On the other hand, if all relations in R set are not contained in one semantic unit, we cannot infer that the relations have been decomposed from one original relation. Thus, we can conclude that semantic relevancy among the relation in R set is not strong. A primary key-foreign key relationship implies a functional dependency between relations as in Lemma 1. For functional dependencies among relations mentioned in Definition 1, we assume that all such dependencies are represented (i.e., implied) by the primary key-foreign key relationships. Lemma 1. If a foreign key Y in a relation R j references a primary key X in R i, the functional dependency R j R i holds. Proof : Let us suppose t 1, t 2 R j where t 1 [Y] = t 2 [Y]; t 3, t 4 R i where t 3 (t 4 ) is referenced by t 1 (t 2 ). 1) Due to the referential integrity constraint, we get t 3 [X] = t 4 [X]. 2) Since t 3 [X] = t 4 [X] and X is the primary key of R i, t 3 [R i ] = t 4 [R i ]. By 1) and 2), a functional dependency Y R i holds. Furthermore, since Y R j, a functional dependency R j R i holds. We discuss how to check whether a set of relations has LJPbyFDs or not by using the primary key-foreign key relationship. In general, we need expensive operations to calculate R i R j and to check the condition ((R i R j ) (R i R j )) F + or ((R i R j ) (R j R i )) F + holds. Since the primary key-foreign key relationship specified in the schema represents the functional dependency between relations, we use it to find sets of relations that have LJPbyFDs. In Lemma 2, we prove that LJPbyFDs inferred from the rooted directed tree connected by the primary keyforeign key relationship is correct. In the tree, a relation maps to a node and a primary key-foreign key relationship between relations maps to a directed edge. Lemma 2. The set of relations {R 1, R 2,, R n } has LJPbyFDs if we can create a rooted directed tree by connecting the relations via the primary key-foreign key relationships. Proof :IfR i has a foreign key referencing the primary key of R j (i.e., there is a primary key-foreign key relationship from R j to R i ), R i R j is a superset of the primary key of R j. Consequently, the condition ((R i R j ) (R j R i )) F + holds. Hence, {R i, R j } has LJPbyFDs. We construct a rooted directed tree DT by connecting relations in {R 1, R 2,, R n }. Let R root be the root node of DT; R rc a child node of R root ; R rcc a child node of R rc. Since there is a primary key-foreign key relationship from R rc to R root, {R root, R rc } has LJPbyFDs. Similarly, since there is a primary keyforeign key relationship from R rcc to R where R = R root R rc, {R, R rcc } has LJPbyFDs. Since both {R, R rcc } and {R root, R rc } have LJPbyFDs, {R root, R rc, R rcc } has LJPbyFDs [19]. Therefore, if we add each descendant node of R root to the set {R root } one by one, the set has LJPbyFDs. Hence, the whole set of relations in DT has LJPbyFDs. For a set of three relations {R i, R j, R k }, let us assume that there is no cycle when we connect the relations via primary key-foreign key relationships. Then, every combination of the relations that can be connected via primary keyforeign key relationships is as follows: (1) R i R j R k, (2) R i R j R k,(3)r i R j R k, and (4) R i R j R k. (1), (2), and (3) are semantic units; (4) is not. Examples 3, 4, and 5 show (1), (2), and (4), respectively. (3) is identical to (2) when R i and R k are exchanged. Example 3. Let us consider the relations S upplier, PartSupp, and Part in the TPC-H database. Figure 7 (a) represents these relations and the relationships among them. Figure 7 (b) shows a relation S PaP that is a join of these relations. Since we can create a rooted directed tree, the set Fig. 7 A semantic unit {Supplier, PartSupp, Part} in the TPC-H database. A rooted directed tree is a tree whose root node has a directed path to each node in the tree [2].

2404 Nation and S upplier Nation as independent relations. Here, Customer corresponds to R i1 in Fig. 5 (c), Nation to R k, and S upplier to R j1. 5. SRT-Rank: Ranking Keyword Query Results Using the Strongly Related Tree Fig. 8 A semantic unit {Customer, Nation, Region} in the TPC-H database. Fig. 9 A set of relations {Customer, Nation, Supplier} in the TPC-H database. of relations has LJPbyFDs by Lemma 2. Therefore, the set {Supplier, PartSupp, Part} is a semantic unit. We note that, for each tuple in PartS upp, we have one integrated tuple in S PaP. That is, each tuple in the relation S PaP represents the information that a Supplier supplies a Part. Example 4. Let us consider the relations Customer, Nation, and Region in the TPC-H database. Figure 8 (a) represents these relations and the relationships among them. Figure 8 (b) shows a relation CNR that is a join of these relations. Since we can create a rooted directed tree, the set of relations has LJPbyFDs by Lemma 2. Therefore, the set {Customer, Nation, Region} is a semantic unit. We note that, for each tuple in Customer, we have one integrated tuple in CNR. That is, each tuple in the relation CNR represents a customer information on the nation and the region to which the customer belongs. Example 5. Let us consider the relations Customer, Nation, and S upplier in the TPC-H database. Figure 9 (a) represents these relations and the relationships among them. Figure 9 (b) shows the relation CNS that is a join of these relations. Since we cannot create a rooted directed tree, {Customer, Nation, S upplier} is not a semantic unit. If we assume that valid relations of CNS are only those that are joins of these three relations, the set has LJPbyMVDs. In CNS, multivalued dependencies {nationkey, n name} {custkey, cname} and {nationkey, nname} {suppkey, sname} hold, which means {custkey, cname} and {suppkey, s name} are independent of each other. Therefore, these attributes must be separated into different relations. Consequently, we consider the relation Customer In this section, we propose the SRT-Rank, a new method for improving the ranking quality of top-k JTTs in relational databases. In Sect. 5.1, we define a new structure that employs the notion of the semantic unit, which we call the strongly related tree (SRT). In Sect. 5.2, we first define the SRT score by which an SRT contributes to the scoring function. Then, we propose the SRT-Rank, that ranks the JTTs based on the SRT score. In Sect. 5.3, we propose an efficient method for finding the SRTs. In Sect. 5.4, we present a top-k keyword query processing algorithm extending existing ones so as to use SRT-Rank. In Sect. 5.5, we compare the SRT with the maximal object, which has been proposed for the universal relation based on the lossless join property. 5.1 Strongly Related Tree (SRT) In this section, we propose a new semantic unit defined on the CN rather than on the schema graph. The reason why we define a semantic unit on the CN is to take into account not only the semantic relevancy among different relations but also the relevancy among the same relations with different roles. If there is a relation that serves multiple roles in the schema graph, the relation can be involved in a CN more than once with different roles. The CN differentiates the same relations with different roles by representing those relations as separate relations while the schema graph does not. Example 6 shows an example of a CN that contains the same relations with different roles. Example 6. Let us consider a top-k keyword query Q: ({ USA, Cruise, England, McKellen }, 2)onthesample dataset of the TPC-H database shown in Fig. 2. A CN for Q is cn 1 : Nation Q Customer Q Orders {} LineItem {} Partsupp {} Supplier Q Nation Q.Here, the Nation relation is involved in the cn 1 twice. The first Nation relation means a set of tuples representing nations that customers belong to; the second Nation a set of tuples representing nations that suppliers belong to. In Definition 3, we define the strongly related tree as a semantic unit. Definition 3. A subtree ST of a CN is defined as a strongly related tree (SRT) if ST satisfies the following conditions: 1) (losslessness) {R i R i ST} has LJPbyFDs. While a node in the CN represents a relation instance, the lossless join property is the property of the relation schema. Thus, we deal with the lossless join property by using the schema corresponding to the CN. For simplicity, we represent the schema corresponding to the CN as the CN if there is no ambiguity.

KIM et al.: SRT-RANK: RANKING KEYWORD QUERY RESULTS IN RELATIONAL DATABASES USING THE STRONGLY RELATED TREE 2405 Table 3 The scores of the results for Q: ({ Washington, Smith }, 2). Scoring method score (SRT-score, other score) (rank) T 1 T 2 Syntactic (0.5, 0.333) (top-2) (1, 0.2) (top-1) IR-combined (0.5, 0.667) (top-2) (1, 0.4) (top-1) Fig. 10 2). SRTs in cn 1 and cn 2 for the query Q: ({ Washington, Smith }, 2) (maximality) (R j )(R j ST AND (R j is adjacent to a node in ST) AND ({R i R i ST} {R j } has LJPbyFDs)) As in Definition 3, an SRT is a subtree of a CN where the set of relations in the subtree has LJPbyFDs. SRT is a maximal subtree in the sense that it cannot include more relations without violating LJPbyFDs. We use an SRT as a unit for ranking JTTs so as to utilize semantic relevancy among relations. While existing works employ the syntactic score based on the number of relations in CN(T)foraJTTT, we employ a semantic score based on the number of SRTs in CN(T). This way, we can make up for the limitation of the syntactic score with semantic relevancy among relations. In other words, if we rank JTTs using the SRT, we can rank higher those JTTs from a CN whose relations have strong semantic relevancy even if the CN has a larger number of relations and produces a lower syntactic score. Example 7. Customer Q Nation {} Supplier Q and cn 2 :Customer Q Orders {} LineItem {} Partsupp {} Supplier Q in Example 1. Figures 10 (a) and (b) show SRTs in the two CNs, respectively. We cannot represent the whole set of relations in cn 1 as one rooted directed tree. Instead, we can represent it as two rooted directed trees as shown in Fig. 10 (a). Hence, cn 1 consists of two SRTs. Meanwhile, we can represent the whole set of relations in cn 2 as one rooted directed tree as shown in Fig. 10 (b). Hence, cn 2 consists of one SRT. Since the number of SRTs in cn 2 is smaller than that in cn 1, the relations in cn 2 have semantic relevancy stronger than that of the relations in cn 1. Thus, using the SRT, we can rank JTTs from cn 2 higher than those from cn 1 even if the number of relations in cn 2 is larger than that in cn 1. 5.2 Scoring Function based on the SRT We define the SRT score of CN(T) ofajttt as in Eq. (1) so that the smaller the number of SRTs in CN(T) is,the more relevant T is to the query Q. We denote the number of SRTs in a CN cn as NO SRT(cn) and the set of JTTs for Q as JTT set (Q). S RT-score(T, Q) = 1 NO SRT(CN(T)) (1) Table 4 The scores of the results for Q: ({ Gladiator, Russell, Crowe }, 2). Scoring method score (SRT-score, other score) (rank) T 1 T 2 Syntactic (1, 0.333) (top-1) (0.5, 0.25) (top-2) IR-combined (1, 1) (top-1) (0.5, 1.25) (top-2) Now, we present a new ranking method, SRT-Rank, that takes advantage of the SRT-score. SRT-Rank ranks the JTTs in the lexicographic order [17] so as to prioritize the SRT score than the other scores. The lexicographic order allows us to first rank the JTTs based on the SRT score, and then, to rank the JTTs with the same SRT score based on the existing scoring functions. Example 8. Let us consider again the query Q: ({ Washington, Smith }, 2) in Example 1. For JTTs T 1 and T 2,NOSRT(CN(T 1 )) = 2 and NO SRT(CN(T 2 )) = 1. Thus, S RT-score(T 1, Q) = 1/2 and S RT-score(T 2, Q) = 1/1 as in Table 3. Thus, SRT-Rank ranks T 2 higher than T 1 in both cases. This result better fits the user s intention. Example 9. Let us consider again the query Q: ({ Gladiator, Russell, Crowe }, 2) in Example 2. For JTTs T 1 and T 2, NO SRT(CN(T 1 )) = 1 and NO SRT(CN(T 2 )) = 2. Thus, S RT-score(T 1, Q) = 1/1 and S RT-score(T 2, Q) = 1/2 as in Table 4. Thus, SRT-Rank ranks T 1 higher than T 2 in both cases. This result better fits the user s intention. 5.3 Finding the SRTs We now discuss how to calculate NO SRT(cn) foracncn efficiently. First, let us consider a naive method that finds all SRTs in cn and count them. The method finds an SRT in cn as follows. For a subtree ST and a node R a where ST consists of a node in cn and R a is adjacent to ST, the method adds R a to ST if {R i R i ST} {R a } has LJPbyFDs. The method repeats this process until there is no more R a to add. Then, the result ST becomes an SRT. When n is the number of relations in cn, the method needs to check whether {R i R i ST} {R a } has the lossless join property at most n 1 times. The time complexity for checking whether {R i R i ST} {R a } has LJPbyFDs is O(n). Hence, the time complexity of the naive method to find an SRT is O(n) * O(n) = O(n 2 ). We can find all SRTs in cn by repeating this To check whether {R i R i ST} {R a } has LJPbyFDs or not, we need to do this, for each R i ST, the direction of the primary key-foreign key relationship between R i and R a. We need to do this n-1 times at maximum. Thus, the time complexity is O(n).

2406 process until every node in cn is included in at least one SRT. Since the number of SRTs in cn is no more than n, we need to repeat the process of finding an SRT at most n times. Therefore, the time complexity of the naive method to find all SRTs is O(n) *O(n 2 ) = O(n 3 ). Now, we propose a new efficient method, called Counting Independent Nodes (simply, CIN), whose time complexity is O(n). The CIN method calculates NO SRT(cn) asthe number of nodes, which we call independent nodes, that do not have incoming edges from any other nodes in cn (see Lemma 3). It first marks the nodes that are pointed by an edge while traversing all edges in cn. When the traversal is complete, it finds out all unmarked nodes, i.e., independent nodes. Therefore, the time complexity of the CIN method is O(e + n), where e is the number of edges in cn. Since e = n 1 in a tree, we obtain O(e + n) = O(n). Lemma 3. The CIN method finds NO SRT(cn) correctly. Proof : Suppose R root is an independent node. We show that a rooted directed tree DT connecting all the nodes that are reachable by directed paths from R root is an SRT. DT is an SRT since it satisfies the following two conditions. 1) {R i R i DT} has LJPbyFDs. 2) For a node R a adjacent to DT, {R i R i DT} {R a } does not have LJPbyFDs. Condition 1) is satisfied by Lemma 2. Condition 2) is satisfied due to the following reasons. If R a is a node adjacent to DT and R = R i DT R i, {R i R i DT} {R a } has LJPbyFDs only if one of the two following conditions is satisfied: (i) (R R a R a ) F + or (ii) (R R a R) F +. R a is not reachable by directed paths from R root since all nodes reachable by directed paths from R root have already been included in DT. Hence, Condition (i) is not satisfied. Besides, there is no directed path from R a to R root since R root does not have incoming edges. Hence, Condition (ii) is not satisfied. Hence, {R i R i DT} {R a } does not have LJPbyFDs. 5.4 Query Processing Algorithm Fig. 11 The skeleton of S parse [7] algorithm. We present a query processing algorithm that extends the existing top-k keyword query processing algorithms to utilize SRT-Rank. In both the syntactic and IR-combined scoring methods, all steps are the same except for the step of computing the scores of JTTs. Since SRT-Rank can be applied to all the methods in the same way, we show how to apply it using a representative existing method, the Sparse algorithm proposed by EFFICIENT [7], which adopts the IRcombined scoring method. Figure 11 shows the skeleton of the Sparse [7] algorithm. The algorithm gets the schema graph, SG, the keyword query, Q, and the number of results to retrieve, k, as the input. It returns the top-k JTTs, JTT topk, as the output. In Step 1, it creates a set of non-free tuple sets NFTS set of Besides the Sparse algorithm, EFFICIENT [7] has proposed the SinglePipelined and the GlobalPipelined algorithms to improve the efficiency. Since these three algorithms use the same scoring function, there is no difference in terms of effectiveness. We use the Sparse algorithm, which is the simplest. Fig. 12 The SparseSRT-Rank algorithm. Q. In Step 2, it creates a set of CNs CN set for Q by using NFTS set and SG. In Step 3, it evaluates each CN in CN set and generates the top-k JTTs by calculating the scores of JTTs, which are generated by each CN. In Step 4, it returns the top-k JTTs. In this paper, we extend Step 3 to use SRT- Rank proposing SparseSRT-Rank in Fig. 12. The other steps

KIM et al.: SRT-RANK: RANKING KEYWORD QUERY RESULTS IN RELATIONAL DATABASES USING THE STRONGLY RELATED TREE 2407 are the same as those of the Sparse algorithm [7]. Figure 12 shows the SparseSRT-Rank algorithm. In this algorithm, we define a class, CombinedScore, to represent the scoring function that combines the SRT score with the existing scoring functions. The CombinedScore class consists of two elements ordered in the lexicographic order: 1) the SRTScore that represents the SRT score and 2) the EfficientScore that represents the EFFICIENT score. The algorithm gets a set of CNs, CN set, a keyword query, Q, and the number of results to retrieve, k, astheinput. It returns the set of top-k JTTs, JTT topk, as the output. In Step 1, it initializes the priority queue, CN queue, to keep the CNs in the order of the CombinedScore, JTT queue, to keep the JTTs in the order of the CombinedScore. InStep2, for each cn in CN set, it calculates the CombinedScore from the SRTScore and EfficientScore of cn. The latter is the maximum possible EFFICIENT score of JTTs that are generated by cn using the EFFICIENT scoring function. Then, it stores cn and its CombinedScore in CN queue. In Step 2.1.1, it calculates the number of SRTs in cn by using the CIN method that is proposed in Sect. 5.3. Then, it calculates the CombinedScore.SRTScore of cn. In Step 2.1.2, it calculates the CombinedScore.EfficientScore of cn by using the scoring function of EFFICIENT [7]. In Step 2.1.3, it inserts cn and its CombinedScore into CN queue. In Step 3, the algorithm generates JTTs by evaluating each CN in CN queue and calculates the CombinedScore of each JTT. First, it evaluates each cn in CN queue in the order of the CombinedScore and obtains JTTs for cn. Then, it assigns CombinedScore.SRTScore of cn to Combined- Score.SRTScore for the JTT. Next, the algorithm calculates the CombinedScore.EfficientScore of each JTT by using the scoring function of EFFICIENT [7]. Then, it inserts cn and its CombinedScore into CN queue. Finally, the algorithm inserts each JTT and its CombinedScore into JTT queue to maintain JTTs in the order of the CombinedScore. It repeats this process until the CombinedScore of the k-th element in JTT queue is higher than or equal to the CombinedScore of the top element in CN queue. In Step 4, the algorithm returns the top-k JTTs in JTT queue. 5.5 Comparison of the SRT with the Maximal Object The SRT appears to be similar to the maximal object [12] that has been proposed to unambiguously interpret selection queries on the universal relation [5]. The universal relation is defined as the virtual relation that is obtained by conceptually joining all the relations in the database schema. However, joining all the relations incurs unexpected result especially where there is a cycle in the schema. Thus, the notion of the maximal object was proposed to solve this problem; the universal relation can be represented as a set of maximal objects. In a maximal object, a selection query can be interpreted unambiguously. However, we observe some cases that a maximal object still includes a cycle. A representative example is a maximal object for the TPC-H database schema. The max- Fig. 13 A maximal object for the TPC-H database. Fig. 14 SRTs in cn 1. imal object for the TPC-H database schema is the whole set of relations in the database schema, i.e., mo: {Region, Nation, Customer, Orders, LineItem, Partsupp, Supplier, Part}. Figure 13 represents the relationships among relations in mo as a graph, which has a cycle. Here, Naion serves as two different roles: 1) a nation that a customer belongs to and 2) a nation that a supplier belongs to. Therefore, we need to break the cycle to interpret a query correctly. Example 10. Let us consider a top-k keyword query Q: ({ USA, Cruise, England, McKellen }, 2)onthesample dataset of the TPC-H database shown in Fig. 2. A user is likely to find the information that a customer Cruise who belongs to USA orders some parts from a supplier McKellen who belongs to England. A CN for Q is cn 1 : Nation Q Customer Q Orders {} LineItem {} Partsupp {} Supplier Q Nation Q. As shown in Fig. 14, since the set of relations in cn 1 has LJPbyFDs, there is an SRT in cn 1.ByusingtheSRT,wegetaJTTT 1 :n 1 c 2 o 2 l 2 ps 4 s 4 n 2 from the cn 1. In contrast, if we use the maximal object directly to interpret Q, we cannot interpret Q as what the user wants. Specifically, the maximal object mo for the TPC-H database schema includes a cycle as shown in Fig. 13. Because of the cycle, if we interpret Q, an unexpected equality condition on the Nation relation is added, unlike when we do with an SRT in Fig. 14. The unexpected equality condition is that the nation of the customer should be the same as that of the supplier. This condition excludes T 1 from the query results, which is contrary to the user s intention. 6. Performance Evaluation 6.1 Experimental Data and Environment In this section, we show that SRT-Rank improves the ranking quality of the existing methods [6], [7], [11] using syntactic or IR-combined scoring by performing the experiments on synthetic and real datasets. We use TPC-H [18]

2408 Table 5 The statistics of the experimental datasets. dataset size # relations # tuples TPC-H 100MB 8 866,602 IMDB 516MB 6 1,673,074 Mondial 9MB 28 17,115 Wikipedia 550MB 6 206,318 as a synthetic dataset and use Internet Movie Database (IMDB) [8], Mondial [15], and Wikipedia [23] as real datasets. We generate the dataset for TPC-H by using the TPC-H DBGen tool [18]. The three real datasets used are the same ones as were used in Coffman et al. [4] where the effectiveness of existing works has been compared and analyzed. Table 5 shows the statistics for each dataset. The TPC-H dataset represents item order information and consists of eight relations. The IMDB dataset represents information on movies and consists of six relations. The Mondial dataset represents geographical information and consists of 28 relations. The size of the Mondial dataset is very small compared with the other datasets, but the schema of the dataset is the most complex. The Wikipedia dataset is created by selecting 5,540 articles from Wikipedia [23]. It consists of six relations. We use the following queries for experiments. For the TPC-H dataset, we use 50 queries by ten graduate students of Computer Science Department at KAIST. To minimize bias with queries, we choose the students who do not participate in this paper. To create the queries, they use the ER diagram of the TPC-H database as a reference and write down search intensions, which specify the information they want to retrieve, and keyword queries to represent the search intensions. This method is the same as Coffman et al. [4] do. An example of the search intensions is Find a supplier whose name is Smith and who belongs to the England nation. The keyword query to represent the search intension is { Smith, England }. In the appendix, we show the search intensions for 50 queries created by the students. For the real datasets, we use the same queries as those used in Coffman et al. [4]. The number of queries for each dataset is 50, and the average number of the query keywords for each dataset is designed to be similar to the search engine log [4]. We use DISCOVER [6] as a representative syntactic scoring method and use EFFICIENT [7] and SPARK [11] as representative IR-combined scoring methods for the comparison. We choose DISCOVER since it is known as the most effective one in the syntactic scoring method [6]. We choose EFFICIENT since it is the first IR-combined method and choose SPARK since it is known as the most effective one in the IR-combined method [11]. We implement all the methods using the Java language. We also extend DIS- COVER, EFFICIENT, and SPARK to use SRT-Rank and call each one as DISCOVER SRT-Rank, EFFICIENT SRT- Rank, and SPARK SRT-Rank, respectively. In addition, we extend DISCOVER, EFFICIENT, and SPARK to use MO-Rank, which is a ranking method using the maximal object as a semantic unit, and call each one as DISCOVER MO-Rank, EFFICIENT MO-Rank, and SPARK MO-Rank, respectively. Hereafter, we call the methods that do not use both SRT-Rank and MO-Rank as base methods, those that apply MO-Rank to the base methods as MO-Rank methods, and those that apply SRT-Rank to the base methods as SRT-Rank methods. To measure the effectiveness of the methods, we define the relevant answers for a keyword query. It is the set of JTTs that satisfy the users search intension for the keyword query [4]. To determine the relevant answers, we use the same approach as the Coffman et al. [4] do. That is, for each keyword query, the relevant answers are the results obtained by evaluating the SQL queries corresponding to the users search intension for the keyword query. We use the following quality measures: 1) The mean normalized discounted cumulated gain at k (MNDCG@k) [9], [22], 2) the number of queries whose top-1 result is relevant to the query (#top1rels) [4], [11], 3) mean reciprocal rank (MRR) [20] and 4) mean average precision (MAP) [13]. The cumulative gain at k (CG@k) is defined as the sum of the relevance score of each result in the topk results [9]. Here, if a result is a relevant answer, we set the relevance score of the result as 1; otherwise, we set it as 0. The discounted cumulative gain at k (DCG@k) isdefined as the sum of the discounted relevance score of each result in the top-k results [9]. We define the discounted relevance score of i-th result as the relevance score divided by log 2 (i + 1). The normalized discounted cumulated gain at k (NDCG@k) for a query is defined as the DCG@k divided by the ideal DCG@k of the top-k results [9]. The ideal DCG@k is defined as the DCG@k of the top-k results where results in the top-k results are sorted in the descending order of their relevance score [9]. MNDCG@k for a set of queries is the average of NDCG@k for all queries in the set. The #top1rels is defined as the number of queries whose first result is relevant. The reciprocal rank (RR) for a query is the inverse of the rank at which the first relevant result is retrieved or 0 if no relevant result is retrieved. The mean reciprocal rank for a set of queries is the average of the reciprocal ranks for all queries in the set. The average precision (AP) for a query is the average of the precision values calculated after each relevant result is retrieved. The mean average precision for a set of queries is the average of the average precisions for all queries in the set. We need to consider how to handle the JTTs whose scores are the same. Existing works break ranks of those JTTs arbitrarily [1], [7], [11] or do not explain any details [6], [10]. Thus, to calculate the DCG@k(RR, AP) more accurately, we use the following probabilistic approach. First, if there are n JTTs whose scores are the same for the l-th ranking, we assign the same 1/n probability to each JTTtobe(l + i) rank where 0 i n 1. Next, we calculate the probabilistic DCG@k(RR, AP) for each possible To set the ideal DCG@k the same for each ranking method, we calculate the ideal DCG@k as DCG@k of the top-k results that contain all correct answers.

KIM et al.: SRT-RANK: RANKING KEYWORD QUERY RESULTS IN RELATIONAL DATABASES USING THE STRONGLY RELATED TREE 2409 ranking of the JTTs by multiplying the DCG@k(RR, AP) with the probability for the possible ranking. Finally, we get the final DCG@k(RR, AP) of the JTTs by summing up the probabilistic DCG@k(RR, AP) for each possible ranking. We have performed all experiments on a machine with Intel Core i5 760 2.80GHz CPU and 4GB RAM running Linux Fedora Core 14. We use MySQL 5.5.19. 6.2 Experimental Results 6.2.1 Effectiveness Figures 15 (a), (b), (c), and (d) show the experimental results of the base methods, the MO-Rank methods, and the SRT-Rank methods for each dataset where k = 10. The results show similar trends for all quality measures. For the TPC-H and Mondial datasets, whose database schemas have cycles, the SRT-Rank methods outperform the MO-Rank methods. For the Wikipedia and IMDB datasets, whose database schema do not have cycles, the effectiveness of every SRT-Rank method is better than or equal to that of the corresponding MO-Rank method. For the TPC-H and IMDB datasets, the SRT-Rank methods significantly outperform the base methods. Specifically, for the TPC-H dataset, the SRT-Rank methods increase the MNDCG@k by 7.7%, 19.4%, and 7.2%, compared with DISCOVER, EFFICIENT, and SPARK, respectively; the #top1rels by 14.8%, 20.0%, and 13.2%; the MRR by 10.4%, 20.6%, and 9.8%; the MAP by 10.0%, 20.6%, and 9.0%. For the IMDB dataset, the SRT-Rank methods increase the MNDCG@k by 0.1%, 46.9%, and 38.2%, compared with DISCOVER, EF- FICIENT, and SPARK, respectively; the #top1rels by 0%, 160.0%, and 63.8%; MRR by 0.1%, 61.7%, and 46.6%; the MAP by 0%, 63.8%, and 47.2%. For the Mondial and Wikipedia datasets, SRT-Rank methods show results similar to those of the base methods. Nevertheless, we note that the effectiveness of every SRT-Rank method is better than or equal to that of the corresponding base method. Specifically, for the Mondial dataset, the SRT-Rank methods increase the MNDCG@k by 0%, 1.2%, and 2.9%, compared with DISCOVER, EFFICIENT, and SPARK, respectively; the #top1rels by 0%, 0%, and 3.3%; the MRR by 0%, 0.7%, and 3.0%; the MAP by 0%, 0.9%, and 3.5%. For the Wikipedia dataset, the SRT-Rank methods increase the MNDCG@k by 0%, 0%, and 12.1%, compared with DISCOVER, EF- FICIENT, and SPARK, respectively; the #top1rels by 0%, 0%, and 0%; the MRR by 0%, 0%, and 5.5%; the MAP by 0%, 0%, and 9.5%. We explain below why this phenomenon occurs. SRT-Rank improves the ranking quality of the query results since it takes semantic relevancy among relations into account. There are two cases when the SRT-Rank is effective: (1) when the syntactic scoring methods gives higher syntactic scores to JTTs whose CN consists of a large number of SRTs than to JTTs whose CN consists of a small number of SRTs; (2) when the IR-combined scoring methods gives higher IR scores to JTTs whose CN consists of a Fig. 15 Experimental results for each dataset where k = 10. large number of SRTs than to JTTs whose CN consists of a small number of SRTs. The first case occurs when the dataset satisfies the following conditions: (1) there are cycles in the database schema graph. A cycle here is an undirected one where an edge means that the relations on either side of the edge

2410 Fig. 16 An undirected cycle in the TPC-H database schema graph. Fig. 18 The SRTs in CN(T 1 )andcn(t 2 )forq: ({ Gladiator, Russell, Crowe }, 2). Fig. 17 A Cycle in the Mondial database schema. can be joined. (2) Tuples of relations in the cycle contain the keywords of the query Q. Conditions (1) and (2) allow two different paths that connect relations containing tuples in which the query keywords appear. (3) The set of the relations in one path (say, P short ) with the shorter length does not have LJPbyFDs, and the set of the relations in the other path (say, P long ) with the longer length has LJPbyFDs. The TPC- H dataset satisfies these three conditions. Figure 16 shows the TPC-H database schema graph, which has an undirected cycle. If tuples in the relations Customer and Suppliercontain the query keywords, we have two different paths P long and P short that connect Customer and Supplier. The set of the relations in P long has LJPbyFDs; the set of the relations in P short does not have LJPbyFDs. Therefore, as shown in Example 1, the syntactic scoring method ranks JTTs for P short higher than those for P long based on the size (i.e., the number of relations) of the JTT. However, SRT-Rank method ranks JTTs for P long higher than those for P short by counting the number of SRTs instead of the number of relations. This effect not only applies to the syntactic scoring method but also to the IR-combined scoring methods. For the Mondial dataset, the first case does not occur since the dataset does not satisfy Condition (3). Although the Mondial database schema graph has cycles, the SRT-Rank does not affect the ranking quality of the syntactic scoring methods. Figure 17 shows a cycle in the Mondial database schema graph. Let us consider two paths that connect two different relations in the cycle that contain the query keywords. We can find all the possible P short in the cycle: (1) Country City, (2) City Organization, (3) Organization IS Member, and (4) IS Member Country. The set of relations in each of these paths has LJPbyFDs. Therefore, it does not satisfy Condition (3). For the Wikipedia and IMDB dataset, the first case does not occur since there is no cycle in the database schema graph. Therefore, the SRT-Rank does not affect the ranking quality of the syntactic scoring methods. The second case occurs when a JTT T 2 has much more query keywords than a JTT T 1 and, at the same time, the set of the relations in CN(T 2 ) does not have LJPbyFDs while that in CN(T 1 ) has LJPbyFDs. Example 2 in Sect. 3 shows the case for the IMDB dataset. For JTTs T 1 and T 2 in Example 2, Figures 18 (a) and (b) show the SRTs in CN(T 1 ) and CN(T 2 ), respectively. Since MovieIn f o in T 2 contains query keywords, T 2 has definitely much more query keywords than T 1. However, T 1 consists of one SRT while T 2 consists of two SRTs. As a result, the IR-combined scoring methods rank T 2 higher than T 1 while the SRT-Rank methods rank T 1 higher than T 2, which is closer to the user s intention. The second case occurs for the TPC-H, Mondial, and Wikipedia datasets as well. Overall, the SRT-Rank methods prove to be more effective than the IRcombined scoring methods. Figures 19 shows the MNDCG@k of each method as k is varied from 1 to 100. Experimental results show that the trends among the methods are similar to one another. We present the experimental results of the MNDCG@k for the TPC-H dataset only since the trends are similar for the other measures and datasets. There may be cases where the search intention of the user does not map to the CN consisting of the smallest number of SRTs. In such a case, SRT-Rank ranks the JTTs whose CN consists of the smallest number of SRTs higher than those whose CN maps to the search intention of the user, and consequently, SRT-Rank degrades the ranking quality. One query for the TPC-H dataset corresponds to this case. The query is { Smith, Miller } and the search intension of the query is Find a nation that both a customer whose name is Miller and a supplier whose name is Smith belong to. For the query, the CN that maps to the search intension consists of two SRTs. However, there is a JTT meaning that a customer whose name is Miller orders some parts from a supplier whose name is Smith (which, in fact, could also be a valid user intention). The CN of this JTT consists of only one SRT. Therefore, SRT- Ranks ranks the JTT whose CN does not map to the search intension higher than that whose CN maps to the search intension. However, joining two different SRTs in a CN generates a Cartesian product among tuples having the same join attribute values since any relation in one SRT cannot functionally determine all the relations in the other SRT. Since the relevancy among relations becomes weaker as the number of Cartesian products in the join path of the relations increases, it is rational to consider that the user wants query results including as small a number of Cartesian products The real names are 000000484 and 000002020 since they are synthetic data in the TPC-H dataset.

KIM et al.: SRT-RANK: RANKING KEYWORD QUERY RESULTS IN RELATIONAL DATABASES USING THE STRONGLY RELATED TREE 2411 Fig. 20 The result generation time for each dataset. of the corresponding base method. Specifically, the SRT- Rank methods are 0.99 1.41 times more efficient than the base methods for the TPC-H dataset; 1.09 1.24 times for the Mondial dataset; 0.97 1.77 times for the Wikipedia dataset; 1.04 1.29 times for the IMDB dataset. In addition, experimental results show that the query performance of a MO-Rank method is comparable to or better than that of the corresponding base method. Specifically, the MO- Rank methods are 1.14 1.38 times more efficient than the base methods for the TPC-H dataset; 1.03 2.08 times for the Mondial dataset; 0.77 1.41 times for the Wikipedia dataset; 1.07 1.28 times for the IMDB dataset. This improvement in efficiency comes from the fact that SRT-Rank (MO-Rank) methods can prune the search space more efficiently by using the SRT score (the MO score that maps to the SRT score). Fig. 19 The MRR and MAP for the TPC-H dataset as k is varied. as possible, i.e., consisting of the smallest number of SRTs. Thus, in most of cases in our experiments, we observe that the search intention of the user maps to the CN consisting of the smallest number of SRTs. 6.2.2 Efficiency To compare the efficiency between the SRT-Rank methods, MO-Rank methods, and base methods, we measure the running time of each method. We divide the overall running time into 1) the preprocessing time, which creates non-free tuple sets and CNs and 2) the result generation time, which evaluates the CNs and generates the results. The preprocessing time is the same for among the SRT-Rank methods, MO-Rank methods, and base methods. Therefore, we show only the result generation time. Figure 20 shows the result generation time of the SRT- Rank methods, MO-Rank methods, and base methods for the TPC-H, Mondial, Wikipedia, and IMDB datasets when k is 10. Experimental results show that the query performance of a SRT-Rank method is comparable to or better than that 7. Related Work Existing works on top-k keyword query processing in relational databases can be classified into two categories: 1) the schema-based approach and 2) the schema-free approach [24]. The schema-based approach such as DBXplorer [1] and DISCOVER [6] uses the syntactic score as a scoring function. DISCOVER [6] improves the effectiveness of DBXplorer by solving the problem that DBXplorer cannot create JTTs containing different tuples in the same relation. EFFICIENT [7], EFFECTIVE [10], and SPARK [11] use the IR-combined score as a scoring function. EFFICIENT [7] treats each tuple in a JTT as a document and calculates the IR score of the JTT by summing all the IR scores of tuples in the JTT. EFFECTIVE [10] improves the effectiveness of EFFICIENT by normalizing the IR score according to its distribution over tuples. SPARK [11] improves the effectiveness of EFFICIENT by treating a JTT as a document and by calculating the IR score of the JTT. There have been a lot of research efforts based on the schema-free approach. The schema-free approach calculates the scores of the JTTs at the instance level by modeling the tuples and their relationships as a graph. Thus, it retrieves subtrees or subgraphs that are the most relevant to

2412 a keyword query considering the weights of the nodes and edges. BANKS [2] is the first research work based on the schema-free approach. CI-Rank [25] observes the problem of BANKS that considers only the root node and the leaf nodes (i.e., nodes containing the query keywords). It shows that intermediate nodes can also affect the effectiveness of the JTTs and propose a new ranking method that takes those into account. Community [16] identifies the problem that a tree cannot express all relationships among the tuples that contain the query keywords. To solve this problem, it proposes a new method that finds subgraphs connecting tuples that contain the query keywords. These methods take the relationships among tuples into account based on the schemafree approach. However, they are different from our approach in the sense that the schema-free approach needs to model the relational data as a graph at the instance level. We can consider application of the concept of SRT to the schema-free approach. We leave it as a further study. AutoJoin [14] uses the concept of the maximal object for efficient keyword query processing in relational databases. Using the maximal object, it efficiently creates SQL queries corresponding to the given keyword query. AutoJoin also breaks the cycles of the maximal object as the SRT does. However, the SRT has the following distinguishing properties from AutoJoin. 1) The purpose of SRT is different from that of AutoJoin: SRT is proposed to improve the effectiveness of ranking methods while AutoJoin is proposed to improve the efficiency of query processing. 2) The SRT defines the semantic structure on the CN, which can be mapped to a query graph, while AutoJoin defines the maximal object on the database schema graph. The query graph can distinguish the relations having the same name with different roles while the database schema graph cannot. Therefore, an SRT can distinguish the relations having the same name with different roles while a maximal object cannot. In Example 10, the first (i.e., left) Nation serves as a nation to which a customer belongs; the second (i.e., right) Nation as a nation to which a supplier belongs. Thus, we need to distinguish the two Nation as different relations in cn 1. 8. Conclusions We have proposed the SRT to improve the ranking quality of the top-k keyword query results in relational databases. An SRT is a tree that maximally connects the strongly related relations in the CN. Next, we have proposed the SRT score that reflects the concept of the SRT as a scoring function. The SRT score is the first ranking method that applies semantic relevancy among relations to ranking the results of keyword queries. Finally, we have proposed SRT- Rank, which ranks the JTTs by using new scoring functions that combine those of existing works with the SRT score. Through experiments, we have shown that SRT-Rank improves the ranking quality. To show the effectiveness of SRT-Rank, we have performed experiments on one synthetic and three real datasets. Experimental results show that SRT-Rank significantly improves the ranking quality. Specifically, for a synthetic dataset, SRT-Rank improves the MNDCG@k by up to 19.4%, #top1rels by up to 20.0%, the MRR by up to 20.6%, and the MAP by up to 20.6%; for real datasets, SRT-Rank improves the MNDCG@k by up to 46.9%, #top1rels by up to 160.0%, the MRR by up to 61.7%, and the MAP by up to 63.8%. We have also shown that the query performance of SRT-Rank is comparable to or better than those of existing methods. SRT-Rank is the first research work that utilizes the concept of the lossless join property so as to rank the topk keyword query results effectively. SRT-Rank is applicable to arbitrary existing scoring functions since the SRT score is a ranking factor independent of the existing ones. For a future work, based on the rigorous comparison between top-k keyword query processing in relational databases and selection query processing in universal relations, we plan to formally and theoretically analyze the relationship between the maximal object and the SRT. Acknowledgements We would like to thank the authors of SPARK [11], Yi Luo, Xuemin Lin, Wei Wang, and Xiaofang Zhou, for allowing us to use their SPARK and EFFICIENT source codes. This work was supported by the National Research Foundation of Korea (NRF) grant funded by Korean Government (MSIP) (No. 2012R1A2A1A05026326). References [1] S. Agrawal, S. Chaudhuri, and G. Das, DBXplorer: a system for keyword-based search over relational databases, Proc. IEEE Int l Conf. on Data Engineering (ICDE), pp.5 16, April 2002. [2] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan, Keyword searching and browsing in databases using BANKS, Proc. IEEE Int l Conf. on Data Engineering (ICDE), pp.431 440, April 2002. [3] Y. Chen, W. Wang, Z. Liu, and X. Lin, Keyword search on structured and semi-structured data, Proc. Int l Conf. on Management of Data, ACM SIGMOD, pp.1005 1010, June 2009. [4] J. Coffman and A.C. Weaver, A framework for evaluating database keyword search strategies, Proc. ACM Int l Conf. on Information and Knowledge Management (CIKM), pp.729 738, Oct. 2010. [5] R. Fagin, A. Mendelzon, and J.D. Ullman, A simplified universal relation assumption and its properties, ACM Trans. Database Systems (TODS), vol.7, no.3, pp.343 360, Sept. 1982. [6] V. Hristidis and Y. Papakonstantinou, DISCOVER: Keyword search in relational databases, Proc. Int l Conf. on Very Large Data Bases (VLDB), pp.670 681, Aug. 2002. [7] V. Hristidis, L. Gravano, and Y. Papakonstantinou, Efficient ir-style keyword search over relational databases, Proc. Int l Conf. on Very Large Data Bases (VLDB), pp.850 861, Sept. 2003. [8] http://www.imdb.com, accessed Feb. 5. 2014. [9] K. Jarvelin and J. Kekalainen, Cumulated gain-based evaluation of IR techniques, ACM Trans. Information Systems (TOIS), vol.20, no.4, pp.422 446, Oct. 2002. [10] F. Liu, C. Yu, W. Meng, and A. Chowdhury, Effective keyword search in relational databases, Proc. Int l Conf. on Management of Data, ACM SIGMOD, pp.563 574, June 2006. [11] Y. Luo, X. Lin, W. Wang, and X. Zhou, SPARK: Top-k keyword

KIM et al.: SRT-RANK: RANKING KEYWORD QUERY RESULTS IN RELATIONAL DATABASES USING THE STRONGLY RELATED TREE 2413 query in relational databases, Proc. Int l Conf. on Management of Data, ACM SIGMOD, pp.115 126, June 2007. [12] D. Maier and J.D. Ullman, Maximal objects and the semantics of universal relation databases, ACM Trans. Database Systems (TODS), vol.8, no.1, pp.1 14, March 1983. [13] C.D. Manning, P. Raghavan, and H. Schutze, Introduction to information retrieval, 1st ed., Cambridge University Press, 2008. [14] T. Mason, L. Wang, and R. Lawrence, Autojoin: Providing freedom from specifying joins, Proc. Int l Conf. on Enterprise Information Systems (ICEIS), vol.5, pp.31 38, May 2005. [15] http://www.dbis.informatik.uni-goettingen.de/mondial, accessed Feb. 5. 2014. [16] L. Qin, J.X. Yu, L. Chang, and Y. Tao, Querying communities in relational databases, Proc. IEEE Int l Conf. on Data Engineering (ICDE), pp.724 735, April 2009. [17] S. Raymond and D. O Shea, Programming for mathematicians, Springer-Verlag, 2000. [18] http://www.tpc.org/tpch, accessed Feb. 5. 2014. [19] J.D. Ullman, Principles of database and knowledge-base systems, vol.1, Computer Science Press, 1988. [20] E.M. Voorhees, TREC-8 question answering track report, Proc. Text Retrieval Conference (TREC), pp.77 82, Nov. 1999. [21] G. Weikum, DB&IR: Both sides now, Proc. Int l Conf. on Management of Data, ACM SIGMOD, pp.25 30, June 2007. [22] M. Weimer, A. Karatzoglou, Q.V. Le, and A. Smola, CofiRank maximum margin matrix factorization for collaborative ranking, Advances in Neural Information Processing Systems 20, NIPS 2007, pp.1593 1600, Dec. 2007. [23] http://www.wikipedia.org, accessed Feb. 5. 2014. [24] J.X. Yu, L. Qin, and L. Chang, Keyword search in relational databases: a survey, IEEE Data Engineering Bulletin, vol.33, no.1, pp.67 78, March 2010. [25] X. Yu and H. Shi, CI-Rank: Ranking keyword search results based on collective importance, Proc. IEEE Int l Conf. on Data Engineering (ICDE), pp.78 89, April 2012. Appendix: 50 search intensions for the TPC-H datasets Q1: Find a supplier whose name is 000000069 and who supplies the navy drab cornsilk smoke violet part. Q2: Find a supplier whose name is 000000813 and who supplies the powder firebrick white peach goldenrod part. Q3: Find a supplier whose name is 000000806 and who supplies the moccasin tan chartreuse cornsilk drab part. Q4: Find a supplier whose name is 000000681 and who supplies the seashell maroon lace burnished lavender part. Q5: Find a supplier whose name is 000000997 and who supplies the orchid bisque antique ivory lavender part. Q6: Find a customer whose name is 000003691 and who orders some parts from a supplier whose name is 000000067. Q7: Find a customer whose name is 000007801 and who orders some parts from a supplier whose name is 000000138. Q8: Find a nation whose name is INDIA and that is in the ASIA region. Q9: Find a nation whose name is RUSSIA and that is in the EUROPE region. Q10: Find a supplier whose name is 000000033 and who belongs to the EUROPE region. Q11: Find a customer whose name is 000011675 and who belongs to the CANADA nation in the AMERICA region. Q12: Find the lemon floral azure frosted lime part ordered by customers who belong to the ASIA region. Q13: Find a customer whose name is 000000029 and who orders the grey green purple sienna peru part and belongs to the ALGERIA nation in the AFRICA region. Q14: Find a supplier who supplies both the antique goldenrod floral forest slate and hot medium spring orange violet parts. Q15: Find a customer whose name is 000003691 and who orders the powder wheat midnight mint salmon part and belongs to the IRAN nation. Q16: Find a supplier whose name is 000000768 and who belongs to the RUSSIA nation. Q17: Find the coral lavender seashell rosy burlywood part supplied by suppliers who belong to the ALGERIA nation. Q18: Find the coral grey snow violet hot part supplied by suppliers who belong to the BRAZIL nation. Q19: Find the rosy saddle dark peach drab part supplied by suppliers who belong to the MOROCCO nation. Q20: Find a supplier whose name is 000000033 and who supplies the sandy wheat coral spring burnished part and belongs to the GERMANY nation. Q21: Find a customer whose name is 000001369 and who belongs to the IRAN nation. Q22: Find a customer whose name is 000009703 and who belongs to the ALGERIA nation. Q23: Find a customer whose name is 000010439 and who belongs to the BRAZIL nation. Q24: Find a customer whose name is 000012191 and who belongs to the INDONESIA nation. Q25: Find a supplier whose name is 000000561 and who supplies parts to customers who belong to the KENYA nation. Q26: Find a supplier whose name is 000000228 and who supplies parts to customers who belong to the GERMANY nation. Q27: Find a customer whose name is 000000020 and who belongs to the RUSSIA nation and orders some parts from a supplier whose name is 000000883. Q28: Find a customer whose name is 000000005 and who belongs to the CANADA nation and orders some parts from a supplier whose name is 000000708 and belongs to the SAUDI ARABIA

2414 nation. Q29: Find a customer whose name is 000008449 and who belongs to the JAPAN nation and orders some parts from a supplier whose name is 000000103 and belongs to the BRAZIL nation. Q30: Find a customer whose name is 000008177 and who belongs to the CHINA nation and orders some parts from a supplier whose name is 000000983 and belongs to the ALGERIA nation. Q31: Find a customer whose name is 000004450 and who belongs to the INDIA nation and orders some parts from a supplier who belongs to the INDONESIA nation. Q32: Find a customer whose name is 000006101 and who belongs to the ARGENTINA nation and orders some parts from a supplier who belongs to the CHINA nation. Q33: Find the almond hot lemon honeydew blush part that is ordered by customers who belong to the IRAQ nation. Q34: Find the moccasin drab olive metallic papaya part that is ordered by customers who belong to the UNITED KINGDOM nation. Q35: Find the red peach seashell honeydew burlywood part that is ordered by customers who belong to the FRANCE nation. Q36: Find a customer whose name is 000000028 and who orders the blanched khaki azure misty orang part and belongs to the INDIA nation. Q37: Find a customer whose name is 000000004 and who orders the tan blush blue almond navy part and belongs to the EGYPT nation. Q38: Find a customer whose name is 000006101 and who orders some parts from a supplier whose name is 000000096 and belongs to the JAPAN nation. Q39: Find a customer whose name is 000012332 and who orders some parts from a supplier whose name is 000000658 and belongs to the RUSSIA nation. Q40: Find a customer whose name is 000003691 and who orders the cyan white honeydew blue hot part from a supplier whose name is 000000067. Q41: Find a customer whose name is 000007801 and who orders the cornflower tomato linen powder cyan part from a supplier whose name is 000000138. Q42: Find a customer whose name is 000000569 and who orders the yellow deep bisque mint orchid part. Q43: Find a customer whose name is 000001570 and who orders the rose light peach grey lemon part. Q44: Find a customer whose name is 000002569 and who orders the navajo moccasin orange papaya seashell part. Q45: Find a nation that both a customer whose name is 000002020 and a supplier whose name is 000000484 belong to. Q46: Find the CANADA nation that both a customer whose name is 000008415 and a customer whose name is 000011675 belong to. Q47: Find a nation that both a customer whose name is 000004043 and a customer whose name is 000000365 belong to. Q48: Find a part that both a supplier whose name is 000000313 and a supplier whose name is 000000062 supply. Q49: Find a part that both a supplier whose name is 000000125 and a supplier whose name is 000000624 supply. Q50: Find a part that both a supplier whose name is 000000866 and a supplier whose name is 000000615 supply. In-Joong Kim received the B.S. in computer engineering from Hongik University in 2004 and the M.S. in computer science from Korea Advanced Institute of Science and Technology (KAIST) in 2006. Currently, he is a Ph.D. candidate in the department of computer science at KAIST. His research interests include information retrieval, search engines, and big data analytics. Kyu-Young Whang earned his Ph.D. from Stanford University in 1984. From 1983 to 1991, he was a Research Staff Member at IBM T.J. Watson Research Center. In 1990, he joined KAIST, where he currently is a Distinguished Professor at Department of Computer Science. His research interests encompass database systems/storage systems, search engines, objectoriented databases, geographic information systems, and data mining. He was the general chair of VLDB2006. He served as an Editor-in-Chief of the VLDB Journal from 2003 to 2009. He is a Fellow of ACM and IEEE. He is currently the chair of IEEE TCDE. Hyuk-Yoon Kwon received the B.S. in computer science and statistics from University of Seoul (UOS) in 2005, the M.S. in computer science from Korea Advanced Institute of Science and Technology (KAIST) in 2007, and the Ph.D. in computer science from KAIST in 2013. He is currently a postdoctoral fellow of computer science at KAIST. His research interests include GIS, information retrieval, and big data analytics.