Path Expression Processing in Korean Natural Language Query Interface for Object-Oriented Databases Jinseok Chae and Sukho Lee Department of Computer Engineering, Seoul National University, San 56-1, Shinrim-Dong, Kwanak-Ku, Seoul, 151-742, Korea E-mail: fwahr, shleeg@ce2.snu.ac.kr Abstract A natural language query interface for databases provides the user friendliness in retrieving the desired information by querying in a native natural language. Up to now, many natural language query interfaces for conventional databases have been developed. However, the eld of natural language query interfaces for object-oriented databases which have recently started to emerge as the next-generation databases has become a new research area. This paper describes a processing technique to manipulate natural language representations of path expressions. From the fact that the path expression is one of the key features in the object-oriented data model, a frame-based decomposition method is proposed for ecient processing. 1 Introduction The objective of natural language interfaces is to take inputs in human language and extract from them something which is meaningful to a computer. A natural language query interface to a database system provides end users with a way to formulate queries in a native natural language. This is particularly useful because computer-naive users frequently need to access database systems. INTELLECT[1] from AI corporation is a natural language information system which is commercially available. It makes the computer understand everyday English. It is designed to be a domainindependent system for relational databases. KDA[2] integrates natural language query system with skeleton-based query guiding facility. When a This paper was supported in part by NON DIRECTED RESEARCH FUND, Korea Research Foundation. user works with the KDA natural language query system, the query guiding facility supplies several kinds of skeletons to guide users in performing database retrieval tasks. It generates SQL database queries from English natural language queries. NHI[3] and K-NLQ[4] systems are developed as Korean natural language query interfaces for relational databases. The NHI and K-NLQ accept Korean natural language queries and transform them into QUEL and SQL, respectively. Kim et al.[5] proposes a Korean Natural Language Query System which also transforms Korean queries into SQL. Recently, object-oriented databases (OODB) started to emerge as the next-generation databases which can model the complicated real world. Therefore, the eld of natural language query interfaces for objectoriented databases has become a new research area. For object-oriented databases, KID[6] is proposed. This interface transforms Korean queries into query graphs used in object-oriented data model. There are important dierences between the objectoriented data model and the relational data model. The object-oriented data model includes the objectoriented concepts of encapsulation, inheritance, path expressions, and arbitrary data types; these concepts are not part of the conventional data model. Among these, the path expression is one of the key features of the object-oriented data model used to retrieve the desired data by navigating the class-attribute hierarchy. This paper describes a path expression processing technique used in the Korean Interface for Databases (KID) system. The KID employs a frame-based decomposition method to process the natural language representations of path expressions. In this paper, the KID is upgraded to generate OQL (Object Query Language) proposed in ODMG-93[7] instead of query
Korean Natural Language Queries KID Frame Name Parents Frame Natural Language Analyzer Predicate Argument Structures Semantic Interpreter Query Frames OQL Generator OQL Figure 1: System architecture graphs, the format of a query frame is modied and more basic patterns are identied. The remainder of this paper is organized as follows. The overview of KID and the schema of a sample database, basic patterns, and an extended query frame are explained in Section 2. In Section 3, the processing technique of natural language representations of path expressions is described. Section 4 shows experimental results consisting of a number of examples generating OQL from Korean natural language queries. Finally, the conclusion is given in Section 5. 2 Overview of KID 2.1 System Architecture The KID consists of three modules: natural language analyzer, semantic interpreter and OQL generator. The natural language analyzer accepts Korean queries and generates appropriate parsing trees. The semantic interpreter decomposes the parsing results into query phrases by referring to the database dictionaries and builds query frames for each query phrase. Then the OQL generator integrates these query frames and produces OQL. The block structure of the KID is shown in Figure 1. In the gure, rectangles indicate modules and arrows the ow of processing. Natural language analyzer: This module performs morphological analysis[8] and parsing to create the internal representations such as parsing trees from Korean queries. The parsing mechanism uses a variation of the CYK-algorithm[9]. The KID system employs a general natural language analyzer Figure 2: Format of a query frame used in Korean-English machine translation[10]. The natural language analyzer generates two structures: tree structure and predicate argument structure[11]. Among these, the semantic interpreter accepts the predicate argument structure. Semantic interpreter: This module decomposes the predicate argument structures into query phrases (QPs) and builds query frames (s). It utilizes two database dictionaries: schema dictionary and domain dictionary [6]. The schema dictionary is used to specify the schema related information and the domain dictionary is used to determine the domain of unknown terms having the semantic ambiguities. OQL generator: This module integrates all query frames and generates OQL. A query frame is designed to have the information about the class-attribute hierarchy such as classes, attributes, relationships, values, and operators. The format of a query frame is shown in Figure 2. Comparing to the format in [6], and slots are added. is used to specify the indicated class (or one of its subclasses) to go down the class-attribute hierarchy. is used to apply the aggregation function to the corresponding attribute. 2.2 Class-Attribute Hierarchy Figure 3 shows a sample class-attribute hierarchy used in this paper. It consists of six classes and `' indicates multi-valued attributes. In this classattribute hierarchy, classes have attributes of the reference attribute representing the attribute-domain relationship. 2.3 Basic Patterns A Korean queries can be decomposed into a number of QPs and each QP is one of the identied basic pat-
dept snum Integer height Integer residence enrolls* Department univ location String Address zipcode Integer country String city String teacher credit Integer University city president String String Professor major String The Korean words corresponding each head type are as follows. HT1 : _f*"(show),, CK!""(retrieve), )-!" "(output), "-!""(list), HT2 : uvsv' "(who), z. ' "(what),, "' "(how), )$' "(where),. :L' "(when), HT3 : - ( )' ", 2@' ",. ' "(what number of), HT4 : $ "(is there),, "(is not there), 2.4 Denition of Korean Queries Figure 3: Class-attribute hierarchy terns. By analyzing sample queries, we identify seven basic patterns: head phrase I (HP1), head phrase II (HP2), noun modier phrase (NMP), verb modier phrase (VMP), adverb modier phrase (AMP), verb phrase (VP), and comparative phrase (CP). Among these, NMP, VP and CP were identied in previously [6] and other phrases are identied additionally in this paper. The predicate argument structures of each basic pattern are as follows: HP1: HEAD! [ HT1 HT2 HT4 ] ` [SUB OBJ]! NOUN Noun HP2: HEAD! HT3 ` (MOD! ADV ^fx) ` MOD! NOUN - NMP: QP-HEAD! Qp-head ` MOD! NOUN Noun VMP: QP-HEAD! Qp-head ` MOD! VERB Verb AMP: QP-HEAD! Qp-head VP: CP: ` MOD! ADV Adverb QP-HEAD! Qp-head ` [MOD VCON]! VERB Verb ` [MOD OBJ SUB NCON]! NOUN Noun QP-HEAD! Qp-head ` [MOD VCON]! VERB Verb ` SUB! NOUN Noun1 ` MOD! NOUN Noun2 HEAD represents a head word of a sentence. It is classied into four types: HT1, HT2, HT3 and HT4. QP-HEAD indicates a head word of a QP. MOD denotes modiers, SUB subject, OBJ object, VCON verb conjunction, NCON noun conjunction and ADV adverb. The sign ` ' denotes `OR'. A Korean query (KQ) consists of a head phrase (HP) and a main query (MQ). The MQ is classied into two kinds: simple query (SQ) and composite query (CQ). The SQ is a query which is a simple concatenation of several QPs without any conjunction (e.g., `and', `or' or `among'), but the CQ has such conjunctions. If a query has the word ` ' or `" HL' which means `among' in English, then it will have an `AMONG' indicator. The denition of Korean queries is as follows: KQ :: HP M Q M Q :: SQ j CQ SQ :: QP 1 QP 2 QP n (n 1) CQ :: SQ 1 SQ 2 SQ m (m 2) HP :: HP 1 j HP 2 QP :: N M P j V M P j AM P j V P j CP :: AN D j OR j AM ON G 3 Path Expression Processing A path query has been well developed by database researchers during past decade. A path query is a query written against nested data, by specifying search conditions against nested data. A path query contains, instead of just an attribute, a sequence of attribute s called a path expression. For example, a type may have an attribute d `dept'; the domain of `dept' may be a Department type; and the Department type may have an attribute d `'. Then it should be possible to issue a single saying query that \nd all students whose departments d `Computer Eng.'." The WHERE clause of the query may contain a predicate.dept. `Computer Eng.'. Formally, a path expression is of the form
sel:attrex 1 : :AttrEx m where sel is the target class and AttrEx i (1 i m) are the reference attributes. The above path expression can be decomposed into sub-path expressions which indicate the reference relationship. The decomposed sub-path expressions are shown below. sel:attrex 1 (toclass 2 ) Class 2 :AttrEx 2 (toclass 3 ) Class m?1:attrex m?1(toclass m ) Therefore, the semantic interpreter can decompose the natural language representations of path expressions into QPs of sub-path expressions. For example, a path expression :dept:univ can be decomposed into QP 1 and QP 2. :dept QP 1 Department:univ QP 2 4 Experiments Q1 is an example of transformation process from Korean queries to OQL. Q1: $" 165_f" V[UW \ $ " UX{" "io$ \HL$"7L $ah"!~! CK, CK!"". (Retrieve students who are taller than 165 and enroll in \Database" which professor \G. D. Hong" teaches.) Predicate argument structure: HEAD! VERB retrieve ` OBJ! NOUN students ` MOD! VERB be taller ` SUB! NOUN height ` MOD! NOUN than 165 ` VCON! VERB and enroll ` MOD! VERB teach ` MOD! NOUN professor ` MOD! NOUN \G. D. Hong" 1 2 3 4 5 QP 1 6 7 8 9 10 QP QP3 QP 2 4 Figure 4: Decomposition process Decomposition process: The decomposition process employs the DFS (Depth First Search) algorithm. Figure 4 explains the process when QP 1 is CP, QP 2 and QP 3 are VP, and QP 4 is NMP. The nodes of the tree structure in Figure 4 denote words in the questions and the numbers above the nodes the visiting sequences by DFS. Decomposed QPs: QP 1 (CP): QP-HEAD! NOUN students ` MOD! VERB be taller ` SUB! NOUN height ` MOD! NOUN than 165 QP 2 (VP): QP-HEAD! NOUN students ` VCON! VERB and enroll QP 3 (VP): QP-HEAD! NOUN \Database" ` MOD! VERB teach ` MOD! NOUN professor QP 4 (NMP): QP-HEAD! NOUN professor ` MOD! NOUN \G. D. Hong" Query frames: Figure 5 shows the query frames for Q1. Three classes are involved in Q1 and these classes are linearly connected by attribute-domain link; i.e., Professor class is referred by class and class is referred by class. OQL: select x from x in, y in x.enrolls where x.height > 165 and y. \Database" and y.teacher. \G. D. Hong" Q2 shows another example in which three classes are involved in a dierent way from Q1; i.e., Department class is referred by class and class is also referred by class. Q2: \, ')! "9L af $UW \HL$"7L$ah"!~! CK, CK!"". (Retrieve students who belong to \Computer Eng." and enrolls in \Database".)
1 2 1 2 1 height 165 > enrolls Department dept Department "Computer Eng." 3 4 3 4 Professor teacher "Database" Professor "G. D. Hong" enrolls "Database" Figure 5: Query frames for Q1 Figure 6: Query frames for Q2 Predicate argument structure: HEAD! VERB retrieve ` OBJ! NOUN students ` MOD! VERB enrolls ` VCON! VERB and belong to ` MOD! NOUN \Computer Eng." Decomposed QPs: QP 1 (VP): Q P-HEAD! NOUN students ` M OD! VERB enrolls QP 2 (VP): QP-HEAD! NOUN students ` VCON! VERB and belong to ` MOD! NOUN \Computer Eng." Query frames: Figure 6 shows the query frames for Q2. OQL: select x from x in, y in x.enrolls where x.dept. \Computer Eng." and y. \Database" Q3 shows an example in which only one class is involved. Q3: \ $ " UX{. z. ' "? (What does professor \G. D. Hong" major in?) OQL: select x.major from x in Professor where x. \G. D. Hong" The execution results of experiments show that about 70% of sample questions are interpreted correctly if sample questions are generated by persons who have knowledge about databases and schema information. 5 Conclusion In this paper, we present a path expression processing technique to transform Korean natural language queries into OQL. From the fact that path expression processing is one of the important issues in object-oriented query processing, we propose a frame-based decomposition approach in order to manipulate the natural language representations of path expressions. Since a path expression can be decomposed into sub-path expressions, Korean queries can be decomposed into query phrases. The decomposed query phrases are transformed into query frames. Finally, all query frames are integrated and OQL is generated. It is necessary to collect a lot of sample queries for improving the performance of a natural language interface. In the future, we will concentrate on upgrading the processing capability of the system by collecting and experimenting a large number of sample queries.
A prototype system of KID is implemented on a SUN sparcstation using the C language. References [1] L. R. Harris, \Experience with INTELLECT: Articial intelligence technology transfer," The AI Magazine, Vol. 5, No. 2, 1984, pp. 43-50. [2] X. Wu, and T. Ichikawa, \KDA: A Knowledge- Based Database Assistant with a Query Guiding Facility," Trans. on Knowledge and Data Engineering, Vol. 4, No. 5, 1992, pp. 443-453. [3] S. Kim, \The Design and Implementation of Interface for Processing Natural Hangul Query," (in Korean) Journal of the Korean Information Science Society, Vol. 12, No. 1, 1985, pp. 31-44. [4] J. Chae, S. Kim, and S. Lee, \Design and Implementation of a Natural Language DB Query System," (in Korean) Journal of the Korean Information Science Society, Vol. 20, No. 6, 1993, pp. 810-820. [5] J. M. Kim, M, Y. Hyun, and S. J. Lee, \Koran Natural Language Query System for Searching Database,"(in Korean) Proc. of the 21st KISS Fall Conference, Oct., 1994, pp. 637-640. [6] J. Chae, and S. Lee, \Natural Language Query Processing in Korean Interface for Object- Oriented Databases," Proc. of the First International Workshop on Applications of Natural Language to Data Bases, June, 1995. [7] R. G. G. Cattell, (1993). The Object Database Standard: ODMG-93, Morgan Kaufmann Publishers. [8] S. S. Kang, and Y. T. Kim, \Syllable-based Model for the Korean Morphology," Proc. of the COLING 94, 1994, pp. 221-226. [9] J. Yang, and Y. T. Kim, \Korean Analysis using Multiple Knowledge Sources," (in Korean) Journal of the Korean Information Science Society, Vol. 21, No. 7, 1994, pp. 1324-1332. [10] H. G. Lee, and Y. T. Kim, \Korean-English Machine Translation based on Idiom Recognition," Proc. of IEEE Region 10 Conference (TENCON '93), 1993. [11] J. Allen, Natural Language Understanding, Benjamin/Cummings Co. Ltd., 1988.