Construction of Knowledge Base for Automatic Indexing and Classification Based. on Chinese Library Classification

Construction of Knowledge Base for Automatic Indexing and Classification Based on Chinese Library Classification Han-qing Hou, Chun-xiang Xue School of Information Science & Technology, Nanjing Agricultural University, China hqhou@njau.edu.cn Abstract Class number, descriptor and keyword are three kinds of subject concept identifiers, among which there exist some concept ual mapping relationships, i.e. compatibility. According to this principle, we construct a CLC Knowledge Base on the basis of Chinese Library Classification for automatic indexing and classification. We compare it with the CLC system to illuminate its obvious advantages over automatic information processing and concept searching. We then introduce some key technologies in the process of construction at length and describe in brief their application to automatic indexing, automatic classification and concept searching. Keywords: automatic indexing, automatic classification, knowledge base, knowledge organization system, Chinese Library Classificatiom. 1. Introduction Knowledge organization systems (KOS) refer to all kinds of semantic tools that are used to describe and interpret human knowledge and its relationship s, such as library classifications, lists of subject headings, thesauri, semantic networks, maps of subject domains and ontologies. Library classification, lists of subject headings and thesauri have played an important part in organizing the traditional information resources, while the semantic network, subject maps and ontologies are designed for the second semantic Web. Some KOS in current use were constructed to improve traditional classifications or thesauri, inheriting and making use of an established knowledge system and abundant vocabulary. These systems have some features and functions of semantic network and ontology, which can promote the enhancing of knowledge processing and the efficiency of information retrieval. The knowledge base discussed in this article, called the CLC Knowledge Base, is also a KOS, an expertise system for knowledge organization, based on the Chinese Library Classification (thereinafter as CLC ). It reveals the concept mapping relationships among class numbers, descriptors and keywords in manually indexing records by statistical methods, and therefore it can be used to realize automatic indexing and classification, and concept searching. 2. Principles of Construction of the CLC Knowledge Base Classification scheme, thesaurus and natural language are three different kinds of information language with different symbols and organizational approaches. But they are the same, in essence; class numbers, descriptors and keywords all can be used to express the subject concept. There are some hidden mapping relationships conceptually, i.e. compatibility relationships among them. There are numerous manually indexed records of documents in most libraries, which simultaneously contain class numbers, descriptor strings or keyword strings. Through processing these data, we can mine the concept mapping relationships among class numbers, descriptor strings and keyword strings in order to construct a knowledge base. The CLC is a library classification based on the scientific classification and conceptual relation, so we can look upon it as a semantic network, which can be used to organize all kinds of information. The reasons why we choose the CLC as the frame of the knowledge base are: 1

Both classification scheme and thesaurus, even all KOS, use methodology of classification. The former uses open classification systems, while the latter hidden ones, such as cross-reference system, categorical index and hierarchical index. Classification scheme is the main part of the integrated vocabulary system, that is the classification/thesaurus system, and much easier to be accepted and understood. The CLC is a large universal classification edited by our own experts. It has been broadly used to classify and search the book materials, audio-visual materials and other sorts of information. The CLC exerts the most comprehensive influence domestically and boasts of numerous users, it has therefore been regarded as the national standard though not officially authorized. Since it was first published in 1975, the CLC has been continuously revised to meet information processing and accessing needs. It is currently in its 4 th edition; its electronic edition is stored in MARC format. The new edition has some features and functions, such as better logical knowledge organizing structure, more extensive coverage of knowledge, and faceted coordination. The CLC is widely used in most of the collections of Chinese documents. If we want to make use of these indexing records to construct knowledge bases, by choosing the CLC as the frame work, we can avoid switching class numbers of other classification schemes into the CLC class numbers. Most experts have approved the feasibility of applying the CLC to organize the Internet resources. Meanwhile, digitalization, faceted coordination, combination with natural language and hyperlinks has been added to the CLC; therefore it can be applied not only in traditional library but also in the web environments. Our automatic indexing and classification system is designed to organize both the traditional documents and digital information. The applicability both in traditional library and in web environments of the CLC happens to meet our needs. Given the advantages of this system, we use the CLC as frame when constructing the knowledge base to realize concept indexing and searching. 3. Comparison between the structure of the CLC Knowledge Base and the CLC system The CLC includes schedules, tables and indexes as well as other classifications. With the new trend that classifications integrate with thesauri developing, the CLC maps its class numbers to descriptors of the Chinese Thesaurus, like DDC to LCSH, and then develops an integrated vocabulary named Classified Chinese Thesaurus (thereinafter as CCT ); its first edition was edited from 1987 to 1993. At that time the CLC, Chinese Thesaurus and CCT made up a KOS, named the CLC system, specified in figure 1. Alt hough the CLC system did very well in the traditional library, its disadvantages are revealed when it is applied to the automatic processing of digital information in the Web. The disadvantages are as follows. The CLC system, both Classification scheme and Thesaurus, is a controlled language and lacks the elasticity of a natural language. The CLC system has a long period of revision, about eight to nine yeas, so many new words and subjects are not incorporated in a timely manner. The present classifications and thesauri have a small scale due to their printed edition. The CLC system cannot be directly applied to automatic information processing. We choose the CLC schedule to organize the knowledge base and improve it. We can discover compatible relationships among the class numbers, descriptor strings and keyword strings in the knowledge base, through statistics and computer technology. Compared with CLC system, the knowledge base adds some new features and functions, i.e. interface to natural language, continuously increasing scale, timely update, to adapt to the development of information organization in the Web. 2

The knowledge base is comprised of three parts: knowledge base for classification, knowledge base for subject indexing and supplementary knowledge base. The concordance of class numbers and keyword strings is the main part of the knowledge base for classification. Go -list, stop-list, dictionary of synonyms and semantic dictionary compose the knowledge base for subject indexing. Tables of area, periods, and document types compose the supplementary knowledge base that are used to extract the subjects about area, period and types from the documents. The structure and compositions of the knowledge base are specified in figure 2. The above two figures respectively reveal the frame of the CLC system and the structure of the knowledge base. Both are based on the CLC schedules and map their class numbers to descriptor or keywords, so they can be used to realize integrated classification with subject indexing. However, compared with the CLC system, the knowledge base is more suitable for automatic indexing and intelligent searching in their content, scale, structure and function. The reasons are as follows. The CLC system just reveals the mapping relationships between the CLC class numbers and descriptors of the Chinese Thesaurus, while the knowledge base reveals the mapping relationships among the class numbers, descriptor strings and keyword strings. The CLC system only comprises the class numbers and descriptors which were included in the CLC schedule and the Chinese Thesaurus, whereas the data of the knowledge base are from the manually indexing records, which includes a great deal of built class numbers and keywords or new words. So the scale of the knowledge base is larger than that of the CLC system. In the CLC system, one class number at most maps to 20 descriptors or strings, averagely 2-3. But, in the knowledge base, one class number averagely maps to 10-14 keyword strings, even more than several hundreds of strings. So the knowledge base could reveal the hidden concepts in the classes. 3

The terms in the CLC system are updated very slowly because both the CLC and the CCT have long revision periods and are maintained by hand. However, the knowledge base is compiled and maintained by machine, and can embody newly proposed terms in real-time. More vocabulary, especially new words can lead to high indexing consistency and correctness. Due to the limited scale and vocabulary of the CLC system, it is only applied to index and classify literature to hand. However, the knowledge base can ensure higher quality and correctness because of its larger scale, more sufficient vocabulary and flexibility. Moreover, the knowledge base is applied not only to indexing and classify ing automatically but also to searching information more intelligently. The knowledge base could give descriptors and keywords as indexing terms at the same time, separately by their facets such as areas, periods and document types and use its dictionary of synonyms to add the entry words. All these advantages of the knowledge base provide users with multiple aspect and intelligent searching. In general, KOS and the collections of library are separated. In our system, we use the technology of database and hyperlink to connect the knowledge base with the collections of literature, like the directory of search engine in the Internet. 4

4. Key technologies of constructio n of CLC Knowledge Base There are some key technologies in the construction of CLC Knowledge Base. We would like to introduce them in the following text. 4.1. Collecting source data from manually indexed records and library classification At first, we should collect source data to build up the source database. There are four kinds of data source. (1) The Indexes of the CLC and the class number-descriptor strings parallel list of the Classified Chinese Thesaurus; (2) Indexing records of the large libraries, e.g. Beijing Library and Shanghai Library, which include the CLC class numbers and descriptors of the Chinese Thesaurus; (3) Indexing records of the periodical literature of bibliographic databases, which include the CLC class numbers and keyword strings, i.e. Database for Chinese Periodicals of Science & Technology (namely VIP), and Database for Social Newspaper and Periodicals that edited by Shanghai Library; (4)Database of titles, which is composed by CLC class numbers and titles coming from some famous bibliographic database. Next, we filter the erroneous and duplicate records to form a source database, which contains the mapping relationships between class numbers and descriptor strings or keyword strings. 4.2. Constructing the knowledge base by statistics method After finishing the data collection, we extract terms and class numbers from the source database, computer the frequency of terms and measure the co-occurrence frequency of the class numbers and strings to construct the knowledge base. Of all dictionaries of the knowledge base, the construction of the class number-keyword strings parallel list is the most important work. Here we use the statistics method to mine the conceptual mapping relationships between the class numbers and keyword strings. Through three statistics respectively called frequency of class number, frequency of the keyword string and the cooccurrence of the class number and keyword string, we use two parameters, namely the support degree and the confidence degree, often used in data mining, to discover the mapping relationships between the class numbers and keyword strings. Then we could generate the knowledge base for automatic classification. The so-called support degree is the co-occurrence frequency of the class numbers and keyword strings in the source database. More co-occurrence frequency shows more indexers agreeing on the conceptual mapping relationships of both. Suppoort ( keyword = P( clc, keyword ) clc ) = freq _ gx P(clc, keyword): the probability that the class number and keyword string are co-existing in an indexing record of the source database; it could be measured by the cooccurrence frequency. Generally speaking, the conceptual mapping relationship of both could be considered correct if the amount of the support >= 2. The greater the degree of support, the more correct the mapping relationship. The degree of confidence reflects the probability of the keyword strings, on the premise that the class number has appeared. Conf ( clc keyword ) = P( clc, keyword ) / P( keyword ) = Freq _ gx / freq _ keyword P(clc, keyword): the probability that class number and keyword string are co-existing in an indexing record of the source database; it could be measured by the co-occurrence frequency. P(keyword): the probability of the keyword string appearance, i.e., the frequency of the string appearing in the whole source data. If the degree of support and of confidence of the class number and keyword string separately reach the threshold, the conceptual mapping relationship between the class number and keyword string would be acceptable. 4.3. Measuring the similarity to solve the multiple-tomultiple relationships between class numbers and strings The relationship between the class numbers and keyword strings is multiple-to-multiple in the source database. In our system, one string only maps to a class number, so one string must map to an exclusive class number in the knowledge base. There are many methods to measure the similarity between the class numbers and strings, such as MI, LogL, Dice, etc. Here we use Dice measure to find out the best class number for a string. P ( clc, keyword ) Dice = 1 2 [ P ( clc ) + P ( keyword freq _ gx = 2 ( freq _ clc + freq _ keyword ) Hereinto: Dice: the probability of the class number and keyword string co-existing; P(clc): the probability of the class number existing in the source database, viz. the frequency of the class number; P(keyword): the probability of the keyword string existing in the source database, viz., the frequency of the keyword string; P(clc, keyword): the probability of the class number and keyword string co-existing, viz., the co-occurrence frequency of the class number and keyword string. )] 5

If one string maps to multiple class numbers, the best class number is the one that is maximum value of Dice. 4.4. Using Cilin, a thesaurus of Chinese words, to create a semantic dictionary for recognizing the synonyms Turning the keywords into descriptors in the subject indexing, measuring the semantic similarity between the indexing subjects and the terms in the knowledge base in the automatic classification, concept searching, all these processes could not be achieved without recognizing the synonyms. So it is important to create a semantic dictionary to recognize the synonyms. Cilin is a semantically classified dictionary, organized like a semantic tree. It divides the Chinese words into three sorts according to the semantic relationships, and from here into 14 major classes, 94 secondary classes and 1428 small classes. The vocabulary of Cilin is made up mostly of pure words, which are the morphemes of compounds. Through using Cilin to create the semantic dictionary, we could, on the one hand, directly recognize the synonyms in the form of morphemes, on the other hand, mine the synonymous relationships among compounds. [Semantic code]=>(major class) (secondly class) (small class) (group) Thereinto, major class=>(capital letter), secondly class=>(capital letter) (lowercase), small class=>(capital letter) (lowercase) (number) (number), group=>(capital letter) (lowercase) (number) (number) (number). For example, the semantic code of the word hotel is [Dm040901], the corresponding code of its major class, secondary class, small class and group are (D), (Dm), (Dm0409), (Dm040901). (D) represents the major class Thing, (Dm) the secondary class Organization, (Di0409) the word troop Hostel under the small class Shop, (Dm040901) the group Hotel. Then we could code all the morphemes to create a semantic dictionary by this method. Through the semantic dictionary, we can analyze the semantics of the terms to measure the semantic distance of two terms, then turn keywords into descriptors, measure the semantic similarity between two strings to realize the automatic classification and concept searching. The above introduces some key technologies about how to construct the class numbers-keyword strings parallel list and semantic dictionary, which is the main strength of the knowledge base. Since these technologies are the particular aspects of the construction of knowledge base, we introduce them at length. Other technologies for the construction of the knowledge base are not given unnecessary detail here. 5. Application of CLC Knowledge Base The knowledge base has a framework of the CLC, based on manual indexing. It has constructed mapping relationships among class numbers, descriptor strings and keyword strings, based on the compatibility principle of classification schemes, thesauri and natural language, which included abundant vocabulary, synonyms and mapping relationships between keyword strings and class numbers. The knowledge base can be broadly applied into automatic indexing and classification, even concept searching. 5.1. To realize automatic indexing by word segment aided by go-list and stop-list and subject controlling aided by synonymy dictionary Select title, abstract, keywords given by the authors, references and so on as the indexing sources, segment the text of indexing sources using max matching algorithm aided by go-list and stop-list, calculate word frequency, word number, word position weight to give ranked indexing terms, then turn them into descriptors through the use of a dictionary of synonyms. 5.2. To realize automatic classification aided by class numbers-keyword strings parallel list, synonymy dictionary and tables of areas, periods and document types The automatic classification discussed in this article is a classification method that classifies the documents by keyword strings and concepts. First, it classifies the documents by string rather than single word, which can improve the correction and precision. Second, it classifies the documents by conceptual matching. When matching the indexing terms with terms in the knowledge base, it first calculates word-form similarity, if no result, calculates semantic similarity aided by a dictionary of synonyms, and a semantic dictionary to work out the best CLC class number under the consideration of correction and speed. Third, it is a method based on cases (that is, indexing experience). Every record in the knowledge base is an example ; the indexing terms or strings will match with them to work out the best classification results. Fourth, the facets of area, period and document type in the text are separately indexed by the subdivisions, thus some shortcomings of the CLC system applied in the automatic classification would be avoided. 5.3. To realize concept searching and multiple-approach searching based on synonymy dictionary and the results of automatic indexing and classification From the perspective of indexing, the results of subject indexing include two parts, i.e. keyword strings and descriptor strings, which help users search not only by keywords and descriptors, but also by strings retrieval rather than single word; furthermore it can add retrieval 6

entries aided by a dictionary of synonyms dictionary and realize concept searching by semantic dictionary to improve searching efficiency. From the perspective of classification, results of classification include main class number, subdivision number of area, period and document type, this way user can search information from subjects, areas, periods and documents types. 6. Conclusion [5] Zhang, Q.Y. (2002). A Concept and faceted coordinate system for automatic classifying. Library Journal, 6:9-10 [6] Hou, H.Q. (1998). Construction of the indexing languages compatibility system on the basis of the Classified Chinese Thesaurus. Journal of the National Library of China, 4:35-39,90 [7] Mei, J.J. (1983). Cilin Thesaurus of Chinese Words. Shanghai: Shanghai Lexicon Press. The knowledge base as a KOS based on the frame of the CLC utilizes dual indexing records simultaneously including class numbers and descriptor strings or keyword strings in a bibliographic database, which has the feature of literary and user warrant. Professional people revise the data of the knowledge base after the statistic computation, which allows the base to improve its accuracy. At the same time, the knowledge base is constructed by the statistics of large corpus computer-assisted compilation; thus the subjectivity of mapping of class numbers to strings can be avoided. Although the knowledge base is based on the CLC, it has more broad functions than the CLC system itself. We think that the current KOS is the combination of indexing languages with the modern technology of computer and network. The CLC Knowledge Base we have constructed is such an example, it possesses an abundant vocabulary and semantic relationships. It combines the traditional indexing languages, such as CLC and CCT, with modern technology, such as database, data mining, hyperlink and computational linguistics. In a sense, it has some features of Ontology suitable for automation of information processing today. But CLC Knowledge Base has some disadvantages on understood and intelligent reasoning by machine. Although the knowledge base has brought about an important practical utilization in the intelligent processing of information, it still needs further research and improvement. References [1] Zeng, M.L. (2004). Networked knowledge organization systems/services. New Technology of Library and Information Services, 1:2-3 [2] Hou, H.Q, Xue, P.J. (2003). Design & construction of knowledge database for automatic classification in Chinese. Journal of the China Society for Scientific and Technical Information, 22(6):681-686 [3] Zhang, C.Z. (2002). Web concept mining based on text layer model, automatic indexing and automatic classifying based on concept semantic network. Supervised by Han-qing Hou. Master Dissertation of Nanjing Agricultural University, 2002,6 [4] Xue, P.J. (2001). Research on intelligent search engine of Chinese economic information based on knowledge database. Supervised by Han-qing Hou. Master Dissertation of Nanjing Agricultural University, 2001,6 7