Text Mining for Documents Annotation and Ontology Support

Size: px

Start display at page:

Download "Text Mining for Documents Annotation and Ontology Support"

Luke Edwards
6 years ago
Views:

1 Text Mining for Documents Annotation and Ontology Support Jan Paralic and Peter Bednar Department of Cybernetics and Artificial Intelligence, Technical University of Kosice, Letná 9, Kosice, Slovakia Abstract This paper presents a survey of basic concepts in the area of text data mining and some of the methods used in order to elicit useful knowledge from collections of textual data. Three different text data mining techniques (clustering/visualisation, association rules and classification models) are analysed and its exploitation possibilities within the Webocracy project 1 are showed. Clustering and association rules discovery are well suited as supporting tools for ontology management. Classification models are used for automatic documents annotation. 1 Introduction In order to get some new, useful information (or knowledge) from (possibly large) collection of textual documents, text data mining methods can be applied. As the concept of knowledge discovery in texts (KDT) is quite new, in the following section 2 the basic process of KDT and its particular steps are described. The process of KDT can be divided into two main phases. Within the first phase, (free-form) text documents are transformed into an internal or intermediate form, which presents already structured data suitable for text data mining, i.e. the second phase of the whole process. Internal representation forms of a text document collection as well as some important pre-processing steps necessary to achieve an efficient and useful internal representation are described in section 3. Different text data mining approaches and algorithms supporting them are analysed in section 4. As first, clustering/visualisation and association rules as unsupervised text mining approaches are presented. Next, supervised approaches used for building classification models are described. In section 5 the possible use of some of the described text mining methods within the WEBOCRAT system is sketched. 2 Knowledge Discovery in Texts Knowledge discovery in texts (KDT) or text data mining can be defined in the same way as knowledge discovery in databases (KDD), though here the data are textual. This implies significant difference in comparison to KDD, which uses as a source of data well 1 IST Webocracy: Web Technologies Supporting Direct Participation in Democratic Processes

2 PARALIC, BEDNAR structured databases. In KDT usually plain textual documents are used. There are also some minor attempts to use (partially or fully) structured textual documents as HTML or XML documents in order to make use not only of plain textual parts but also of additional structural information. Despite this simple approximation of the KDD definition, there is quite a lot of confusion, what KDT really is. For example, Marti Hearst [2] claims that it is important to distinguish between text data mining and information retrieval. The goal of information retrieval is to help users find documents that satisfy their information needs. Marti Hearst describes information retrieval as a way to pull out the documents you are interested in and push away the others. Which means, that information retrieval is the process of finding information that is already known and has been inserted into document by an author. In text data mining in contrast, a collection of documents is examined with the aim to discover information (knowledge) not contained in any individual document in the collection. Yves Kodratoff [1] distinguishes between inductive and deductive text mining. The better-known deductive text mining is called Information Extraction, and amounts to finding instances of a predefined pattern in a set of texts. On the other hand, inductive text mining looks for unknown patterns or rules to discover inside a set of texts. We further on refer always to inductive text mining, when we are speaking about text data mining. 2.1 Particular steps of the KDT process Text data mining is much more complex task than data mining [7], because it involves text data that is inherently unstructured and fuzzy. KDT process can be divided into two main phases. 1. Transformation of (free-form) text documents into an internal or intermediate form (this is an analogy of data pre-processing techniques in KDD process. 2. Text mining itself (A. H. Tan in [7] calls it knowledge distillation) that deduces patterns or knowledge from the intermediate form. In greater detail we can compare the KDT approach and its particular steps against the KDD process steps [8]. 1. Understanding the application domain and the goals of the KDT process: user must define which concepts are interesting. 2. Acquiring or selecting a target data set: texts must be gathered using information retrieval tools or in manual way. 3. Data cleaning, pre-processing and transformation: concepts must be described and texts need to be analyzed and stored in the internal representation form, usually after eliminating stop-words and possibly after stemming and exclusion of too frequent. 4. Model development and hypothesis building: identifying concepts in the collection 5. Choosing and execution of suitable data mining algorithms: e.g. the application of the statistical techniques (text data mining task). 6. Result interpretation and visualisation: human must interpret the results. 2.2 Text Data Mining Tasks Mining internal representation form of a document collection induces patterns and relationship across documents [7]. Some examples of unsupervised text mining tasks are:

3 TEXT MINING FOR DOCUMENTS ANNOTATION AND ONTOLOGY SUPPORT Clustering/visualisation of documents Association rules Typical example of supervised text mining task is Predictive modelling (classification models) 3 Representation of Textual Documents For internal representation of textual documents we can use some information retrieval model (see formal definition in [2]). The classic models in information retrieval consider that each document is described by a set of representative keywords called index terms. An index term is simply a (document) word whose semantics helps in remembering the documents main themes [2]. It is quite obvious that different index terms have varying relevance when used to describe document contents in particular document collection. This effect is captured through the assignment of numerical weights to each index term of a document. Let t j be an index term, d i be a document, and w ij 0 be a weight associated with the pair (d i, t j ). This weight quantifies the importance of the index term t j describing the document d i semantic contents. Based on how these weights are calculated and treated, there are three classic information retrieval models, namely the Boolean, the vector [5], and the probabilistic models [2]. 3.1 Classical information retrieval models The Boolean model is a simple retrieval model based on set theory and Boolean algebra. This model considers that index terms are present or absent in a document. As a result, the index term weights are assumed to be binary, i.e. w ij {0,1}. A query is composed of index terms linked by three logical connectives: not, and, or. Thus a query is essentially a Boolean expression with precise semantics. As a result this model is unable to recognise partial matches, which frequently leads to poor performance. Another weakness is that by considering only the presence or absence of term, the binary weighting scheme ignores information inherent in the frequencies of terms. A related problem concerns document length. As a document gets longer, the number of distinct terms used will in general increase. Many of these terms usages in very long documents will be unrelated to the core content of the document, but are treated as being of the same significance as similar occurrences in short documents. The vector model removes this disadvantages by assigning non-binary weights to index terms in queries and in documents. The term frequency of occurrence (tf) [5] [6] in document is a common weighting scheme here and is generally used as the basis of the weighted document vector. Document frequency can be combined with the collection frequency factor, which is used to discriminate one document from the other. Most of the used schemes for this factor, e.g. the inverse document frequency (idf), assume that the importance of a term is proportional to the number of documents the terms appear in. As a combination of these two factors we can obtain tfidf scheme, which is the most widely used weighting scheme, defined as: C w( i, j) = tfidf ( di, t j ) = Ndi, t j. log (1) Nt j

4 PARALIC, BEDNAR where Nd i,t j denotes the number the term t j occurs in the document d i (term frequency factor), Nt j denotes the number of documents in collection C in which t j occurs at least once (document frequency of the term t j ) and C denotes the number of the documents in collection C. This weighting scheme includes intuitive presumption that: the more often a term occurs in a document, the more it is representative of the content of the document, and the more documents the term occurs in, less discriminating it is. In order to fit weights into the interval 0, 1 and documents be represented by vectors of equal length, the document vector resulting from tfidf weighting are often normalized to length equal to 1, so final normalized term weight can be computed as: tfidf ( d i, t j ) w( i, j) = tfidf ( d i, t j ) norm = (2) 2 tfidf ( d, t ) i T where T is the set of terms used for the vector representation of the document i. The probabilistic model attempts to capture the IR problem within a probabilistic framework [2]. The index term weights are all binary and a query is a subset of index terms. Given a user query, there is a set of documents which contains exactly relevant documents and no other (so called ideal answer set). The querying process can be seen as a process of specifying the properties of an ideal answer set. Since these properties are not known at query time, an effort has to be made at initially guessing what they could be. This initial guess allows then to generate a preliminary probabilistic description of the ideal answer set which is used to retrieve a first set of documents. An interaction with the user is then initiated with the purpose of improving the probabilistic description of the ideal answer set. Through several different measures, Salton and Buckley in [6] showed that the vector space model is expected to outperform the probabilistic model with general collections. This also seems to be dominant thought among researchers, practitioners, and the Web community, where popularity of the vector model runs high [2]. 3.2 Term selection/reduction Documents can be described by thousands of terms and this high dimensionality of document space can cause problem with efficiency. Terms that do not describe content of documents induce noise, which can degrade performance of created text mining model. For these reasons, selection of relevant terms is very important text processing. Appropriate method for term selection is generally dependent on used text mining algorithm. Either it will be a supervised text data mining algorithm (i.e. the information about classes of particular documents is available), or an unsupervised one (i.e. no information about classes of particular documents is available). Main difference is that methods for supervised learning can use information about document category, and relevance of the term can be determined by how it separates documents into categories. The classification accuracy of the generated model (classifier) estimated on the testing examples can be used as a guide to find optimal set of terms. Note that unsupervised term selection methods can be generally used also for the supervised learning. i j

5 TEXT MINING FOR DOCUMENTS ANNOTATION AND ONTOLOGY SUPPORT From unsupervised term selection methods two can be mentioned: Document frequency threshold [7] is the simplest technique for term selection. In this method, document frequency for all terms on training collection is computed and terms, which have document frequency lower than specified threshold, are removed from the resulting set of terms used for document representation. Over years, alternative modelling paradigms for each type of classic model have been proposed. Regarding vector model, as a representative of algebraic models, a very interesting extension the latent semantic indexing (LSI) has been proposed in [13]. From supervised term selection methods, e.g. Information gain is frequently employed. Other approach uses χ 2 statistic measures. 4 Text Mining Methods 4.1 Clustering/visualization For clustering of textual documents in vector representation the self-organizing map (SOM) [9] is used very often. SOM is an unsupervised neural network, which provides a mapping from high-dimensional feature spaces onto a two-dimensional space such that similar data are mapped close to each other. This allows a very intuitive cluster representation and analysis. A comparison of the SOM approach with a statistical one, on one particular domain can be found in [12]. The comparison has shown that statistical approach was not powerful enough to deal with larger text collections and the interpretation of results has been quite difficult. Very interesting for text mining purposes is the combination of the basic SOM algorithm with the LabelSOM method to automatically extract classification from the trained SOM [10]. This method has been used e.g. within the SOMLib system [11]. The SOMLib Digital Library System provides methods for organizing of large collections of electronic documents to allow topic-oriented browsing and orientation. SOM provides only flat, i.e. two-dimensional representation of documents clusters, which might be blind for interpretation when document collection is very large. Moreover, this representation of clusters has usually very irregular coverage with documents due to unbalanced topic distribution. To overcome these limitations the Growing Hierarchical SOM (GHSOM) [11], which automatically creates a hierarchical organization of a set of documents has been developed. This allows the network architecture to determine the topical structure of the given document repository during the training process, creating a hierarchy of self-organizing maps, each of which provides a topologically sorted representation of a topical subset. Starting from a rather small high-level SOM, which provides a coarse overview of the various topics present in the collection, subsequent layers are added where necessary to display a finer subdivision of topics. Each map in turn grows in size until it represents its topic in sufficient degree of granularity. Since usually not all topics are present equally strongly in a collection, this leads to an unbalanced hierarchy, assigning more mapspace to topics that are more prominent in a given collection. This allows the user to approach and intuitively browse a document collection in a way similar to conventional libraries.

6 PARALIC, BEDNAR Visualization by this system is provided by means of web interface. An HTML page represents each map with links to lower level (expanded) maps, or particular documents associated with a cell at the lowest leaves level (see Figure 1). Figure 1: A partial view of the 2 nd layer map for a text collection from Austrian newsletter Standard 4.2 Association Rules This text data mining task discovers associations between concepts and expresses these findings as rules in the format B H [support, confidence], where B as well as H may be a set of concepts or a unique concept [8]. The rule means, if B is present in a text, then H is present with a certain confidence and a certain support. Following the usual definition (e.g. in [4]), confidence is the proportion of texts that have B AND H in relation to number of texts that have only B and support is the proportion of texts that have B AND H in relation to number of all texts in the collection. Such rules allow predicting the presence of (a) concept(s) according to the presence of another one(s). Moreover, complex rules may be discovered, when combination of concepts and/or words is allowed in discovered rules. E.g. WORD_1 AND WORD_2 AND CONCEPT_1 AND CONCEPT_2 CONCEPT_3. This kind of rule can be used to select sub-collections of text documents where some words are present.

7 TEXT MINING FOR DOCUMENTS ANNOTATION AND ONTOLOGY SUPPORT The most often used approach to mine associations is Apriori algorithm [4], which is running in two steps. 1. Finds all frequent itemsets (i.e. tuples of concepts or terms). Each of these itemsets must occur at least as frequently as a pre-defined minimum support count. 2. Generates strong association rules from the frequent itemsets. These rules must satisfy pre-defined minimum support and minimum confidence. The name of the Apriori algorithm is based on the fact, that the algorithm uses prior knowledge of frequent itemset properties in the following way. It employs an iterative approach known as level-wise search, where k-itemsets are used to explore (k+1)- itemsets, starting with k=1 until no more frequent k-itemsets can be found. For each new k one full scan of the database is needed. To improve the efficiency of Apriori algorithm, anti-monotone property is applied which means that all subsets of a frequent itemset must be frequent itemsets. This reduces the number of candidate itemsets generated for k+1 (from already identified frequent itemsets of size k and less). Even so the performance of Apriori algorithm does not scale well for large databases, which is the case of document collections as well. An interesting method that generates the set of frequent itemsets without candidate generation is called frequent-pattern growth, or simply FP-growth [4], which adopts a divide-and-conquer strategy. 4.3 Classification Models The term classification can cover any procedure in which some decision or forecast is made on the basis of currently available information. In the context of data mining a more restrictive interpretation is considered. At first, we may aim of establishing existence of classes (or clusters) in the data according to a set of observations. Or we may know for certain that there are some classes, and the aim is to establish a rule by use of which we can classify a new observation into one (or more) of the existing class(es). The first case is known as unsupervised learning (clustering). In this chapter we use the term classification for the second case - supervised learning. Rule induction algorithms. Any form of inference in which the premises do not imply deductively the conclusions, can be thought of as an induction. In the case of the text categorisation, it is one special form of the inductive inference, supervised concept learning or classification learning. To learn a concept means to infer its general definition from a number of specific examples of it. A concept can be formally considered as a function from the set of all examples (in our case from the set of all documents) to the Boolean set {true, false} or equivalently to the set {document is assigned to category; document is not assigned to category}. For text categorization, the concept is equivalent to some category, so the problem is to find a definition of this category from a number of documents. Rules induction algorithms are characterized by representing the target category as a set of "if-then" rules. These rules have the form "if <complex> then predict <category> where <complex> is the conjunction of the attribute tests (selectors). Rule induction algorithms have several advantages; the most noteworthy one is that these rules are best understandable for humans from all representations currently in use in concept learning.

PARALIC, BEDNAR Examples of rule induction algorithms are CN2 [15] (see Figure 2 with visualisation of the results in KDD Package * [12]), REP, IREP [16], RIPPER [17].

8 PARALIC, BEDNAR Examples of rule induction algorithms are CN2 [15] (see Figure 2 with visualisation of the results in KDD Package * [12]), REP, IREP [16], RIPPER [17]. Other possibility is to mine decision trees, e.g. using Quinlan s algorithm C4.5 Figure 2: Example of a visual representation of decision rules produced by the CN2 implementation within the KDD package. 5 Exploitation of KDT in Webocracy Webocracy project responds to an urgent need for establishment of efficient systems providing effective and secure user-friendly tools, working methods, and support mechanisms to ensure the efficient exchange of information between citizens and the administrations. The project addresses the problem of providing new types of communication flows and services from public institutions to citizens, and improves the access of citizens to public administration services and information. The new types of services will increase the efficiency, transparency and accountability of public administration institutions and their policies toward citizens. Within this project a WEBOCRAT system is being designed and developed. WE- BOCRAT system is a Web-based system comprising Web publishing, computer-mediated discussion, virtual communities, discussion forums, organizational memories, text data mining, and knowledge modeling. The WEBOCRAT system will support communication and discussion, publication of documents on the Internet, browsing and navigation, opinion polling on questions of public interest, intelligent retrieval, analytical tool, alerting services, and convenient access to information based on individual needs. 5.1 Clustering/visualisation We think, that clustering/visualisation does not fit the functionality of the WEBOCRAT system as defined in [3], because documents in WEBOCRAT system are primarily organ- * We have developed KDD Package within the EU funded INCO Copernicus project No GOAL - Geographic Information On-Line Analysis (GIS Data Warehouse Integration)

9 TEXT MINING FOR DOCUMENTS ANNOTATION AND ONTOLOGY SUPPORT ized by their links to knowledge model so that primarily knowledge model is used for document retrieval and topic-oriented browsing On the other hand, it could be useful to use techniques like GHSOM, because of its hierarchical structure that is tailored to the actual text data collection, as a supporting tool within the initial phase, when the knowledge model of a local authority is being constructed. This is true in such a case when local authority has a representative set of text documents in electronic form available for this purpose. It is assumed that these documents will be later on published using the WEBOCRAT system and linked to the knowledge model for intelligent retrieval purposes. But users must be aware of the fact, that GHSOM does not produce any ontology. It is just a hierarchical structure, where documents are organized in such a way that documents about similar topics should be topologically close to each other, and documents with different topics should be topologically far away from each other. Particular node in this hierarchical structure is labelled by (stemmed) words terms, which occur most often in cluster of documents presented by this node. This list of terms can provide some idea about concept(s), which can be (possibly) represented in the designed knowledge model. Finally, particular documents represent leave nodes of this hierarchical structure. It is in our opinion necessary to look carefully through the whole structure, including particular documents in order to make reasonable conclusions about particular concepts proposed for the knowledge model and relations among them. 5.2 Association rules Association rules as presented in previous section have been used e.g. in [8] in two different experiments. The first one tried to analyse political newspaper articles from two different periods in order to analyse what there are saying about a mayor in a big Brazil city (first period was before corruption scandal and the other one after it). The second experiment tried to analyse articles about various text mining tools as competitive intelligence analysis. In both experiments firstly, concepts were defined by different approaches. To identify concept terms, natural language processing is commonly required. Instead of that, ontology provides this information in the WEBOCRAT system, and we can represent documents directly as a binary vector. This vector has one element for each concept, equal to 1 if document is linked to the particular concept. Another advantage is, that within the WEBOCRAT system a hierarchy of concepts from particular domain (local authority s area of work) is available. We can make use of it for mining of associations, not only at the level of leaves concepts, but also at the higher levels of the hierarchy, getting so called multi-dimensional association rules [4]. Association rules can be exploited e.g. for automatic improvements of the knowledge model in the following way. When we use as input attributes for association rules mining algorithm only concepts to which documents are linked that means that we are looking for frequently occurring linking patterns. These patterns can be confronted with the actual ontology. When e.g. our algorithm finds association between concepts X, Y, and Z and in our ontology no relation between concepts X, Y and Z is presented, we can expect a missing relation between them. This approach is suitable mainly for documents that were not linked automatically, using a pre-defined template.

10 PARALIC, BEDNAR 5.3 Classification models In the WEBOCRAT system, ontology is used as a knowledge model of the domain, which is composed from concepts occurring in this domain and relationships between these concepts. Information is stored in the system in the form of text documents, which are annotated by set of concepts relevant to the document content. One strategy for document retrieval is based on concepts. User selects interesting concepts and asks for information related to them, is used for information retrieval. The decision about document relevance to the user query is based on a similarity between set of query concepts and a set of concepts, which are annotated to the document. This task of document retrieval can be viewed as a classification task when the decision is made, whether the document is relevant for the user or not. With appropriate ontology which models domain well, use of this knowledge model can yield better results than e.g. retrieval based on vector representation of documents. Retrieval accuracy depends on the quality of documents annotation. Data mining methods can be very useful to guide user at annotating new document. Annotation of the new document is the classification task (text categorization task) when we need to make decision which concept (concept represents category) is relevant to the content of the document. The system must propose relevant concepts for new document in real time, so important requirement to used algorithm is execution time efficiency. User can add or delete some link between new document and concepts, and these changes can be immediately integrated into classifier. This requires ability of incremental learning. Relevance weighting of the concepts to the new document is better than simple binary decision. Concepts can be ordered by weight of the relevance to the new document and user can search for additional relevant concept according to this ordering. Acknowledgments This work is done within the Webocracy project, which is supported by European Commission DG INFSO under the IST program, contract No. IST ; within the TEXAS Project supported by Austrian Institute for East- and Southeast Europe and within the VEGA project 1/8131/01 Knowledge Technologies for Information Acquisition and Retrieval of Scientific Grant Agency of Ministry of Education of the Slovak Republic. The content of this publication is the sole responsibility of the authors, and in no way represents the view of the European Commission or its services. References [1] Kodratoff, Y. (2001) Rating the Interest of Rules Induced from Data and within Texts. In Proc. of the 12 th IEEE - International Conference on Database and Expert Systems Applications - DEXA 2001, Munich. (Long version of the paper, submitted for publication in Knowledge and Information Systems: An International Journal). [2] Baeza-Yates, R. and Ribeiro-Neto, B. (1999) Modern Information Retrieval. Addison-Wesley Longman Publishing Company [3] Mach, M., Furdik K. (2001) Webocrat system architecture and functionality. Webocracy deliverable R2.4, Technical University of Kosice, April 2001

11 TEXT MINING FOR DOCUMENTS ANNOTATION AND ONTOLOGY SUPPORT [4] Han, J., Kamber, M. (2000) Data Mining Concepts and Techniques. Morgan Kaufmann Publishers [5] Salton, G., Wong, A., Yang, C. (1975) A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11), [6] Salton, G.,Buckley, C. (1988) Term Weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5), [7] Ah-Hwee Tan (1999) Text Mining: The state of the art and the challenges. In Proc. of the PAKDD'99 workshop on Knowledge Disocovery from Advanced Databases, Beijing, pp [8] Loh, S., Wives, L. K., and Palazzo, J. (2000) Concept-Based Knowledge Discovery in Texts Extracted from the Web. SIGKDD Explorations, Vol. 2, Issue 1, July 2000, pp [9] Kohonen, T. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, Springer Verlag, Berlin, Heidelberg, New York. [10] Rauber, A. (1999) On the labeling of self-organizing maps. In Proc. of the International Joint Conference on Neural Networks, Washington, DC [11] Rauber, A., Dittenbach, M. and Merkl, D. (2000) Automatically Detecting and Organizing Documents into Topic Hierarchies: A Neural Network Based Approach to Bookshelf Creation and Arrangement. In: Proc. of the 4th European Conference on Research and Advanced Technologies for Digital Libraries (ECDL2000), Springer LNCS 1923, Lisboa, Portugal [12] Rauber, A., and Paralic, J. (2000) Cluster Analysis as a First Step in the Knowledge Discovery Process. In Journal of Advanced Computational Intelligence, Fuji Technology Press Ltd., ISSN , Vol. 4, No. 4, pp [13] Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. Harshman, R. A. (1990) Indexing by latent semantic analysis. Journal of the Society for Information Science 41(6), [14] Lewis, D. D. (1998) Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning, 4-15, Chemnitz. [15] Clark, P., Niblett, T. (1989) The CN2 induction algorithm. Machine Learning Journal, 3(4), [16] Fürnkranz, J., Widmer, G. (1994) Incremental Reduced Error Prunning. Machine Learning: Proceedings of the 11th Annual Conference, New Brunswick, New Jersey. [17] Cohen, W. W. (1995) Fast effective rule induction. Proceedings of the 12th International Conference (ML95), Morgan Kaufmann, San Mateo, California. About the Authors Jan Paralic received his Master s degree in technical cybernetics in 1992 and PhD s degree in artificial intelligence in 1998 from Technical University of Kosice. His research currently focuses mainly on knowledge discovery (from databases as well as from texts) and knowledge management. Peter Bednar received a Master s degree in cybernetics 2001 from Technical University of Kosice. His research interests include text categorisation, data and text mining.

Keyword Extraction by KNN considering Similarity among Features

64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,