Concept Based Text Document Summarization Using Domain Ontology

Size: px

Start display at page:

Download "Concept Based Text Document Summarization Using Domain Ontology"

Julian Melton
5 years ago
Views:

1 Concept Based Text Document Summarization Using Domain Ontology Dr. S.Logeswari ASP/CSE, Bannari Amman Institute of Technology, Sathyamangalam Dr. R.Gomathi AP (Sr.G)/CSE, Bannari Amman Institute of Technology, Sathyamangalam Dr.B.Gomathy AP(Sl.G)/CSE, Bannari Amman Institute of Technology, Sathyamangalam ABSTRACT In today s world, the web search engine often returns thousands of web pages which makes difficult for users to browse or to find relevant information. Clustering methods help to automatically group the retrieved documents into a list of meaningful sections. Summarization is the process of extracting and retaining the set of most important points from the original document. The important sentences are extracted using the domain ontology from the source documents for final summary generation. In this paper we propose a system which is used for generating summaries for the medical documents using MeSH ontology. KEYWORDS: Summarization, MeSH Ontology, Clustering, K-Means 1. INTRODUCTION In the most recent years, with the massive expansion of the information society, the web has become a precious source of information for almost every potential domain of knowledge. This has induced many researches to initiate considering the web as a legitimate repository for information retrieval and knowledge acquisition tasks. The Web consists of massive amount of information available for each possible domain and its high redundancy, can be a valid knowledge source for similarity computation. Therefore, text mining system faces with a large amount of attributes. The knowledge discovery in database techniques require input texts to be represented as a set of attributes in order to deal with them. The text-torepresentation method is known as text or document indexing, and the attributes are called indexes. Indexing becomes a critical task in text mining because it has to represent the information in the text with the minimum loss of semantics for its future usage. 1.1 DATA MINING Data mining is the computational process of discovering patterns in large data sets devising methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. Data mining involves data pre-processing, database and data management aspects, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. The overall objective of the data mining process is to extract the information from database and transfer it into an understandable structure for further use. Besides the raw analysis step, it includes database and data management aspects, data preprocessing, model and interface considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization and online updating. 1.2 TEXT MINING Text mining is also referred as text data mining or text analytics which is the technique of deriving high-quality information from text. High-quality information is typically derived through the devising of 173 Dr. S.Logeswari, Dr. R.Gomathi, Dr.B.Gomathy

2 patterns and statistical pattern learning. Text mining generally involves the process of structuring the input text usually parsing, along with derived linguistic features and the eradication of others, and subsequent insertion into a database, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 1.3 CLUSTERING Clustering is an automatic learning approach aimed at grouping a set of objects into subsets or cluster. Clustering is the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities. Clustering is an unsupervised classification. A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to diagnose relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories. A good clustering method will produce high quality clusters in which (a) the intraclass similarity is high (b)the inter-class similarity is low (c)the quality of a clustering result also depends on both the similarity measure used by the method and its implementation and (d)the quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. 1.4 DOCUMENT CLUSTERING Document clustering is an automatic document organization, topic extraction and fast information retrieval or filtering. Document Clustering is a fundamental and crucial operation in several applications such as document organization, corpus summarization, information retrieval and filtering, automatic topic extraction. Document clustering has been widely applied to information retrieval systems for enhancing performance. The goal of a document clustering scheme is to minimize intra-cluster distances between documents, while maximizing inter-cluster distances. Document clustering involves the need of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is typically considered to be a centralized process. In traditional document clustering technique, the terms of the documents are treated as features. Examples of document clustering comprises web document clustering for search users. 1.5 SUMMARIZATION Summarization is the process of extracting and retaining a set of most important points from the original document. The significant part in summarization is to identify the most informative part from the document. The information flow in a document varies from document to document and the importance of information is unpredictable. As the problem of information overload has grown, and as the quantity of data has increased, the interest in automatic summarization has also increased. Text summarization or abstraction has always been a key activity in the information access context. Document summaries serve readers with condensed versions of the most relevant information found in documents, they can help the readers to assess the value of the document without having to read it, or it can be used as content repositories for extracting valuable facts or information. 1.6 SEMANTIC ANALYSIS Semantic analysis figures out the meaning of linguistic input. It processes language to produce common-sense knowledge about the world and extract data and construct models of the world. Lexical semantics describes the meaning of component words and word sense disambiguation. Semantic analysis can begin with the relationship between individual words. 1.7 ONTOLOGY Ontology is an engineering artifact explaining what exists in a particular domain. Ontology belongs to a specific domain of knowledge. Ontology can be used as a background knowledge that can help in finding the related meanings for the terms occurring in documents. The scope of the ontology concentrates on definitions of a certain domain, although sometimes the domain can be very broad. The domain can be such as industry domain, an enterprise, a research field, or any other restricted set of knowledge, whether abstract, concrete or even imagined. It is usually constructed with a certain task in mind. Present-day ontology can be categorized into two general levels: those that form meta-language dictionaries and those that are derived from 174 Dr. S.Logeswari, Dr. R.Gomathi, Dr.B.Gomathy

3 knowledge bases built for inference engines and expert systems. Text clustering and classification are two promising approaches to help users organize and contextualize textual information. Medical Subject Headings ( MeSH) is published by the National Library of Medicine is hierarchically arranged from most general to most specific. It mainly consists of the controlled vocabulary and a MeSH Tree. The controlled vocabulary contains many different kinds of terms, such as descriptor, qualifiers, publication types, geographic and entry terms. Descriptors terms are the main concepts or main headings in the ontology. Entry terms are the synonyms or the related terms to descriptors. MeSH descriptors are organized in MeSH Tree, which can be seen as MeSH concept Hierarchy. In the MeSH tree, there are 15 categories (e.g. category A for anatomic terms) and each category is further divided into subcategories. In each subcategory, the descriptors are hierarchically arranged from most general to most specific. Descriptors usually appear in more than one place in the tree, they are represented in a graph rather than a tree. 1. EXISTING SYSTEM There exist two common approaches for producing automatic summarization: extraction and abstraction. Extraction method selects a subset of existing words, phrases, or sentences in the source document to form the summary. The abstraction method builds an internal semantic representation and natural language generation techniques to create a summary that is closer to human generated summary. A new key-phrase extraction method is also used which extracts the content from source document using semantic relations. Lexical chains are used to represent semantic relations. 2.1 SUMMARIZATION PROCESS The steps involved in the general summarization system are: The input documents are analysed and the sentences are extracted one by one. Parse trees are generated for each sentence and typed dependencies are also generated by the parser. Subject, predicate and object are extracted from the typed dependencies for each sentence. The extracted subject, predicate and object are represented in a Resource Description Framework (RDF). Semantic distance between the each pair of triples is calculated and semantic distance matrix is generated using a Wu and Palmer metric and the distance is calculated from the values obtained from word net. Once the semantic distance matrix is generated, mean is calculated for the semantic values of each pair of triples and provided as input to the clustering algorithm. K means clustering algorithm is applied to the obtained mean values in order to group the triples that are semantically similar. After applying the clustering algorithm, the cluster points of each cluster is analysed for sentence extraction to generate the summary of the input document. 2.3 ISSUES IN EXISTING SYSTEM The huge number of possible documents may be assigned to imply that many standard semantic models such as simple language model, topic signature translation model, context sensitive semantic model, cannot be easily adapted. Whenever a parse tree is constructed, it occupies a lot of memory. In parse tree generation, it is necessary that the tokens to be completely context free. When a parsing error occurs, it is hard to determine where the source file gets failed. 2. PROPOSED SYSTEM The major objective of this proposed work is to improve the quality of searching which is based on the process of summarization model for MeSH labels. The experiment is based on concept-based weighting scheme that is used to index the words in the source document. It includes sentence extraction, Parts-Of- Speech (POS) tagging, concept mapping and summarization. 175 Dr. S.Logeswari, Dr. R.Gomathi, Dr.B.Gomathy

4 POS tagging applies to tag the source text automatically. Concept mapping uses semantic relation such as identity, synonym, hypernym and meronym which are identified from the MeSH ontology. The significance of the concepts in each document is represented as concept weight. The concept weight is computed based on the semantic relations. Weight is computed using the frequency of each word and its importance. 3.1 MODULE DESCRIPTION In this system, MeSH ontology is used as the domain reference for concept extraction. A conceptbased weighting scheme is used to index the words in the source document. Semantic weight of individual concept is computed based on the semantic relationships, identity, synonymy, hypernymy and meronymy. It includes modules such as Sentence extraction POS tagging Concept mapping Summarization SENTENCE EXTRACTION Sentence extraction helps to separate the paragraph into individual sentences from the source document. Sentence extraction is a technique used for automatic summarization of a text. Sentence extraction works as a filter which allows only important sentences to pass. Sentences are extracted from a set of documents that contains similar content PARTS-OF-SPEECH TAGGING The process of assigning one of the parts of speech to the given word is called POS tagging. POS tagging includes nouns, verbs, adverbs, adjectives, pronouns, conjunction and their sub-categories CONCEPT MAPPING Concept mapping is a procedure used to find the concept behind the sentence using the semantic relation such as identity, synonym, hypernym and meronym. Hypernym is a word that denotes general or parent category whereas Meronym is a word that denotes a constituent part or a member of something SUMMARIZATION Individual relations are assigned with initial weights. In this proposed method, the concept of weight is introduced to prioritize each sentence. Identity and synonym are considered to have more weight whereas hypernym and meronym are considered to have minimum score. Initially, Identity and synonyms are assigned with the weight as 1. The term representing the concept is known as the root word and it is assigned with the weight 1. Hypernyms are assigned with the weight reduced by 0.1 level by level in the backward direction from the root. Meronyms are assigned with the weight value by decreasing 0.01 from the weight of the root, level by level in the forward direction till the end of the tree. Extract the sentence from the input document. Apply POS tagging. Extract the concept mapping from ontology. Assign the weight for the noun based on the semantic relation. Compute the concept for each sentence using the equation 3.1. where, w( ) - weight of the particular concept freq i - frequency of the particular relation weight i - semantic weight assumed for a particular relation N - number of occurrences of all concepts in the document (3.1) 176 Dr. S.Logeswari, Dr. R.Gomathi, Dr.B.Gomathy

5 3.1.5 IMPLEMENTATION The documents are collected from the PubMed repository for the experimentation purpose in the disease Dengue, Typhoid and Jauntice. The documents are preprocessed by applying the procedures such as sentence extraction, POS tagging and the concept weights are computed for all the sentences in the input documents. The efficiency of this concept based summarization is assessed by the clustering process with K- Means, Hierarchical-single and Hierarchical-complete algorithms. Various similarity measures such as Euclidean distance and Pearson correlation are used to find the optimal number of clusters. The quality of the clusters is compared with the traditional tf-idf method which computed using the equations ( 3.2). For the computation of tf-idf weight of the documents, the tokenization and stop-word removal are performed during the preprocessing stage. where is the number of occurrences of term in the document, is the term frequency in the collection of documents and is the total number of documents in the collection PERFORMANCE EVALUATION SILHOUETTE CLUSTERING: Silhouette refers to a process of interpretation and validation of consistency within clusters of data. The silhouette value is a measure of how identical an object is to its own cluster distinguished to other clusters. The silhouette ranges from -1 to 1, where a high value indicates that the object is well related to its own cluster and poorly matched to neighboring clusters. If many objects have a high value, then the clustering configuration is appropriate. Objects with a high silhouette value are considered well grouped; objects with a low value may be outliers. This index suits well with k-means clustering, and is also used to determine the optimal number of clusters. Davies-Bouldin Criterion The Davies Bouldin (DB) index is a metric for assessing clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been made is done using quantities and features inherent to the dataset. DB Index criterion depends on the ratio of within-cluster and between-cluster distances. The maximum value denotes the worst-case within-to-between cluster ratio for cluster i. The optimal clustering solution has the minimum DB index value. Clustering Algorithm Table 3.1: Performance Measures Based on Euclidean distance No. of Performance Measures Clusters K-means FM SILHOUETTE DB Index Hierarchical single Hierarchical Complete (3.2) 177 Dr. S.Logeswari, Dr. R.Gomathi, Dr.B.Gomathy

6 In K-Means, the optimal cluster for FM index is obtained when k=2. The highest value for silhouette index is attained at k=3 whereas the lowest value for DB index is at k=3. When considering algorithms like Hierarchical single and Hierarchical complete, the best clustering for FM index is obtained when k=10, DB index attains the optimal cluster at k=2 and Silhouette index at k=2. Here, k denotes the number of clusters. In Hierarchical Single the optimal clusters are obtained at k=2. Hence, Hierarchical Single algorithm out performance K-Means and Hierarchical Complete algorithm. Figure: 3.1 Euclidean distance based FM index When k=3, Silhouette index has the value which is the optimal solution Figure: 3.2 Euclidean distance based Silhouette index When k=7 DB index has the value which is the optimal solution 178 Dr. S.Logeswari, Dr. R.Gomathi, Dr.B.Gomathy

Figure: 3.3 Euclidean distance based DB index The optimal clustering solution has the lowest FM index and Davies Bouldin index value whereas the Silhouette index has the maximum value.

7 Figure: 3.3 Euclidean distance based DB index The optimal clustering solution has the lowest FM index and Davies Bouldin index value whereas the Silhouette index has the maximum value. CONCLUSION Document clustering will be mainly used for extraction of better document and text mining. From the literatures, it is observed that the summarization technique will be used for clustering the documents so that the dimensionality of the document can be reduced and clustering quality will be improved. Hence a concept based text document summarization method is proposed in this work. The proposed method involves with concept based weighting scheme which computes the importance of the underlying text by converting the documents into a bag of concepts. In this paper concept based text document summarization deals with encapsulating the text document using statistical approach. Ontology is used as a background knowledge which helps in finding the related meanings for the terms occurring in the source documents. It is designed to improve the quality of searching which is based on the process of summarization model for MeSH (Medical Subject Heading) labels. REFERENCES [1] Abdullah, KA 2015 An Ontology Based Text Document Summarization Using Statistical Approach, Journal of Computing & ICT vol 8. no. 2 issue 2, pp [2] Acierno, AD, Moscato, V, Persia, F, Picariello, A, Penta, A 2012 iwin: A summarizer system based on a semantic analysis of web documents, Proceedings of the Sixth IEEE International Conference on Semantic Computing, pp [3] Aliguliyev, RM 2009 A new sentence similarity measure and sentence based extractive technique for automatic text summarization, Expert Systems with Applications, vol. 36, no. 4, pp [4] Al-Hashemi, R 2010 Text Summarization Extraction System Using Extracted Keywords, International Arab Journal of e-technology, vol.1, no.4, pp [5] Archana, AB, Sunitha, C, Babu, AS & Sarasan, S 2013 Document Clustering Using Cluster Based Method, International Journal of Emerging Technology and Advanced Engineering, volume 3, issue 7. [6] Bhole, P and Dr.Agrawal, AJ 2014 Single Document Text Summarization Using Clustering Approach Implementing for News Article, International Journal of Engineering Trends and Technology (IJETT) volume 15, number Dr. S.Logeswari, Dr. R.Gomathi, Dr.B.Gomathy

8 [7] Deshpande, AR and Lobo, LMRJ 2013 Text Summarization using Clustering Technique, International Journal of Engineering Trends and Technology (IJETT) volume.4, issue 8. [8] Devasena, CL & Hemalatha, M 2009 Automatic Text Categorization and Summarization using Rule Reduction, International Conference on Advances In Engineering, Science and Management, pp [9] Florence, A & Padmadas, V 2015 A Summarizer System Based on a Semantic Analysis of Web Documents, International Conference on Technologies for Sustainable Development, pp.1-6. [10] Fiszman, M., Demner-Fushman, D, Kilicoglu, H, & Rindflesch, TC 2009 Automatic Summarization of MEDLINE citations for Evidence-based Medical Treatment: A Topic-Oriented Evaluation, Journal of Biomedical Informatics, vol. 42, no.5, pp [11] Gupta, V, Chauhan, P & Garg, S 2012 An Statistical Tool for Multi-Document Summarization, International Journal of Scientific and Research Publications, volume 2, issue 5, pp [12] Gupta, V & Lehal, GS 2010 A Survey of Text Summarization Extractive Techniques, Journal of Emerging Technologies in Web Intelligence, vol. 2, no.3, pp [13]Khanapure, VM & Chirchi, VR 2014 Multi-document Summarization Based on Cluster, International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, vol. 3, issue 4, pp [14] Kogilavani, A and Dr.Balasubramanie, P 2009 Ontology Enhanced Clustering Based Summarization of Medical Documents, International Journal of Recent Trends in Engineering, vol. 1, no. 1. [15] Kaur, S & Chopra, A 2016 Clustering Based Document Summarization, International Journal of Emerging Trends & Technology in Computer Science (IJETTCS), volume 5, issue 1. [16] Latika 2015 An Effective and Efficient Algorithm for Document Clustering, International Journal of Advanced Research in Computer Science and Software Engineering, volume 5, issue 5. [17] Lei Li & Tao Li 2014 An Empirical Study of Ontology-Based Multi-Document Summarization in Disaster Management, vol. 44, no. 2, pp [18] Li, Y, Luo, C & Chung, SM 2008 Text clustering with feature selection by using statistical data, IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 20, pp [19] Lodhi, P, Sharma, T 2016, An Extractive Summarization of Document Using Conceptual Mining and Sentence Ranking, International Journal of Innovative Research in Computer and Communication Engineering, vol. 4, issue 6. [20] Nagwani, K and Dr.Verma, S 2011 A Frequent Term and Semantic Similarity based Single Document Text Summarization Algorithm, International Journal of Computer Applications, volume 17, no.2, pp [21] Oak, R 2016 Extractive Techniques for Automatic Document Summarization, International Journal of Innovative Research in Computer and Communication Engineering, Vol. 4, Issue 3. [22] Pal, AR, Maiti, KP & Saha, D 2013, An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm And Wordnet, International Journal of Control Theory and Computer Modeling (IJCTCM) vol.3, no.4/5. [23] Pie-ying, Z & Cun he, L 2012 Automatic text summarization based on sentences clustering and extraction, IEEE International Conference pp [24] Prasad, GK, Mathivanan, H, Jayaprakasam, M & Geetha, TV 2009 Document summarization and information extraction for generation of presentation slides, International Conference on Advances in Recent Technologies in Communication and Computing, pp [25] Ramezani, M & Feizi-Derakhshi, M 2015 Ontology-Based Automatic Text Summarization Using FarsNet, an International Journal, vol. 4, issue 2, no.14, pp [26] Alguliev, R & Aliguliyev, R 2009 Evolutionary Algorithm for Extractive Text Summarization, Intelligent Information Management, 1,pp [27] Saranyamol, CS & Sindhu, L 2014 A Survey on Automatic Text Summarization, International Journal of Computer Science and Information Technologies, vol. 5, pp [28] Sarkar, K 2009 Using Domain Knowledge for Text Summarization in Medical Domain, International Journal of Recent Trends in Engineering, vol.1, no. 1. [29] Wang, DD, Zhu, SH, Li, T, Chi, Y & Gong, YH 2008 Integrating clustering and multi-document summarization to improve document understanding, Proceedings of ACM 17th Conference on Information and Knowledge Management, pp [30] Zhang, X, Jing, L, Hu, X, Ng, M, Xia, J and Zhou, X, 2008 Medical document clustering using Ontology Based Term similarity measures, International Journal of Data Warehousing and Mining, vol.4, no.1, pp Dr. S.Logeswari, Dr. R.Gomathi, Dr.B.Gomathy

A Semantic Model for Concept Based Clustering

A Semantic Model for Concept Based Clustering S.Saranya 1, S.Logeswari 2 PG Scholar, Dept. of CSE, Bannari Amman Institute of Technology, Sathyamangalam, Tamilnadu, India 1 Associate Professor, Dept. of