Text Mining: A Burgeoning technology for knowledge extraction

Size: px

Start display at page:

Download "Text Mining: A Burgeoning technology for knowledge extraction"

Phebe Walsh
5 years ago
Views:

1 Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi. anshika1807@gmail.com, g_udayan@lycos.com ABSTRACT With the dramatic growth of textual information over the Internet or databases, there is an increasing need for the system that can automatically discover useful knowledge from the text. Text Mining is the process of applying automatic methods to analyze and structure textual data in order to create useable knowledge from previously unstructured information. This paper describes text mining with its techniques as well as its role and applications in various areas. The motivation for trying to extract information from it is compelling, even if success is only partial. Miller [3] describes text mining as the automated or partially automated processing of text". He characterizes text mining with a process model explaining components and interacting steps specific to texts. Figure 1 depicts such a process model..keywords Text mining, Data mining, Classification, Clustering, Text analysis 1. INTRODUCTION Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from a usually a large amount of different unstructured textual resources [1]. Text mining is the discovery of interesting knowledge in text documents. Text mining is a variation on a field called data mining [2] that tries to find interesting patterns from large databases. Text mining, also known as Intelligent Text Analysis, Text Data Mining or Knowledge- Discovery in Text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining [1] is similar to data mining, except that data mining tools [2] are designed to handle structured data from databases, but text mining can work with unstructured or semistructured data sets such as s, full-text documents and HTML files etc. As a result, text mining is a much better solution for companies. Compared with the kind of data stored in databases, text is unstructured, amorphous, and difficult to deal with. Nevertheless, in the current scenario, text is the most common media for the formal exchange of information. Fig.1: A process model for text mining. At the beginning there is the raw text input denoted as text corpus representing a collection of text documents, like memos, reports, or publications. Grammatical parsing and preprocessing steps transform the unstructured text corpus into a semi-structured format denoted as a text database. Subsequently a structured representation is created by computing a document-term matrix from either the text corpus or the text database. The document-term matrix is a bag-of-words mechanism containing term frequencies for all documents in the corpus. This common data structure forms the basis for further text mining analysis, like text classification, syntax analysis, relationship identification, information extraction & retrieval, and document summarization. 2. TEXT MINING, ITS TECHNIQUES AND OTHER TECHNOLOGIES

2 Text mining is about analysing unstructured information and extracting relevant patterns and characteristics. Information can be extracted to derive information for the words contained in the documents or to compute summary for the documents based on the words contained in them. We can analyse words, clusters of words used in documents, etc., documents and determine similarities between them. The primary goal of the text mining is to identify the useful information without duplication from various documents with synonymous understanding. Text Mining is an empirical tool that has a capacity of identifying new information that is not apparent from a document collection. Text Mining Process includes the following Steps: 1) Text pre-processing: It involves the syntactic/semantic analysis of text. 2) Text Transformation, which includes the attribute generation. Two main approaches of document representation bag of words, vector space. 3) Text Representation, involves selecting a subset of features to represent a document, further reduction of dimensionality, irrelevant feature reduction, e.g. - sampling, statistics etc. 4) Data Mining: It includes application of Classical data mining techniques, e.g. - Classification, Clustering etc. This is purely application dependent stage. 5) Interpretation/Evaluation- Analysing results. data mining and Computational linguistics (Bolasco et al. 2002) as shown in figure2.text Mining techniques are also aimed at finding the Business Intelligence solution to help companies to remain competitive in the market. Information Extraction Information Extraction (IE) is the process of automatically obtaining structured data from an unstructured natural language document. Often this involves defining the general form of the information that we are interested in as one or more templates, which are then used to guide the extraction process. IE systems rely heavily on the data generated by NLP systems. It perform the Term Analysis which identifies the terms in a document, where a term may consist of one or more words, Named-entity recognition, which identifies the names in a document, such as the names of people or organizations, dates, expressions of time, quantities and associated units, etc. and Fact extraction which identifies and extracts complex facts from documents such as relationships between entities or events. Since Information Extraction addresses the problem of transforming a corpus of textual documents into a more structured database, the database constructed by an IE module can be provided to the Knowledge module for further mining of knowledge as illustrated in figure2. Hence Information Extraction can play an obvious role in text mining DATA MINING TEXT MINING COMPUTATION AL LINGUISTICS INFORMATION RETRIEVAL Fig.2: Text Mining as an Interdisciplinary Field Text Mining can be defined as a sub field of data mining if data mining techniques are used to discover patterns or information from textual data. It also inherently requires techniques from other fields of Information Retrieval, Fig.3: Overview of IE-based text mining framework Information Retrieval Information Retrieval is defined as the methods used for representation, storage and accessing of information items (Joachims 1998) where the information handled is mostly in the form of textual documents, newspapers, research papers and books which are retrieved from

3 databases according to the user request or queries. Information Retrieval (IR) systems identify the documents of user s interest in a collection which match a user s query. A Text Mining process differs from information retrieval in the sense it identifies the knowledge as a consequence of applications of data mining techniques which is new, potentially useful and ultimately understandable. The most well-known IR systems are search engines such as Google, which identify those documents on the World Wide Web that are relevant to a set of given words. IR systems allow us to narrow down the set of documents that are relevant to a particular problem. As text mining involves applying very computationally-intensive algorithms to large document collections, IR can speed up the analysis considerably by reducing the number of documents for analysis. Computational Linguistics Computational Linguistics computes statistics over large text collections in order to discover useful patterns which are used to inform algorithms for various subproblems within NLP, e.g. Parts Of Speech tagging, and Word Sense Disambiguation [Armstrong 1994]. Text mining techniques share methods from natural language processing to deal with textual information hidden in natural language text based databases. The ability to handle information of this type and make it understandable to the computer lies at the core of text mining technological efforts. Computational efforts are being made to make the computer understand the human natural language but efficient methods are not yet achievable for processing these types of information and extracting useful knowledge patterns. Therefore text mining techniques can offer benefits for processing human natural language information with speed and accuracy (Gao et al.2005). The patterns generated by these methods and techniques are analyzed by the computer to generate information which can further be processed by applying data mining algorithms. These algorithms can help to discover useful patterns for part of speech tagging, word sense disambiguation or creation of bilingual dictionaries. Pattern Recognition Pattern Recognition is the process of searching for predefined sequences in text. In a text mining scenario it is taken as a process of matching the patterns using words as well as morphological and syntactic properties. Two different methods for pattern recognition are terms or word matching and relevancy signatures. Word and term matching methods are easier to implement but need manual efforts as well whereas relevancy signatures are based upon methods of morphological and syntactic information processing techniques. Text Categorization In text mining, categorization refers to the process of grouping related concepts, themes, or common threads. This process is data-driven and iterative. The main topic of a document is identified by placing the document into a pre-defined set of topics. The process of categorization of documents relies on methods of taking the whole document as a set of words or bag of words where the information is extracted on the basis of words counts, the relationships are identified by looking terms in broader and narrowing aspects of these terms, and their synonyms. Documents having the most content on a particular topic are arranged in order and rank is given to it as per content. Text Clustering Clustering [4] is a technique used to group similar documents, but it differs from categorization in that documents are clustered on the fly instead of through the use of predefined topics. This method groups similar documents on the basis of strong similarities within each cluster and dissimilarities to the documents outside the cluster. A basic clustering algorithm creates a vector of topics for each document and measures the weights of how well the document fits into each cluster. This technique is useful to organise thousands of documents in an industrial or organisational information management systems. Categorization is the process of identifying the similarities in the documents. In K-means clustering algorithm[5], while calculating similarity between text documents, not only consider Eigen vector based on algorithm of term frequency statistics,but also combine the degree of association between words,then the relationship between keywords has been taken into consideration,thereby it lessens

4 sensitivity of input sequence and frequency, to a certain extent, it considered semantic understanding, effectively raises similarity accuracy of small text and simple sentence as well as preciseness and recall rate of text cluster result. Text Summarization Text Summarization process helps to reduce the content of documents and makes it readable to others whilst still retaining the sense of the topic discussed in it. In practice humans read through the text and understand the meaning of this and mention or highlight the main topic or point discussed in the text. Computers lack this capability of understanding the text therefore certain methods or techniques (e.g. sentence extraction) are used to find the useful information by using statistical weighting methods. These methods are used to find the key information in terms of phrases to define main theme of the text. Text Visualization Trend Analysis and Association Analysis are used to find trends or predict future patterns based on time dependent data and associate these patterns to the other extracted patterns. Text Visualization is defined as representing the extracted features with respect to the key terms and helps identifying main topics or concepts by the degree of their importance on the representation. It is further used to easily discover the location of the documents in graphical representation. 3. BENEFITS OF TEXT MINING A significant benefit of Text Mining is efficient analysis of extant knowledge from vast collection of documents, eventually cutting down the time spent on ensuring coverage of whole material. Text Mining unlocks the hidden information and extracts new knowledge. Hence improved research process and quality resulting in cost saving and productivity gains. 4. APPLICATIONS OF TEXT MINING In the area of Knowledge management & HR, Text Mining techniques are used in support and decision making and Competitive Intelligence by selecting only relevant information by automatic reading of this data about the company as well as its competitor. They are also used to manage human resources strategically, mainly with applications aiming at analysing staff s opinions, monitoring the level of employee satisfaction, as well as reading and storing CVs for the selection of new personnel. Extraction Transformation Loading fills non-structured textual material into categories and structured fields. The search engines are usually associated with ETL offers conceptual browsing and questioning in natural language. The applications are found in the editorial sector, the juridical and political document field and medical healthcare. In Customer Relationship management, Text mining competitors and/or monitor customers' opinions to identify new potential customers, as well as to determine the Companies image through the analysis of press reviews and other relevant sources. In the area of Technology watch, it analyses the characteristics of existing technologies, as well as identifying emerging technologies, through its potentiality, application fields and relationships with the existing technology. In Natural Language Processing (NLP), Text mining used in construction of websites which support systems of questioning in natural language. Text Mining applications are used to analyse web pages published in different language. 5. CONCLUSION The rapid growth of stored information in almost every area of live scenario has created a great demand for new, powerful tools for turning data into useful knowledge. This problem of information overload is further aggravated due to the unstructured, textual data form of the majority of the data. Text represents a vast, rich collection of information, but encrypts this information in such a form that is difficult to decipher automatically. Text Mining also known as Text Data Mining or KDT refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is an interdisciplinary field which draws on information retrieval, data mining,

5 machine learning, statistics and computational linguistics. As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value. Knowledge may be discovered from many sources of information; yet, unstructured texts remain the largest readily available source of knowledge. Some of Text Mining techniques are as follows: Information Extraction- It identifies facts and relations in text. This technique gives relationships between all the identified people, places, and time to provide the user with meaningful information. Information Retrieval- It provides documents of user s interest to the users. Categorization- Categorization is the process of identifying the similarities in the documents. The main topic of a document is identified by placing the document into a pre-defined set of topics. Categorization counts words that appear, and from the counts, identifies the main topics that the document covers. Clustering- Clustering can be used in order to find groups of documents with similar content. Summarization- It reduces the length of a document with keeping its main points and overall meaning as it is. [4] Liritano S. and Ruffolo M., (2001), Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining, IEEE, , Italy. [5] Ming Zhao, Jianli Wang and Guanjun Fan (2008), Research on Application of Improved Text Cluster Algorithm in intelligent QA system, Proceedings of the Second International Conference on Genetic and Evolutionary Computing, China,, IEEE Computer Society, [6] J. Han and M. Kamber., Data Mining: Concepts and Techniques Morgan Kaufmann, [7] Zaïane, O.-R., Principles of Knowledge Discovery in Databases, University of Alberta, Text Mining unlocks the hidden information and extracts new knowledge. Hence improved research process and quality resulting in cost saving and productivity gains. 6. REFERENCES [1] Berry Michael W., ( 2004), Survey of Text Mining: Clustering, Classification and Retrieval, Springer Verlag, New York, LLC, [2] Navathe, Shamkant B., and Elmasri Ramez, (2000 ) Fundamentals of Database Systems, Pearson Education pvt Inc., Singapore. [3] Thomas W. Miller. Data and Text Mining: a Business Applications Approach, Pearson Edition, 2005.

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,