Text Mining: A Burgeoning technology for knowledge extraction

Similar documents
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Techniques for Mining Text Documents

Text Mining. Representation of Text Documents

Overview of Web Mining Techniques and its Application towards Web

An Approach To Web Content Mining

Introduction to Text Mining. Hongning Wang

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Review on Text Mining

Powerful Tool to Expand Business Intelligence: Text Mining

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Analysis on the technology improvement of the library network information retrieval efficiency

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

Information mining and information retrieval : methods and applications

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Chapter 1: Text Mining Overview 1.1 Introduction

Text Document Clustering Using DPM with Concept and Feature Analysis

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

DATA WAREHOUSING IN LIBRARIES FOR MANAGING DATABASE

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

3 Publishing Technique

Information Retrieval

Dynamic Clustering of Data with Modified K-Means Algorithm

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Domain-specific Concept-based Information Retrieval System

The Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce Website Bo Liu

CHAPTER-26 Mining Text Databases

Qualitative Data Analysis Software. A workshop for staff & students School of Psychology Makerere University

Data Mining in the Application of E-Commerce Website

Knowledge Engineering with Semantic Web Technologies

Chapter 27 Introduction to Information Retrieval and Web Search

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

ResPubliQA 2010

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

ISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164

Integrating Text Mining with Image Processing

Ubiquitous Computing and Communication Journal (ISSN )

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Semantic Web Mining and its application in Human Resource Management

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Competitive Intelligence and Web Mining:

What is this Song About?: Identification of Keywords in Bollywood Lyrics

DIGIT.B4 Big Data PoC

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Get the most value from your surveys with text analysis

Chapter 2 BACKGROUND OF WEB MINING

Computers Are Your Future

Information Retrieval. Information Retrieval and Web Search

A SURVEY ON WEB LOG MINING AND PATTERN PREDICTION

KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

Information Extraction Techniques in Terrorism Surveillance

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

Chapter 1 AN INTRODUCTION TO TEXT MINING. 1. Introduction. Charu C. Aggarwal. ChengXiang Zhai

Research and Design of Key Technology of Vertical Search Engine for Educational Resources

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Clustering Analysis based on Data Mining Applications Xuedong Fan

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Ontology Extraction from Heterogeneous Documents

NLP - Based Expert System for Database Design and Development

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

What is Text Mining? Sophia Ananiadou National Centre for Text Mining University of Manchester

Using PageRank in Feature Selection

Associating Terms with Text Categories

INTRODUCTION TO DATA MINING

Taxonomies and controlled vocabularies best practices for metadata

CSE 494: Information Retrieval, Mining and Integration on the Internet

Natural Language Processing. SoSe Question Answering

Reading group on Ontologies and NLP:

Automated Tagging for Online Q&A Forums

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Filtering of Unstructured Text

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Keywords Data alignment, Data annotation, Web database, Search Result Record

Bayesian Learning Networks Approach to Cybercrime Detection

Mining Web Data. Lijun Zhang

UML-Based Conceptual Modeling of Pattern-Bases

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Mining Association Rules in Temporal Document Collections

Association Rule Mining in The Wider Context of Text, Images and Graphs

Life Science Journal 2017;14(2) Optimized Web Content Mining

CSE 626: Data mining. Instructor: Sargur N. Srihari. Phone: , ext. 113

Enterprise Multimedia Integration and Search

Modelling Structures in Data Mining Techniques

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing

Using PageRank in Feature Selection

Transcription:

Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi. Email: anshika1807@gmail.com, g_udayan@lycos.com ABSTRACT With the dramatic growth of textual information over the Internet or databases, there is an increasing need for the system that can automatically discover useful knowledge from the text. Text Mining is the process of applying automatic methods to analyze and structure textual data in order to create useable knowledge from previously unstructured information. This paper describes text mining with its techniques as well as its role and applications in various areas. The motivation for trying to extract information from it is compelling, even if success is only partial. Miller [3] describes text mining as the automated or partially automated processing of text". He characterizes text mining with a process model explaining components and interacting steps specific to texts. Figure 1 depicts such a process model..keywords Text mining, Data mining, Classification, Clustering, Text analysis 1. INTRODUCTION Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from a usually a large amount of different unstructured textual resources [1]. Text mining is the discovery of interesting knowledge in text documents. Text mining is a variation on a field called data mining [2] that tries to find interesting patterns from large databases. Text mining, also known as Intelligent Text Analysis, Text Data Mining or Knowledge- Discovery in Text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining [1] is similar to data mining, except that data mining tools [2] are designed to handle structured data from databases, but text mining can work with unstructured or semistructured data sets such as emails, full-text documents and HTML files etc. As a result, text mining is a much better solution for companies. Compared with the kind of data stored in databases, text is unstructured, amorphous, and difficult to deal with. Nevertheless, in the current scenario, text is the most common media for the formal exchange of information. Fig.1: A process model for text mining. At the beginning there is the raw text input denoted as text corpus representing a collection of text documents, like memos, reports, or publications. Grammatical parsing and preprocessing steps transform the unstructured text corpus into a semi-structured format denoted as a text database. Subsequently a structured representation is created by computing a document-term matrix from either the text corpus or the text database. The document-term matrix is a bag-of-words mechanism containing term frequencies for all documents in the corpus. This common data structure forms the basis for further text mining analysis, like text classification, syntax analysis, relationship identification, information extraction & retrieval, and document summarization. 2. TEXT MINING, ITS TECHNIQUES AND OTHER TECHNOLOGIES

Text mining is about analysing unstructured information and extracting relevant patterns and characteristics. Information can be extracted to derive information for the words contained in the documents or to compute summary for the documents based on the words contained in them. We can analyse words, clusters of words used in documents, etc., documents and determine similarities between them. The primary goal of the text mining is to identify the useful information without duplication from various documents with synonymous understanding. Text Mining is an empirical tool that has a capacity of identifying new information that is not apparent from a document collection. Text Mining Process includes the following Steps: 1) Text pre-processing: It involves the syntactic/semantic analysis of text. 2) Text Transformation, which includes the attribute generation. Two main approaches of document representation bag of words, vector space. 3) Text Representation, involves selecting a subset of features to represent a document, further reduction of dimensionality, irrelevant feature reduction, e.g. - sampling, statistics etc. 4) Data Mining: It includes application of Classical data mining techniques, e.g. - Classification, Clustering etc. This is purely application dependent stage. 5) Interpretation/Evaluation- Analysing results. data mining and Computational linguistics (Bolasco et al. 2002) as shown in figure2.text Mining techniques are also aimed at finding the Business Intelligence solution to help companies to remain competitive in the market. Information Extraction Information Extraction (IE) is the process of automatically obtaining structured data from an unstructured natural language document. Often this involves defining the general form of the information that we are interested in as one or more templates, which are then used to guide the extraction process. IE systems rely heavily on the data generated by NLP systems. It perform the Term Analysis which identifies the terms in a document, where a term may consist of one or more words, Named-entity recognition, which identifies the names in a document, such as the names of people or organizations, dates, expressions of time, quantities and associated units, etc. and Fact extraction which identifies and extracts complex facts from documents such as relationships between entities or events. Since Information Extraction addresses the problem of transforming a corpus of textual documents into a more structured database, the database constructed by an IE module can be provided to the Knowledge module for further mining of knowledge as illustrated in figure2. Hence Information Extraction can play an obvious role in text mining DATA MINING TEXT MINING COMPUTATION AL LINGUISTICS INFORMATION RETRIEVAL Fig.2: Text Mining as an Interdisciplinary Field Text Mining can be defined as a sub field of data mining if data mining techniques are used to discover patterns or information from textual data. It also inherently requires techniques from other fields of Information Retrieval, Fig.3: Overview of IE-based text mining framework Information Retrieval Information Retrieval is defined as the methods used for representation, storage and accessing of information items (Joachims 1998) where the information handled is mostly in the form of textual documents, newspapers, research papers and books which are retrieved from

databases according to the user request or queries. Information Retrieval (IR) systems identify the documents of user s interest in a collection which match a user s query. A Text Mining process differs from information retrieval in the sense it identifies the knowledge as a consequence of applications of data mining techniques which is new, potentially useful and ultimately understandable. The most well-known IR systems are search engines such as Google, which identify those documents on the World Wide Web that are relevant to a set of given words. IR systems allow us to narrow down the set of documents that are relevant to a particular problem. As text mining involves applying very computationally-intensive algorithms to large document collections, IR can speed up the analysis considerably by reducing the number of documents for analysis. Computational Linguistics Computational Linguistics computes statistics over large text collections in order to discover useful patterns which are used to inform algorithms for various subproblems within NLP, e.g. Parts Of Speech tagging, and Word Sense Disambiguation [Armstrong 1994]. Text mining techniques share methods from natural language processing to deal with textual information hidden in natural language text based databases. The ability to handle information of this type and make it understandable to the computer lies at the core of text mining technological efforts. Computational efforts are being made to make the computer understand the human natural language but efficient methods are not yet achievable for processing these types of information and extracting useful knowledge patterns. Therefore text mining techniques can offer benefits for processing human natural language information with speed and accuracy (Gao et al.2005). The patterns generated by these methods and techniques are analyzed by the computer to generate information which can further be processed by applying data mining algorithms. These algorithms can help to discover useful patterns for part of speech tagging, word sense disambiguation or creation of bilingual dictionaries. Pattern Recognition Pattern Recognition is the process of searching for predefined sequences in text. In a text mining scenario it is taken as a process of matching the patterns using words as well as morphological and syntactic properties. Two different methods for pattern recognition are terms or word matching and relevancy signatures. Word and term matching methods are easier to implement but need manual efforts as well whereas relevancy signatures are based upon methods of morphological and syntactic information processing techniques. Text Categorization In text mining, categorization refers to the process of grouping related concepts, themes, or common threads. This process is data-driven and iterative. The main topic of a document is identified by placing the document into a pre-defined set of topics. The process of categorization of documents relies on methods of taking the whole document as a set of words or bag of words where the information is extracted on the basis of words counts, the relationships are identified by looking terms in broader and narrowing aspects of these terms, and their synonyms. Documents having the most content on a particular topic are arranged in order and rank is given to it as per content. Text Clustering Clustering [4] is a technique used to group similar documents, but it differs from categorization in that documents are clustered on the fly instead of through the use of predefined topics. This method groups similar documents on the basis of strong similarities within each cluster and dissimilarities to the documents outside the cluster. A basic clustering algorithm creates a vector of topics for each document and measures the weights of how well the document fits into each cluster. This technique is useful to organise thousands of documents in an industrial or organisational information management systems. Categorization is the process of identifying the similarities in the documents. In K-means clustering algorithm[5], while calculating similarity between text documents, not only consider Eigen vector based on algorithm of term frequency statistics,but also combine the degree of association between words,then the relationship between keywords has been taken into consideration,thereby it lessens

sensitivity of input sequence and frequency, to a certain extent, it considered semantic understanding, effectively raises similarity accuracy of small text and simple sentence as well as preciseness and recall rate of text cluster result. Text Summarization Text Summarization process helps to reduce the content of documents and makes it readable to others whilst still retaining the sense of the topic discussed in it. In practice humans read through the text and understand the meaning of this and mention or highlight the main topic or point discussed in the text. Computers lack this capability of understanding the text therefore certain methods or techniques (e.g. sentence extraction) are used to find the useful information by using statistical weighting methods. These methods are used to find the key information in terms of phrases to define main theme of the text. Text Visualization Trend Analysis and Association Analysis are used to find trends or predict future patterns based on time dependent data and associate these patterns to the other extracted patterns. Text Visualization is defined as representing the extracted features with respect to the key terms and helps identifying main topics or concepts by the degree of their importance on the representation. It is further used to easily discover the location of the documents in graphical representation. 3. BENEFITS OF TEXT MINING A significant benefit of Text Mining is efficient analysis of extant knowledge from vast collection of documents, eventually cutting down the time spent on ensuring coverage of whole material. Text Mining unlocks the hidden information and extracts new knowledge. Hence improved research process and quality resulting in cost saving and productivity gains. 4. APPLICATIONS OF TEXT MINING In the area of Knowledge management & HR, Text Mining techniques are used in support and decision making and Competitive Intelligence by selecting only relevant information by automatic reading of this data about the company as well as its competitor. They are also used to manage human resources strategically, mainly with applications aiming at analysing staff s opinions, monitoring the level of employee satisfaction, as well as reading and storing CVs for the selection of new personnel. Extraction Transformation Loading fills non-structured textual material into categories and structured fields. The search engines are usually associated with ETL offers conceptual browsing and questioning in natural language. The applications are found in the editorial sector, the juridical and political document field and medical healthcare. In Customer Relationship management, Text mining competitors and/or monitor customers' opinions to identify new potential customers, as well as to determine the Companies image through the analysis of press reviews and other relevant sources. In the area of Technology watch, it analyses the characteristics of existing technologies, as well as identifying emerging technologies, through its potentiality, application fields and relationships with the existing technology. In Natural Language Processing (NLP), Text mining used in construction of websites which support systems of questioning in natural language. Text Mining applications are used to analyse web pages published in different language. 5. CONCLUSION The rapid growth of stored information in almost every area of live scenario has created a great demand for new, powerful tools for turning data into useful knowledge. This problem of information overload is further aggravated due to the unstructured, textual data form of the majority of the data. Text represents a vast, rich collection of information, but encrypts this information in such a form that is difficult to decipher automatically. Text Mining also known as Text Data Mining or KDT refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is an interdisciplinary field which draws on information retrieval, data mining,

machine learning, statistics and computational linguistics. As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value. Knowledge may be discovered from many sources of information; yet, unstructured texts remain the largest readily available source of knowledge. Some of Text Mining techniques are as follows: Information Extraction- It identifies facts and relations in text. This technique gives relationships between all the identified people, places, and time to provide the user with meaningful information. Information Retrieval- It provides documents of user s interest to the users. Categorization- Categorization is the process of identifying the similarities in the documents. The main topic of a document is identified by placing the document into a pre-defined set of topics. Categorization counts words that appear, and from the counts, identifies the main topics that the document covers. Clustering- Clustering can be used in order to find groups of documents with similar content. Summarization- It reduces the length of a document with keeping its main points and overall meaning as it is. [4] Liritano S. and Ruffolo M., (2001), Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining, IEEE, 454-458, Italy. [5] Ming Zhao, Jianli Wang and Guanjun Fan (2008), Research on Application of Improved Text Cluster Algorithm in intelligent QA system, Proceedings of the Second International Conference on Genetic and Evolutionary Computing, China,, IEEE Computer Society, 463-466. [6] J. Han and M. Kamber., Data Mining: Concepts and Techniques Morgan Kaufmann, 2000. [7] Zaïane, O.-R., Principles of Knowledge Discovery in Databases, University of Alberta, 1999. Text Mining unlocks the hidden information and extracts new knowledge. Hence improved research process and quality resulting in cost saving and productivity gains. 6. REFERENCES [1] Berry Michael W., ( 2004), Survey of Text Mining: Clustering, Classification and Retrieval, Springer Verlag, New York, LLC, 24-43. [2] Navathe, Shamkant B., and Elmasri Ramez, (2000 ) Fundamentals of Database Systems, Pearson Education pvt Inc., Singapore. [3] Thomas W. Miller. Data and Text Mining: a Business Applications Approach, Pearson Edition, 2005.