Review on Text Mining

Similar documents
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Text Mining Research: A Survey

A Study on K-Means Clustering in Text Mining Using Python

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Comparative Study on Classification Meta Algorithms

Overview of Web Mining Techniques and its Application towards Web

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Text Mining: A Burgeoning technology for knowledge extraction

CS6220: DATA MINING TECHNIQUES

Correlation Based Feature Selection with Irrelevant Feature Removal

Introduction to Text Mining. Hongning Wang

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

A Survey On Data Mining Algorithm

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Introduction to Data Mining and Data Analytics

Iteration Reduction K Means Clustering Algorithm

A Survey On Different Text Clustering Techniques For Patent Analysis

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Machine Learning using MapReduce

Mining Social Media Users Interest

Filtering of Unstructured Text

Techniques for Mining Text Documents

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

String Vector based KNN for Text Categorization

Encoding Words into String Vectors for Word Categorization

International Journal of Advanced Research in Computer Science and Software Engineering

Domain-specific Concept-based Information Retrieval System

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

A THREE LAYERED MODEL TO PERFORM CHARACTER RECOGNITION FOR NOISY IMAGES

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

D B M G Data Base and Data Mining Group of Politecnico di Torino

Keyword Extraction by KNN considering Similarity among Features

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

Pre-Requisites: CS2510. NU Core Designations: AD

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Domestic electricity consumption analysis using data mining techniques

Analyzing Outlier Detection Techniques with Hybrid Method

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Application of Text Mining in Effective Document Analysis: Advantages, Challenges, Techniques and Tools

ECG782: Multidimensional Digital Signal Processing

Parametric Comparisons of Classification Techniques in Data Mining Applications

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Volume 2, Issue 6, June 2014 International Journal of Advance Research in Computer Science and Management Studies

Comparing and Selecting Appropriate Measuring Parameters for K-means Clustering Technique

Performance Analysis of Data Mining Classification Techniques

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Enhanced Bug Detection by Data Mining Techniques

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH

On-line glossary compilation

International Journal Of Engineering And Computer Science ISSN: Volume 5 Issue 11 Nov. 2016, Page No.

Chapter 2 BACKGROUND OF WEB MINING

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Ontology Based Search Engine

A Performance Assessment on Various Data mining Tool Using Support Vector Machine

KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES

Image enhancement for face recognition using color segmentation and Edge detection algorithm

A study of classification algorithms using Rapidminer

TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran

Web Mining TEAM 8. Professor Anita Wasilewska CSE 634 Data Mining

Data Analytics Framework and Methodology for WhatsApp Chats

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Collective Intelligence in Action

Detection and Extraction of Events from s

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

A Survey Paper on Text Mining - Techniques, Applications And Issues

A NEW HYBRID APPROACH FOR NETWORK TRAFFIC CLASSIFICATION USING SVM AND NAÏVE BAYES ALGORITHM

Structured Learning. Jun Zhu

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

A SURVEY- WEB MINING TOOLS AND TECHNIQUE

Redefining and Enhancing K-means Algorithm

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

9. Conclusions. 9.1 Definition KDD

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

Contents. Preface to the Second Edition

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining in Bioinformatics: Study & Survey

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

Dynamic Clustering of Data with Modified K-Means Algorithm

Comparative Study of Clustering Algorithms using R

Knowledge Engineering with Semantic Web Technologies

Intrusion Detection Using Data Mining Technique (Classification)

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

Modelling Structures in Data Mining Techniques

Web Information Retrieval using WordNet

Data mining techniques for actuaries: an overview

CHAPTER 1 INTRODUCTION

Transcription:

Review on Text Mining Aarushi Rai #1, Aarush Gupta *2, Jabanjalin Hilda J. #3 #1 School of Computer Science and Engineering, VIT University, Tamil Nadu - India #2 School of Computer Science and Engineering, VIT University, Tamil Nadu - India. #3 Associate Professor, School of Computer Science and Engineering, VIT University, Tamil Nadu - India ABSTRACT: Text data mining is the process of deriving high quality information from text, it is a relatively novel area of computer science and the use of it has grown as the unstructured data available continues to increase exponentially. In the present day there has been a surge in the amount of textual information available on the internet. Text mining has applications in several areas such as customer care service, fraud detection, contextual advertisement, healthcare etc. KEYWORDS: data, text mining, classification, clustering INTRODUCTION: Data mining is the method to find meaningful answers or patterns from a large dataset.the primary goal of data mining is to fetch information from a database and transform it into a data set which is meaningful to the user. The user can then decide how he/she wishes to use the data to find answers to various questions. Data mining is still a very evolving technology. Using the techniques and concepts of data mining on a database consisting of text refers as text mining. Data mining can be used to solve a lot of business problems. Some of those business problems could be like answering questions such as, How should one shop place the items in order to increase sales or How should business advertise in order to target areas better. Using text mining in order to solve such problems is referred as text analytics. Text mining can allow organizations to obtain valuable business insights of the market from text-based sources such as word documents, emails, pdfs and various social media platforms. A recent study showed that almost 80% of a company s information is stored in the form of text. Since the amount of information on text is so high, text mining is believed to have a huge scope. Text mining is a complex task since most of the text data is unstructured data, which requires cleaning in order to get good results. Text mining involves a lot of steps within it such as information retrieval, clustering, visualization, machine learning, data mining etc. TEXT DATA The text data can be stored in various formats such as word documents, pdfs, social media etc. It can mainly be categorized into two types: (i) Structured Data (ii) Unstructured Data STRUCTURED DATA Structured data is the format of data that can be loaded spreadsheet. Structured text data can be stored in the form of rows and columns. Structured text data that is present in a spreadsheet may or may not have each cell filled. Such kind of data is consistent, uniform and data mining friendly i.e. after applying various text mining techniques on it, it is expected that more accurate and desired results will be generated. UNSTRUCTURED DATA Unstructured data is generally present in the form of word documents, html web pages, pdf documents. Unstructured data is inconsistent and isn t data mining friendly i.e. it won t produce expected results. Structured data is often converted into a semi-structured data first to perform data mining techniques. Semi-structure data is mostly in XML format. STAGES OF TEXT MINING INFORMATION RETRIEVAL Information retrieval (IR) is the primary step in the process of text mining. IR is the process of extracting relevant information from all the sources. It involves the use of a query written by the user to collect all the documents that have relevant information. One of the example of IR would be search engines such as Google. Search engines run the query to match all the relevant documents present on World Wide Web to the keywords entered by the user. NATURAL LANGUAGE PROCESSING Human language for a computer can be difficult to process due to the subtle nuances. For example, synonyms are words which are spelled same but have different meaning. The meaning of the sentence can only be inferred through the context in which it is used. Natural Language Processing (NLP) is a complex task and is a part of the artificial intelligence domain. It refers ISSN: 2231-2803 http://www.ijcttjournal.org Page 159

to the inference of the human language by computers. Shallow parsers categorize only the major grammatical components in a sentence like nouns, adjectives and verbs. On the other end, deep parsers build an extensive characterization of the grammatical structure of a sentence. NLP is the domain of text mining is responsible for transferring data in the form of linguistic data to the information extraction phase. INFORMATION EXTRACTION After linguistic data is produced by natural language processing it is transferred to the information extraction(ie) stage. IE stage is responsible for structuring the linguistic data. The data after being structured in this stage is transferred to the data mining stage where the text analysis occurs. DATA MINING Data mining is a method that helps to make sense of a huge amount of information. It is used to find meaningful patters and answers from large data sets. When used in text mining, the techniques or algorithms of data mining are applied on the structured data processed from the information extraction stage. The results obtained from the data mining stage are stored into another database which can be accessed by end-user queries. FRAMEWORK OF TEXT MINING Text mining can broadly be visualized in two phases: (i) Text Refining In the text refining stage, the input is data in the form of text and the output is an intermediate form. Intermediate form can be either semistructured/structured. (ii) Knowledge Distillation In the knowledge distillation phase the input is the intermediate form generated in the text refining phase and the output are the different patterns mined from the data. Figure i Framework of text mining TEXT MINING ALGORITHMS The text mining algorithms can be broadly classified into three categories: (i) classification (ii) categorization (iii) clustering K-MEANS CLUSTERING ALGORITHM Clustering is the method of dividing a group of data points into a small number of clusters. K-Means clustering is an unsupervised learning algorithm that is used to partition data set into k groups. Each group or cluster has a cluster centroid. The main goal of the k- means algorithm is to reduce the total sum of the squared distance of every point in the data set to its corresponding cluster centroid. Given a set of observations (x1, x2,, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k ( n) sets S = {S1, S2,, Sk} so as to minimize the within-cluster sum of squares where µi is the mean of points in Si. ISSN: 2231-2803 http://www.ijcttjournal.org Page 160

Advantages to Using this Technique K-Means clustering algorithm has an advantage over other clustering methods such as hierarchical clustering when there is a large number of variables. K-means will be faster in computation. Also, K-Means produce tighter clusters if the cluster are global. Disadvantages to Using this Technique The main disadvantage of using K-Means clustering is being unable to compare the quality of the clusters generated. K-means algorithm doesn t work well when it comes to non-globular clusters. Also, different iterations of initial partition will eventually end up in an output of different results. APPLICATIONS OF K-MEANS CLUSTERING Unsupervised learning of neural networks Machine Vision Pattern recognition Artificial intelligence Classification analysis Image Processing Figure ii Flow of K-Means NAIVES BAYES Naïve Bayes classifiers are linear classifiers. The term naïve refers to the assumption that the features in the database are mutually independent and Bayes comes from the theoretical use of Bayes theorem. In real time scenarios, the assumption of two independences are disobeyed. However, Naïve Bayes classifiers works better than most of the alternatives for a small sample size. The Naiver Bayes classifiers seem to find a lot of lot of application in various fields since they are relatively robust, easy to implements, fast and accurate. LINEAR SUPPORT VECTOR MACHINE LEARNING CLASSIFICATION Support vector machine learning (SVM) is a very modern algorithm used for regression and classification. ISSN: 2231-2803 http://www.ijcttjournal.org Page 161

SVM has a procedure for optimizing predictive accuracy. SVM forwards the input data into a kernel space and later builds a linear model within this kernel space that is automatically chosen by SVM. SVM executes well with real world applications such as text classification, recognition of hand written symbols, image classification etc. For the purpose of machine learning and data mining SVM is the most widely used tool. SVM works best with sparse data and produces wonderful results with high-dimensional challenging data. SVM makes use of active learning. Active learning refers to the fact that as the size of the SVM model increases so does the size of the training sets. TEXT MINING PROCESS FLOW Figure iii Process flow of Text Mining LIMITATIONS OF TEXT MINING The main difficulties faced during text mining is that the text data consists of language which is ambiguous, Misspellings, abbreviations and spelling variants can also cause difficulties and error in finding the correct result. Text mining is the process of finding patterns and answers from data. These data are already present and have these patterns hidden in them. We use the process of data mining or text mining to uncover these hidden patters. All in all, text mining doesn t generate any new facts. The limitations in text mining occurs due to unavailability of structured clean data. Apart from that since natural language processing is still an evolving domain, text mining is unable to generate more accurate results. This happens because the human language is difficult. The text data generated from it is ambiguous and contains some words can have the same spelling but different meaning(homographs) or different words that can have the same meaning (synonyms). ISSN: 2231-2803 http://www.ijcttjournal.org Page 162

ACKNOWLEDGEMENT: The authors are grateful to the authorities of School of Computer Science and Engineering of VIT University, Vellore. REFERENCES: 1. A Comparison Of Document Clustering Techniques: Michael Steinback, George Karypis and Vipin Kumar 2. A tutorial review on Text Mining Algorithms: Mrs. Sayantani Ghosh, Mr. Sudipta Roy and Prof. Samir K. Bandyopadhyay 3. A Survey of Text Mining Techniques and Applications: Vishal Gupta and Gurpreet S. Lehal 4. Text Categorization with Support Vector Machines: Learning with Many Relevant Features, LS-8 Report 23, Thorsten Joachims 5. A review on various text mining techniques and algorithms: R.Balamurugan, Dr. S.Pushpa 6. Text Mining Process, Techniques and Tools: an Overview, Vidhya. K.A, G.Aghila 7. Text Mining: The state of the art and the challenges 8. https://www.experfy.com/blog/k-means-clustering-in-text-data 9. https://web.stanford.edu/class/cs124/lec/naivebayes.pdf 10. https://www.slideshare.net/treparel/support-vector-machinessvm-text-analytics-algorithm-introduction-2012 ISSN: 2231-2803 http://www.ijcttjournal.org Page 163