CHAPTER 7 CONCLUSION AND FUTURE WORK

Similar documents
Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

A Review on Unsupervised Record Deduplication in Data Warehouse Environment

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Information Integration of Partially Labeled Data

Data Management Glossary

Introduction to Data Science

Tamr Technical Whitepaper

Slides for Data Mining by I. H. Witten and E. Frank

Machine Learning Techniques for Data Mining

Data Collection, Preprocessing and Implementation

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

A Flexible Fuzzy Expert System for Fuzzy Duplicate Elimination in Data Cleaning

9. Conclusions. 9.1 Definition KDD

DATA POSITION AND PROFILING IN DOMAIN-INDEPENDENT WAREHOUSE CLEANING

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

I. Introduction II. Keywords- Pre-processing, Cleaning, Null Values, Webmining, logs

Table Of Contents: xix Foreword to Second Edition

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

IntelliClean : A Knowledge-Based Intelligent Data Cleaner

Data Mining. Asso. Profe. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of CS (1)

Basic Concepts Weka Workbench and its terminology

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

Data Mining Concepts

Pre-processing of Web Logs for Mining World Wide Web Browsing Patterns

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

Data Mining and Data Warehousing Classification-Lazy Learners

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

Analyzing Flow-based Anomaly Intrusion Detection using Replicator Neural Networks. Carlos García Cordero Sascha Hauke Max Mühlhäuser Mathias Fischer

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

Leveraging Set Relations in Exact Set Similarity Join

OBJECT-CENTERED INTERACTIVE MULTI-DIMENSIONAL SCALING: ASK THE EXPERT

Midterm Examination CS540-2: Introduction to Artificial Intelligence

A study of classification algorithms using Rapidminer

Web Data mining-a Research area in Web usage mining

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

ART 알고리즘특강자료 ( 응용 01)

Coriolis: Scalable VM Clustering in Clouds

Enterprise Miner Software: Changes and Enhancements, Release 4.1

In-Memory Analytics with EXASOL and KNIME //

Contents. Foreword to Second Edition. Acknowledgments About the Authors

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH

Chapter 3: Data Mining:

CONCENTRATIONS: HIGH-PERFORMANCE COMPUTING & BIOINFORMATICS CYBER-SECURITY & NETWORKING

FINGER VEIN RECOGNITION USING LOCAL MEAN BASED K-NEAREST CENTROID NEIGHBOR AS CLASSIFIER

Full file at

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Data Mining & Data Warehouse

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

A Generalized Method to Solve Text-Based CAPTCHAs

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Computationally Efficient Serial Combination of Rotation-invariant and Rotation Compensating Iris Recognition Algorithms

Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Author(s) Yoshida, Yoshihiro, Yabuki, Nobuy.

Implementing a Data Warehouse with Microsoft SQL Server 2012

Learning Blocking Schemes for Record Linkage

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Overview of Web Mining Techniques and its Application towards Web

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

A Survey Of Issues And Challenges Associated With Clustering Algorithms

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Web Usage Mining: A Research Area in Web Mining

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Data Preprocessing. Slides by: Shree Jaswal

WEB USAGE MINING: ANALYSIS DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH NOISE ALGORITHM

Visualizing NCI Seer Cancer Data

Data Preprocessing. Data Preprocessing

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

Data preprocessing Functional Programming and Intelligent Algorithms

Implement a Data Warehouse with Microsoft SQL Server

Predictive Coding. A Low Nerd Factor Overview. kpmg.ch/forensic

TextProc a natural language processing framework

Data Warehouse and Mining

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.

Q1) Describe business intelligence system development phases? (6 marks)

Machine Learning Chapter 2. Input

USPTO INVENTOR DISAMBIGUATION

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

Scenario integration via the transformation and manipulation of higher-order graphs

MetaData for Database Mining

Growing Agents - An Investigation of Architectural Mechanisms for the Specification of Developing Agent Architectures

VISUAL RERANKING USING MULTIPLE SEARCH ENGINES

Advanced Data Management Technologies

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Database Systems: Design, Implementation, and Management Tenth Edition. Chapter 1 Database Systems

The Importance of Data Profiling

Data Integration Best Practices

DATA MINING - 1DL105, 1DL111

AntMover 0.9 A Text Structure Analyzer

Viewpoint Review & Analytics

Transcription:

CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion Data pre-processing is very important in data mining process. Certain data cleaning techniques usually are not applicable to all kinds of data. Deduplication and data linkage are important tasks in the pre-processing step for many data mining projects. It is important to improve data quality before data is loaded into data warehouse. Locating approximate duplicates in large databases is an important part of data management and plays a critical role in the data cleaning process. In this research wok, a framework is designed to clean duplicate data for improving data quality and also to support any subject oriented data. Only few cleaning methods are implemented in the existing data cleaning techniques. However, those existing techniques are good in some part of cleaning process. For example duplicate elimination cleaning tools are suited for data elimination process and similarity cleaning tools is well suited for field similarity and record similarity. 145

The main contributions of the thesis are outlined below. 7.1.1 Framework design In order to overcome these problems, a new framework is designed and implemented to comprise all the techniques into a single data cleaning tool. This new framework consists of six elements: Selection of attributes, Formation of tokens, Blocking of records with best block-token-key, token-based similarity computation for selected attributes, rule based elimination and Merge. This framework is useful to develop a powerful data cleaning tool by using the existing data cleaning techniques in a sequential order. 7.1.2 Selection of Attributes A new attribute selection algorithm is implemented and evaluated through extensive experiments. In order to handle noisy data and duplicate data in the data cleaning steps the numbers of attributes are reduced. By considerable reduction of the attributes the experimental results show that it is a very efficient and simple method for attribute selection. This attribute selection approach demonstrates its efficiency and effectiveness in dealing with higher dimensionality (thousands of attributes) in the data cleaning process. The attribute selection algorithm can eliminate both irrelevant and redundant attributes and is applicable to any type of data (nominal, numeric, etc.). Also, this attribute selection algorithm can handle data of different attribute types smoothly. The quality of the algorithm results are confirmed by applying a set of rule. The main purpose of this attribute selection for data cleaning is to reduce the time for the 146

further data cleaning process such as token formation, record similarity and elimination process in an efficient way. 7.1.3 Formation of Tokens The token formation algorithm is used to form smart tokens for data cleaning and it is suitable for numeric, alphanumeric and alphabetic data. There are three different rules described for the numeric, alphabetic, and alphanumeric tokens. The result of the token based data cleaning is to remove duplicate data in an efficient way. The time will be reduced by the selection of attributes and by the token based approach. These formed tokens are stored in the LOG Table. The time required to compare entire string is more than comparison of tokens. This formed token will be used as the blocking key in the further data cleaning process. So, the token formation is very important to define best and smart token. 7.1.4 Blocking of Records Using of an unsuitable key, which is not able to group the duplicates together, has a deterring effect on the result, i.e. many false duplicates are detected in comparison with the true duplicates, using say, the address key. Hence, key creation and selection of attributes are important in the blocking method to group similar records together. The selection of the most suitable blocking key (parameter) for the blocking method is addressed in this research work. Dynamically adjusting the blocking key for the blocking method will be effective in record linkage algorithms during the execution time. The 147

blocking key is selected based on the type of the data and usage of the data in the data warehouse. The dynamically adjusting blocking key and token based blocking key as well as the dynamic window size SNM method is used in this research work. An agent is used in tuning parameter or everything is set dynamically for the blocking method without human intervention to yield better performance. However, in most real world problems where expert knowledge is hard to obtain, it is helpful to have methods that can automatically choose reasonable parameters for us. 7.1.5 Maintenance of LOG Table Duplicates are easily detected from LOG table. This LOG table contains similarity of token field values by comparing neighboring records for a match. Finally, LOG table is maintained for incremental record identification while adding new data into the data warehouse. 7.1.6 Performance based on time and data quality Time is critical in cleansing large database. In this research work, efficient token based blocking method and similarity computation method is used to reduce the time taken on each comparison. In this research work, efficient duplicate detection and duplicate elimination approach is developed to obtain good result of duplicate detection and elimination by reducing false positives. Performance of this research work shows that there was significant time saving and improved duplicate results than the existing approach. 148

To compare this new framework with previous approaches the token concept is included to speed up the data cleaning process and reduce the complexity. Each step of this new framework is specified clearly in a sequential order by means of the six data cleaning process offered such as attribute selection, token formation, clustering, similarity computation, elimination, and merge. An agent is used in this framework to reduce the effort taken by the user. This agent will work according to the type and size of the data set. This framework is flexible for all kinds of data in the relational databases. The framework is mainly developed to increase the speed of the duplicate data detection and elimination process and to increase the quality of the data by identifying true duplicates and strict enough to keep out false-positives. The accuracy and efficiency of duplicate elimination strategies are improved by introducing the concept of a certainty factor for a rule. Data cleansing is a complex and challenging problem. This rule-based strategy helps to manage the complexity, but does not remove that complexity. This approach can be applied to any subject oriented databases in any domain. This proposed research work maintains LOG files with all the cleaning process for the incremental data cleaning. The main benefits of the system are as follows: a. A framework is developed to handle duplicate data. b. Using attribute selection algorithm attributes are selected to reduce the complexity of the data cleaning process. 149

c. The speed of the data cleaning process is improved (Result is given in Figure 5f) by using token-based blocking and similarity computation method. d. Potential resemblance records are grouped together as cluster by using token based blocking key to reduce its computational cost. e. Rule based duplicate data detection and elimination is developed to reduce false mismatches and increase accuracy. 7.2 Future work 7.2.1 Domain Independent Future work will consist of applying this framework on instructed (like complete text files) and semi-structured (like XML file) data. The proposed framework is only applicable for relational oriented data. The algorithms proposed in this thesis for selecting attributes, forming tokens, blocking records, record matching and eliminating duplicates are used for relational data warehouses. In future, additional improvements to domain-independent attribute selection algorithm, token formation algorithm, record blocking algorithm, record matching algorithm, and rule based duplicate detection and elimination approach will be explored. 7.2.2 Accuracy of duplicate detection The experiments using token based similarity computation have shown their enormous impact on the final duplicate decision. This demands further work to develop better token based similarity method to improve the performance of duplicate detection 150

process. In this context it is also interesting to examine, which attributes need to be compared at all, i.e., which attributes really contribute to the duplicate decision in token based similarity measures. In future, accuracy of detecting approximate duplicate records should be improved by performing more comparisons among records and additional techniques have to be explored which are less expensive. 7.2.3 Association Rules for Duplicate Detection The duplicate detection in a representative biological dataset using the Apriori method for association rule mining is done in earlier stage [JMA +, 04]. Our future work is to improve the duplicate detection method using association rule for large scale datasets. In the existing method, developed association rule is applicable only for biological dataset. In future, duplicate detection in any kind of dataset using association rule mining will be developed. 7.2.4 Normalization Dictionaries In this research work, dictionary is used to structure data by abbreviating record values and standardizing format of data. This dictionary maintains limited number of values and supports specific subject oriented data. To avoid this problem, normalized dictionary will be developed to support any kind of subject oriented data. It can be customized to meet all the needs of the data correction. This dictionary will be the foundation for structuring the data in its most useful and consistent form. 151

7.2.5 Machine Learning Techniques for Blocking Key Generation Future work includes incorporating machine learning techniques to the extraction of automatic blocking keys for blocking methods. A fast and efficient machine learning based approach will classify the data before generating blocking key in order to achieve optimal performance. This machine learning techniques will be useful for developing best suitable blocking key in record linkage. 7.2.6 Improving Proposed Algorithms Future work includes more extensive run time evaluation, design of better blocking indices, and aiding users in designing good token based similarity functions to improve efficiency on large datasets. The algorithms proposed in this thesis can be improved by automatically choosing the different parameter values and can be adjusted based on the size and type of the data warehouse. 7.2.7 Incremental Data Cleaning Finally, a further investigation will be considered how to calculate the data quality and the identification power for data newly added to exported ones. The main next future work is to design a sequential framework for incremental computation methods. Much work still needs to be done in this area. Relations spanning several databases give rise to opportunities for anomaly and duplicate detection not possible using a single database. In incremental data cleaning techniques, extending the 152

knowledge-based framework for de-duplicating results returned by web search engines will be explored. 7.3 Final Thoughts This thesis has presented ways to detect and eliminate duplicates in the data warehouse. Overall, the work presented in this thesis contributes methods leading to state-of the art performance on the efficient detection and elimination of duplicates and provides a number of useful algorithms for practitioners in attribute selection, token based approach, blocking records and rule based approach for duplicate detection and elimination. This research demonstrates the power of using token based cleaning method to increase the speed of the data cleaning process. This research work will motivate further research in duplicate detection and elimination, as well as encourage employing token based cleaning approach in various applications where distance estimates between instances are required. 153