CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion Data pre-processing is very important in data mining process. Certain data cleaning techniques usually are not applicable to all kinds of data. Deduplication and data linkage are important tasks in the pre-processing step for many data mining projects. It is important to improve data quality before data is loaded into data warehouse. Locating approximate duplicates in large databases is an important part of data management and plays a critical role in the data cleaning process. In this research wok, a framework is designed to clean duplicate data for improving data quality and also to support any subject oriented data. Only few cleaning methods are implemented in the existing data cleaning techniques. However, those existing techniques are good in some part of cleaning process. For example duplicate elimination cleaning tools are suited for data elimination process and similarity cleaning tools is well suited for field similarity and record similarity. 145
The main contributions of the thesis are outlined below. 7.1.1 Framework design In order to overcome these problems, a new framework is designed and implemented to comprise all the techniques into a single data cleaning tool. This new framework consists of six elements: Selection of attributes, Formation of tokens, Blocking of records with best block-token-key, token-based similarity computation for selected attributes, rule based elimination and Merge. This framework is useful to develop a powerful data cleaning tool by using the existing data cleaning techniques in a sequential order. 7.1.2 Selection of Attributes A new attribute selection algorithm is implemented and evaluated through extensive experiments. In order to handle noisy data and duplicate data in the data cleaning steps the numbers of attributes are reduced. By considerable reduction of the attributes the experimental results show that it is a very efficient and simple method for attribute selection. This attribute selection approach demonstrates its efficiency and effectiveness in dealing with higher dimensionality (thousands of attributes) in the data cleaning process. The attribute selection algorithm can eliminate both irrelevant and redundant attributes and is applicable to any type of data (nominal, numeric, etc.). Also, this attribute selection algorithm can handle data of different attribute types smoothly. The quality of the algorithm results are confirmed by applying a set of rule. The main purpose of this attribute selection for data cleaning is to reduce the time for the 146
further data cleaning process such as token formation, record similarity and elimination process in an efficient way. 7.1.3 Formation of Tokens The token formation algorithm is used to form smart tokens for data cleaning and it is suitable for numeric, alphanumeric and alphabetic data. There are three different rules described for the numeric, alphabetic, and alphanumeric tokens. The result of the token based data cleaning is to remove duplicate data in an efficient way. The time will be reduced by the selection of attributes and by the token based approach. These formed tokens are stored in the LOG Table. The time required to compare entire string is more than comparison of tokens. This formed token will be used as the blocking key in the further data cleaning process. So, the token formation is very important to define best and smart token. 7.1.4 Blocking of Records Using of an unsuitable key, which is not able to group the duplicates together, has a deterring effect on the result, i.e. many false duplicates are detected in comparison with the true duplicates, using say, the address key. Hence, key creation and selection of attributes are important in the blocking method to group similar records together. The selection of the most suitable blocking key (parameter) for the blocking method is addressed in this research work. Dynamically adjusting the blocking key for the blocking method will be effective in record linkage algorithms during the execution time. The 147
blocking key is selected based on the type of the data and usage of the data in the data warehouse. The dynamically adjusting blocking key and token based blocking key as well as the dynamic window size SNM method is used in this research work. An agent is used in tuning parameter or everything is set dynamically for the blocking method without human intervention to yield better performance. However, in most real world problems where expert knowledge is hard to obtain, it is helpful to have methods that can automatically choose reasonable parameters for us. 7.1.5 Maintenance of LOG Table Duplicates are easily detected from LOG table. This LOG table contains similarity of token field values by comparing neighboring records for a match. Finally, LOG table is maintained for incremental record identification while adding new data into the data warehouse. 7.1.6 Performance based on time and data quality Time is critical in cleansing large database. In this research work, efficient token based blocking method and similarity computation method is used to reduce the time taken on each comparison. In this research work, efficient duplicate detection and duplicate elimination approach is developed to obtain good result of duplicate detection and elimination by reducing false positives. Performance of this research work shows that there was significant time saving and improved duplicate results than the existing approach. 148
To compare this new framework with previous approaches the token concept is included to speed up the data cleaning process and reduce the complexity. Each step of this new framework is specified clearly in a sequential order by means of the six data cleaning process offered such as attribute selection, token formation, clustering, similarity computation, elimination, and merge. An agent is used in this framework to reduce the effort taken by the user. This agent will work according to the type and size of the data set. This framework is flexible for all kinds of data in the relational databases. The framework is mainly developed to increase the speed of the duplicate data detection and elimination process and to increase the quality of the data by identifying true duplicates and strict enough to keep out false-positives. The accuracy and efficiency of duplicate elimination strategies are improved by introducing the concept of a certainty factor for a rule. Data cleansing is a complex and challenging problem. This rule-based strategy helps to manage the complexity, but does not remove that complexity. This approach can be applied to any subject oriented databases in any domain. This proposed research work maintains LOG files with all the cleaning process for the incremental data cleaning. The main benefits of the system are as follows: a. A framework is developed to handle duplicate data. b. Using attribute selection algorithm attributes are selected to reduce the complexity of the data cleaning process. 149
c. The speed of the data cleaning process is improved (Result is given in Figure 5f) by using token-based blocking and similarity computation method. d. Potential resemblance records are grouped together as cluster by using token based blocking key to reduce its computational cost. e. Rule based duplicate data detection and elimination is developed to reduce false mismatches and increase accuracy. 7.2 Future work 7.2.1 Domain Independent Future work will consist of applying this framework on instructed (like complete text files) and semi-structured (like XML file) data. The proposed framework is only applicable for relational oriented data. The algorithms proposed in this thesis for selecting attributes, forming tokens, blocking records, record matching and eliminating duplicates are used for relational data warehouses. In future, additional improvements to domain-independent attribute selection algorithm, token formation algorithm, record blocking algorithm, record matching algorithm, and rule based duplicate detection and elimination approach will be explored. 7.2.2 Accuracy of duplicate detection The experiments using token based similarity computation have shown their enormous impact on the final duplicate decision. This demands further work to develop better token based similarity method to improve the performance of duplicate detection 150
process. In this context it is also interesting to examine, which attributes need to be compared at all, i.e., which attributes really contribute to the duplicate decision in token based similarity measures. In future, accuracy of detecting approximate duplicate records should be improved by performing more comparisons among records and additional techniques have to be explored which are less expensive. 7.2.3 Association Rules for Duplicate Detection The duplicate detection in a representative biological dataset using the Apriori method for association rule mining is done in earlier stage [JMA +, 04]. Our future work is to improve the duplicate detection method using association rule for large scale datasets. In the existing method, developed association rule is applicable only for biological dataset. In future, duplicate detection in any kind of dataset using association rule mining will be developed. 7.2.4 Normalization Dictionaries In this research work, dictionary is used to structure data by abbreviating record values and standardizing format of data. This dictionary maintains limited number of values and supports specific subject oriented data. To avoid this problem, normalized dictionary will be developed to support any kind of subject oriented data. It can be customized to meet all the needs of the data correction. This dictionary will be the foundation for structuring the data in its most useful and consistent form. 151
7.2.5 Machine Learning Techniques for Blocking Key Generation Future work includes incorporating machine learning techniques to the extraction of automatic blocking keys for blocking methods. A fast and efficient machine learning based approach will classify the data before generating blocking key in order to achieve optimal performance. This machine learning techniques will be useful for developing best suitable blocking key in record linkage. 7.2.6 Improving Proposed Algorithms Future work includes more extensive run time evaluation, design of better blocking indices, and aiding users in designing good token based similarity functions to improve efficiency on large datasets. The algorithms proposed in this thesis can be improved by automatically choosing the different parameter values and can be adjusted based on the size and type of the data warehouse. 7.2.7 Incremental Data Cleaning Finally, a further investigation will be considered how to calculate the data quality and the identification power for data newly added to exported ones. The main next future work is to design a sequential framework for incremental computation methods. Much work still needs to be done in this area. Relations spanning several databases give rise to opportunities for anomaly and duplicate detection not possible using a single database. In incremental data cleaning techniques, extending the 152
knowledge-based framework for de-duplicating results returned by web search engines will be explored. 7.3 Final Thoughts This thesis has presented ways to detect and eliminate duplicates in the data warehouse. Overall, the work presented in this thesis contributes methods leading to state-of the art performance on the efficient detection and elimination of duplicates and provides a number of useful algorithms for practitioners in attribute selection, token based approach, blocking records and rule based approach for duplicate detection and elimination. This research demonstrates the power of using token based cleaning method to increase the speed of the data cleaning process. This research work will motivate further research in duplicate detection and elimination, as well as encourage employing token based cleaning approach in various applications where distance estimates between instances are required. 153