Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data

Size: px
Start display at page:

Download "Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data"

Transcription

1 Int'l Conf. Information and Knowledge Engineering IKE' Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data (Research in progress) A. Pei Wang 1, B. Daniel Pullen 2, C. John Talburt 2 and D. Ningning Wu 2 1 Department Information Science Department University of Arkansas at Little Rock Little Rock, AR, USA Abstract: Entity resolution and record linking processes are often required to process input records of poor data quality. However, the matching errors caused by poor quality data can often be overcome by categorizing the quality problems, then applying a cyclic process that continuously refines the match rules to overcome these problems. This paper presents an extension to a previous case study of this process for student enrollment data and describes how the unique data quality issues that were identified throughout this cyclic process and how different phonetic hashing functions were used to overcome these issues. Key Word: Entity Resolution, Record Linkage, Phonetic Hash Code, Data Quality (DQ), Boolean matching rules 1. Introduction Previous work in this area has been published utilizing similarity functions such as Levenshtein Edit Distance and Q-gram Tetrahedral Ratio [11]. This research takes a different approach by applying phonetic hash code functions to mitigate quality issues presented in student enrollment data. This approach can help to overcome variations that stem from phonetic to text conversion performed by humans as well as overcome common typographical variations. 2. Background Entity Resolution (ER) is the process of determining whether two references to real world objects in an information system are referring to the same object or to different objects [1]. The references are made up of attributes and the values of the attributes describe the real world entity to which they refer. The ER processes discussed in this paper use Boolean match rules to make their decisions. Boolean match rules do not produce a score or weight when comparing a pair of references, only a True/False decision. If two references satisfy a Boolean match rule, i.e. the rule is true, the references are linked together. After the application of transitive closure, all of the references that can be linked together form an entity identity structure (EIS) [9]. 3. Boolean Match Rules and ER Outcomes Boolean match rules are used to determine the outcome as "link" pairs or "non-link" pairs. The basic unit of a Boolean rule is a term. A term is the comparison between the values of an attribute in the pair of records. The term is considered to be TRUE if the degree of similarity required by the comparison is met. The rule itself is made up of a series of terms connected by AND logic, i.e. every term must be true in order for the rule to be true. Finally, the ER process may use several Boolean rules that are connected by OR logic, i.e. the pair of references should be linked if at least one of the Boolean rules is true [3]. In evaluating the outcome of an ER process, the results of the matches between all pairs of references can be placed into four, mutually exclusive categories: true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN). TPs are correctly labeled "link" pairs. TNs are correctly labeled nonlink pairs. Contrasting these correct results are two types of incorrect linking results. FPs are pairs of records that have been identified as matches or link pairs by the ER process but actually refer to two

2 188 Int'l Conf. Information and Knowledge Engineering IKE'15 different real world entities. FNs are pairs of records that have been identified as non-matches or nonlink pairs by an ER process but actually refer to the same real world entity [8]. The goal of an ER process is to produce the lowest number of FPs and FNs. 4. The OYSTER ER System The ER processes in this paper were performed with OYSTER (Open system for Entity Resolution). OYSTER is an open source ER system developed by the Center for Advance Research in Entity Resolution and Information Quality (ERIQ) at the University of Arkansas at Little Rock (UALR). OYSTER was specifically designed to support entity identity information management (EIIM) [9]. Although OYSTER can be run in several different configurations to support the various phases of the entity identity information life cycle, only the identity capture configuration was used for the results given in this paper [10]. ER Impact of Data Quality Issues The data set used throughout this testing is a collection of student enrollment data spanning two academic years. The total records and Clusters are listed in Table1. TABLE I. DATA SETS Set A Set B Total Cluster 526, ,934 Total Records 3,234,292 3,255,513 Only the student identity information was used. Any results discussed in this paper have been made anonymous to allow the sharing and description of the unique cases identified. In the data available, a few strong identifying attributes are of particular interest. These are first name, middle name, last name, date of birth, and student identifier. Some of the data quality (DQ) issues identified with these attributes and their rates are summarized below in Tables 2 and Tables 3. TABLE II. DATA QUALITY ISSUES IN DATA SET A Data Quality Issue Data Set A % Number in First Name Number in Middle Name Number in Last Name Virgule in First Name Asterisk in First Name Total Problems Total Records 3,234,292 TABLE III. DATA QUALITY ISSUES IN DATA SET B Data Quality Issue Data Set B % Number in First Name Number in Middle Name Number in Last Name Virgule in First Name Asterisk in First Name Total Problems Total Records 3,255,513 These tables point out some of the obvious and easily quantifiable data quality issues present in these two data sets. There are several other data quality issues that occur over these attributes. The student name fields have some particularly interesting and challenging problems. The fields occur frequently enough throughout the data set to increase the amount of errors made by the ER process. The first name field has many records where the field is treated not only as the student s first name but also nickname. This creates examples that look like Joseph (Joey) or Joseph Joey. In other cases, Many Hispanic students have a hyphenated name where one comes from the father and the other comes from the mother. Upon data entry, sometimes the first of the two names is placed in the middle name field. This has a detrimental impact on matching using the middle and last name fields. In addition, to the presence of numbers or special characters in all three of the name fields can cause problems. Some problems affect multiple attributes. Some of these unique cases can be summarized briefly. In some cases one attribute is placed in the incorrect

3 Int'l Conf. Information and Knowledge Engineering IKE' field. Cases involving the phone number, student identifier, and address field have been identified where these values are actually in one of the student name fields. The data also shows a trend in naming twins. Often parents will name the twins with very similar names such as Terrell and Jerrell. Occasionally, this is extended to a similarity in the middle names as well. With the date of birth and last name fields already identical, differentiating twins in the match rules is problematic. In some cases, mixing this with erroneous or sequential student identifiers can create FP outcomes. 5. Methodology How can managers of entity data overcome data quality problems when performing ER? To overcome data quality issues some appropriate similarity functions and comparator functions can make a notable improvement. IBM Alpha Code - IBM Alpha Code is a name encoding algorithm. The coding rules produce a 14 digit phonetic key of the name according to some rules [11]. Based on these phonetic keys, the name which has different spelling but same pronunciation can be matched to each other. For example: value 1 = "Rodgers" and value 2 = "Rogers". The New York State Identification and Intelligence System (NYSIIS) - It is a phonetic algorithm. Much like the previous algorithm, a name with different spelling can produce a match by using this function. For example: value 1 = "Carry" and value 2 = "Carrie". Soundex Soundex can be used to find the values which have similar pronunciation but difference spelling. This function can be used to fix misspelled and even transposed characters. For example value 1 = "Damieva" and value 2 = "Dameiva." These two values will produce the same Soundex hash value, creating a match. Scan In order to overcome the special characters in names, the similarity function scan can be used. It is often performed in preprocessing before the ER is completed and has the capability to filter all the special characters and only include letters or alphanumerical characters. For example, value 1 = "JAMES\\" and value 2 = "JAMES". Also, scan can reorder strings or even read them from right to left as opposed to left to right and perform transformations regarding the casing of alphabetical characters. This comparator can force all characters to be lower case, upper case, or the original case present in the string. For example "Eric" can be generated as "ERIC" after using scan. Sometimes, these similarity functions will create FPs and FNs. For example, suppose two different rules are used to produce two different ER results from the same data set. The first rule we use is student first name, student last name and date of birth with an exact match for each of them. The second rule is student first name Soundex, last name and date of birth with an exact match. After performing a split comparison to compare the two results as in previous research [7], the FPs and FNs created by the second rule can be identified and their rates can be calculated. The calculation for the approximate FP percentage rate is shown in equation (1). The results are shown in table 3. Since these FPs were identified using split analysis, these are considered to be worst case FP rates. Split analysis is a methodology used to analyze splits in the clusters between two different link identifiers. How this process works has been discussed in detail in recent research [7]. (1) The FP rate indicates one side of how well the rules are performing. For this reason, the user should attempt to reduce FP and FN rates as low as possible when creating and testing rules. These results focus on the FN rates in particular. This research focuses on three similarity functions. These are Soundex, NYSIIS and IBM Alpha Code. These three similarity functions can be used in indexing, which can help the process to speed up, especially for the large data sets. After testing these three functions in the same student enrollment data, the percentage of TP and FP are shown in the table below (Table 4):

4 190 Int'l Conf. Information and Knowledge Engineering IKE' TABLE IV. THE PERCENTAGES OF TP AND FP TP FP Not Sure Soundex 34.6% 62.5% 2.9% NYSIIS 36.5% 56.9% 6.7% IBMAlpha 27.4% 68.0% 4.6% 0 Fig. 1. Bar Graph of TP and FP Percentages Comparing the results of these three similarity functions to benchmark, clearly the one that has the best performance is NYSIIS, which has the highest percentage of TP and lowest percentage of FP. 6. Conclusions TP FP NotSure Soundex Nysiis IBMalpha Data quality problems often present a formidable obstacle to obtaining an accurate and effective ER result. The approaches to overcome data quality issues in the student enrollment data during ER described in this paper have been successfully implemented in OYSTER. The success of any ER process is often directly related to the time spent profiling the data and identifying these types of data quality problems. Effectively identifying and categorizing these types of problems directly affect the quality of the ER results at the end of such processes. The approaches above include the similarity functions such as Soundex and IBMAlphaCode that can overcome some issues such as both nickname and given name contained together in one field, transposed characters, and other typographical or spelling errors. Additionally, other similarity functions such as Scan can overcome the issues such as special characters, numbers, and misspellings. While these approaches contribute greatly to improving the ER results, there is a limit to which of these approaches can aid in reducing the FP rate. The hash code functions tested in this paper cannot overcome all of the issues listed earlier in this paper. For example, they cannot directly overcome variations produced by the inclusion of nickname in some references but a given name in other references. However, the application of these hash code functions along with similarity functions such as q-gram tetrahedral ratio, Levenshtein edit distance, and nickname could further mitigate the issues encountered in this particular student enrollment data. 7. Acknowledgment The research described in this paper has been supported in part through grants from the Arkansas Department of Education and Black Oak Analytics. 8. Reference [1] Talburt, John R. Entity Resolution and Information Quality. San Francisco, CA: Morgan Kaufmann/Elsevier, [2] Melody Penning and John Talburt. "Information Quality Assessment and Improvement of Student Information in the University Environment". Information and Knowledge Engineering, [3] Yinle Zhou, John Talburt, Fumiko Kobayashi and Eric D.Nelson. "Implementing Boolean Matching Rules in an Entity Resolution System using XML Scripts". Information and Knowledge Engineering, [4] Holland, G. & Talburt, J. (2010) q-gram Tetrahedral Ratio (atr) for approximate pattern matching Conference on Applied Research in Information Technology, University of Central Arkansas, Conway, AR. [5] IvenFellegi and Alan Sunter. A Theory for Record Linkage ; Journal of the American Statistical Association, Vol. 64 No. 328, , 1969 [6] Steven Whang and Hector Garcia-Molina. Entity Resolution with Evolving Rules ; Proceedings of the VLDB Endowment, Vol. 3 Issue 1-2, , September 2010

5 Int'l Conf. Information and Knowledge Engineering IKE' [7] Huzaifa Syed, Fan Lui, Daniel Pullen, Ningning Wu, John Talburt. Developing and Refining Matching Rules for Entity Resolution ; Information and Knowledge Engineering, 2012 [8] Christen, Peter. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Berlin: Springer, [9] Zhou, Y. and Talburt, J. (2011). Entity Identity Information Management (EIIM). International Conference on Information Quality (ICIQ-11), Adelaide, Australia, November 18-20, 2011, pp [10] Zhou, Y. and Talburt, J. (2011). The Role of Asserted Resolution in Entity Identity Management. The 2011 International Conference on Information and Knowledge Engineering (IKE 11), Las Vegas, Nevada, July 18-20, 2011, pp [11] Wang, Pei, Pullen, Daniel, Wu, Ningning, and Talburt, John. (2013) Mitigating Data Quality Impairment on Entity Resolution Errors in Student Enrollment Data; Information and Knowledge Engineering Conference, 2013.

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Fumiko Kobayashi, John R Talburt Department of Information Science University of Arkansas at Little Rock 2801 South

More information

Entity Resolution and Master Data Life Cycle Management in the Era of Big Data

Entity Resolution and Master Data Life Cycle Management in the Era of Big Data ABSTRACT Paper 3920-2015 and Master Data Life Cycle Management in the Era of Big Data John R. Talburt, University of Arkansas at Little Rock & Black Oak Analytics, Inc. Proper management of master data

More information

The Role of Asserted Resolution in Entity Identity Information Management

The Role of Asserted Resolution in Entity Identity Information Management The Role of Asserted Resolution in Entity Identity Information Management Yinle Zhou and John R. Talburt Information Science Department, University of Arkansas at Little Rock, Little Rock, Arkansas, USA

More information

Overview of Record Linkage Techniques

Overview of Record Linkage Techniques Overview of Record Linkage Techniques Record linkage or data matching refers to the process used to identify records which relate to the same entity (e.g. patient, customer, household) in one or more data

More information

Introduction to blocking techniques and traditional record linkage

Introduction to blocking techniques and traditional record linkage Introduction to blocking techniques and traditional record linkage Brenda Betancourt Duke University Department of Statistical Science bb222@stat.duke.edu May 2, 2018 1 / 32 Blocking: Motivation Naively

More information

Data Linkage Methods: Overview of Computer Science Research

Data Linkage Methods: Overview of Computer Science Research Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,

More information

The Impact of Running Headers and Footers on Proximity Searching

The Impact of Running Headers and Footers on Proximity Searching The Impact of Running Headers and Footers on Proximity Searching Kazem Taghva, Julie Borsack, Tom Nartker, Jeffrey Coombs, Ron Young Information Science Research Institute University of Nevada, Las Vegas

More information

Use of Synthetic Data in Testing Administrative Records Systems

Use of Synthetic Data in Testing Administrative Records Systems Use of Synthetic Data in Testing Administrative Records Systems K. Bradley Paxton and Thomas Hager ADI, LLC 200 Canal View Boulevard, Rochester, NY 14623 brad.paxton@adillc.net, tom.hager@adillc.net Executive

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Proceedings of the Eighth International Conference on Information Quality (ICIQ-03)

Proceedings of the Eighth International Conference on Information Quality (ICIQ-03) Record for a Large Master Client Index at the New York City Health Department Andrew Borthwick ChoiceMaker Technologies andrew.borthwick@choicemaker.com Executive Summary/Abstract: The New York City Department

More information

Data Linkages - Effect of Data Quality on Linkage Outcomes

Data Linkages - Effect of Data Quality on Linkage Outcomes Data Linkages - Effect of Data Quality on Linkage Outcomes Anders Alexandersson July 27, 2016 Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, 2016 1 / 13 Introduction

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree

Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree World Applied Sciences Journal 21 (8): 1207-1212, 2013 ISSN 1818-4952 IDOSI Publications, 2013 DOI: 10.5829/idosi.wasj.2013.21.8.2913 Decision Making Procedure: Applications of IBM SPSS Cluster Analysis

More information

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD)

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD) American-Eurasian Journal of Scientific Research 12 (5): 255-259, 2017 ISSN 1818-6785 IDOSI Publications, 2017 DOI: 10.5829/idosi.aejsr.2017.255.259 Rule-Based Method for Entity Resolution Using Optimized

More information

Software Development Techniques. 26 November Marking Scheme

Software Development Techniques. 26 November Marking Scheme Software Development Techniques 26 November 2015 Marking Scheme This marking scheme has been prepared as a guide only to markers. This is not a set of model answers, or the exclusive answers to the questions,

More information

1. Introduction. Archana M 1, Nandhini S S 2

1. Introduction. Archana M 1, Nandhini S S 2 Phonetic Search in Facebook Archana M 1, Nandhini S S 2 1, 2 Assistant Professor, Department of CSE, Bannari Amman Institute of Technology, Sathyamangalam Abstract: A novel work Phonetic Search in Facebook

More information

Entity Resolution over Graphs

Entity Resolution over Graphs Entity Resolution over Graphs Bingxin Li Supervisor: Dr. Qing Wang Australian National University Semester 1, 2014 Acknowledgements I would take this opportunity to thank my supervisor, Dr. Qing Wang,

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

Data linkages in PEDSnet

Data linkages in PEDSnet 2016/2017 CRISP Seminar Series - Part IV Data linkages in PEDSnet Toan C. Ong, PhD Assistant Professor Department of Pediatrics University of Colorado, Anschutz Medical Campus Content Record linkage background

More information

SimEval - A Tool for Evaluating the Quality of Similarity Functions

SimEval - A Tool for Evaluating the Quality of Similarity Functions SimEval - A Tool for Evaluating the Quality of Similarity Functions Carlos A. Heuser Francisco N. A. Krieser Viviane Moreira Orengo UFRGS - Instituto de Informtica Caixa Postal 15.064-91501-970 - Porto

More information

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German. German RLC German RLC German RLC Merge ToolBox MTB German RLC Record Linkage Software, Version 0.742 Getting Started German RLC German RLC 12 November 2012 Tobias Bachteler German Record Linkage Center

More information

Plagiarism Detection Using FP-Growth Algorithm

Plagiarism Detection Using FP-Growth Algorithm Northeastern University NLP Project Report Plagiarism Detection Using FP-Growth Algorithm Varun Nandu (nandu.v@husky.neu.edu) Suraj Nair (nair.sur@husky.neu.edu) Supervised by Dr. Lu Wang December 10,

More information

jellyfish Documentation

jellyfish Documentation jellyfish Documentation Release 0.5.6 James Turk December 01, 2016 Contents 1 Overview 1 1.1 Phonetic Encoding............................................ 1 1.1.1 American Soundex.......................................

More information

Linking Patients in PDMP Data

Linking Patients in PDMP Data Linking Patients in PDMP Data Kentucky All Schedule Prescription Electronic Reporting (KASPER) PDMP Training & Technical Assistance Center Webinar October 15, 2014 Jean Hall Lindsey Pierson Office of Administrative

More information

Private Record linkage: Comparison of selected techniques for name matching

Private Record linkage: Comparison of selected techniques for name matching Private Record linkage: Comparison of selected techniques for name matching Pawel Grzebala and Michelle Cheatham DaSe Lab, Wright State University, Dayton OH 45435, USA, grzebala.2@wright.edu, michelle.cheatham@wright.edu

More information

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst

More information

HCC Administrator User Guide

HCC Administrator User Guide HealthStream Competency Center TM Administrator access to features and functions described in the HCC Help documentation is dependent upon the administrator s role and affiliation. Administrators may or

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Ranking in a Domain Specific Search Engine

Ranking in a Domain Specific Search Engine Ranking in a Domain Specific Search Engine CS6998-03 - NLP for the Web Spring 2008, Final Report Sara Stolbach, ss3067 [at] columbia.edu Abstract A search engine that runs over all domains must give equal

More information

Introduction Entity Match Service. Step-by-Step Description

Introduction Entity Match Service. Step-by-Step Description Introduction Entity Match Service In order to incorporate as much institutional data into our central alumni and donor database (hereafter referred to as CADS ), we ve developed a comprehensive suite of

More information

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Keith Kranker

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Keith Kranker dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Presentation at the 2018 Stata Conference Columbus, Ohio July 20, 2018 Keith Kranker Abstract Stata users

More information

Single Error Analysis of String Comparison Methods

Single Error Analysis of String Comparison Methods Single Error Analysis of String Comparison Methods Peter Christen Department of Computer Science, Australian National University, Canberra ACT 2, Australia peter.christen@anu.edu.au Abstract. Comparing

More information

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS INFORMATION SYSTEMS IN MANAGEMENT Information Systems in Management (2017) Vol. 6 (3) 213 222 USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS PIOTR OŻDŻYŃSKI, DANUTA ZAKRZEWSKA Institute of Information

More information

Entity Resolution with Heavy Indexing

Entity Resolution with Heavy Indexing Entity Resolution with Heavy Indexing Csaba István Sidló Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences sidlo@ilab.sztaki.hu

More information

MULTI-FINGER PENETRATION RATE AND ROC VARIABILITY FOR AUTOMATIC FINGERPRINT IDENTIFICATION SYSTEMS

MULTI-FINGER PENETRATION RATE AND ROC VARIABILITY FOR AUTOMATIC FINGERPRINT IDENTIFICATION SYSTEMS MULTI-FINGER PENETRATION RATE AND ROC VARIABILITY FOR AUTOMATIC FINGERPRINT IDENTIFICATION SYSTEMS I. Introduction James L. Wayman, Director U.S. National Biometric Test Center College of Engineering San

More information

Association Rules Mining using BOINC based Enterprise Desktop Grid

Association Rules Mining using BOINC based Enterprise Desktop Grid Association Rules Mining using BOINC based Enterprise Desktop Grid Evgeny Ivashko and Alexander Golovin Institute of Applied Mathematical Research, Karelian Research Centre of Russian Academy of Sciences,

More information

Grouping methods for ongoing record linkage

Grouping methods for ongoing record linkage Grouping methods for ongoing record linkage Sean M. Randall sean.randall@curtin.edu.au James H. Boyd j.boyd@curtin.edu.au Anna M. Ferrante a.ferrante@curtin.edu.au Adrian P. Brown adrian.brown@curtin.edu.au

More information

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises 308-420A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises Section 1.2 4, Logarithmic Files Logarithmic Files 1. A B-tree of height 6 contains 170,000 nodes with an

More information

A model of information searching behaviour to facilitate end-user support in KOS-enhanced systems

A model of information searching behaviour to facilitate end-user support in KOS-enhanced systems A model of information searching behaviour to facilitate end-user support in KOS-enhanced systems Dorothee Blocks Hypermedia Research Unit School of Computing University of Glamorgan, UK NKOS workshop

More information

Private Record Linkage

Private Record Linkage Undefined 0 (2016) 1 1 IOS Press Private Record Linkage An analysis of the accuracy, efficiency, and security of selected techniques for name matching Pawel Grzebala and Michelle Cheatham Wright State

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Privacy Preserving Probabilistic Record Linkage

Privacy Preserving Probabilistic Record Linkage Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of

More information

Examiners Report/ Lead Examiner Feedback Summer BTEC Level 3 Nationals in IT Unit 2: Creating Systems to Manage Information (31761H)

Examiners Report/ Lead Examiner Feedback Summer BTEC Level 3 Nationals in IT Unit 2: Creating Systems to Manage Information (31761H) Examiners Report/ Lead Examiner Feedback Summer 2017 BTEC Level 3 Nationals in IT Unit 2: Creating Systems to Manage Information (31761H) 1 Edexcel and BTEC Qualifications Edexcel and BTEC qualifications

More information

A Retrieval Mechanism for Multi-versioned Digital Collection Using TAG

A Retrieval Mechanism for Multi-versioned Digital Collection Using TAG A Retrieval Mechanism for Multi-versioned Digital Collection Using Dr M Thangaraj #1, V Gayathri *2 # Associate Professor, Department of Computer Science, Madurai Kamaraj University, Madurai, TN, India

More information

K-Means Clustering With Initial Centroids Based On Difference Operator

K-Means Clustering With Initial Centroids Based On Difference Operator K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,

More information

Handling Missing Attribute Values in Preterm Birth Data Sets

Handling Missing Attribute Values in Preterm Birth Data Sets Handling Missing Attribute Values in Preterm Birth Data Sets Jerzy W. Grzymala-Busse 1, Linda K. Goodwin 2, Witold J. Grzymala-Busse 3, and Xinqun Zheng 4 1 Department of Electrical Engineering and Computer

More information

Exploring Econometric Model Selection Using Sensitivity Analysis

Exploring Econometric Model Selection Using Sensitivity Analysis Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover

More information

DIRECT CERTIFICATION

DIRECT CERTIFICATION DIRECT CERTIFICATION New Jersey Department of Agriculture Division of Food and Nutrition School Nutrition Programs TABLE OF CONTENTS INTRODUCTION 3 ACCESS TO SCHOOL NUTRITION ELECTRONINC APPLICATION &

More information

An Ensemble Approach for Record Matching in Data Linkage

An Ensemble Approach for Record Matching in Data Linkage Digital Health Innovation for Consumers, Clinicians, Connectivity and Community A. Georgiou et al. (Eds.) 2016 The authors and IOS Press. This article is published online with Open Access by IOS Press

More information

Improving Collection of Client Identifiers. July 29, 2010

Improving Collection of Client Identifiers. July 29, 2010 Improving Collection of Client Identifiers July 29, 2010 Welcome! The State Office of AIDS is now working with providers to improve the quality of data that is collected and entered into ARIES. Today s

More information

ARELLO.COM Licensee Verification Web Service v2.0 (LVWS v2) Documentation. Revision: 8/22/2018

ARELLO.COM Licensee Verification Web Service v2.0 (LVWS v2) Documentation. Revision: 8/22/2018 ARELLO.COM Licensee Verification Web Service v2.0 (LVWS v2) Documentation Revision: 8/22/2018 Table of Contents Revision: 8/22/2018... 1 Introduction... 3 Subscription... 3 Interface... 3 Formatting the

More information

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Department of Computer Science & Engineering, Gitam University, INDIA 1. binducheekati@gmail.com,

More information

Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem

Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem Bidyut Gupta, Nick Rahimi, Henry Hexmoor, and Koushik Maddali Department of Computer Science Southern Illinois

More information

Hybrid Clustering Approach for Software Module Clustering

Hybrid Clustering Approach for Software Module Clustering Hybrid Clustering Approach for Software Module Clustering 1 K Kishore C, 2 Dr. K. Ramani, 3 Anoosha G 1,3 Assistant Professor, 2 Professor 1,2 Dept. of IT, Sree Vidyanikethan Engineering College, Tirupati

More information

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Offer Sharabi, Yi Sun, Mark Robinson, Rod Adams, Rene te Boekhorst, Alistair G. Rust, Neil Davey University of

More information

Math Search with Equivalence Detection Using Parse-tree Normalization

Math Search with Equivalence Detection Using Parse-tree Normalization Math Search with Equivalence Detection Using Parse-tree Normalization Abdou Youssef Department of Computer Science The George Washington University Washington, DC 20052 Phone: +1(202)994.6569 ayoussef@gwu.edu

More information

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

An Overview of various methodologies used in Data set Preparation for Data mining Analysis An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of

More information

Advanced Global Name Recognition Technology

Advanced Global Name Recognition Technology 1 IBM Information Management software Advanced Global Name Dr. John C. Hermansen IBM Distinguished Engineer Chief Technology Officer IBM Global Name Recognition 2 Contents 2 Introduction 4 Elements of

More information

Recent Progress on RAIL: Automating Clustering and Comparison of Different Road Classification Techniques on High Resolution Remotely Sensed Imagery

Recent Progress on RAIL: Automating Clustering and Comparison of Different Road Classification Techniques on High Resolution Remotely Sensed Imagery Recent Progress on RAIL: Automating Clustering and Comparison of Different Road Classification Techniques on High Resolution Remotely Sensed Imagery Annie Chen ANNIEC@CSE.UNSW.EDU.AU Gary Donovan GARYD@CSE.UNSW.EDU.AU

More information

Handling Missing Values via Decomposition of the Conditioned Set

Handling Missing Values via Decomposition of the Conditioned Set Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,

More information

Investigation of Golay Code (24, 12, 8) Structure in Improving Search Techniques

Investigation of Golay Code (24, 12, 8) Structure in Improving Search Techniques The International Arab Journal of Information Technology, Vol. 8, No. 3, July 2011 265 Investigation of Golay Code (24, 12, 8) Structure in Improving Search Techniques Eyas El-Qawasmeh 1, Maytham Safar

More information

Getting More from Segmentation Evaluation

Getting More from Segmentation Evaluation Getting More from Segmentation Evaluation Martin Scaiano University of Ottawa Ottawa, ON, K1N 6N5, Canada mscai056@uottawa.ca Diana Inkpen University of Ottawa Ottawa, ON, K1N 6N5, Canada diana@eecs.uottawa.com

More information

Test Cases Generation from UML Activity Diagrams

Test Cases Generation from UML Activity Diagrams Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing Test Cases Generation from UML Activity Diagrams Hyungchoul Kim, Sungwon

More information

Query Modifications Patterns During Web Searching

Query Modifications Patterns During Web Searching Bernard J. Jansen The Pennsylvania State University jjansen@ist.psu.edu Query Modifications Patterns During Web Searching Amanda Spink Queensland University of Technology ah.spink@qut.edu.au Bhuva Narayan

More information

Ontology Based Prediction of Difficult Keyword Queries

Ontology Based Prediction of Difficult Keyword Queries Ontology Based Prediction of Difficult Keyword Queries Lubna.C*, Kasim K Pursuing M.Tech (CSE)*, Associate Professor (CSE) MEA Engineering College, Perinthalmanna Kerala, India lubna9990@gmail.com, kasim_mlp@gmail.com

More information

Task Taxonomy for Graph Visualization

Task Taxonomy for Graph Visualization Task Taxonomy for Graph Visualization Bongshin Lee, Catherine Plaisant, Cynthia Sims Parr Human-Computer Interaction Lab University of Maryland, College Park, MD 20742, USA +1-301-405-7445 {bongshin, plaisant,

More information

Volume 2, Issue 9, September 2014 ISSN

Volume 2, Issue 9, September 2014 ISSN Fingerprint Verification of the Digital Images by Using the Discrete Cosine Transformation, Run length Encoding, Fourier transformation and Correlation. Palvee Sharma 1, Dr. Rajeev Mahajan 2 1M.Tech Student

More information

Development of Search Engines using Lucene: An Experience

Development of Search Engines using Lucene: An Experience Available online at www.sciencedirect.com Procedia Social and Behavioral Sciences 18 (2011) 282 286 Kongres Pengajaran dan Pembelajaran UKM, 2010 Development of Search Engines using Lucene: An Experience

More information

Metaheuristic Optimization with Evolver, Genocop and OptQuest

Metaheuristic Optimization with Evolver, Genocop and OptQuest Metaheuristic Optimization with Evolver, Genocop and OptQuest MANUEL LAGUNA Graduate School of Business Administration University of Colorado, Boulder, CO 80309-0419 Manuel.Laguna@Colorado.EDU Last revision:

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016 + Databases and Information Retrieval Integration TIETS42 Autumn 2016 Kostas Stefanidis kostas.stefanidis@uta.fi http://www.uta.fi/sis/tie/dbir/index.html http://people.uta.fi/~kostas.stefanidis/dbir16/dbir16-main.html

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

Package phonics. February 13, Type Package Title Phonetic Spelling Algorithms Version Date Encoding UTF-8

Package phonics. February 13, Type Package Title Phonetic Spelling Algorithms Version Date Encoding UTF-8 Package phonics February 13, 2018 Type Package Title Phonetic Spelling Algorithms Version 1.0.0 Date 2018-02-13 Encoding UTF-8 URL https://jameshoward.us/software/phonics/, https://github.com/howardjp/phonics

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Keywords: Thresholding, Morphological operations, Image filtering, Adaptive histogram equalization, Ceramic tile.

Keywords: Thresholding, Morphological operations, Image filtering, Adaptive histogram equalization, Ceramic tile. Volume 3, Issue 7, July 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Blobs and Cracks

More information

The Effect of Changing Grid Size in the Creation of Laser Scanner Digital Surface Models

The Effect of Changing Grid Size in the Creation of Laser Scanner Digital Surface Models The Effect of Changing Grid Size in the Creation of Laser Scanner Digital Surface Models Smith, S.L 1, Holland, D.A 1, and Longley, P.A 2 1 Research & Innovation, Ordnance Survey, Romsey Road, Southampton,

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

Full file at Chapter 2: Foundation Concepts

Full file at   Chapter 2: Foundation Concepts Chapter 2: Foundation Concepts TRUE/FALSE 1. The input source for the conceptual modeling phase is the business rules culled out from the requirements specification supplied by the user community. T PTS:

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases. A Thesis Presented to

Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases. A Thesis Presented to Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases A Thesis Presented to The Faculty of the Computer Science Program California State University Channel Islands In (Partial)

More information

Scan-Based BIST Diagnosis Using an Embedded Processor

Scan-Based BIST Diagnosis Using an Embedded Processor Scan-Based BIST Diagnosis Using an Embedded Processor Kedarnath J. Balakrishnan and Nur A. Touba Computer Engineering Research Center Department of Electrical and Computer Engineering University of Texas

More information

Comparison of Online Record Linkage Techniques

Comparison of Online Record Linkage Techniques International Research Journal of Engineering and Technology (IRJET) e-issn: 2395-0056 Volume: 02 Issue: 09 Dec-2015 p-issn: 2395-0072 www.irjet.net Comparison of Online Record Linkage Techniques Ms. SRUTHI.

More information

GLEIF Global LEI Data Quality Report Dictionary

GLEIF Global LEI Data Quality Report Dictionary GLEIF Global LEI Data Quality Report Dictionary Global LEI Data Quality Report Dictionary 2 19 Contents Data Quality Report Glossary... 3 1. Chapter 1: Preface... 4 1.1. Purpose of the Data Quality Report...

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

OCR correction based on document level knowledge

OCR correction based on document level knowledge OCR correction based on document level knowledge T. Nartker, K. Taghva, R. Young, J. Borsack, and A. Condit UNLV/Information Science Research Institute, Box 4021 4505 Maryland Pkwy, Las Vegas, NV USA 89154-4021

More information

JMP Clinical. Release Notes. Version 5.0

JMP Clinical. Release Notes. Version 5.0 JMP Clinical Version 5.0 Release Notes Creativity involves breaking out of established patterns in order to look at things in a different way. Edward de Bono JMP, A Business Unit of SAS SAS Campus Drive

More information

An Algorithm for Frequent Pattern Mining Based On Apriori

An Algorithm for Frequent Pattern Mining Based On Apriori An Algorithm for Frequent Pattern Mining Based On Goswami D.N.*, Chaturvedi Anshu. ** Raghuvanshi C.S.*** *SOS In Computer Science Jiwaji University Gwalior ** Computer Application Department MITS Gwalior

More information

Get Better Genealogical Results from

Get Better Genealogical Results from S.C. Computer / Genealogy Special Interest Group Karen Ristic Get Better Genealogical Results from Part 1: Basic Search Strategies March 14, 2013 2013 Karen Ristic A. What is Google? 1. Meaning 1. Googol

More information

Empirical Study of Automatic Dataset Labelling

Empirical Study of Automatic Dataset Labelling Empirical Study of Automatic Dataset Labelling Francisco J. Aparicio-Navarro, Konstantinos G. Kyriakopoulos, David J. Parish School of Electronic, Electrical and System Engineering Loughborough University

More information

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Int. J. Advance Soft Compu. Appl, Vol. 9, No. 1, March 2017 ISSN 2074-8523 The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Loc Tran 1 and Linh Tran

More information

Transitivity and Triads

Transitivity and Triads 1 / 32 Tom A.B. Snijders University of Oxford May 14, 2012 2 / 32 Outline 1 Local Structure Transitivity 2 3 / 32 Local Structure in Social Networks From the standpoint of structural individualism, one

More information

Similarity Joins in MapReduce

Similarity Joins in MapReduce Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented

More information

Toward Part-based Document Image Decoding

Toward Part-based Document Image Decoding 2012 10th IAPR International Workshop on Document Analysis Systems Toward Part-based Document Image Decoding Wang Song, Seiichi Uchida Kyushu University, Fukuoka, Japan wangsong@human.ait.kyushu-u.ac.jp,

More information

ISSN: [Keswani* et al., 7(1): January, 2018] Impact Factor: 4.116

ISSN: [Keswani* et al., 7(1): January, 2018] Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AUTOMATIC TEST CASE GENERATION FOR PERFORMANCE ENHANCEMENT OF SOFTWARE THROUGH GENETIC ALGORITHM AND RANDOM TESTING Bright Keswani,

More information

Spatial Topology of Equitemporal Points on Signatures for Retrieval

Spatial Topology of Equitemporal Points on Signatures for Retrieval Spatial Topology of Equitemporal Points on Signatures for Retrieval D.S. Guru, H.N. Prakash, and T.N. Vikram Dept of Studies in Computer Science,University of Mysore, Mysore - 570 006, India dsg@compsci.uni-mysore.ac.in,

More information

ENTITY RESOLUTION MODELS

ENTITY RESOLUTION MODELS 3 ENTITY RESOLUTION MODELS Overview This chapter presents three models of ER. The models are complementary in that they address different levels and aspects of the ER process. The first and earliest model

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

Statewide Student Identifiers (SSIDs or SIDs)

Statewide Student Identifiers (SSIDs or SIDs) Statewide Student Identifiers (SSIDs or SIDs) Overview SSIDs, which are unique to the state of Ohio, are used for funding and tracking longitudinal student data. The SSID system provides a way for ODE

More information

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR Detecting Missing and Spurious Edges in Large, Dense Networks Using Parallel Computing Samuel Coolidge, sam.r.coolidge@gmail.com Dan Simon, des480@nyu.edu Dennis Shasha, shasha@cims.nyu.edu Technical Report

More information

CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE

CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE In work educational data mining has been used on qualitative data of students and analysis their performance using C4.5 decision tree algorithm.

More information