Naming Disambiguation Based on Approximate String Matching for Co- Authorship Networks

Size: px

Start display at page:

Download "Naming Disambiguation Based on Approximate String Matching for Co- Authorship Networks"

Abner Watkins
5 years ago
Views:

1 Naming Disambig Ba on Approximate String Matching for Co- Authorship s Dr. V. Akila Dept. of Computer Science & Engg. akila@pec.edu Dr.V.Govindasamy Dept. of Information Technology, vgopu@pec.edu R. Kowsalya Dept. of Computer Science & Engg. kowsiamu@pec.edu Abstract Co-ship network is a network that models the co ship of scientific publication in a network. Naming Disambig is an important aspect of Coship. To finding an unique in a co ship network is a challenging one. In co ship network multiple persons have the same, abbreviation, misspelling etc. Further, human error leads to considering multiple persons under a single reference. Such mistakes affect the performance of finding an unique. Author is to assign a unique identifier to the same. A naming model ba on approximate string matching algorithms Jaro wrinkler and Levenstien similarity is propo in the paper. The propo System Naming Disambig pertains to assigning an unique ID to each unique. Keywords : Disambig; Co-Authorship ; Approximate String Matching 1. Introduction Co ship network multiple s collaboratively publish their research work. Identifying unique s in this scenario is difficult. Name is an important part to reduce the redundant data. After the naming process one can obtain an unique information with an unique id which is more useful for further process in the Co-ship network. Consider a example, jean francois vs jeanfrancois. The difference in this is only with or without space. Zak vs zakaria or dave vs david shows the partial or the nick but it also refers the same. Bjoern vs bjorn here represents an alphabetical error but belongs to the same person. Permutation tokens kim. jim vs kim, jim this also refers the same persons. If this data set is utilized for computing collaboration efficiency then the performance of the system reduces. In order to extract the underlying information from the Co-ship network preprocessing of the Co ship s is the primary step. Assigning unique indexes to s is necessary. The first step to collect the dataset contains following informations(first, last, middle, year of publications journals, co s s, affiliation and address). First take the full string if the contains the smith, john hoy -> (smith, J.L). Then count the token which is separated by., & # $. First take the last or sure compare the dataset and the unique are separated. The remaining are compared with the middle and the sure in the dataset and the unique s are separated. The remaining s are compared with all the hybrid of (First, Middle, Sure ). Remaining data s are compared with the other information ba upon the citation and the address and year of publication. Through this obtain an unique datasets with unique id. The propo system is a naming method ba on approximate string matching. Naming is usually performed using full and actual. 2. Related Work Author identifies the using the midline[1] and establish a unique registry of unique identifiers. It also uses the manual calculations through supervi and unsupervi approaches. Co ship for 274

2 [2] uses unsupervi learning spectral clustering for naming. Name Disambig in Author Citation using a k-way spectral clustering[3] method use key way spectral clustering using eigenvalue and eigenvector. Through this value, is done. Efficient Topic Ba Unsupervi Name Disambig[4] use probabilistic Latent Semantic PLSA Latent Dirichlet Allocation LDA and Topic ba LDA which is u to find the content matching in an web page which uses 11,000,000 pages and yahoo database for. In order to extract the underlying information from the co-ship network preprocessing of the co ship s is the primary step. Assigning unique indexes to s is necessary. Naming is needed to resolve multiple persons having the same information. Name Disambig from Link Data in a Collaboration Graph[5] using cluster ba entity means same data and the information about the data which is present in different clusters are collected into an metadata. Through this metadata it is easy to extract the information. DBLB, Arnetminer dataset are u in this method. Author Name in MEDLINE[6] propose a using the MIDLINE. The Dataset contains First, Middle, Last initials here Middle is considered as an unique data and through this the naming is done. Person Disambig by Relevance Weighting of Extended Feature Sets[7] use the feature set and feature weighting support vector person to measure feature to the query. Cluster of words and d entities are most commonly u techniques in more existing web entity. Name ba on web personal reference entity tables[8] mined from the web describes web querying method to mine link with entity ba methods. Disambig of person by linking person entities with the mined tables through categorization is performed. Author describes pairwise similarity by supervi and unsupervi methods. Matching synonymous and resolution relies of homonymous. Author pairwise similarity means the degree of instance of two s which is present in two different articles belongs to the same person. Unsupervi personal [9] is ba on unsupervi clustering technique. Disambig on co ship networks of the US patient inventor database[10] is propo. It uses Bayesian supervi learning approach and the metrics u here is precision, recall, and f-measures. The summary of the survey is tabulated in the table below. S. Paper No. 1 Author 2 On coshi p for 3 Name in citation using a key way spectral clusterin g method 4 Efficient topic ba unsuperv i 5 Name from link data in a collabora tion graph 6 Author in MIDLIN E Table 1: Summary of the Survey Key features METLIN E establish a unique registry of unique identificat ion Manual calculatio ns Feature selection Spectral clustering Probabilis tic latent semantic (P LSA) Clustered ba entity Disambig Midline Metrics Supervise d and unsupervi approache s Probabilis tic latent semantic Domain Data mining - Co shi p network 275

3 7 person by weightin g of extended features sets 8 personal ba on reference entity tables mixed from the web 9 Unsuper vi personal 10 Disambi g and coshi p networks of the v.s. patent inventor database To measure feature to the query and the other is the to the text content querying method to mine (link with entity base methods) Pairwise similarity by supervi and unsupervi Bayesian supervi learning approach Cluster of words and d entities are most commonly u techniques in more existing web entity u ation Unsupervi Clustering Technique Precision, Recall, and f- measures 3. Architecture Diagram Figure.1: Naming Disambig Compare the extracted sur with the sur in the database for a similarity threshold of 0.8. If the condition holds then compare the extracted middle with the middle in DB for a similarity value of 0.8. If the condition holds then compare the first with the first in DB. If all the three are considered dissimilar and a unique id is assigned. If all the three are considered as similar then check the other fields in the database. And then assign the unique id. Rule 1: If (SURNAME,SURNAME DB)>=0.8 then If (MIDDLE NAME, MIDDLE NAME DB)>=0.8 then If (FIRSTNAME, FIRSTNAME DB)>=0.8 then Dissimilar Assign unique ID Rule 2: If (SURNAME,SURNAME DB)>=0.8 then If (MIDDLE NAME, MIDDLE NAME DB)>=0.8 then If (FIRSTNAME, FIRSTNAME DB)>=0.8 then Similar Compare the other fields in the dataset And then assign the unique ID. 4. Implementation Details The propo system uses the DBLP-ACM and DBLP dataset as the benchmark Dataset. The dataset contains 3000 s information which is a redundant data s information are replicated and more s having the same information. To remove 276

4 the redundancy two algorithms are u jaro winkler and levenstien. The s of the s are compared to fix similarity threshold level in the ranges of 0.9, 0.8, 0.7, 0.6. The obtained values shows the propo system is more effective than existing system incase of evaluated algorithm. The result depicts that the jaro winkler prove to be better in performance then the levenstien. For this experiment, the software environment u are windows 10 operating system front end is a NetBeans IDE 8.0 and the backend is an Microsoft access and the tool u is Graphviz tool. The hardware requirements u in our propo methods are Processor - Intel core i3-2330m GHz and the Harddisk 320GB and the RAM 2GB. The resuts of the experiments are shown in Figure: 4 Naming for 2000 Authors Figure: 5 Naming for1500 Authors Figure: 2 Naming Disambig for 3500 Authors Figure: 6 Naming for 1000 Authors Figure: 3 Naming for 2500 Authors 277

5 Figure: 7 Naming for 750 Authors 5. Conclusion Co-ship network is network that models the Co ship of scientific publication in a network. In this the ambig is an issue in which where not derive at unique set of s. So this work focus on the uses of jaro winkler and levenstien alogrithm. On compare of the results of both the working of algorithm it is found jaro winkler is more efficient than the levenstien which reduces the naming ambig problem. 6. References [1] Neil R. Smalheiser,Vetle I. Torvik, Author Name Disambig, Annual Review of Information Science and Technology (ARIST) in Volume 43, [2] Hui Han, Hongyuan Zha, C. Lee Giles, Name Disambig in Author Citations using a Kway Spectral Clustering Method, International Conference On Digital Libraries, [3] Yang Song, Jian Huang, Isaac G. Councill, Jia Li, C. Lee Giles, Efficient Topic-ba Unsupervi Name Disambig, Proceedings of the ACM/IEEE-CS joint conference on Digital libraries Pages , [4] Baichuan Zhang, Tanay Kumar Saha, Mohammad Al Hasan, Name Disambig from link data in a collaboration graph, published ASONAM, [5] Vetle I. Torvik, Neil R. Smalheiser, Author Name Disambig in MEDLINE, ACM Transaction Knowledge Discovery Data, [6] Chong Long, Lei Shi, Person Name Disambig by Relevance Weighting of Extended Feature Sets, CLEF (Notebook Papers/LABs/Workshops), [7] Xianpei Han, Jun Zhao, Personal Name Disambig Ba on Reference Entity Tables Mined from the, Proceedings of the eleventh international workshop on information and data management, [8] In-Su Kang, Seung-Hoon Na, Seungwoo Lee, Hanmin Jung, Pyung Kim, Won-Kyung Sung, Jong-Hyeok Lee, On co-ship for, Information Processing and Management: as International Journal Volume 45 Issue 1, January [9] Gideon S. Mann, David Yarowsky, Unsupervi Personal Name Disambig, Proceedings of the seventh conference on Natural language learning at HLT-NAACL- Volume 4, [10] Duncan M. McRae-Spencer, Nigel R. Shadbolt, Also By The Same Author: AKTiveAuthor, a Citation Graph Approach to Name Disambig, Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, [11] Li Tang, John P. Walsh, Bibliometric fingerprints: ba on approximate structure equivalence of cognitive maps, scientometrics Volume 84, Issue 3, pp , [12] Ronald Lai, Alexander D Amour, Amy Yu, Ye Sun, Vetle Torvik, Disambig and Coship s of the U.S. Patent Inventor Database, Research Policy, volume 43, Issue 6, Pages , July [13] Jiashen Sun, Tianmin Wang, Li Li, Xing Wu,Person, Name Disambig ba on Topic Model, Joint Conference on Chinese Language Processing,

AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE

The Pennsylvania State University The Graduate School College of Information Sciences and Technology AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE A Thesis in Information Sciences and