Naming Disambiguation Based on Approximate String Matching for Co- Authorship Networks

Size: px
Start display at page:

Download "Naming Disambiguation Based on Approximate String Matching for Co- Authorship Networks"

Transcription

1 Naming Disambig Ba on Approximate String Matching for Co- Authorship s Dr. V. Akila Dept. of Computer Science & Engg. akila@pec.edu Dr.V.Govindasamy Dept. of Information Technology, vgopu@pec.edu R. Kowsalya Dept. of Computer Science & Engg. kowsiamu@pec.edu Abstract Co-ship network is a network that models the co ship of scientific publication in a network. Naming Disambig is an important aspect of Coship. To finding an unique in a co ship network is a challenging one. In co ship network multiple persons have the same, abbreviation, misspelling etc. Further, human error leads to considering multiple persons under a single reference. Such mistakes affect the performance of finding an unique. Author is to assign a unique identifier to the same. A naming model ba on approximate string matching algorithms Jaro wrinkler and Levenstien similarity is propo in the paper. The propo System Naming Disambig pertains to assigning an unique ID to each unique. Keywords : Disambig; Co-Authorship ; Approximate String Matching 1. Introduction Co ship network multiple s collaboratively publish their research work. Identifying unique s in this scenario is difficult. Name is an important part to reduce the redundant data. After the naming process one can obtain an unique information with an unique id which is more useful for further process in the Co-ship network. Consider a example, jean francois vs jeanfrancois. The difference in this is only with or without space. Zak vs zakaria or dave vs david shows the partial or the nick but it also refers the same. Bjoern vs bjorn here represents an alphabetical error but belongs to the same person. Permutation tokens kim. jim vs kim, jim this also refers the same persons. If this data set is utilized for computing collaboration efficiency then the performance of the system reduces. In order to extract the underlying information from the Co-ship network preprocessing of the Co ship s is the primary step. Assigning unique indexes to s is necessary. The first step to collect the dataset contains following informations(first, last, middle, year of publications journals, co s s, affiliation and address). First take the full string if the contains the smith, john hoy -> (smith, J.L). Then count the token which is separated by., & # $. First take the last or sure compare the dataset and the unique are separated. The remaining are compared with the middle and the sure in the dataset and the unique s are separated. The remaining s are compared with all the hybrid of (First, Middle, Sure ). Remaining data s are compared with the other information ba upon the citation and the address and year of publication. Through this obtain an unique datasets with unique id. The propo system is a naming method ba on approximate string matching. Naming is usually performed using full and actual. 2. Related Work Author identifies the using the midline[1] and establish a unique registry of unique identifiers. It also uses the manual calculations through supervi and unsupervi approaches. Co ship for 274

2 [2] uses unsupervi learning spectral clustering for naming. Name Disambig in Author Citation using a k-way spectral clustering[3] method use key way spectral clustering using eigenvalue and eigenvector. Through this value, is done. Efficient Topic Ba Unsupervi Name Disambig[4] use probabilistic Latent Semantic PLSA Latent Dirichlet Allocation LDA and Topic ba LDA which is u to find the content matching in an web page which uses 11,000,000 pages and yahoo database for. In order to extract the underlying information from the co-ship network preprocessing of the co ship s is the primary step. Assigning unique indexes to s is necessary. Naming is needed to resolve multiple persons having the same information. Name Disambig from Link Data in a Collaboration Graph[5] using cluster ba entity means same data and the information about the data which is present in different clusters are collected into an metadata. Through this metadata it is easy to extract the information. DBLB, Arnetminer dataset are u in this method. Author Name in MEDLINE[6] propose a using the MIDLINE. The Dataset contains First, Middle, Last initials here Middle is considered as an unique data and through this the naming is done. Person Disambig by Relevance Weighting of Extended Feature Sets[7] use the feature set and feature weighting support vector person to measure feature to the query. Cluster of words and d entities are most commonly u techniques in more existing web entity. Name ba on web personal reference entity tables[8] mined from the web describes web querying method to mine link with entity ba methods. Disambig of person by linking person entities with the mined tables through categorization is performed. Author describes pairwise similarity by supervi and unsupervi methods. Matching synonymous and resolution relies of homonymous. Author pairwise similarity means the degree of instance of two s which is present in two different articles belongs to the same person. Unsupervi personal [9] is ba on unsupervi clustering technique. Disambig on co ship networks of the US patient inventor database[10] is propo. It uses Bayesian supervi learning approach and the metrics u here is precision, recall, and f-measures. The summary of the survey is tabulated in the table below. S. Paper No. 1 Author 2 On coshi p for 3 Name in citation using a key way spectral clusterin g method 4 Efficient topic ba unsuperv i 5 Name from link data in a collabora tion graph 6 Author in MIDLIN E Table 1: Summary of the Survey Key features METLIN E establish a unique registry of unique identificat ion Manual calculatio ns Feature selection Spectral clustering Probabilis tic latent semantic (P LSA) Clustered ba entity Disambig Midline Metrics Supervise d and unsupervi approache s Probabilis tic latent semantic Domain Data mining - Co shi p network 275

3 7 person by weightin g of extended features sets 8 personal ba on reference entity tables mixed from the web 9 Unsuper vi personal 10 Disambi g and coshi p networks of the v.s. patent inventor database To measure feature to the query and the other is the to the text content querying method to mine (link with entity base methods) Pairwise similarity by supervi and unsupervi Bayesian supervi learning approach Cluster of words and d entities are most commonly u techniques in more existing web entity u ation Unsupervi Clustering Technique Precision, Recall, and f- measures 3. Architecture Diagram Figure.1: Naming Disambig Compare the extracted sur with the sur in the database for a similarity threshold of 0.8. If the condition holds then compare the extracted middle with the middle in DB for a similarity value of 0.8. If the condition holds then compare the first with the first in DB. If all the three are considered dissimilar and a unique id is assigned. If all the three are considered as similar then check the other fields in the database. And then assign the unique id. Rule 1: If (SURNAME,SURNAME DB)>=0.8 then If (MIDDLE NAME, MIDDLE NAME DB)>=0.8 then If (FIRSTNAME, FIRSTNAME DB)>=0.8 then Dissimilar Assign unique ID Rule 2: If (SURNAME,SURNAME DB)>=0.8 then If (MIDDLE NAME, MIDDLE NAME DB)>=0.8 then If (FIRSTNAME, FIRSTNAME DB)>=0.8 then Similar Compare the other fields in the dataset And then assign the unique ID. 4. Implementation Details The propo system uses the DBLP-ACM and DBLP dataset as the benchmark Dataset. The dataset contains 3000 s information which is a redundant data s information are replicated and more s having the same information. To remove 276

4 the redundancy two algorithms are u jaro winkler and levenstien. The s of the s are compared to fix similarity threshold level in the ranges of 0.9, 0.8, 0.7, 0.6. The obtained values shows the propo system is more effective than existing system incase of evaluated algorithm. The result depicts that the jaro winkler prove to be better in performance then the levenstien. For this experiment, the software environment u are windows 10 operating system front end is a NetBeans IDE 8.0 and the backend is an Microsoft access and the tool u is Graphviz tool. The hardware requirements u in our propo methods are Processor - Intel core i3-2330m GHz and the Harddisk 320GB and the RAM 2GB. The resuts of the experiments are shown in Figure: 4 Naming for 2000 Authors Figure: 5 Naming for1500 Authors Figure: 2 Naming Disambig for 3500 Authors Figure: 6 Naming for 1000 Authors Figure: 3 Naming for 2500 Authors 277

5 Figure: 7 Naming for 750 Authors 5. Conclusion Co-ship network is network that models the Co ship of scientific publication in a network. In this the ambig is an issue in which where not derive at unique set of s. So this work focus on the uses of jaro winkler and levenstien alogrithm. On compare of the results of both the working of algorithm it is found jaro winkler is more efficient than the levenstien which reduces the naming ambig problem. 6. References [1] Neil R. Smalheiser,Vetle I. Torvik, Author Name Disambig, Annual Review of Information Science and Technology (ARIST) in Volume 43, [2] Hui Han, Hongyuan Zha, C. Lee Giles, Name Disambig in Author Citations using a Kway Spectral Clustering Method, International Conference On Digital Libraries, [3] Yang Song, Jian Huang, Isaac G. Councill, Jia Li, C. Lee Giles, Efficient Topic-ba Unsupervi Name Disambig, Proceedings of the ACM/IEEE-CS joint conference on Digital libraries Pages , [4] Baichuan Zhang, Tanay Kumar Saha, Mohammad Al Hasan, Name Disambig from link data in a collaboration graph, published ASONAM, [5] Vetle I. Torvik, Neil R. Smalheiser, Author Name Disambig in MEDLINE, ACM Transaction Knowledge Discovery Data, [6] Chong Long, Lei Shi, Person Name Disambig by Relevance Weighting of Extended Feature Sets, CLEF (Notebook Papers/LABs/Workshops), [7] Xianpei Han, Jun Zhao, Personal Name Disambig Ba on Reference Entity Tables Mined from the, Proceedings of the eleventh international workshop on information and data management, [8] In-Su Kang, Seung-Hoon Na, Seungwoo Lee, Hanmin Jung, Pyung Kim, Won-Kyung Sung, Jong-Hyeok Lee, On co-ship for, Information Processing and Management: as International Journal Volume 45 Issue 1, January [9] Gideon S. Mann, David Yarowsky, Unsupervi Personal Name Disambig, Proceedings of the seventh conference on Natural language learning at HLT-NAACL- Volume 4, [10] Duncan M. McRae-Spencer, Nigel R. Shadbolt, Also By The Same Author: AKTiveAuthor, a Citation Graph Approach to Name Disambig, Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, [11] Li Tang, John P. Walsh, Bibliometric fingerprints: ba on approximate structure equivalence of cognitive maps, scientometrics Volume 84, Issue 3, pp , [12] Ronald Lai, Alexander D Amour, Amy Yu, Ye Sun, Vetle Torvik, Disambig and Coship s of the U.S. Patent Inventor Database, Research Policy, volume 43, Issue 6, Pages , July [13] Jiashen Sun, Tianmin Wang, Li Li, Xing Wu,Person, Name Disambig ba on Topic Model, Joint Conference on Chinese Language Processing,

AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE

AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE The Pennsylvania State University The Graduate School College of Information Sciences and Technology AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE A Thesis in Information Sciences and

More information

Inventor Disambiguation for Patents filed at USPTO Swapnil Mishra - u

Inventor Disambiguation for Patents filed at USPTO Swapnil Mishra - u School of Computer Science College of Engineering and Computer Science Inventor Disambiguation for Patents filed at USPTO Swapnil Mishra - u5053816 COMP8740 -Artificial Intelligence Project Supervisor:

More information

DML Łukasz Bolikowski; Piotr Jan Dendek Towards a Flexible Author Name Disambiguation Framework. Terms of use:

DML Łukasz Bolikowski; Piotr Jan Dendek Towards a Flexible Author Name Disambiguation Framework. Terms of use: DML 2011 Łukasz Bolikowski; Piotr Jan Dendek Towards a Flexible Author Name Disambiguation Framework In: Petr Sojka and Thierry Bouche (eds.): Towards a Digital Mathematics Library. Bertinoro, Italy, July

More information

Finding Topic-centric Identified Experts based on Full Text Analysis

Finding Topic-centric Identified Experts based on Full Text Analysis Finding Topic-centric Identified Experts based on Full Text Analysis Hanmin Jung, Mikyoung Lee, In-Su Kang, Seung-Woo Lee, Won-Kyung Sung Information Service Research Lab., KISTI, Korea jhm@kisti.re.kr

More information

Information Extraction from Research Papers by Data Integration and Data Validation from Multiple Header Extraction Sources

Information Extraction from Research Papers by Data Integration and Data Validation from Multiple Header Extraction Sources , October 24-26, 2012, San Francisco, USA Information Extraction from Research Papers by Data Integration and Data Validation from Multiple Header Extraction Sources Ozair Saleem, Seemab Latif Abstract

More information

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect: Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA

More information

Clustering using Topic Models

Clustering using Topic Models Clustering using Topic Models Compiled by Sujatha Das, Cornelia Caragea Credits for slides: Blei, Allan, Arms, Manning, Rai, Lund, Noble, Page. Clustering Partition unlabeled examples into disjoint subsets

More information

Random Forest DBSCAN Clustering for USPTO Inventor Name Disambiguation and Conflation

Random Forest DBSCAN Clustering for USPTO Inventor Name Disambiguation and Conflation Random Forest DBSCAN Clustering for USPTO Inventor Name Disambiguation and Conflation Kunho Kim, Madian Khabsa, C. Lee Giles Computer Science and Engineering Microsoft Research Information Sciences and

More information

arxiv: v4 [cs.ir] 14 Sep 2017

arxiv: v4 [cs.ir] 14 Sep 2017 Random Forest DBSCAN for USPTO Inventor Name Disambiguation arxiv:1602.01792v4 [cs.ir] 14 Sep 2017 Kunho Kim, Madian Khabsa, C. Lee Giles Computer Science and Engineering Microsoft Research Information

More information

Patent Classification Using Ontology-Based Patent Network Analysis

Patent Classification Using Ontology-Based Patent Network Analysis Association for Information Systems AIS Electronic Library (AISeL) PACIS 2010 Proceedings Pacific Asia Conference on Information Systems (PACIS) 2010 Patent Classification Using Ontology-Based Patent Network

More information

Analyzing Patterns with Timelines on Researcher Data

Analyzing  Patterns with Timelines on Researcher Data Analyzing Email Patterns with Timelines on Researcher Data Jangwon Gim 1, Yunji Jang 1, Do-Heon Jeong 1,*, Hanmin Jung 1 1 Korea Institute of Science and Technology Information (KISTI) 245 Daehak-ro, Yuseong-gu,

More information

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous

More information

jldadmm: A Java package for the LDA and DMM topic models

jldadmm: A Java package for the LDA and DMM topic models jldadmm: A Java package for the LDA and DMM topic models Dat Quoc Nguyen School of Computing and Information Systems The University of Melbourne, Australia dqnguyen@unimelb.edu.au Abstract: In this technical

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD 10 Text Mining Munawar, PhD Definition Text mining also is known as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT).[1] A process of identifying novel information from a collection

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor

More information

Metadata Topic Harmonization and Semantic Search for Linked-Data-Driven Geoportals -- A Case Study Using ArcGIS Online

Metadata Topic Harmonization and Semantic Search for Linked-Data-Driven Geoportals -- A Case Study Using ArcGIS Online Metadata Topic Harmonization and Semantic Search for Linked-Data-Driven Geoportals -- A Case Study Using ArcGIS Online Yingjie Hu 1, Krzysztof Janowicz 1, Sathya Prasad 2, and Song Gao 1 1 STKO Lab, Department

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

An Intelligent Retrieval Platform for Distributional Agriculture Science and Technology Data

An Intelligent Retrieval Platform for Distributional Agriculture Science and Technology Data An Intelligent Retrieval Platform for Distributional Agriculture Science and Technology Data Xiaorong Yang 1,2, Wensheng Wang 1,2, Qingtian Zeng 3, and Nengfu Xie 1,2 1 Agriculture Information Institute,

More information

An Efficient Methodology for Image Rich Information Retrieval

An Efficient Methodology for Image Rich Information Retrieval An Efficient Methodology for Image Rich Information Retrieval 56 Ashwini Jaid, 2 Komal Savant, 3 Sonali Varma, 4 Pushpa Jat, 5 Prof. Sushama Shinde,2,3,4 Computer Department, Siddhant College of Engineering,

More information

Scalable Name Disambiguation using Multi-level Graph Partition

Scalable Name Disambiguation using Multi-level Graph Partition Scalable Name Disambiguation using Multi-level Graph Partition Byung-Won On Penn State University, USA on@cse.psu.edu Dongwon Lee Penn State University, USA dongwon@psu.edu Abstract When non-unique values

More information

BibPro: A Citation Parser Based on Sequence Alignment Techniques

BibPro: A Citation Parser Based on Sequence Alignment Techniques TR-IIS-07-017 BibPro: A Citation Parser Based on Sequence Alignment Techniques Kai-Hsiang Yang, Chien-Chih Chen, Jan-Ming Ho Oct. 30, 2007 Technical Report No. TR-IIS-07-017 http://www.iis.sinica.edu.tw/page/library/lib/techreport/tr2007/tr07.html

More information

AUTOMATICALLY GENERATING DATA LINKAGES USING A DOMAIN-INDEPENDENT CANDIDATE SELECTION APPROACH

AUTOMATICALLY GENERATING DATA LINKAGES USING A DOMAIN-INDEPENDENT CANDIDATE SELECTION APPROACH AUTOMATICALLY GENERATING DATA LINKAGES USING A DOMAIN-INDEPENDENT CANDIDATE SELECTION APPROACH Dezhao Song and Jeff Heflin SWAT Lab Department of Computer Science and Engineering Lehigh University 11/10/2011

More information

USPTO INVENTOR DISAMBIGUATION

USPTO INVENTOR DISAMBIGUATION Team Member: Yang GuanCan Zhang Jing Cheng Liang Zhang HaiChao Lv LuCheng Wang DaoRen USPTO INVENTOR DISAMBIGUATION Institute of Scientific and Technical Information of China SEP 20, 2015 Content 1. Data

More information

Query Independent Scholarly Article Ranking

Query Independent Scholarly Article Ranking Query Independent Scholarly Article Ranking Shuai Ma, Chen Gong, Renjun Hu, Dongsheng Luo, Chunming Hu, Jinpeng Huai SKLSDE Lab, Beihang University, China Beijing Advanced Innovation Center for Big Data

More information

Survey on Recommendation of Personalized Travel Sequence

Survey on Recommendation of Personalized Travel Sequence Survey on Recommendation of Personalized Travel Sequence Mayuri D. Aswale 1, Dr. S. C. Dharmadhikari 2 ME Student, Department of Information Technology, PICT, Pune, India 1 Head of Department, Department

More information

Word Disambiguation in Web Search

Word Disambiguation in Web Search Word Disambiguation in Web Search Rekha Jain Computer Science, Banasthali University, Rajasthan, India Email: rekha_leo2003@rediffmail.com G.N. Purohit Computer Science, Banasthali University, Rajasthan,

More information

A Semantic Model for Concept Based Clustering

A Semantic Model for Concept Based Clustering A Semantic Model for Concept Based Clustering S.Saranya 1, S.Logeswari 2 PG Scholar, Dept. of CSE, Bannari Amman Institute of Technology, Sathyamangalam, Tamilnadu, India 1 Associate Professor, Dept. of

More information

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department

More information

LSI UNED at M-WePNaD: Embeddings for Person Name Disambiguation

LSI UNED at M-WePNaD: Embeddings for Person Name Disambiguation LSI UNED at M-WePNaD: Embeddings for Person Name Disambiguation Andres Duque, Lourdes Araujo, and Juan Martinez-Romo Dpto. Lenguajes y Sistemas Informáticos Universidad Nacional de Educación a Distancia

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK STUDY ON DIFFERENT SENTENCE LEVEL CLUSTERING ALGORITHMS FOR TEXT MINING RAKHI S.WAGHMARE,

More information

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

RiMOM Results for OAEI 2009

RiMOM Results for OAEI 2009 RiMOM Results for OAEI 2009 Xiao Zhang, Qian Zhong, Feng Shi, Juanzi Li and Jie Tang Department of Computer Science and Technology, Tsinghua University, Beijing, China zhangxiao,zhongqian,shifeng,ljz,tangjie@keg.cs.tsinghua.edu.cn

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Grid Resources Search Engine based on Ontology

Grid Resources Search Engine based on Ontology based on Ontology 12 E-mail: emiao_beyond@163.com Yang Li 3 E-mail: miipl606@163.com Weiguang Xu E-mail: miipl606@163.com Jiabao Wang E-mail: miipl606@163.com Lei Song E-mail: songlei@nudt.edu.cn Jiang

More information

A Few Things to Know about Machine Learning for Web Search

A Few Things to Know about Machine Learning for Web Search AIRS 2012 Tianjin, China Dec. 19, 2012 A Few Things to Know about Machine Learning for Web Search Hang Li Noah s Ark Lab Huawei Technologies Talk Outline My projects at MSRA Some conclusions from our research

More information

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan Explore Co-clustering on Job Applications Qingyun Wan SUNet ID:qywan 1 Introduction In the job marketplace, the supply side represents the job postings posted by job posters and the demand side presents

More information

Linking Entities in Chinese Queries to Knowledge Graph

Linking Entities in Chinese Queries to Knowledge Graph Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn

More information

Context Based Indexing in Search Engines: A Review

Context Based Indexing in Search Engines: A Review International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ Context Based Indexing in Search Engines: A Review Suraksha

More information

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Emerging Approach

More information

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 12-2002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

More information

Literature Survey on Various Recommendation Techniques in Collaborative Filtering

Literature Survey on Various Recommendation Techniques in Collaborative Filtering Literature Survey on Various Recommendation Techniques in Collaborative Filtering Mr. T. Sunil Reddy #, Mr. M. Dileep Kumar *, Mr.N. Vijaya sunder sagar # # M.Tech., Dept. of CSE, Ashoka Institute of Engineering

More information

Semantic Scholar. ICSTI Towards a More Efficient Review of Research Literature 11 September

Semantic Scholar. ICSTI Towards a More Efficient Review of Research Literature 11 September Semantic Scholar ICSTI Towards a More Efficient Review of Research Literature 11 September 2018 Allen Institute for Artificial Intelligence (https://allenai.org/) Non-profit Research Institute in Seattle,

More information

Graph Classification in Heterogeneous

Graph Classification in Heterogeneous Title: Graph Classification in Heterogeneous Networks Name: Xiangnan Kong 1, Philip S. Yu 1 Affil./Addr.: Department of Computer Science University of Illinois at Chicago Chicago, IL, USA E-mail: {xkong4,

More information

Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function

Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, Andrew McCallum Department of Computer Science University

More information

Automatic Image Annotation by Classification Using Mpeg-7 Features

Automatic Image Annotation by Classification Using Mpeg-7 Features International Journal of Scientific and Research Publications, Volume 2, Issue 9, September 2012 1 Automatic Image Annotation by Classification Using Mpeg-7 Features Manjary P.Gangan *, Dr. R. Karthi **

More information

Document Summarization using Semantic Feature based on Cloud

Document Summarization using Semantic Feature based on Cloud Advanced Science and echnology Letters, pp.51-55 http://dx.doi.org/10.14257/astl.2013 Document Summarization using Semantic Feature based on Cloud Yoo-Kang Ji 1, Yong-Il Kim 2, Sun Park 3 * 1 Dept. of

More information

ImgSeek: Capturing User s Intent For Internet Image Search

ImgSeek: Capturing User s Intent For Internet Image Search ImgSeek: Capturing User s Intent For Internet Image Search Abstract - Internet image search engines (e.g. Bing Image Search) frequently lean on adjacent text features. It is difficult for them to illustrate

More information

Continuous Time Group Discovery in Dynamic Graphs

Continuous Time Group Discovery in Dynamic Graphs Continuous Time Group Discovery in Dynamic Graphs Kurt T. Miller 1,2 tadayuki@cs.berkeley.edu 1 EECS University of California Berkeley, CA 94720 Tina Eliassi-Rad 2 eliassi@llnl.gov 2 Larence Livermore

More information

Mining Trusted Information in Medical Science: An Information Network Approach

Mining Trusted Information in Medical Science: An Information Network Approach Mining Trusted Information in Medical Science: An Information Network Approach Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Collaborated with many, especially Yizhou

More information

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi Journal of Energy and Power Engineering 10 (2016) 405-410 doi: 10.17265/1934-8975/2016.07.004 D DAVID PUBLISHING Shirin Abbasi Computer Department, Islamic Azad University-Tehran Center Branch, Tehran

More information

CL Scholar: The ACL Anthology Knowledge Graph Miner

CL Scholar: The ACL Anthology Knowledge Graph Miner CL Scholar: The ACL Anthology Knowledge Graph Miner Mayank Singh, Pradeep Dogga, Sohan Patro, Dhiraj Barnwal, Ritam Dutt, Rajarshi Haldar, Pawan Goyal and Animesh Mukherjee Department of Computer Science

More information

Council for Innovative Research

Council for Innovative Research AN AMELIORATED METHODOLOGY FOR RANKING THE TUPLE Ajeet A. Chikkamannur, Shivanand M. Handigund Department of Information Science and Engg, New Horizon College of Engineering, Bangalore 562157, India ac.ajeet@gmail.com

More information

Hierarchical Location and Topic Based Query Expansion

Hierarchical Location and Topic Based Query Expansion Hierarchical Location and Topic Based Query Expansion Shu Huang 1 Qiankun Zhao 2 Prasenjit Mitra 1 C. Lee Giles 1 Information Sciences and Technology 1 AOL Research Lab 2 Pennsylvania State University

More information

A Co-Operative Cluster Based Data Replication Technique for Improving Data Accessibility and Reducing Query Delay in Manet s

A Co-Operative Cluster Based Data Replication Technique for Improving Data Accessibility and Reducing Query Delay in Manet s International Refereed Journal of Engineering and Science (IRJES) ISSN (Online) 2319-183X, (Print) 2319-1821 Volume 2, Issue 10 (October 2013), PP. 56-60 A Co-Operative Cluster Based Data Replication Technique

More information

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Dong Han and Kilian Stoffel Information Management Institute, University of Neuchâtel Pierre-à-Mazel 7, CH-2000 Neuchâtel,

More information

Text clustering based on a divide and merge strategy

Text clustering based on a divide and merge strategy Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and

More information

Applying Data Mining to Wireless Networks

Applying Data Mining to Wireless Networks Applying Data Mining to Wireless Networks CHENG-MING HUANG 1, TZUNG-PEI HONG 2 and SHI-JINN HORNG 3,4 1 Department of Electrical Engineering National Taiwan University of Science and Technology, Taipei,

More information

An Integrated Face Recognition Algorithm Based on Wavelet Subspace

An Integrated Face Recognition Algorithm Based on Wavelet Subspace , pp.20-25 http://dx.doi.org/0.4257/astl.204.48.20 An Integrated Face Recognition Algorithm Based on Wavelet Subspace Wenhui Li, Ning Ma, Zhiyan Wang College of computer science and technology, Jilin University,

More information

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani

More information

Scholarly Big Data: Leverage for Science

Scholarly Big Data: Leverage for Science Scholarly Big Data: Leverage for Science C. Lee Giles The Pennsylvania State University University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu Funded in part by NSF, Allen Institute for

More information

A FUZZY NAIVE BAYESCLASSIFICATION USING CLASS SPECIFIC FEATURES FOR TEXT CATEGORIZATION

A FUZZY NAIVE BAYESCLASSIFICATION USING CLASS SPECIFIC FEATURES FOR TEXT CATEGORIZATION A FUZZY NAIVE BAYESCLASSIFICATION USING CLASS SPECIFIC FEATURES FOR TEXT CATEGORIZATION V.Bharathi 1, P.K.Jayanivetha 2, K.Kanniga 3, D.Sharmilarani 4 1 (Dept of CSE, UG scholar, Sri Krishna College of

More information

Visualization and text mining of patent and non-patent data

Visualization and text mining of patent and non-patent data of patent and non-patent data Anton Heijs Information Solutions Delft, The Netherlands http://www.treparel.com/ ICIC conference, Nice, France, 2008 Outline Introduction Applications on patent and non-patent

More information

arxiv: v1 [cs.ir] 10 Aug 2018

arxiv: v1 [cs.ir] 10 Aug 2018 Effective Unsupervised Author Disambiguation with Relative Frequencies Tobias Backes GESIS - Leibniz-Institute for the Social Sciences tobias.backes@gesis.org arxiv:188.4216v1 [cs.ir] 1 Aug 218 ABSTRACT

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala Master Project Various Aspects of Recommender Systems May 2nd, 2017 Master project SS17 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Fast and Effective System for Name Entity Recognition on Big Data

Fast and Effective System for Name Entity Recognition on Big Data International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22 Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task Junjun Wang 2013/4/22 Outline Introduction Related Word System Overview Subtopic Candidate Mining Subtopic Ranking Results and Discussion

More information

Tag Based Image Search by Social Re-ranking

Tag Based Image Search by Social Re-ranking Tag Based Image Search by Social Re-ranking Vilas Dilip Mane, Prof.Nilesh P. Sable Student, Department of Computer Engineering, Imperial College of Engineering & Research, Wagholi, Pune, Savitribai Phule

More information

K-indicators Method for Community Detection in Social Networks

K-indicators Method for Community Detection in Social Networks Int. J. Advance Soft Compu. Appl, Vol. 8, No. 3, December 2016 ISSN 2074-8523 K-indicators Method for Community Detection in Social Networks Mohammad H. Nadimi-Shahraki 1, Mehrafarin Adami-Dehkordi 1 1

More information

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, 2014 ISSN 2278 5485 EISSN 2278 5477 discovery Science Comparative Study of Classification Algorithms Using Data Mining Akhila

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 02, February -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Survey

More information

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM Myomyo Thannaing 1, Ayenandar Hlaing 2 1,2 University of Technology (Yadanarpon Cyber City), near Pyin Oo Lwin, Myanmar ABSTRACT

More information

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining Masaharu Yoshioka Graduate School of Information Science and Technology, Hokkaido University

More information

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction International Journal of Engineering Science Invention Volume 2 Issue 1 January. 2013 An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction Janakiramaiah Bonam 1, Dr.RamaMohan

More information

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Sheetal K. Labade Computer Engineering Dept., JSCOE, Hadapsar Pune, India Srinivasa Narasimha

More information

Metadata Extraction with Cue Model

Metadata Extraction with Cue Model Metadata Extraction with Cue Model Wan Malini Wan Isa 2, Jamaliah Abdul Hamid 1, Hamidah Ibrahim 2, Rusli Abdullah 2, Mohd. Hasan Selamat 2, Muhamad Taufik Abdullah 2 and Nurul Amelina Nasharuddin 2 1

More information

Enriching an Authority File of Scientific Conferences with Information Extracted from the Web

Enriching an Authority File of Scientific Conferences with Information Extracted from the Web Journal of Computer Sciences Original Research Paper Enriching an Authority File of Scientific Conferences with Information Extracted from the Web Heider Alvarenga de Jesus and Denilson Alves Pereira Department

More information

Community Detection Using Node Attributes and Structural Patterns in Online Social Networks

Community Detection Using Node Attributes and Structural Patterns in Online Social Networks Computer and Information Science; Vol. 10, No. 4; 2017 ISSN 1913-8989 E-ISSN 1913-8997 Published by Canadian Center of Science and Education Community Detection Using Node Attributes and Structural Patterns

More information

Learning to find transliteration on the Web

Learning to find transliteration on the Web Learning to find transliteration on the Web Chien-Cheng Wu Department of Computer Science National Tsing Hua University 101 Kuang Fu Road, Hsin chu, Taiwan d9283228@cs.nthu.edu.tw Jason S. Chang Department

More information

Efficient Name Disambiguation for Large-Scale Databases

Efficient Name Disambiguation for Large-Scale Databases Efficient Name Disambiguation for Large-Scale Databases Jian Huang 1,SeydaErtekin 2, and C. Lee Giles 1,2 1 College of Information Sciences and Technology The Pennsylvania State University, University

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

A Bayesian Approach to Hybrid Image Retrieval

A Bayesian Approach to Hybrid Image Retrieval A Bayesian Approach to Hybrid Image Retrieval Pradhee Tandon and C. V. Jawahar Center for Visual Information Technology International Institute of Information Technology Hyderabad - 500032, INDIA {pradhee@research.,jawahar@}iiit.ac.in

More information

An Efficient Semantic Image Retrieval based on Color and Texture Features and Data Mining Techniques

An Efficient Semantic Image Retrieval based on Color and Texture Features and Data Mining Techniques An Efficient Semantic Image Retrieval based on Color and Texture Features and Data Mining Techniques Doaa M. Alebiary Department of computer Science, Faculty of computers and informatics Benha University

More information

Medical Records Clustering Based on the Text Fetched from Records

Medical Records Clustering Based on the Text Fetched from Records Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Context Based Web Indexing For Semantic Web

Context Based Web Indexing For Semantic Web IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 4 (Jul. - Aug. 2013), PP 89-93 Anchal Jain 1 Nidhi Tyagi 2 Lecturer(JPIEAS) Asst. Professor(SHOBHIT

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Semantic Annotation of Web Resources Using IdentityRank and Wikipedia

Semantic Annotation of Web Resources Using IdentityRank and Wikipedia Semantic Annotation of Web Resources Using IdentityRank and Wikipedia Norberto Fernández, José M.Blázquez, Luis Sánchez, and Vicente Luque Telematic Engineering Department. Carlos III University of Madrid

More information

Object Distinction: Distinguishing Objects with Identical Names

Object Distinction: Distinguishing Objects with Identical Names Object Distinction: Distinguishing Objects with Identical Names Xiaoxin Yin Univ. of Illinois xyin1@uiuc.edu Jiawei Han Univ. of Illinois hanj@cs.uiuc.edu Philip S. Yu IBM T. J. Watson Research Center

More information

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important

More information

A Navigation-log based Web Mining Application to Profile the Interests of Users Accessing the Web of Bidasoa Turismo

A Navigation-log based Web Mining Application to Profile the Interests of Users Accessing the Web of Bidasoa Turismo A Navigation-log based Web Mining Application to Profile the Interests of Users Accessing the Web of Bidasoa Turismo Olatz Arbelaitz, Ibai Gurrutxaga, Aizea Lojo, Javier Muguerza, Jesús M. Pérez and Iñigo

More information

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL Shwetha S P 1 and Alok Ranjan 2 Visvesvaraya Technological University, Belgaum, Dept. of Computer Science and Engineering, Canara

More information

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System Scientific Journal of Impact Factor (SJIF): 4.72 International Journal of Advance Engineering and Research Development Volume 4, Issue 3, March -2017 A Facebook Profile Based TV Shows and Movies Recommendation

More information