Relational Clustering for Multi-type Entity Resolution
|
|
- Melvin Mason
- 5 years ago
- Views:
Transcription
1 Relational Clustering for Multi-type Entity Resolution Indrajit Bhattacharya and Lise Getoor Department of Computer Science, University of Maryland Presented by Martin Leginus 13th of March, 2013
2 Agenda Motivation Related work Use case scenarios Problem formulation Relational clustering Similarity measures Results Discussion
3 Why there is a need for entity resolution? The correspondence problem - 2 pictures refer to the same entity. Natural language processing - recognizing which noun phrases refer to the same entity. Data preprocessing - detection of duplicates.
4 Why there is a need for relational entity resolution? Traditional approaches utilize textual similarity measures. Collective Entity Resolution in Relational Data 3 Jim Doe Jason Doe J Doe James Doe James Doe Jonathan Doe Jonathan Doe Jason Doe Jackie Doe Jon Doe Jeanette Doe Jason Doe Jeanette Doe Jackie Doe Jean Doe (a) Relational evidences might improve the accuracy of the resolution. Fig. 1. Example of (a) a reference graph for simple example given in the text and (b) the resolved tit h (b)
5 Related work Textual similarity calculated for the descriptions of two entities. Supervised alg. that learn string similarity measures from labelled data. Performance is improved with blocking approach. Relational features considered for data integration problems.
6 Use case example Two citation examples of the same paper: Fast algorithms for mining association rules in large databases. Agrawal, Rakesh and Srikant, Ramakrishnan. In Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994 Fast algorithms for mining association rules. Agrawal, R., Srikant, R. in VLDB-94,1994 String edit distance does not work. Multiple entity resolution problem i.e., author, paper and venue entities.
7 $\(-8QL4L7M;LR2Q\50F0L9", D?F"F0"5+96+0"26ST$;CRL3KM$/ 0"5;F %S0_' (-+EKMQ+"+U#L$]+M0C,; ž 1X(*mf $'.BEEe >D?0 74^'.*$]4""()Ce Joint resolution using entity relations R Agrawal R Srikant Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules in Large Databases r1 e1 r5 GÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇ GÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇ ÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ r2 e2 r6 GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË r3 e3 r7 ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË VLDB 94 C1 Proc of the 20th Intl. Conference... C2 GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ r4 e4 r8 GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ h1 h2 ko"d kmkmd Local and global resolution. ç-üs Ü^Ó:Ò!ßÏUßÝ[Ó ásãvåmó+ánñió&âßòï.áiðmâ"èvö HØéFæIÓÞ"à7Ò#Ò#Ó ê Ï[Ð7ÑIÒ#ÓcÔ7Õ&Ö% MØÙAÚn 7ÛÜÝ[Ó2Þ?Ï[ß ßÏ[àMásâ!ÜI ÒâÓ+ãvÏ[áNßàZà7äâÓ:Ò#åMÓ+ã= 7ÑIßænà7Ò ç^üs ÜHÓ:Ò 7áIãcåMÓ+ánÑIÓTÒ#Ó+ë`Ó:Ò#ÓáIÞ"Ó+âQì4Ï[ßærßæIÓ\Ò#Óâà7Ý[åMÓ+ã@Ó+áNßÏ[ßÏ[Ó+â"è Ü^àMáIãsÏ.áIÐY 7ÑnßæIà7Ò Positive and negative relational evidence. K F0" :1Qe F;: D" f 76e >D 0 9D f ' ž 1X(-mf+ 8%Q9G0"+04a$B9WX>'I8-41" F(**#*4"Ws" i 6$:;IJ*C6L5naKM?)e,; $'[$aq_68op$$:ee a#j% ž 1X(*mf ; 1X(-mfs76e >D?0 ž 1X(*mfs#LQ4L5+n0
8 Problem formulation Entities and references are denoted by e and r. Assigned variables of e and r are denoted by e.a and r.a. References are typed and r.t is observed. Each reference r corresponds to a hidden entity so that each r has assigned entity label r.e. The problem is to discover the hidden set of entities E = {e i } and entity labels r.e for each reference. References are observed as members of hyper-edges. The membership of a reference is stored in hyper-edge label r.h = h (if reference r h).
9 GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ Problem formulation R Agrawal R Srikant Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules in Large Databases GÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇ r1 e1 r5 GÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇ GÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇ ÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ r2 e2 r6 GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË r3 e3 r7 ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË VLDB 94 C1 Proc of the 20th Intl. Conference... C2 GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ r4 e4 r8 h1 h2 ko"d kmkmd The set of hidden entities is E = {e 1, e 2, e 3, e 4 } where ç-üs Ü^Ó:Ò!ßÏUßÝ[Ó ásãvåmó+ánñió&âßòï.áiðmâ"èvö HØéFæIÓÞ"à7Ò#Ò#Ó ê Î)Ï[Ð7ÑIÒ#ÓcÔ7Õ&Ö% MØÙAÚn 7ÛÜÝ[Ó2Þ?Ï[ß ßÏ[àMásâ!ÜI ÒâÓ+ãvÏ[áNßàZà7äâÓ:Ò#åMÓ+ã= 7ÑIßænà7Ò ç^üs ÜHÓ:Ò 7áIãcåMÓ+ánÑIÓTÒ#Ó+ë`Ó:Ò#ÓáIÞ"Ó+âQì4Ï[ßærßæIÓ\Ò#Óâà7Ý[åMÓ+ã@Ó+áNßÏ[ßÏ[Ó+â"è âü^àmáiãsï.áiðy 7ÑnßæIà7Ò r 1.E = r 5.E = e 1, $\(-8QL4L7M;LR2Q\50F0L9", r 2.E = r 6.E = e 2, D?F"F0"5+96+0"26ST$;CRL3KM$/ F0" :1Qe F;: D" f 76e >D 0 9D f ' 1K i 6$:;IJ*C6L5naKM?)e,; ()F(**#*4"Ws" 1X(-mfs76e >D?0 ž 1X(*mfs#LQ4L5+n0 ž r 3.E = r 7.E = e 3, $b ±". º±ƒW ()8c!0"?6r5+$:8=e8 9D f ; r 4.E = r 8.E = e 4 0"5;F %S0_' (-+EKMQ+"+U#L$]+M0C,; ž 1X(*mf $'.BEEe >D?0 74^'.*$]4""()Ce 1X(-mf+ 8%Q9G0"+04a$B9WX>'I8-41" ž a#j% ž 1X(*mf ; $'[$aq_68op$$:ee 3.4 Positive And Negative Relational Evidence
10 Resolution by clustering The goal is to group all the references corresponding to the same entity into one cluster. The membership of a reference to a cluster is represented with r.c. All references from the cluster are of the same type. 1 At the beginning, each reference belongs to the separate cluster. 2 At each step, the cluster pair, with the highest similarity to be the same entity, is merged. The general similarity is defined as: sim(c i, c j ) = (1 α) sim attr (c i, c j ) + α sim rel (c i, c j ) where 0 α 1
11 Attribute a relational similarity Attribute similarity Any basic similarity measure for two reference attributes. The similarity for two clusters is calculated between two most representative attributes of those clusters. Relational similarity The measure between two clusters considering the clusters that they link to via observed edges. Edge detail similarity Neighborhood similarity
12 Edge detail similarity Each cluster is associated with the set of hyper-edges: c.h = {h r.h = h r.c = c} The similarity between two edges is defined as: sim(h i, h j ) = t (sim t (h i, h j )) where: sim t (h i, h j ) = Jaccard(π t (h i ), π t (h j ))) and π t (h) = {c r.c = c c.t = t r.h = h} The final similarity is defined as: sim rel (c i, c j ) = max(h i, h j ){sim(h i, h j )} where h i c i.h, h j c j.h
13 Neighborhood similarity The similarity between two clusters is defined as: sim rel (c i, c j ) = Jaccard(N t (c i ), N t (c j ))) where N t (c) = m π t (h), h c.h The obtained neighborhoods are multisets.
14 Implementation Greedy agglomerative clustering that merges closest cluster pair at each step. All candidate pairs are sorted by their similarities in a priority queue - blocking approach. During the initial phase, references with the identical attributes v 1 = v 2 or with a reference which is initialed form of the other are merged.
15 Datasets and baseline methods CiteSeer dataset contains 2892 references with 1165 authors, contained in 1504 documents. arxiv dataset contains references with 9200 authors, contained in papers. Baseline method ATTR based on SoftTF-IDF where the secondary distance measures can be Jaro-Winkler, Jaro or Scaled Levenstein distance.
16 Accuracy results with different similarity measures
17 Precision, recall and F1 results for both datasets
18 4Š () EhH s; Performance '[ > 0/?$X 1 4 :5 QKMC ' $91 :KM : ,9 F 24 ñ Š# 4'., B +$C 76\ ñ 4'. $91 :KM EM$> 8% 7()9X&9Cˆ L + D" % 1 / mcê 76 CPU time (secs) Execution time ATTR* RC (Nbr) with Bootstrap RC (Nbr) w/o Bootstrap RC (Edge) with Bootstrap RC (Edge) w/o Bootstrap 200 1C$/ Number of References (in Thousands) " Î)Ï[ÐMÑIÒ#ÓíìIÕrÙaÚNÓ+Þ?ÑIßÏ[à7ágßÏmÛÓ=ë`à7Ò0 CéFé/lœ Z 7ásãîl>m,êÙHl >" (-/
19 Attribute vs relational similarity effects on accuracy Varying alpha: Jaro for CiteSeer Varying alpha: Jaro Winkler for CiteSeer Varying alpha: Scaled Levenstein for CiteSeer best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* ko"d alpha kmkmd alpha kmd alpha Varying alpha: Jaro for HEP Varying alpha: Jaro Winkler for HEP Varying alpha: Scaled Levenstein for HEP best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* ko67d alpha km#d alpha k.'xd alpha Î)Ï[Ð7ÑIÒ#Ó`ÏNÕZéFæIÓcä^Ó+âß2ÎEÔcÛÓ: 7âÑnÒÓ+âS 7ÞæsÏ[Ó:åMÓ+ã ä4 ïl>m,êùhl ì4ï[ßætáió+ï[ðmæpä^à7òæiàpànãt 7ásã Ó+ãIÐ7ÓrãnÓ+ß$ 7Ï.ÝFâÏmÛ2Ï.ÝU ÒÏ[ßÏ[Ó+âVà+åMÓ+Ò å+ Ò nï.áið2þ"àpû\äï.ái ßÏ[àMáZìLÓ+Ï[ÐMæPß<"@ë`à7Ò\Ö êþ:ø/mlï[ßóz NÓ:Ó:Ò_ ásãyöwãnê0ëø/ó3ùjÿgñiâï.áiðâï.áiðmý[ó!ý.ï.ágªvë`à7ò_ ß$ßÒÏ[äÑIßÓSâÏ.Û2Ï.Ý[ ÒÏUßM Zì4Ï[ßæ N Òànç n Ò#àê Ï.ágªNÝ[Ó:ÒT ásã` nþ 7Ý[Ó+ã= HÓ+åMÓ+ásâß$ÓÏ[á@Ò#Ó+âÜHÓ+Þ"ßÏ[åMÓ+Ý' 7è 05+BQKMA506_C0"?)$'.$*:, 8BK Q5W/ 91Q$!;Hh^]+M$4 Indrajit *3, B8# :$a63 Bhattacharya, Lise Getoor 6$9T501Q#7KEW01L09498T405$;: % Relational ób JẼMˆ " "Œ:; Clustering for Multi-type Entity Resolution
20 Conclusions Introduced two relational similarity measures Relational similarity in combination with attributes similarity outperform other non-relational approaches. Successful usage of bootstrapping and blocking approach for improved performance.
Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park
Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous
More informationCollective Entity Resolution in Relational Data
Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution
More informationQuery-Time Entity Resolution
Query-Time Entity Resolution Indrajit Bhattacharya University of Maryland, College Park MD, USA 20742 indrajit@cs.umd.edu Lise Getoor University of Maryland, College Park MD, USA 20742 getoor@cs.umd.edu
More informationQuery-time Entity Resolution
Journal of Artificial Intelligence Research 30 (2007) 621-657 Submitted 03/07; published 12/07 Query-time Entity Resolution Indrajit Bhattacharya IBM India Research Laboratory Vasant Kunj, New Delhi 110
More informationEntity Resolution over Graphs
Entity Resolution over Graphs Bingxin Li Supervisor: Dr. Qing Wang Australian National University Semester 1, 2014 Acknowledgements I would take this opportunity to thank my supervisor, Dr. Qing Wang,
More informationSQL Based Association Rule Mining using Commercial RDBMS (IBM DB2 UDB EEE)
SQL Based Association Rule Mining using Commercial RDBMS (IBM DB2 UDB EEE) Takeshi Yoshizawa, Iko Pramudiono, Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 7-22-1 Roppongi,
More informationNovel Hybrid k-d-apriori Algorithm for Web Usage Mining
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 4, Ver. VI (Jul.-Aug. 2016), PP 01-10 www.iosrjournals.org Novel Hybrid k-d-apriori Algorithm for Web
More informationA framework of identity resolution: evaluating identity attributes and matching algorithms
Li and Wang Security Informatics (2015) 4:6 DOI 10.1186/s13388-015-0021-0 RESEARCH A framework of identity resolution: evaluating identity attributes and matching algorithms Jiexun Li 1 and Alan G. Wang
More informationEfficient Remining of Generalized Multi-supported Association Rules under Support Update
Efficient Remining of Generalized Multi-supported Association Rules under Support Update WEN-YANG LIN 1 and MING-CHENG TSENG 1 Dept. of Information Management, Institute of Information Engineering I-Shou
More informationUnderstanding Rule Behavior through Apriori Algorithm over Social Network Data
Global Journal of Computer Science and Technology Volume 12 Issue 10 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: 0975-4172
More informationOutline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:
Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA
More informationPrivacy. University of Maryland, College Park
Graph Identification & Privacy Lise Getoor University of Maryland, College Park Stanford InfoSeminar January 16, 2009 Graphs and Networks everywhere The Web, social networks, communication networks, financial
More informationDeduplication of Hospital Data using Genetic Programming
Deduplication of Hospital Data using Genetic Programming P. Gujar Department of computer engineering Thakur college of engineering and Technology, Kandiwali, Maharashtra, India Priyanka Desai Department
More informationWhere we are. Exploratory Graph Analysis (40 min) Focused Graph Mining (40 min) Refinement of Query Results (40 min)
Where we are Background (15 min) Graph models, subgraph isomorphism, subgraph mining, graph clustering Eploratory Graph Analysis (40 min) Focused Graph Mining (40 min) Refinement of Query Results (40 min)
More informationImproving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique
Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique P.Nithya 1, V.Karpagam 2 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College,
More informationA Parallel Evolutionary Algorithm for Discovery of Decision Rules
A Parallel Evolutionary Algorithm for Discovery of Decision Rules Wojciech Kwedlo Faculty of Computer Science Technical University of Bia lystok Wiejska 45a, 15-351 Bia lystok, Poland wkwedlo@ii.pb.bialystok.pl
More informationInformation Integration of Partially Labeled Data
Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de
More informationDatabase system development lifecycles
Database system development lifecycles 2009 Yunmook Nah Department of Electronics and Computer Engineering School of Computer Science & Engineering Dankook University 이석호 ä ± Á Ç ºÐ ¼ ¼³ è ± Çö î µ ½Ã
More informationCOFI Approach for Mining Frequent Itemsets Revisited
COFI Approach for Mining Frequent Itemsets Revisited Mohammad El-Hajj Department of Computing Science University of Alberta,Edmonton AB, Canada mohammad@cs.ualberta.ca Osmar R. Zaïane Department of Computing
More informationInternational Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015
International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 Privacy Preservation Data Mining Using GSlicing Approach Mr. Ghanshyam P. Dhomse
More informationGenerating Cross level Rules: An automated approach
Generating Cross level Rules: An automated approach Ashok 1, Sonika Dhingra 1 1HOD, Dept of Software Engg.,Bhiwani Institute of Technology, Bhiwani, India 1M.Tech Student, Dept of Software Engg.,Bhiwani
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationLeveraging Data and Structure in Ontology Integration
Leveraging Data and Structure in Ontology Integration O. Udrea L. Getoor R.J. Miller Group 15 Enrico Savioli Andrea Reale Andrea Sorbini DEIS University of Bologna Searching Information in Large Spaces
More informationFuzzy Cognitive Maps application for Webmining
Fuzzy Cognitive Maps application for Webmining Andreas Kakolyris Dept. Computer Science, University of Ioannina Greece, csst9942@otenet.gr George Stylios Dept. of Communications, Informatics and Management,
More informationRule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD)
American-Eurasian Journal of Scientific Research 12 (5): 255-259, 2017 ISSN 1818-6785 IDOSI Publications, 2017 DOI: 10.5829/idosi.aejsr.2017.255.259 Rule-Based Method for Entity Resolution Using Optimized
More informationToday s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan
Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics
More informationObject Distinction: Distinguishing Objects with Identical Names
Object Distinction: Distinguishing Objects with Identical Names Xiaoxin Yin Univ. of Illinois xyin1@uiuc.edu Jiawei Han Univ. of Illinois hanj@cs.uiuc.edu Philip S. Yu IBM T. J. Watson Research Center
More informationKnowledge Graph Completion. Mayank Kejriwal (USC/ISI)
Knowledge Graph Completion Mayank Kejriwal (USC/ISI) What is knowledge graph completion? An intelligent way of doing data cleaning Deduplicating entity nodes (entity resolution) Collective reasoning (probabilistic
More information6. Dicretization methods 6.1 The purpose of discretization
6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many
More informationUSING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS
INFORMATION SYSTEMS IN MANAGEMENT Information Systems in Management (2017) Vol. 6 (3) 213 222 USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS PIOTR OŻDŻYŃSKI, DANUTA ZAKRZEWSKA Institute of Information
More informationIntroduction Entity Match Service. Step-by-Step Description
Introduction Entity Match Service In order to incorporate as much institutional data into our central alumni and donor database (hereafter referred to as CADS ), we ve developed a comprehensive suite of
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationBeing Prepared In A Sparse World: The Case of KNN Graph Construction. Antoine Boutet DRIM LIRIS, Lyon
Being Prepared In A Sparse World: The Case of KNN Graph Construction Antoine Boutet DRIM LIRIS, Lyon Co-authors Joint work with François Taiani Nupur Mittal Anne-Marie Kermarrec Published at ICDE 2016
More informationConcept-Based Document Similarity Based on Suffix Tree Document
Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri
More informationMining of Web Server Logs using Extended Apriori Algorithm
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational
More informationCOLLABORATIVE LOCATION AND ACTIVITY RECOMMENDATIONS WITH GPS HISTORY DATA
COLLABORATIVE LOCATION AND ACTIVITY RECOMMENDATIONS WITH GPS HISTORY DATA Vincent W. Zheng, Yu Zheng, Xing Xie, Qiang Yang Hong Kong University of Science and Technology Microsoft Research Asia WWW 2010
More informationFrequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management
Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES
More informationA Versatile Record Linkage Method by Term Matching Model Using CRF
A Versatile Record Linkage Method by Term Matching Model Using CRF Quang Minh Vu, Atsuhiro Takasu, and Jun Adachi National Insitute of Informatics, Tokyo 101-8430, Japan {vuminh,takasu,adachi}@nii.ac.jp
More informationDATABASES often contain uncertain and imprecise references
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 14, NO. 5, SEPTEMBER/OCTOBER 2008 999 Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation Hyunmo Kang,
More informationIntroduction to Mobile Robotics
Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,
More informationData Structure for Association Rule Mining: T-Trees and P-Trees
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 6, JUNE 2004 1 Data Structure for Association Rule Mining: T-Trees and P-Trees Frans Coenen, Paul Leng, and Shakil Ahmed Abstract Two new
More informationLecture 8 May 7, Prabhakar Raghavan
Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of
More informationr v i e w o f s o m e r e c e n t d e v e l o p m
O A D O 4 7 8 O - O O A D OA 4 7 8 / D O O 3 A 4 7 8 / S P O 3 A A S P - * A S P - S - P - A S P - - - - L S UM 5 8 - - 4 3 8 -F 69 - V - F U 98F L 69V S U L S UM58 P L- SA L 43 ˆ UéL;S;UéL;SAL; - - -
More informationEffective Sequential Pattern Mining Algorithms for Dense Database
DEWS2006 3A-o4 Abstract Effective Sequential Pattern Mining Algorithms for Dense Database Zhenglu YANG, Yitong WANG, and Masaru KITSUREGAWA Institute of Industrial Science, The Univeristy of Tokyo Komaba
More informationA Novel Method of Optimizing Website Structure
A Novel Method of Optimizing Website Structure Mingjun Li 1, Mingxin Zhang 2, Jinlong Zheng 2 1 School of Computer and Information Engineering, Harbin University of Commerce, Harbin, 150028, China 2 School
More informationA FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM
A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,
More informationA Replicated Study on Duplicate Detection: Using Apache Lucene to Search Among Android Defects
A Replicated Study on Duplicate Detection: Using Apache Lucene to Search Among Android Defects Borg, Markus; Runeson, Per; Johansson, Jens; Mäntylä, Mika Published in: [Host publication title missing]
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationCOMS 4771 Clustering. Nakul Verma
COMS 4771 Clustering Nakul Verma Supervised Learning Data: Supervised learning Assumption: there is a (relatively simple) function such that for most i Learning task: given n examples from the data, find
More informationMining Vague Association Rules
Mining Vague Association Rules An Lu, Yiping Ke, James Cheng, and Wilfred Ng Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong, China {anlu,keyiping,csjames,wilfred}@cse.ust.hk
More informationKeywords Data alignment, Data annotation, Web database, Search Result Record
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web
More informationOutlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data
Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University
More informationA Mixed Fragmentation Algorithm for Distributed Object Oriented Databases 1
A Mixed Fragmentation Algorithm for Distributed Object Oriented Databases 1 Fernanda Baião Department of Computer Science - COPPE/UFRJ Abstract Federal University of Rio de Janeiro - Brazil baiao@cos.ufrj.br
More informationInfrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset
Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset M.Hamsathvani 1, D.Rajeswari 2 M.E, R.Kalaiselvi 3 1 PG Scholar(M.E), Angel College of Engineering and Technology, Tiruppur,
More informationOutline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration
Outline Duplicates Detection in Database Integration Background HumMer Automatic Data Fusion System Duplicate Detection methods An efficient method using priority queue Approach based on Extended key Approach
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationPattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42
Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationA DISTRIBUTED ALGORITHM FOR MINING ASSOCIATION RULES
A DISTRIBUTED ALGORITHM FOR MINING ASSOCIATION RULES Pham Nguyen Anh Huy *, Ho Tu Bao ** * Department of Information Technology, Natural Sciences University of HoChiMinh city 227 Nguyen Van Cu Street,
More informationEntity Resolution with Heavy Indexing
Entity Resolution with Heavy Indexing Csaba István Sidló Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences sidlo@ilab.sztaki.hu
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationA Systems View of Large- Scale 3D Reconstruction
Lecture 23: A Systems View of Large- Scale 3D Reconstruction Visual Computing Systems Goals and motivation Construct a detailed 3D model of the world from unstructured photographs (e.g., Flickr, Facebook)
More informationTo provide state and district level PARCC assessment data for the administration of Grades 3-8 Math and English Language Arts.
200 West Baltimore Street Baltimore, MD 21201 410-767-0100 410-333-6442 TTY/TDD msde.maryland.gov TO: FROM: Members of the Maryland State Board of Education Jack R. Smith, Ph.D. DATE: December 8, 2015
More informationTowards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search.
Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer Science University of California, Irvine
More informationCPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017
CPSC 340: Machine Learning and Data Mining Hierarchical Clustering Fall 2017 Assignment 1 is due Friday. Admin Follow the assignment guidelines naming convention (a1.zip/a1.pdf). Assignment 0 grades posted
More informationEfficient Incremental Mining of Top-K Frequent Closed Itemsets
Efficient Incremental Mining of Top- Frequent Closed Itemsets Andrea Pietracaprina and Fabio Vandin Dipartimento di Ingegneria dell Informazione, Università di Padova, Via Gradenigo 6/B, 35131, Padova,
More informationComparison of Online Record Linkage Techniques
International Research Journal of Engineering and Technology (IRJET) e-issn: 2395-0056 Volume: 02 Issue: 09 Dec-2015 p-issn: 2395-0072 www.irjet.net Comparison of Online Record Linkage Techniques Ms. SRUTHI.
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationHolistic Query Evaluation over Information Extraction Pipelines
Holistic Query Evaluation over Information Extraction Pipelines ABSTRACT Ekaterini Ioannou Open University of Cyprus ekaterini.ioannou@ouc.ac.cy We introduce holistic in-database query processing over
More informationImproving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,
More informationPerformance Analysis of Apriori Algorithm with Progressive Approach for Mining Data
Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data Shilpa Department of Computer Science & Engineering Haryana College of Technology & Management, Kaithal, Haryana, India
More informationAssume we are given a tissue sample =, and a feature vector
MA 751 Part 6 Support Vector Machines 3. An example: Gene expression arrays Assume we are given a tissue sample =, and a feature vector x œ F Ð=Ñ $!ß!!! consisting of 30,000 gene expression levels as read
More informationCPSC 425: Computer Vision
1 / 31 CPSC 425: Computer Vision Instructor: Jim Little little@cs.ubc.ca Department of Computer Science University of British Columbia Lecture Notes 2016/2017 Term 2 2 / 31 Menu March 16, 2017 Topics:
More informationSEQUENTIAL PATTERN MINING FROM WEB LOG DATA
SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract
More informationTransforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm
Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel Josiah-Akintonde December
More informationConcurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm
Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Marek Wojciechowski, Krzysztof Galecki, Krzysztof Gawronek Poznan University of Technology Institute of Computing Science ul.
More informationRECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH
Int. J. Engg. Res. & Sci. & Tech. 2013 V Karthika et al., 2013 Research Paper ISSN 2319-5991 www.ijerst.com Vol. 2, No. 2, May 2013 2013 IJERST. All Rights Reserved RECORD DEDUPLICATION USING GENETIC PROGRAMMING
More informationA Modified Apriori Algorithm for Fast and Accurate Generation of Frequent Item Sets
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 6, ISSUE 08, AUGUST 2017 ISSN 2277-8616 A Modified Apriori Algorithm for Fast and Accurate Generation of Frequent Item Sets K.A.Baffour,
More informationPapers for comprehensive viva-voce
Papers for comprehensive viva-voce Priya Radhakrishnan Advisor : Dr. Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Gachibowli, Hyderabad, India
More informationALIN Results for OAEI 2016
ALIN Results for OAEI 2016 Jomar da Silva, Fernanda Araujo Baião and Kate Revoredo Department of Applied Informatics Federal University of the State of Rio de Janeiro (UNIRIO), Rio de Janeiro, Brazil {jomar.silva,fernanda.baiao,katerevoredo}@uniriotec.br
More informationMining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,
Mining N-most Interesting Itemsets Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang Department of Computer Science and Engineering The Chinese University of Hong Kong, Hong Kong fadafu, wwkwongg@cse.cuhk.edu.hk
More informationPrivacy Preserving in Knowledge Discovery and Data Publishing
B.Lakshmana Rao, G.V Konda Reddy and G.Yedukondalu 33 Privacy Preserving in Knowledge Discovery and Data Publishing B.Lakshmana Rao 1, G.V Konda Reddy 2, G.Yedukondalu 3 Abstract Knowledge Discovery is
More informationMulti-component Similarity Method for Web Product Duplicate Detection
Multi-component Similarity Method for Web Product Duplicate Detection Ronald van Bezu ronaldvanbezu@gmail.com Jim Verhagen j.m.verhagen@gmail.com Sjoerd Borst s.v.borst@gmail.com Damir Vandic vandic@ese.eur.nl
More informationVisual Analysis of Lagrangian Particle Data from Combustion Simulations
Visual Analysis of Lagrangian Particle Data from Combustion Simulations Hongfeng Yu Sandia National Laboratories, CA Ultrascale Visualization Workshop, SC11 Nov 13 2011, Seattle, WA Joint work with Jishang
More informationComprehensive and Progressive Duplicate Entities Detection
Comprehensive and Progressive Duplicate Entities Detection Veerisetty Ravi Kumar Dept of CSE, Benaiah Institute of Technology and Science. Nagaraju Medida Assistant Professor, Benaiah Institute of Technology
More informationAPPLESHARE PC UPDATE INTERNATIONAL SUPPORT IN APPLESHARE PC
APPLESHARE PC UPDATE INTERNATIONAL SUPPORT IN APPLESHARE PC This update to the AppleShare PC User's Guide discusses AppleShare PC support for the use of international character sets, paper sizes, and date
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Hierarchical Clustering and Outlier Detection Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. Admin Assignment 2 is due
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationWeb page recommendation using a stochastic process model
Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:
IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T
More informationAROMA results for OAEI 2009
AROMA results for OAEI 2009 Jérôme David 1 Université Pierre-Mendès-France, Grenoble Laboratoire d Informatique de Grenoble INRIA Rhône-Alpes, Montbonnot Saint-Martin, France Jerome.David-at-inrialpes.fr
More informationChapter 1, Introduction
CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from
More informationClassifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped
More informationEffect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching
Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna
More informationNetwork Based Hard/Soft Information Fusion Data Association Process Gregory Tauer, Kedar Sambhoos, Rakesh Nagi (co-pi), Moises Sudit (co-pi)
Network Based Hard/Soft Information Fusion Data Association Process Gregory Tauer, Kedar Sambhoos, Rakesh Nagi (co-pi), Moises Sudit (co-pi) Objectives: Formulate and implement a workable, quantitativelybased
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationFast Contextual Preference Scoring of Database Tuples
Fast Contextual Preference Scoring of Database Tuples Kostas Stefanidis Department of Computer Science, University of Ioannina, Greece Joint work with Evaggelia Pitoura http://dmod.cs.uoi.gr 2 Motivation
More informationA New Technique to Optimize User s Browsing Session using Data Mining
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
More informationFeature Subset Selection using Clusters & Informed Search. Team 3
Feature Subset Selection using Clusters & Informed Search Team 3 THE PROBLEM [This text box to be deleted before presentation Here I will be discussing exactly what the prob Is (classification based on
More informationA Modified Apriori Algorithm
A Modified Apriori Algorithm K.A.Baffour, C.Osei-Bonsu, A.F. Adekoya Abstract: The Classical Apriori Algorithm (CAA), which is used for finding frequent itemsets in Association Rule Mining consists of
More information