Robust and Efficient Fuzzy Match for Online Data Cleaning. Motivation. Methodology

Similar documents
Similarity Joins of Text with Incomplete Information Formats

Database Applications (15-415)

Database Applications (15-415)

Database Applications (15-415)

Database Applications (15-415)

Database Applications (15-415)

RELATIONAL OPERATORS #1

Similarity Joins in MapReduce

Information Retrieval

Supporting Fuzzy Keyword Search in Databases

COP 4531 Complexity & Analysis of Data Structures & Algorithms

Information Retrieval

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

New Directions in Traffic Measurement and Accounting. Need for traffic measurement. Relation to stream databases. Internet backbone monitoring

Tag-based Social Interest Discovery

CSE 5243 INTRO. TO DATA MINING

Topic: Duplicate Detection and Similarity Computing

Developing Focused Crawlers for Genre Specific Search Engines

Approximate String Joins

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Mobile and Heterogeneous databases

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Complex Queries in DHTbased Peer-to-Peer Networks

(f) Given what we know about linked lists and arrays, when would we choose to use one data structure over the other?

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Large-Scale Duplicate Detection

SCALABLE WEB PROGRAMMING. CS193S - Jan Jannink - 2/02/10

Semantic Search in s

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Big Data Analytics CSCI 4030

Birkbeck (University of London)

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

ROCHESTER INSTITUTE OF TECHNOLOGY. SQL Tool Kit

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

CTL.SC4x Technology and Systems

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

An Efficient Utilisation of AML Using P-Prune Techinque

Approximate Search and Data Reduction Algorithms

Knowledge Discovery and Data Mining 1 (VO) ( )

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Robust Identification of Fuzzy Duplicates

DW Performance Optimization (II)

CSCI 5417 Information Retrieval Systems Jim Martin!

Spring 2013 CS 122C & CS 222 Midterm Exam (and Comprehensive Exam, Part I) (Max. Points: 100)

Adding Source Code Searching Capability to Yioop

Oracle 1Z0-515 Exam Questions & Answers

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Storage and Indexing

On Indexing Error-Tolerant Set Containment

CPSC 310: Database Systems / CSPC 603: Database Systems and Applications Exam 2 November 16, 2005

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Data on External Storage

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Enhancing Bitmap Indices

Advanced Database Systems

Overview of Query Evaluation. Chapter 12

Towards a Domain Independent Platform for Data Cleaning

Lecture 5: Suffix Trees

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

MAINTAIN TOP-K RESULTS USING SIMILARITY CLUSTERING IN RELATIONAL DATABASE

Estimating Persistent Spread in High-speed Networks Qingjun Xiao, Yan Qiao, Zhen Mo, Shigang Chen

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees: Example. B-trees. primary key indexing

Chapter 18 Indexing Structures for Files

Towards a hybrid approach to Netflix Challenge

Advanced Data Management Technologies

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Lecture Data layout on disk. How to store relations (tables) in disk blocks. By Marina Barsky Winter 2016, University of Toronto

Graph-Based Synopses for Relational Data. Alkis Polyzotis (UC Santa Cruz)

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Related reading: Effectively Prioritizing Tests in Development Environment Introduction to Software Engineering Jonathan Aldrich

An Adaptive Online System for Efficient Processing of Hierarchical Data

AUTOMATIC CLUSTERING PRASANNA RAJAPERUMAL I MARCH Snowflake Computing Inc. All Rights Reserved

Evaluation of Relational Operations

Min-Hashing and Geometric min-hashing

x-fast and y-fast Tries

DATA STRUCTURES USING C

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Efficiency vs. Effectiveness in Terabyte-Scale IR

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

CMSC 424 Database design Lecture 13 Storage: Files. Mihai Pop

Database Management Systems (COP 5725) Homework 3

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Introduction to Information Retrieval

External Sorting Implementing Relational Operators

Machine Learning using MapReduce

September 2013 Alberto Abelló & Oscar Romero 1

10701 Machine Learning. Clustering

Virtual views. Incremental View Maintenance. View maintenance. Materialized views. Review of bag algebra. Bag algebra operators (slide 1)

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Transcription:

Robust and Efficient Fuzzy Match for Online Data Cleaning S. Chaudhuri, K. Ganjan, V. Ganti, R. Motwani Presented by Aaditeshwar Seth 1 Motivation Data warehouse: Many input tuples Tuples can be erroneous Spelling mistakes Different syntactic representation How to clean them automatically? Assumptions Tuples are structured: Eg. schema matching has already been done Some tuple fields are supposed to contain the same values 2 Methodology Clean tuples are stored in reference table Fuzzy matching done to find best matching clean tuples 3 1

Challenges Scalability Reference table can be very large Volume of input tuples can be very large Domain specific enhancements should be possible to add Should be able to build upon existing relational DBMS No complex data structures for persistence 4 Solution outline Transform input tuple into reference tuple Similarity metric = 1 (transformation cost) Flat edit distance not good Within a field, cannot distinguish between more and less informative tokens Intuitively know Boing is more informative than Corporation Hence, match on Boing should mean high similarity Implies, should be expensive to transform input token into Boing than into Corporation Attach weights to transformation costs for each token 5 Solution outline: Cont Use inverse token frequency for weight assignment Tokens like Boing occur less than tokens like Corporation Just a heuristic pathological cases can exist Optimizations Do not compute exact transformation cost Match on sets of substrings instead Design efficient index on reference table 6 2

Fuzzy matching similarity function Input tuple u, reference tuple v Split into tokens Each token has weight How is the minimum edit distance calculated? Many transformation paths are possible. w TOKEN = log( Ref Set / freq(token, col)) Find tc = edit distance for each token tc(u, v) = Summation of tc for each token fms(u, v) = 1 min(tc(u, v) / Sum(w TOKEN ), 1.0) Not a problem with fms apx 7 Approximate FMS Match on q-grams B o e i n g H 1 3 2 1 0 H 2 4 5 6 7 Min hash = {ing, boe} B u e i n g H 1 8 2 1 0 H 2 9 5 6 7 Min hash = {ing, uei} Hash functions avoid string comparison 8 Approximate FMS: Cont 4. Repeat over all columns, and divide by total weight 3. Repeat over all tokens t in same column 2. Multiply by weight of chosen Max tuple r d q = 1 1/q. Factor of 2 implies sim mh (QG(t), QG(t)) = 0.5? 1. Test for similarity of token t with all tokens r in same column, and select Max Cannot compute w(t) from input u if spelling mistake, or ordering diff 9 3

Error Tolerant Index Each q-gram belongs to a token Each token has a weight Token belongs to a column Coordinate indicates hash What if index: same extra q-gram level of indirection belongs to multiple tokens? Overwriting! Index reference set on tid Index ETI on {q-gram, coordinate, column} 10 Query processing Find all tokens of input tuple u Find min-hash signature of all tokens For all q-grams in min-hash signature Find ETI(q-gram, coordinate, column) Find token t to which q-gram belongs, and weight of this token Increment similarity metric of matching tid by w(t)/ mh(t) Fetch best K matching tid s with similarity > c.threshold 11 Query processing: Cont Optimizations When incrementing similarity metric with matching tid s, only need to do it with new tid s if the maximum score possible with all remaining q-grams is > c.threshold Optimistic short circuiting Order q-grams according to their weights Process only the first i q-grams Fetch matching tid s But only fetch new tid s if Stop when FMS of all K tid s > c.threshold If don t stop then increment i and repeat 12 4

Extensions Consider token as another q-gram Split importance equally among itself and its min-hash signature Assign weights to columns Domain dependant Token transposition 13 Experiments Clean reference set Error injection methods for unclean input tuples Accuracy FMS better than edit distance Min-hash signatures are better than token-only Accuracy improves with more hash functions Having tokens does not negatively impact 14 Experiments: Cont Efficiency: Processing time Much faster than naïve Query processing time decreases with signature size Use of tokens improves processing time Efficiency: ETI construction About 7 times the amount of time taken to process 1 tuple using the naïve algorithm But cost is amortized over repeat queries 15 5

Experiments: Cont Average number of tid s fetched per input tuple More q-grams decrease set sizes by better distinguishing similarity scores Average number of tid s processed per input tuple More q-grams increase the number of tid s processed Compensated by decrease in number of tid s fetched 16 Discussion topics Relevance: In what scenarios does IDF work and distinguishes between more and less informative tokens Cluster tokens together? Does weight calculation become an issue with optimizations and optimal short circuiting? How to update ETI with new tuples or outdated tuples? What is the role of the factor 2 in fms apx? What if same q-gram belongs to multiple tokens in the same column? 17 6