Approximate String Joins

Similar documents
Similarity Joins of Text with Incomplete Information Formats

Approximate String Joins in a Database (Almost) for Free

Supporting Fuzzy Keyword Search in Databases

Estimating Quantiles from the Union of Historical and Streaming Data

Similarity Joins in MapReduce

IMPROVING DATABASE QUALITY THROUGH ELIMINATING DUPLICATE RECORDS

Towards a Domain Independent Platform for Data Cleaning

Object Matching for Information Integration: A Profiler-Based Approach

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Bachelor Thesis: Approximate String Joins

ABSTRACT I. INTRODUCTION

Robust and Efficient Fuzzy Match for Online Data Cleaning. Motivation. Methodology

A Learning Method for Entity Matching

Craig Knoblock University of Southern California. These slides are based in part on slides from Sheila Tejada and Misha Bilenko

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Database Applications (15-415)

LinkedMDB. The first linked data source dedicated to movies

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Deduplication of Hospital Data using Genetic Programming

Unsupervised Detection of Duplicates in User Query Results using Blocking

Fast Algorithms for Top-k Approximate String Matching

A fast approach for parallel deduplication on multicore processors

Comparison of Online Record Linkage Techniques

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD)

15-415/615 Faloutsos 1

COMPUSOFT, An international journal of advanced computer technology, 4 (6), June-2015 (Volume-IV, Issue-VI)

Introduction to Wireless Sensor Network. Peter Scheuermann and Goce Trajcevski Dept. of EECS Northwestern University

SimDataMapper: An Architectural Pattern to Integrate Declarative Similarity Matching into Database Applications

Parallelizing String Similarity Join Algorithms

A KNOWLEDGE BASED ONLINE RECORD MATCHING OVER QUERY RESULTS FROM MULTIPLE WEB DATABASE

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Novel Materialized View Selection in a Multidimensional Database

Probabilistic/Uncertain Data Management

Overview of Data Exploration Techniques. Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri

(Big Data Integration) : :

A Fast Linkage Detection Scheme for Multi-Source Information Integration

Deccansoft Software Services. SSIS Syllabus

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Toward Entity Retrieval over Structured and Text Data

CSE 344 APRIL 20 TH RDBMS INTERNALS

Leveraging Set Relations in Exact Set Similarity Join

A Survey on Removal of Duplicate Records in Database

Mining Heterogeneous Transformations for Record Linkage

Identifying Value Mappings for Data Integration: An Unsupervised Approach

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Evaluation of Relational Operations

Z-KNN Join for the Swiss Feed Database: a feasibility study

Compression of the Stream Array Data Structure

Indexing: Overview & Hashing. CS 377: Database Systems

Performance Tuning for the BI Professional. Jonathan Stewart

MICROSOFT BUSINESS INTELLIGENCE

Database Applications (15-415)

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

CompSci 516 Data Intensive Computing Systems

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Performance and scalability of fast blocking techniques for deduplication and data linkage

Increasing Database Performance through Optimizing Structure Query Language Join Statement

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Oracle Essbase XOLAP and Teradata

What s a database system? Review of Basic Database Concepts. Entity-relationship (E/R) diagram. Two important questions. Physical data independence

Comprehensive and Progressive Duplicate Entities Detection

Joins that Generalize: Text Classification Using WHIRL

Detecting Duplicate Objects in XML Documents

Microsoft. [MS20762]: Developing SQL Databases

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Interview Questions on DBMS and SQL [Compiled by M V Kamal, Associate Professor, CSE Dept]

Join Operation Under Streaming Inputs For optimizing Output results

Efficient Duplicate Detection and Elimination in Hierarchical Multimedia Data

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

Analysis of Basic Data Reordering Techniques

Developing SQL Databases

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Duplicate Identification in Deep Web Data Integration

#mstrworld. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending. Presented by: Trishla Maru.

Record Linkage using Probabilistic Methods and Data Mining Techniques

Joint Entity Resolution

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Learning High Accuracy Rules for Object Identification

Accelerating XML Structural Matching Using Suffix Bitmaps

Collective Entity Resolution in Relational Data

Advanced Databases: Parallel Databases A.Poulovassilis

Venezuela: Teléfonos: / Colombia: Teléfonos:

Huge market -- essentially all high performance databases work this way

6 SSIS Expressions SSIS Parameters Usage Control Flow Breakpoints Data Flow Data Viewers

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

Accelerate your SAS analytics to take the gold

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

CHAPTER 7 CONCLUSION AND FUTURE WORK

A Case for Merge Joins in Mediator Systems

HANA Performance. Efficient Speed and Scale-out for Real-time BI

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Physical Design. Elena Baralis, Silvia Chiusano Politecnico di Torino. Phases of database design D B M G. Database Management Systems. Pag.

Tuning Large Scale Deduplication with Reduced Effort

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Transformation-based Framework for Record Matching

Using Similarity-based Operations for Resolving Data-level Conflicts

Transcription:

Approximate String Joins Divesh Srivastava AT&T Labs-Research

The Need for String Joins Substantial amounts of data in existing RDBMSs are strings There is a need to correlate data stored in different tables Applications: data cleaning, data integration Example: Find common customers across different services Service A Divesh Srivastava John Paul McDougal KIA International Euroaft Corp. Hatronic Inc. Comp. Sci. Dept. Service B Eurodraft Corp. Hatronic Corp. Euroaft Divesh Shrivastava John P. McDougal Dept. of Comp. Sci. KIA 10/16/2003 AT&T Labs-Research 2

Problems with Exact String Joins Typing mistakes, abbreviations, different conventions Standard equality joins do not forgive such mistakes Service A Service B Divesh Srivastava Eurodraft Corp. John Paul McDougal Hatronic Corp. KIA International Euroaft Corp. Euroaft Divesh Shrivastava Hatronic Inc. John P. McDougal Comp. Sci. Dept. Dept. of Comp. Sci. KIA 10/16/2003 AT&T Labs-Research 3

Matching String Attributes Need for a similarity metric! Match entries with typing mistakes Divesh Srivastava vs. Divesh Shrivastava Match entries with abbreviations Euroaft Corporation vs. Euroaft Corp. Match entries with different conventions Comp. Sci. Dept. vs. Dept. of Comp. Sci. 10/16/2003 AT&T Labs-Research 4

Using String Edit Distance String Edit Distance: Number of single character insertions, deletions, and modifications to transform one string to the other Divesh Srivastava - Divesh Shrivastava 1 Dept. Comp. Sci. - Dept. of Comp. Sci. 3 Euroaft Corporation - Euroaft 12 Good for spelling errors, short word inserts/deletes Problems with word order variations, long word inserts/deletes 10/16/2003 AT&T Labs-Research 5

Using Cosine Similarity Infrequent token (high weight) Similar entries should share infrequent tokens (tokens can be words, groups of characters, ) EUROAFT CORPORATION EUROAFT EUROAFT CORPORATION HATRONIC CORPORATION Similarity = weight(token, t 1 ) * weight(token, t 2 ) Common token (low weight) Good for common word inserts/deletes, word order variations Problems with rare word inserts/deletes, rare misspellings 10/16/2003 AT&T Labs-Research 6

Approximate String Joins No native support for approximate string joins in RDBMSs Two existing (straightforward) solutions: Join data outside of DBMS Join data via user-defined functions (UDFs) inside DBMS 10/16/2003 AT&T Labs-Research 7

AS Joins Outside DBMS 1. Export data 2. Join outside of DBMS 3. Import the result Main advantage: Can exploit specialized tools (e.g., address matching) and business rules, without restrictions from DBMS functionality Disadvantages: Substantial amounts of data to be exported/imported Cannot be easily integrated with other DBMS processing steps 10/16/2003 AT&T Labs-Research 8

AS Joins Using UDFs 1. Write a UDF to check if two strings match within distance K 2. Write an SQL statement that applies the UDF to the string pairs SELECT R.sA, S.sA FROM R, S WHERE approx_string_match(r.sa, S.sA, K) Main advantage: Easily integrated with other DBMS processing steps Main disadvantage: Inefficient: UDF applied to entire cross-product of relations 10/16/2003 AT&T Labs-Research 9

Our Approach: Use Q-grams Intuition Similar strings have many common substrings (q-grams) Preprocess string data, generate auxiliary tables of substrings Perform approximate string join, exploiting RDBMS capabilities of exact joins and aggregations Advantages No modification of underlying RDBMS needed. Can leverage the RDBMS query optimizer. Much more efficient than the approach based on naive UDFs 10/16/2003 AT&T Labs-Research 10

What is a Q-gram? Q-gram: A sequence of q characters of the original string Split each string into all overlapping q-grams string with length L L + q - 1 q-grams Example: q=3 srivastava = ##s, #sr, sri, riv, iva, vas, ast, sta, tav, ava, va$, a$$ 10/16/2003 AT&T Labs-Research 11

Q-grams and String Edit Distance Each edit distance operation affects at most q q-grams Two strings S1 and S2 with string edit distance K have at least [max(s1.len, S2.len) + q - 1] Kq q-grams in common Example: srivastava = ##s, #sr, sri, riv, iva, vas, ast, sta, tav, ava, va$, a$$ shrivastava = ##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, ava, va$, a$$ Useful filter: eliminate all string pairs without "enough" common q- grams (no false dismissals!) 10/16/2003 AT&T Labs-Research 12

Words and Cosine Similarity Using words as tokens: Infrequent token (high weight) Split each entry into words EUROAFT CORP EUROAFT Similar entries share infrequent words EUROAFT CORP HATRONIC CORP Common token (low weight) Problems with misspellings, abbreviations Euroaft Corporation Eurodraft Corp. WHIRL W.Cohen, SIGMOD 98 10/16/2003 AT&T Labs-Research 13

Q-grams and Cosine Similarity Use q-grams as tokens: Similar entries share many, infrequent q-grams Euroaft Corporation ##E, #Eu, Eur, uro, roa, oaf, aft,, Cor, orp, rpo, por, ##E, #Eu, Eur, uro, rod, odr, dra, raf, aft,, Cor, orp, rp., Eurodraft Corp. Good for common misspellings, abbreviations 10/16/2003 AT&T Labs-Research 14

Matching Strings Efficiently in SQL Problem 1: Find all pairs of strings t 1, t 2 with string edit distance K Problem 2: Find all pairs of strings t 1, t 2 with cosine similarity where cosine similarity = weight(token, t 1 ) * weight(token, t 2 ) token SQL-only solution desirable: Scalability Robustness Ease of deployment 10/16/2003 AT&T Labs-Research 15

Problem 1: Edit Distance in DBMS LENGTH FILTER: two strings S1 and S2 with edit distance K cannot differ in length by > K COUNT FILTER: two strings S1 and S2 with edit distance K have [max(s1.len, S2.len) + q - 1] Kq q-grams in common Create auxiliary DBMS tables with tuples of the form: <sid, qgram>, join and aggregate these tables POSITION FILTER: corresponding q-grams of S1 and S2 cannot differ in their positions by more than K Create auxiliary DBMS tables with tuples of the form: <sid, qgram, pos>, join and aggregate these tables 10/16/2003 AT&T Labs-Research 16

Problem 1: The SQL Statement SELECT R1.sid, R2.sid FROM R1, R1Q, R2, R2Q WHERE R1.sid = R1Q.sid AND R2.sid = R2Q.sid AND R1Q.qgram = R2Q.qgram AND abs(r1q.pos R2Q.pos) <= k AND abs(len(r1.str) LEN(R2.str)) <= k AND (LEN(R1.str)+q-1 > k*q OR LEN(R2.str)+q-1 > k*q) GROUP BY R1.sid, R2.sid, R1.str, R2.str HAVING COUNT(*) >= max(len(r1.str), LEN(R2.str))+q-1 - k*q AND edit_distance(r1.str, R2.str, k) UNION ALL SELECT FROM WHERE R1.sid, R2.sid R1, R2 LEN(R1.str)+q-1 <= k*q AND LEN(R2.str)+q-1 <= k*q AND abs(len(r1.str) - LEN(R2.str)) <= k AND edit_distance(r1.str, R2.str, k) 10/16/2003 AT&T Labs-Research 17

Problem 1: Experimental Data Three customer data sets from AT&T Worldnet service (a) set1 with about 40K strings (b) set2 and (c) set3 with about 30K strings each 12000 1200 600 10000 1000 500 Number of Strings 8000 6000 4000 Number of Strings 800 600 400 Number of Strings 400 300 200 2000 200 100 0 1 6 11 16 21 26 31 0 1 9 17 25 33 41 49 57 65 0 1 7 13 19 25 31 37 43 49 55 61 67 String Length String Length String Length (a) (b) (c) 10/16/2003 AT&T Labs-Research 18

Problem 1: DBMS Setup Used Oracle 8i (supports UDFs), on Sun 20 Enterprise Server Materialized the q-gram tables with entries <sid, qgram, pos> (less then 2 minutes per table) Tested configurations with and without indexes on the auxiliary q-gram tables (less than 5 minutes to generate each index) The generation time for the auxiliary q-gram tables and indexes is small: even on-the-fly materialization is feasible 10/16/2003 AT&T Labs-Research 19

Problem 1: RDBMS Query Plans Naive approach with UDFs: nested-loops joins (prohibitively slow even for small data sets) Q-gram approach: usually sort-merge joins In our prototype implementation, sort-merge joins is the fastest too 10/16/2003 AT&T Labs-Research 20

Problem 1: Naïve UDFs vs. Q-grams 10 000 Q1 (UDF only) Q4 (Filtering) 10 00 100 10 k=1 k=2 k=3 Q1 (UDF o nly) 19 54 2028 2 0 44 Q4 (Filtering ) 4 8 6 8 91 For a subset of set1, our technique was 20 to 30 times faster than the naïve use of UDFs 10/16/2003 AT&T Labs-Research 21

Problem 1: Effect of Filters CP=Cross Product, L=Length Filtering, LP=Length and Position Filtering, LC=Length and Count Filtering, LPC=Length, Position, and Count Filtering, Real=Number of Real matches 1.E+10 CP L LP LC LPC Real 1.E+10 CP L LP LC LPC Real 1.E+10 CP L LP LC LPC Real 1.E+09 1.E+09 1.E+09 Candidate Set Size 1.E+08 1.E+07 Candidate Set Size 1.E+08 1.E+07 Candidate Set Size 1.E+08 1.E+07 1.E+06 1.E+06 1.E+06 1.E+05 1.E+05 1.E+05 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 (a) LENGTH FILTER: 40-70% reduction for set1 (small length deviation) (b) 90-98% reductions for set2, set3 (big length deviation) +COUNT FILTER: > 99% reduction +POSITION FILTER: ~ additional 50% reduction (c) 10/16/2003 AT&T Labs-Research 22

Problem 1: Effect of Q-gram Size 1.E+07 q=1 q=2 q=3 q=4 q=5 1.E+08 q=1 q=2 q=3 q=4 q=5 Candidate Set Size 1.E+06 Candidate Set Size 1.E+07 1.E+05 set1 set2 set3 1.E+06 set1 set2 set3 (a) (b) For the given data sets, q=2 worked best q=2 is small enough to avoid, as much as possible, the space overhead for the auxiliary tables 10/16/2003 AT&T Labs-Research 23

Problem 2: Cosine Similarity in DBMS Create in SQL relations R i Weights (token weights from R i ) Compute similarity of each tuple pair R 1 Name 1 EUROAFT CORP 2 HATRONIC INC 1 1 2 2 R 1 Weights Token EUROAFT CORP HATRONIC INC W 0.98 0.20 0.99 0.14 1 1 2 2 3 3 R 2 Weights Token HATRONIC CORP EUROAFT INC EUROAFT CORP W 0.98 0.20 0.95 0.30 0.97 0.25 1 2 3 R 2 Name HATRONIC CORP EUROAFT INC EUROAFT CORP R 1 EUROAFT CORP EUROAFT CORP EUROAFT CORP R 2 EUROAFT INC EUROAFT CORP HATRONIC CORP Similarity 0.98 1.00 0.05 10/16/2003 HATRONIC INC AT&T HATRONIC Labs-Research CORP 0.98 24 HATRONIC INC EUROAFT INC 0.04

Problem 2: Naïve SQL Computes similarity for many useless pairs SELECT r1w.tid AS tid1, r2w.tid AS tid2 FROM R1Weights r1w, R2Weights r2w WHERE r1w.token = r2w.token GROUP BY r1w.tid, r2w.tid HAVING SUM(r1w.weight*r2w.weight) Expensive operation! 10/16/2003 AT&T Labs-Research 25

Problem 2: Sampling Step Cosine similarity = weight(token, t 1 ) * weight(token, t 2 ) Products cannot be high when weight is small Can (safely) drop low weights from RiWeights (adapted from [Cohen & Lewis, SODA97] for efficient execution in an RDBMS) R i Weights Token EUROAFT HATRONIC CORP INC W 0.9144 0.8419 0.01247 0.00504 Token EUROAFT HATRONIC R i Sample #TIMES SAMPLED 18 (18/20=0.90) 17 (17/20=0.85) Eliminates low similarity pairs (e.g., EUROAFT INC with HATRONIC INC ) 10/16/2003 AT&T Labs-Research 26

Problem 2: Sampling-Based Joins R 1 Weights R 2 Sample R 1 Token W Token W Name 1 EUROAFT 0.98 1 HATRONIC 0.98 1 EUROAFT CORP 1 CORP 0.02 1 CORP 0.02 2 HATRONIC INC 2 2 HATRONIC INC 0.98 0.01 2 2 EUROAFT INC 0.95 0.05 3 EUROAFT 0.97 3 CORP 0.03 R1 EUROAFT CORP EUROAFT CORP HATRONIC INC R2 EUROAFT INC EUROAFT CORP HATRONIC CORP Similarity 0.98 0.9 0.98 10/16/2003 AT&T Labs-Research 27

Problem 2: Sampling-Based SQL SELECT r1w.tid AS tid1, r2s.tid AS tid2 FROM WHERE R1Weights r1w, R2Sample r2s, R2sum r2sum r1w.token = r2s.token AND r1w.token = r2sum.token GROUP BY r1w.tid, r2s.tid HAVING SUM(r1w.weight*r2sum.total*r2s.c) S Fully implemented in pure SQL! 10/16/2003 AT&T Labs-Research 28

Problem 2: Experimental Setup 40,000 entries from AT&T customer database, split into R 1 (26,000 entries) and R 2 (14,000 entries) Tokenizations: Words Q-grams, q=2 & q=3 Methods compared: 100,000,000 10,000,000 number of tuple pairs 1,000,000 100,000 10,000 1,000 Variations of sample-based joins Baseline in SQL 100 10 1 Q-grams, q=2 Q-grams, q=3 Words 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 similarity WHIRL [SIGMOD98], adapted for handling q-grams 10/16/2003 AT&T Labs-Research 29

Problem 2: Metrics Execute the (approximate) join for similarity Precision: (measures accuracy) Fraction of the pairs in the answer with real similarity Recall: (measures completeness) Fraction of the pairs with real similarity that are also in the answer Execution time 10/16/2003 AT&T Labs-Research 30

Problem 2: Comparing with WHIRL recall 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 R1R2 sr1r2 R1sR2 sr1sr2 WHIRL 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 similarity precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 3-grams, S=128 0 R1R2 sr1r2 R1sR2 sr1sr2 WHIRL 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 similarity Sample-based Joins: Good recall across similarity thresholds. Retrieves many interesting matches (many good matches have similarity ~0.5) WHIRL: Low recall (memory problems) Misses many good matches WHIRL: Perfect precision result of post-join step 10/16/2003 AT&T Labs-Research 31

Problem 2: Changing Sample Size recall 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 S=2 S=32 S=64 S=128 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 similarity precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 S=2 S=32 S=64 S=128 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 similarity Increased sample size More accurate weight estimation Better recall, precision 3-grams, S=128 Drawback: Increased execution time (more pairs found) 10/16/2003 AT&T Labs-Research 32

Problem 2: Execution Time 10000 1000 execution time (secs) 100 10 1 WHIRL sr1r2 0.1 S=1 S=2 S=4 S=8 S=16 S=32 S=64 S=128 S=256 sample size 3-grams WHIRL and sample-based joins break-even at S ~ 64, 128 Baseline in SQL >24hrs, never finished (out of disk space) 10/16/2003 AT&T Labs-Research 33

Problem 2: Evaluation Summary Sample-based joins preferable when recall is important. WHIRL good for very high thresholds ( a few good matches ) Cosine similarity with q-grams give better results (semantically). Higher execution time compared to word-based cosine similarity Sample-based joins more adaptable and flexible: easier to tune more scalable more robust easy to deploy in any environment 10/16/2003 AT&T Labs-Research 34

Related Work Fellegi & Sunter. A theory for record linkage. JASA 1969. Winkler. Matching and record linkage. Business Survey Methods. Wiley, 1995. Hernandez & Stolfo. The merge/purge problem for large databases. SIGMOD 1995. Galhardas et al. Declarative data cleaning: Language, model, VLDB 2001. Sarawagi & Bhamidipaty. Interactive deduplication using active learning. KDD 2002. Ananthakrishna et al. Eliminating fuzzy duplicates in data warehouses. VLDB 2002. Dasu & Johnson: Exploratory data mining and data cleaning, Wiley, 2003 Tejada, Knoblock & Minton. Learning domain-independent string KDD 2002. Navarro. A guided tour to approximate string matching. ACM Comp. Surveys, 2001. 10/16/2003 AT&T Labs-Research 35

Conclusions We introduced techniques for mapping approximate string joins into vanilla SQL expressions, each with its own advantages: String edit distance: no false positives Cosine similarity: tradeoff between recall and efficiency Our techniques do not require modifying the underlying RDBMS Significantly outperform existing approaches Other applications? 10/16/2003 AT&T Labs-Research 36

Acknowledgements Joint work with: Luis Gravano (Columbia University) Panagiotis G. Ipeirotis (Columbia University) H. V. Jagadish (University of Michigan) Nick Koudas (AT&T Labs-Research) S. Muthukrishnan (Rutgers University, AT&T Labs-Research) Based on papers in VLDB 01 and WWW 03, and poster in ICDE 03 10/16/2003 AT&T Labs-Research 37