Peer-to-Peer Similarity Search in Metric Spaces

Similar documents
doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

ERCIM Alain Bensoussan Fellowship Scientific Report

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

SIMILARITY SEARCH The Metric Space Approach. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

Cost Models for Query Processing Strategies in the Active Data Repository

Peer-to-Peer Similarity Search over Widely Distributed Document Collections

Benchmarking the UB-tree

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

Approximate Searches

Similarity Searching:

Knowledge libraries and information space

Information Retrieval and Filtering over Self-Organising Digital Libraries

Geometric data structures:

High Dimensional Indexing by Clustering

Outline. CS38 Introduction to Algorithms. Approximation Algorithms. Optimization Problems. Set Cover. Set cover 5/29/2014. coping with intractibility

Theorem 2.9: nearest addition algorithm

Using Natural Clusters Information to Build Fuzzy Indexing Structure

FASTER UPPER BOUNDING OF INTERSECTION SIZES

Semantic Overlays for P2P Web Searching

Textural Features for Image Database Retrieval

On the Least Cost For Proximity Searching in Metric Spaces

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks

Fast Similarity Search for High-Dimensional Dataset

GEMINI GEneric Multimedia INdexIng

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

An encoding-based dual distance tree high-dimensional index

Query Processing and Advanced Queries. Advanced Queries (2): R-TreeR

Computational Geometry

A Branch-and-Bound Method for Group Reverse Queries

Compressing Relations and Indexes Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft 1 Database Talk Speaker : Jonathan Goldstein Microsoft Researc

An Efficient Transformation for Klee s Measure Problem in the Streaming Model

Scalability Comparison of Peer-to-Peer Similarity-Search Structures

Locality- Sensitive Hashing Random Projections for NN Search

Integer and Combinatorial Optimization: Clustering Problems

Orthogonal Range Queries

CPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2016

RIPPLE: A Scalable Framework for Distributed Processing of Rank Queries

Re-imagining Vivaldi with HTML5/WebGL. PhD Candidate: Djuro Mirkovic Supervisors: Prof. Grenville Armitage and Dr. Philip Branch

Delaunay Triangulation Overlays

CSE 5243 INTRO. TO DATA MINING

A Pivot-based Index Structure for Combination of Feature Vectors

CHAPTER 3 A TIME-DEPENDENT k-shortest PATH ALGORITHM FOR ATIS APPLICATIONS

Hike: A High Performance knn Query Processing System for Multimedia Data

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Multidimensional Indexing The R Tree

Cardinality Estimation: An Experimental Survey

Approximate Nearest Neighbor via Point- Location among Balls

Distributed Skyline Processing: a Trend in Database Research Still Going Strong

idistance: An Adaptive B + -tree Based Indexing Method for Nearest Neighbor Search

CS 229 Midterm Review

Computer Vision. Exercise Session 10 Image Categorization

MUFIN Basics. MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic SEMWEB 1

Trade-offs in Explanatory

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Algorithms for Nearest Neighbors

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

L1-Spatial Concepts L1 - Spatial Concepts

Three-Dimensional Motion Tracking using Clustering

PRINCIPLES AND APPLICATIONS FOR SUPPORTING SIMILARITY QUERIES IN NON-ORDERED-DISCRETE AND CONTINUOUS DATA SPACES. Gang Qian A DISSERTATION

Tracker-based Peer Selection using ALTO Map Information

Nearest neighbor classification DSE 220

Generalizations of the Distance and Dependent Function in Extenics to 2D, 3D, and n-d

Scalable Semantic Overlay Generation for P2P-Based Digital Libraries

Principles of Data Management. Lecture #14 (Spatial Data Management)

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Hierarchical GEMINI - Hierarchical linear subspace indexing method

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces

Computing minimum distortion embeddings into a path for bipartite permutation graphs and threshold graphs

CHAPTER 4 SEMANTIC REGION-BASED IMAGE RETRIEVAL (SRBIR)

Approximation Algorithms for Clustering Uncertain Data

Towards Energy Proportionality for Large-Scale Latency-Critical Workloads

Reverse Top-k Queries

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

Outline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas

Spatial Data Management

Making Gnutella-like P2P Systems Scalable

CPSC / Sonny Chan - University of Calgary. Collision Detection II

Spatial Data Management

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Finding the k-closest pairs in metric spaces

Batch Nearest Neighbor Search for Video Retrieval

Computational Geometry

Algorithms for Nearest Neighbors

Telematics Chapter 9: Peer-to-Peer Networks

Challenges in Ubiquitous Data Mining

Outline. Database Management and Tuning. Index Data Structures. Outline. Index Tuning. Johann Gamper. Unit 5

Region Clustering Based Evaluation of Multiple Top-N Selection Queries

A Scalable Index Mechanism for High-Dimensional Data in Cluster File Systems

Assignment 5: Solutions

Outline. Motivation. Traditional Database Systems. A Distributed Indexing Scheme for Multi-dimensional Range Queries in Sensor Networks

Cost Estimation of Spatial k-nearest-neighbor Operators

Lecture 5: Simplicial Complex

Geometric Approaches for Top-k Queries

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Generalizations of the Distance and Dependent Function in Extenics to 2D, 3D, and n D

Efficient Processing of Top-k Joins in MapReduce

Cardinality of Sets. Washington University Math Circle 10/30/2016

Foundations of Multidimensional and Metric Data Structures

Transcription:

Peer-to-Peer Similarity Search in Metric Spaces Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, Michalis Vazirgiannis http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Department of Informatics Athens University of Economics and Business (AUEB) Athens, Greece Christos Doulkeridis, AUEB 1

Motivation Similarity search in metric spaces Objects are represented in a high dimensional feature space Complex distance functions (e.g. text, multimedia) Goal: share the computational load over a set of computers Peer-to-Peer DBISP2P 07 session on P2P similarity search Existing work Centralized settings Structured P2P systems (not preserving peer autonomy) Christos Doulkeridis, AUEB 2

Outline 1. Preliminaries a. Metric spaces b. idistance 2. SIMPEER a. Construction b. Range query processing c. k-nn query processing 3. Experimental results 4. Conclusions & further work Christos Doulkeridis, AUEB 3

Metric Space Metric space M=(D,d) d(p,q) = d(q,p) (symmetry) d(p,q) > 0, q p and d(p,p)=0 (non negativity) d(p,q) d(p,o) + d(o,q) (triangle inequality) Similarity queries Range queries: R(q,r) = { u D d(q,u) < r } k-nn queries: NN k (q) Christos Doulkeridis, AUEB 4

idistance Indexing the Distance Space partitioning into n clusters Reference points K i Each cluster mapped to an interval Each object x mapped to 1-d Values indexed in a B + -Tree Query R(q,r) If a query intersects with a cluster Scan the interval Christos Doulkeridis, AUEB 5

SIMPEER 3-level Clustering Scheme 1. Each peer clusters its own data indexes local points using idistance 2. Each super-peer receives its peers cluster descriptions computes the hyper-clusters using our extension of idistance 3. Super-peers exchange hyper-clusters build a set of routing clusters Super-peer architecture Christos Doulkeridis, AUEB 6

idistance Extension Map clusters, not points Index the furthest point of each cluster only! Each cluster C j mapped to Values indexed in a B + -Tree C 3 r i R(q,r) O i C 1 Query R(q,r) Search region [ d(o i,q) - r, r I ] C 2 Hyper-cluster Christos Doulkeridis, AUEB 7

Peer Query Processing Data organization Clustering / Space partitioning idistance LC p sent to super-peer LC p = { C i : (K i,r i ) } Range query processing Scan intervals of B + -tree LC p = {C 1 (K 1,r 1 ), C 2 (K 2,r 2 ), C 3 (K 3,r 3 )} Peer data space K 1,r 1 K 3,r 3 K 2,r 2 R(q,r) Leaf nodes of B + -tree Christos Doulkeridis, AUEB 8

Super-Peer Query Processing Super-peer Creates hyper-clusters based on peer clusters Indexes the furthest point of each peer s cluster LHC sp = {HC 1 (O 1,r 1 ), HC 2 (O 2,r 2 ), HC 3 (O 3,r 3 )} O 1,r 1 Range query processing Find peers to forward the R(q,r) query Peer selection mechanism Super-peer data space O 2,r 2 O 3,r 3 Christos Doulkeridis, AUEB 9

Routing Indices Super-peers broadcast hyper-clusters Recipient super-peers Treat hyper-clusters similarly to peer clusters Build routing clusters RC i Used to determine the neighbouring super-peer to forward the query Super-peer selection mechanism Christos Doulkeridis, AUEB 10

k-nn Query Processing Convert k-nn query to range query R(q,r) Use estimated range r Based on (peer) cluster information at a super-peer local estimation Based on hyper-cluster information at a super-peer global estimation No communication required for estimation! Maximum 2 round-trips required! If less than k objects retrieved, cannot avoid second round-trip Super-peer computes an upper bound for r, based on its peers data Goal: make a good estimation, such that First round-trip is enough (overestimate r) r is sufficient, but not too large (do not overestimate r too much) Christos Doulkeridis, AUEB 11

Histogram Construction Distribution of distances F(r) = Pr {d(q,p) r} Expected number of retrieved objects by R(q,r) #objs(r(q,r)) = n x F(r) Frequency of distances for: d 2r B Cluster i F i (sr B ) Assumption high homogeneity of viewpoints inside a cluster [Ciaccia, PODS 98] Approximate F q with a sampled distance distribution F F i (2r B ) F i (r B )... r B 2r B sr B Christos Doulkeridis, AUEB 12

Local Estimation (LE) r i r r i K i K R(q,r) i R(q,r) C i C i Condition : d(k i, q) + r r i d(k i, q) + r > r i Estimated #objects : n i x F i (r) n i x F i (r ) Binary search on [0,sr B ] to find the smallest r for which the estimated number of objects k where r =r i +r-d(k i,q)/2 Christos Doulkeridis, AUEB 13

Global Estimation (GE) Hyper-clusters enhanced with 2 histograms: (hc i ) (hd i ) Number of clusters intersecting the query (nc i ) Distance distribution of clusters within a hyper-cluster Number of data objects contained in the intersection (nd i ) Superimpose cluster histograms, by keeping the minimum value of each bin Also keep the minimum cardinality of all clusters Estimated #objects : nc i (r) x nd i (r) Christos Doulkeridis, AUEB 14

Experimental Setup GT-ITM topology generator (4K-16K peers) #Super-peers={200,400} DEG sp =4-7 DEG p =20-60 k p =10 Sunthetic {uniform,clustered} datasets 8-32d, 3M-12M objects Real datasets VEC 1M 45-dim vectors of color image features CovType 581K 54-dim instances of forest Covertype data Christos Doulkeridis, AUEB 15

Construction Cost Mainly depends on super-peer topology One-time cost! Approx. 1.5MB per super-peer Total construction cost (MB) 700 600 500 400 300 Nsp=200 Nsp=400 200 100 0 4 5 6 7 DEGsp Christos Doulkeridis, AUEB 16

Range Queries Response Time (N sp =200, N p =2000, n=1m, d=16) Increases only slightly with cardinality Higher response time in clustered dataset Most results come from the same network paths, causing delays Response Time (sec) Network transfer rate 4KB/sec 18 16 14 12 10 8 6 4 2 0 3 6 9 12 Cardinality (x10^6) Uniform, k=120 Uniform, k=60 Clustered, k=120 Clustered, k=60 Christos Doulkeridis, AUEB 17

Range Queries Success Ratio Clustered dataset (N sp =200, N p =2000, n=1m) Success ratio = how many of the contacted peers (super-peers) returned results Success Ratio 100 80 60 40 20 SP, d=8 SP, d=32 P, d=8 P, d=32 0 2 1.67 1.33 1 0.67 0.33 Query Selectivity (x10^-5) Christos Doulkeridis, AUEB 18

k-nn Queries Overestimation(%) VEC dataset (N sp =200, N p =2000) LE better (initially) GE becomes better with increasing k sp 12 10 Overestimation (%) 8 6 4 2 LE/RE GE/RE 0 k=100 ksp=5 k=50 ksp=5 k=100 ksp=10 k=50 ksp=10 k=100 ksp=15 k=50 ksp=15 Christos Doulkeridis, AUEB 19

Conclusions & Further Work SIMPEER A metric-based framework for P2P similarity search Utilizes a three-level clustering scheme Support for range and k-nn query processing Distributed statistics Further work Extension for non-vector-based data representations Devise an approach that deals with uniform data distributions in a better way Christos Doulkeridis, AUEB 20

Thank you for your attention! More info: http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Christos Doulkeridis, AUEB 21