Integrating rankings: Problem statement

Similar documents
Combining Fuzzy Information - Top-k Query Algorithms. Sanjay Kulhari

modern database systems lecture 5 : top-k retrieval

Optimal algorithms for middleware

Combining Fuzzy Information: an Overview

The interaction of theory and practice in database research

Comparison of of parallel and random approach to

Optimal Aggregation Algorithms for Middleware

IO-Top-k at TREC 2006: Terabyte Track

Information Retrieval Rank aggregation. Luca Bondi

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

Combining Fuzzy Information from Multiple Systems*

Incomplete Information: Null Values

CS264: Homework #1. Due by midnight on Thursday, January 19, 2017

Peter Gurský. Institute of Computer Science, Faculty of Science.

Direct Rendering of Trimmed NURBS Surfaces

Lecture 23: Priority Queues, Part 2 10:00 AM, Mar 19, 2018

Functions. Def. Let A and B be sets. A function f from A to B is an assignment of exactly one element of B to each element of A.

A Survey of Techniques for Answering Top-k Queries

Representing and Querying Imprecise Data

Secure Multiparty Computation Multi-round protocols

The complement of PATH is in NL

Analyze the obvious algorithm, 5 points Here is the most obvious algorithm for this problem: (LastLargerElement[A[1..n]:

The Expected Performance Curve: a New Assessment Measure for Person Authentication

SDC Platinum includes several utilities to help in creating and reusing searches and reports.

Finding k-dominant Skylines in High Dimensional Space

Homework #3 (Search) Out: 1/24/11 Due: 2/1/11 (at noon)

The Expected Performance Curve: a New Assessment Measure for Person Authentication

Sorting. Divide-and-Conquer 1

Rank Aggregation for Automatic Schema Matching

Missing Value Imputation in Time Series using Top-k Case Matching

Efficient Top-k Algorithms for Fuzzy Search in String Collections

Chapter S:V. V. Formal Properties of A*

Multi-objective Query Processing for Database Systems

. A quick enumeration leads to five possible upper bounds and we are interested in the smallest of them: h(x 1, x 2, x 3) min{x 1

Evaluating Top-k Queries Over Web-Accessible Databases

Module 7 VIDEO CODING AND MOTION ESTIMATION

Arnab Bhattacharya, Shrikant Awate. Dept. of Computer Science and Engineering, Indian Institute of Technology, Kanpur

Exact and Approximate Generic Multi-criteria Top-k Query Processing

Estimating the Quality of Databases

Cost-aware top-k join algorithms

Predictive Indexing for Fast Search

A Detailed Evaluation of Threshold Algorithms for Answering Top-k queries in Peer-to-Peer Networks

Line Arrangements. Applications

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY

Civil Engineering Systems Analysis Lecture XIV. Instructor: Prof. Naveen Eluru Department of Civil Engineering and Applied Mechanics

Test 2 Version A. On my honor, I have neither given nor received inappropriate or unauthorized information at any time before or during this test.

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

1 Overview. 2 Applications of submodular maximization. AM 221: Advanced Optimization Spring 2016

COT 6936: Topics in Algorithms! Giri Narasimhan. ECS 254A / EC 2443; Phone: x3748

Heuristic (Informed) Search

6. Dicretization methods 6.1 The purpose of discretization

COP 4531 Complexity & Analysis of Data Structures & Algorithms

Heuris'c Search. Reading note: Chapter 4 covers heuristic search.

J. Parallel Distrib. Comput.

Virtual views. Incremental View Maintenance. View maintenance. Materialized views. Review of bag algebra. Bag algebra operators (slide 1)

Lecture 10: Clocks and Time

Optimizing Access Cost for Top-k Queries over Web Sources: A Unified Cost-based Approach

Top of Minds Report series Removing duplicates using Informatica PowerCenter

REGION BASED SEGEMENTATION

COMP Analysis of Algorithms & Data Structures

1 Linear programming relaxation

3. Relational Data Model 3.5 The Tuple Relational Calculus

Process Document Creating a Query with Runtime Prompts. Creating a Query with Runtime Prompts. Concept

Computing Classic Closeness Centrality, at Scale

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Best-First Search Minimizing Space or Time. IDA* Save space, take more time

Answers to specimen paper questions. Most of the answers below go into rather more detail than is really needed. Please let me know of any mistakes.

Rank Aggregation for Automatic Schema Matching

Rank-aware Query Optimization

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Efficient Top-k Aggregation of Ranked Inputs

Databases -Normalization I. (GF Royle, N Spadaccini ) Databases - Normalization I 1 / 24

Finite Model Theory and Its Applications

Anytime Measures for Top-k Algorithms

My Query Builder Function

Here is a recursive algorithm that solves this problem, given a pointer to the root of T : MaxWtSubtree [r]

Using the Log Viewer. Accessing the Log Viewer Window CHAPTER

CS3DB3/SE4DB3/SE6M03 TUTORIAL

Introduction to Algorithms

Consensus Answers for Queries over Probabilistic Databases. Jian Li and Amol Deshpande University of Maryland, College Park, USA

Database Applications (15-415)

Unsupervised Sentiment Analysis Using Item Response Theory Models

CSE 421 Applications of DFS(?) Topological sort

Copyright 2000, Kevin Wayne 1

On the Max Coloring Problem

Data Structures for Range Minimum Queries in Multidimensional Arrays

Math 5593 Linear Programming Lecture Notes

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Multi-objective Optimization

IS 709/809: Computational Methods in IS Research. Algorithm Analysis (Sorting)

HOMEWORK 4 SOLUTIONS. Solution: The Petersen graph contains a cycle of odd length as a subgraph. Hence,

Towards Robust Indexing for Ranked Queries

(Refer Slide Time: 00:02:00)

3. Cluster analysis Overview

Optimal Routing and Scheduling in Multihop Wireless Renewable Energy Networks

Quick-Sort fi fi fi 7 9. Quick-Sort Goodrich, Tamassia

Algorithms and applications for answering ranked queries using ranked views

Evaluating Top-N Queries in n-dimensional Normed Spaces

Advanced ARC Reporting

Approximation Algorithms for Clustering Uncertain Data

Transcription:

Integrating rankings: Problem statement Each object has m grades, oneforeachofm criteria. The grade of an object for field i is x i. Normally assume 0 x i 1. Typically evaluations based on different criteria The higher the value of x i, the better the object is according to the ith criterion The objects are given in m sorted lists the ith list is sorted by x i value These lists correspond to different sources or to different criteria. Goal: find the top k objects. L. Libkin 1 Data Integration and Exchange

Example Grade 1 (17, 0.9936) (1352,0.9916) (702,0.9826) (12, 0.3256) Grade 2 2 (235, 0.9996) (12, 0.9966) (8201, 0.9926) (17, 0.406) L. Libkin 2 Data Integration and Exchange

Aggregation Functions Have an aggregation function F. Let x 1,,x m be the grades of object R under the m criteria. Then F (x 1,,x m ) is the overall grade of object R. Common choices for F : min average or sum An aggregation function F is monotone if F (x 1,,x m ) F (x 1,,x m) whenever x i x i for all i. L. Libkin 3 Data Integration and Exchange

Other Applications Information retrieval Objects R are documents. The m criteria are search terms s 1,,s m. The grade x i : how relevant document R is for search term s i. Common to take the aggregation function F to be the sum F (x 1,,x m )=x 1 + + x m. L. Libkin 4 Data Integration and Exchange

Modes of Access Sorted access Can obtain the next object with its grade in list L i cost c S. Random access Can obtain the grade of object R in list L i cost c R. Middleware cost: c S (# of sorted accesses) + c R (# of random accesses). L. Libkin 5 Data Integration and Exchange

Algorithms Want an algorithm for finding the top k objects. Naive algorithm: compute the overall grade of every object; return the top k answers. Too expensive. L. Libkin 6 Data Integration and Exchange

Fagin s Algorithm (FA) 1. Do sorted access in parallel to each of the m sorted lists L i. Stop when there are at least k objects, each of which have been seen in all the lists. 2. For each object R that has been seen: Retrieve all of its fields x 1,,x m by random access. Compute F (R) =F (x 1,,x m ). 3. Return the top k answers. L. Libkin 7 Data Integration and Exchange

Fagin s algorithm is correct Assume object R was not seen its grades are x 1,,x m. Assume object S is one of the answers returned by FA its grades are y 1,,y m. Then x i y i for each i Hence F (R) =F (x 1,,x m ) F (y 1,,y m )=F(S). L. Libkin 8 Data Integration and Exchange

Fagin s algorithm: performance guarantees Typically probabilistic guarantees Orderings are independent Then with high probability the middleware cost is ( k ) O N m N i.e., sublinear But may perform poorly e.g., if F is constant: ( still takes O N m k/n ) instead of a constant time algorithm L. Libkin 9 Data Integration and Exchange

Optimal algorithm: The Threshold Algorithm 1. Do sorted access in parallel to each of the m sorted lists L i.aseach object R is seen under sorted access: Retrieve all of its fields x 1,,x m by random access. Compute F (R) =F (x 1,,x m ). If this is one of the top k answers so far, remember it. Note: buffer of bounded size. 2. For each list L i,letˆx i be the grade of the last object seen under sorted access. 3. Define the threshold value t to be F (ˆx 1,,ˆx m ). 4. When k objects have been seen whose grade is at least t, then stop. 5. Return the top k answers. L. Libkin 10 Data Integration and Exchange

Threshold Algorithm: correctness and optimality The Threshold Algorithm is correct for every monotone aggregate function F. Optimal in a very strong sense: it is as good as any other algorithm on every instance any other algorithm means: except pathological algorithms as good means: within a constant factor pathological means: making wild guesses. L. Libkin 11 Data Integration and Exchange

Wild guesses can help An algorithm makes a wild guess if it performs random access on an object not previously encountered by sorted access. Neither FA nor TA make wild guesses, nor does any natural algorithm. Example: The aggregation function is min; k =1. LIST L 1 (1, 1) (2, 1) (3, 1) (n+1, 1) (n+2, 0) (n+3, 0) (2n+1, 0) LIST L 2 (2n+1, 1) (2n, 1) (2n-1, 1) (n+1, 1) (n, 0) (n-1, 0) (1, 0) L. Libkin 12 Data Integration and Exchange

Threshold Algorithm as an approximation algorithm Approximately finding top k answers. For ε>0, anε-approximation of top k answers is a collection of k objects R 1,,R k so that for each R not among them, (1 + ε) F (R i ) F (R) Turning TA into an approximation algorithm: Simply change the stopping rule into: When k objects have been seen whose grade is at least t 1+ε, then stop. L. Libkin 13 Data Integration and Exchange