Addendum to the proof of log n approximation ratio for the greedy set cover algorithm

Size: px

Start display at page:

Download "Addendum to the proof of log n approximation ratio for the greedy set cover algorithm"

Maximilian McLaughlin
6 years ago
Views:

1 Addendum to the proof of log n approximation ratio for the greedy set cover algorithm (From Vazirani s very nice book Approximation algorithms ) Let x, x 2,...,x n be the order in which the elements are covered (break ties arbitrarily) Lemma: c(x i )<=C*/(n-i+) Proof. Suppose we are selecting a set that will cover x i. The remaining elements can be covered with C* sets. Thus there largest set in C*, the optimal solution, will cover at least (n-i+)/c* elements. I.e., The cost per element is at most C*/(n-i+) Thus Theorem. The approximation cost is at most H(n) Proof. The cost is at most the sum of the costs c(x i ) n i= c( x ) i C * C * n i + n i= n C * H ( n) Proving the bound H(s) is more tedious.

2 Finding fragments of orders, partial orders, and total orders from - data Themes of the chapter Given a / a matrix Rows: observations, columns variables Can one find ordering information for the observations? Without additional assumptions, no; with some assumptions, yes Paleontological application: find orders for subsets of fossil sites a good ordering for (a subset of) the rows is one where the s are consecutive Also other applications 2

3 Themes of the chapter Finding small total orders (fragments) from - data Local models/patterns Finding partial orders from - data A global model Find total orders for - data A global model Finding small total orders (fragments) from - data Model: a subset of observations and a total order on the subset Task: find all such models fulfilling certain criteria Algorithm: a pattern discovery algorithm (levelwise search) 3

4 Finding partial orders from - data Model: a partial order over all observations Loglikelihood: proportional to the number of cases the observed occurrence patterns violate the continuity of species Prior: prefer partial orders that are as specific as possible Task: find a model with high likelihood * prior Algorithm: Find fragments and use heuristic search to build a good partial order Find total orders for - data Model: a total order Loglikelihood: how many cases the observed occurrence patterns violate the continuity of species Task: find the best total order for the observations Algorithm: spectral method 4

5 Type of data - data, large number of variables Examples: Occurrences of words in documents Occurrences of species in paleontological sites Occurrence of a particular motif in a promoter region of a gene Typically the data is sparse: only a few s Asymmetry between s and s A means that there really was something A has less information (in a way) Example Paleontological data from the NOW (Neogene Mammal Database) Fossil sites (one location, one layer) Each site contains fossils that are about the same age (+- Ma) Variables: species/genera A is reasonably certain A might be due to several reasons The species was not extant at that time The remains did not fossilize The tooth was overlooked 5

6 Site-genus -matrix site genus Background knowledge Species do not vanish and return An ordering of the sites with a between s is improbable time 6

7 Given data about the occurrences of genera in fossil sites Want to find an ordering in which occurrences of a genus are consecutive Lazarus count: how many s are between s Example: seriation in paleontological data Site Genus A better ordering A smaller Lazarus count 7

8 Find small total orders (fragments) from - occurrence data Fragment: a total ordering of a subset of observations E.g., c<a<d<f Intuitive interpretation: For most variables the sequence of observations has no pattern of the form c a d f Fragments of order / data set Fragment of order f is a sequence of observations t < t 2 < t 3 < < t k An variable A disagrees with fragment f, if for some i<j<h we have t i (A)=t h (A)=, but t j (A)=. Otherwise t agrees with f: Then the column for A has the form for the observations in f 8

9 Example A B C D E F a<b<c<d: dis ag dis dis b<d<f<a: ag dis ag ag What is a good fragment of order? A sequence f of rows, say, u<v<w<t Da(f): the number of variables disagreeing with the ordering Fr(f): the number of variables having at least 2 ones in the rows of f A good fragment has high Fr(f) and low Da(f) 9

10 Problem statement Given thresholds σ and γ Find all fragments of order f such that in the data Fr(f) > σ Da(f) < γ and all subfragments of f satisfy these and the fragment has smaller Da value than its peers Any other fragments from the same set of objects Algorithm How to find fragments with the specific properties? Start from fragments of length 2 No disagreements are possible Only the bound Fr(f)>σ needs to be tested Iteration: Assume fragments of length k- are known Then we can build candidate fragments of length k Continue until no new patterns are found A complete algorithm: all fragments will be found

11 Monotonicity property Fragment t < t 2 < t 3 < < t k can satisfy the requirements only if all subfragments of length k- satisfy them All these have to be in the collection of fragments of size k- The levelwise algorithm Algorithm Find F2, fragments of size 2 C = all triples A<B<C such that A<B, A<C, and B<C are in F2 k 3 While C is not empty compute Da(f) for all f in C Fk {f in C Fr(f)> σ and Da(f)< γ } k k+ C all fragments of length k such that all the subfragments of length k- are in Fk

12 Complexity of the algorithm Potentially exponential in the number of variables F+C = the size of the answer + all the candidates Proportional to F+C n m for a matrix with n rows and m columns Too low values of σ or too high values of γ will lead to huge outputs Experimental results Data about students and courses Columns: students Rows: courses D(s,c)= if student s has taken course c Here we know the true ordering Or actually two: official ordering Real order in which the student took the courses 2

13 Part of the recommendations Discovered fragment f Fr(f)=36, Da(f)=3.2% Results 3

14 Results (paleontological data) Fragments for sites Or transpose the matrix: fragments for species Sequences of sites such that there are very few Lazarus events Provide ways of looking at projections of the data Can be used to find partial orders Example: words in documents Represent collections of documents as term vectors Which words occur () in the document or not () Very large dimensionality, lots of observations 4

15 Example from Citeseer (in 25) What does this tell us about these terms? Databases and selectivity estimation together do not occur without queries Databases < queries < selectivity estimation Old (25) example from Google Scholar prior distribution MCMC 5, documents prior distribution MCMC 295 documents prior distribution MCMC 5 documents prior distribution MCMC 65 documents prior < distribution < MCMC 5

16 Example from Google Scholar, Nov. 24, 27 prior distribution MCMC 2,22, documents prior distribution MCMC 6,3 documents prior distribution MCMC 6,3 documents prior distribution MCMC,23 documents prior < distribution < MCMC An aside: have the ratios of the frequencies changed? Query Ratio p d m p d m p d m p d m

17 Next theme Find small total orders from - data Finding partial orders from - data Find total orders for - data Finding partial orders from - data Model: a partial order over all observations Loglikelihood: proportional to the number of cases the observed occurrence patterns violate the continuity of species Prior: prefer partial orders that are as specific as possible Task: find a model with high likelihood * prior Algorithm: Find fragments and use heuristic search to build a good partial order 7

18 Why partial orders? Determining the ages of sites is difficult Radioisotope methods apply only to few sites In paleontology the so-called MN system: 8 classes for the last 25 Ma Classes are assigned by ad hoc methods Searching for a total order might not be a good idea The MN system is a partial order Finding partial orders from data How to find a partial order that fits well with the data? What does this mean? 8

19 What is a good partial order? The Lazarus count of a species with respect to a partial order P: For how many sites the species was extinct at the site, but extant before and after it (as determined by P) The same definition as for total orders A good partial order has small Lazarus count Can be formulated as a likelihood (a Lazarus event is a false positive) Laz No Laz No Laz 9

20 What is a good partial order? Find a partial order that has a low Lazarus count The trivial partial order has Lazarus count Want to find a partial order that is specific (close to a total order) and agrees with the data Measures of specificity: the number of linear extensions of P (hard to compute) number of edges in P Find a partial order that has high specificity * likelihood Algorithm for finding partial orders Compute fragments from the unordered data E.g., a < d < b < e < f and b < e < c and b < a < c < f and Form a precedence matrix: in what fraction of the fragments does a precede b Form a partial order that approximates the precedence matrix (heuristic search) 2

21 Fragments and reverse fragments The fragment generation will produce for each fragment f also its reverse f R The pairwise precedence matrix would be useless Divide the fragments into two classes (graph cutting) Discard one class Build the precedence matrix From precedence matrix to partial order Heuristic search Add edges to the partial order so that the match with the precedence matrix improves Keep track of transitivity Difficult (and interesting) algorithmic problem Empirical results look good Very recent theoretical results 2

22 22

23 Early Miocene Transfer to late Miocene 23

24 Themes of the talk Find small total orders from - data Finding partial orders from - data Find total orders for - data Finding good total orders for a matrix Given a site-genus matrix What is a good total ordering for the rows? One in which there are as few Lazarus events as possible Model class: total orders Loglikelihood proportional to the number of Lazarus events 24

25 How to find such an ordering of the rows? If there is an ordering that has no Lazarus events, it can be found in linear time (Booth & Lueker) consecutive ones property But normally there are (lots of) Lazarus events Finding good total orders for a matrix The problem of finding the best ordering of the matrix is NP-hard Finding whether there is a submatrix of size k that has no Lazarus events is NP-hard The fragment method finds such submatrices Local search, traveling salesperson approaches Spectral methods 25

26 Spectral ordering for finding good total orders for a matrix Spectral ordering Compute a similarity measure s(i,j) between sites (e.g., dot product) Laplacian L(i,j) The eigenvector v corresponding to the second smallest eigenvalue of L satisfies Maps the points to -d, keeping similar points close to each other The values v i can be used to order the points 26

27 Empirical observation The eigenvector seems to minimize also Lazarus events Even better than some combinatorial algorithms Why? No really good intuitive theoretical understanding Related to mixing time of Markov chains etc. Site-genus -matrix 27

28 After spectral ordering Fortelius, Jernvall, Gionis, Mannila, Paleobiology 32 (26) 28

29 29

30 Questions Computational Why does it work so well? How well does it actually work (what is the smallest number of Lazarus events for this data?) How to interpret the coefficients? Paleontological Fully based on the occurrence matrix (excellent and bad) Site-species data is only one type of data; how to use other types of data for the ordering? Rough estimates of the sizes of the model classes N observations Fragments of size at most k N individual fragments 2 N k k sets of fragments Partial orders Total orders ( ) 2 O N 2 N! 3

31 Concluding remarks General task: finding order from unordered data Here using species continuity as the additional information Other applications are possible Model classes Fragments Partial orders Total orders Lots of open questions The unreasonable effectiveness of spectral methods on discrete optimization task Approximation guarantees Fragments from other applications MDL description of sequences via partial orders Etc. 3

32 References A. Gionis, T. Kujala and H. Mannila: Fragments of order. ACM SIGKDD 23, p A. Ukkonen, M. Fortelius, H. Mannila: Finding partial orders from unordered - data. ACM SIGKDD 25, p M. Fortelius, A. Gionis, J. Jernvall, H. Mannila, Spectral Ordering and Biochronology of European Fossil Mammals. Paleobiology 32, 2, (26). K. Puolamäki, M. Fortelius, H. Mannila: Seriation in Paleontological Data Using Markov Chain Monte Carlo Methods. PLoS Comput Biol 2(2): e6 A. Gionis, H. Mannila, K. Puolamaki, and A. Ukkonen, Algorithms for Discovering Bucket Orders from Data, 2th International Conference on Knowledge Discovery and Data Mining (KDD) 26, p

From Data to Knowledge Research Unit

Chapter 17 From Data to Knowledge Research Unit Heikki Mannila, Jaakko Hollmén, Kai Puolamäki, Gemma Garriga, Jouni Seppänen, Robert Gwadera, Sami Hanhijärvi, Hannes Heikinheimo, Samuel Myllykangas, Antti