Ranking Clustered Data with Pairwise Comparisons

Size: px
Start display at page:

Download "Ranking Clustered Data with Pairwise Comparisons"

Transcription

1 Ranking Clustered Data with Pairwise Comparisons Kevin Kowalski 1. INTRODUCTION Background. Machine learning often relies heavily on being able to rank instances in a large set of data by some measure of relative fitness, but in many settings the only reasonable way to acquire information on the ranking is through making pairwise comparisons between the instances. Oftentimes, directly comparing these instances requires human intervention, which can be expensive in terms of time and other resources. Work by Jamieson and Nowak in [JN11] mitigates this problem by providing an algorithm for minimizing the number of pairwise comparisons necessary to rank a data set, assuming that the set is structured in a certain way. In particular, [JN11] assumes that the set can be embedded in a Euclidean space and that each point s fitness is inversely proportional to its distance from an unknown reference point. The algorithm presented in that paper achieves full ranking of the sample space with Θ(d log n) comparisons in the average case, where d is the dimensionality of the Euclidean space and n is the number of samples. This is a nontrivial improvement over general-purpose comparisonbased sorting algorithms which require Ω(n log n) pairwise comparisons even in the average case. The work also presents a version of the algorithm that works in a noisy setting, where each query to the comparison oracle has some probability of returning an incorrect result (and keeps returning the same result on further queries). This returns a full ranking of some subset of the samples that is consistent with all query responses and uses Θ(d log 2 n) queries in the average case, assuming a constant probability of error. Contribution. We generalize the algorithm of [JN11] to perform on data sets with rankings that follow a more general form. Specifically, we expect the data to be partitioned into k clusters where each cluster has its own unknown reference point, and the distance from each data point to its corresponding reference point determines its fitness. We demonstrate that this extended algorithm achieves Θ(kd log n + n log k) comparisons in the noiseless case, and additionally that it performs robustly in both the noiseless and noisy cases on points drawn uniformly from a hypercube and on real-world data on product reviews, respectively. 2. METHODS The algorithm we present in this report relies on the algorithms of [JN11] as subroutines in both the noiseless and noisy settings, so we begin with a brief characterizations of each For the remainder of this section, we assume that S = {s 1,..., s n} is the set of data points, s i s j denotes the event that s i is ranked below s j, S 1 S k is a partitioning of S into k clusters, and d is the dimensionality of the points in S. Noiseless active ranking with one reference point. In the noiseless case, the procedure works by running any standard comparison-based sorting algorithm 1 on the set of points to be ranked, with the following exception. Whenever a comparison is to be made, the procedure first checks whether the outcome of the comparison can be imputed from the values of the comparisons it has already made. If the value of the comparison cannot be imputed, then the comparison oracle is queried, but otherwise the imputed value can be used instead. More specifically, the outcome of a comparison s k sl is ambiguous if there exists a ranking consistent with the extant comparisons that ranks s k below s l, and another that ranks s l below s k. Equivalently, a comparison is ambiguous if there exist candidate points of reference that produce each of these rankings. For any point of reference r = (r 1,..., r d ) and any points s i, s j S, let H ij be the hyperplane normal to and bisecting the line between s i and s j. Then, s i is ranked above s j in the ranking induced by r if and only if r lies on the side of H ij closer to s i (since H ij divides the space of all points into a half that is closer to s i and a half that is closer to s j). Hence, any candidate reference point r is consistent with the extant comparisons if for every comparison s i s j for which the outcome is known, r lies on the correct side of H ij. This defines a set of linear constraints on r, and given a comparison s k sl the goal is then to determine whether these constraints force the comparison to take a particular value. We can determine whether it is possible for s k to be ranked below s l by adding the linear constraint encoding s k s l to the rest and invoking a standard linear programming algorithm to determine whether a feasible r exists. We can also determine whether it is possible to rank s k above s l in exactly the same way, so this tells us whether s k is ambiguous, as desired. sl 1 For our experiment, we used TimSort, a combination of merge sort and insertion sort that is used as default by Python.

2 Noisy active ranking with one reference point. In the noisy setting, queries to the comparison oracle have some probability of returning an erroneous result, and this error is persistent across multiple queries regarding the same comparison. Hence, ibecomest is impossible in the general case to fully rank a list of samples, so our goal instead becomes to return a full ranking on a reasonably large subset of the original list that is consistent with all the queries we receive from the oracle. Working in this setting necessitates making two major changes to the noiseless procedure: 1. The underlying comparison-based sorting algorithm must be insertion sort. In order to create the ranking on a subset, the algorithm must build the subset one element at a time while ensuring that query results are consistent with all elements already in the subset. The structure of insertion sort is the only natural one for this purpose. 2. Suppose that we already have a partial ranking on some subset of the first l 1 samples, and we want to add s l to the ranking. If the outcome of a query s k sl is imputable, then we take the imputed value as truth, but if the outcome is ambiguous, then instead of querying the oracle and trusting in its response, we create a voting set of exactly R samples s j where R is a parameter of the algorithm. Each sample votes for the outcome s k s l if after querying the oracle on s k sj and s j s l (or imputing the values of these queries if they are imputable) we get that s k s j s l, or votes for the opposite outcome if s l s j s k. The sample abstains if s l, s k s j or vice versa. The plurality vote determines the outcome we accept as truth, and in the case of a tie we just directly query the oracle on s k sl. In order for a sample s j to be considered for the voting set, it must be possible for s j to lie between s k and s l (or else the sample would simply abstain). Another way of stating this condition is that the outcome of either s k sj or s j sl must be ambiguous with respect to all the query results we have accepted as true 2. If there are fewer than R samples that meet this criterion, then we likely will not be able to obtain enough data to rank s l accurately, so we delete s l from the ranking. Active ranking with multiple reference points. In the setting of multiple reference points, we are given S = S 1 S k as input, where each data point in S i is ranked according to its distance to the i-th reference point. A high-level description of our algorithm is as follows. Given S, we pass each cluster S i to the single-reference-point ranking algorithm in sequence, obtaining a ranking (or partial ranking in the noisy case) on the elements of each cluster. Then, we merge the k rankings together with k 1 invocations of our pairwise merge procedure. These pairwise merges are organized into a binary tree, so that if σ 1,..., σ k are the rankings of each cluster, then first we merge σ 1 and σ 2 to make σ 1,2, then σ 3 and σ 4 to make σ 3,4 and so on 3 2 The definition of ambiguous in this context is actually underspecified in [JN11], but this interpretation makes the algorithm perform about as well as the results in that paper lead us to expect. 3 If k is odd, then σ k gets merged with an arbitrary other ranking. until σ k 1 and σ k make σ k 1,k. After that we merge σ 1,2 and σ 3,4 to make σ 1,2,3,4, and this process continues until we get the final ranking σ 1,...,k. In the noiseless case, the pairwise merges work exactly like the standard merge procedure from merge sort in each iteration, we compare the least element of the first list to the least element of the second, then remove the lesser of the two from the corresponding list and add it to the sorted list we are building. In this case, we can prove an upper bound on the expected number of pairwise comparisons the algorithm makes. Proposition 1. Let S = S 1 S k be such that S i = n i for i [k], let σ be a ranking on S that is inducible by some k reference points, and let M(σ) denote the number of pairwise comparisons the algorithm makes to produce a full ranking on S. Then, E[M(σ)] = O(kd log n + n log k), where the expectation is over σ drawn from the uniform distribution of rankings inducible by k reference points, d is the dimensionality of each point in S, and n = i ni. Proof. By [JN11], the expected number of comparisons the noiseless single-reference-point algorithm makes on input of size n i is at most cd log n i for some constant c. The expected number of comparisons that the multiple-referencepoint algorithm takes is then i cd log ni plus the expected number of comparisons for the merging procedure. By the convexity of logarithms, cd log n i ckd log n/k = O(kd log n). i Any pairwise merge of lists of size a and b uses at most a+b comparisons since it takes at most one comparison to add each element to the fully sorted list, and since the pattern of merges forms a binary tree, each sample undergoes at most log 2 k merges. Hence, the entire merge procedure uses at most n log 2 k = O(n log k) comparisons. Adding this to the expected number of comparisons used to sort each cluster gives a total of O(kd log n + n log k), as desired. In the noisy case, the algorithm takes an additional parameter R that serves a purpose analogous to that of R in the single-reference-point algorithm. Since we cannot trust the result of the comparison between the least element of the first list and the least element of the second, after getting the result we create a voting set of size R. Let A and B be the two lists we are merging, and assume that they are sorted from least to greatest (so in particular, A[0] and B[0] are the least elements of the two lists). Without loss of generality, assume that the initial query returns that A[0] B[0]. Then, the voting set for the query consists of the elements {B[0], B[1],..., B[R 1]}, or all the elements in B if B has fewer than R elements. For each element B[j] in the voting set, we query the oracle on whether A[0] B[j], and each positive result counts as a vote for the result A[0] B[0] while each negative result counts as a vote for B[0] A[0]. The plurality vote indicates the outcome we accept as truth, and in case of a tie the outcome is chosen randomly. The intuition behind this voting procedure is that if B[0] A[0], then the remaining elements of the voting set will very likely vote for the correct outcome, and if A[0] B[0], then

3 one of two things will happen. If there are many (say, more than R ) elements in B that outrank A, then the voting set will likely vote overwhelmingly for the correct outcome, but if there are few (say, less than R /2), then the voting set will likely vote for the incorrect outcome. However, in this latter case, if we rank B[0] below A[0] we are unlikely to create many inversions between B[0] and A since B[0] and A[0] are likely to be close. 3. RESULTS AND INTERPRETATION We ran experiments to evaluate the quality of our multiplereference-point sorter in both the noiseless and noisy cases. Noiseless experiment. For this algorithm we adapted the noiseless experiment of [JN11] to the case of two reference points. In each trial, S was initialized to contain n = 100 points drawn uniformly at random from the hypercube [0, 1] d and two reference points were drawn uniformly at random from the same distribution. The partition S 1 was defined to be the subset of S consisting of those points closer to the first reference point than the second, and S 2 was defined to be the remaining points. For each value of d = 10, 20,..., 100, 20 trials were run, and the mean numbers of queries the algorithm used are plotted in Figure 1. As in [JN11], the number of queries used approaches an asymptote as the dimensionality increases. In the k = 1 case, the algorithm is exactly TimSort except that the values of certain comparisons are imputed wherever possible, so it is impossible for the algorithm to make more queries than TimSort on an given input. As dimensionality increases, values become harder to impute until they are completely impossible in the case of d = 100 (since in this case d = n, and by cleverly choosing the reference point one can make any ordering possible). For k = 2 and k = 4, the algorithm still outperforms the baseline of TimSort for small values of d, but do worse when the dimensionality is high. This worsening is due to the extra overhead incurred by the merge procedure. The overall numbers of queries are well within the bounds predicted by Proposition 1. Noisy experiment. We would have liked to use the same data set as in [JN11] to evaluate our noisy algorithm, but the Aural Sonar data set appears to have disappeared. Instead, critical reviews for both A Game of Thrones and The Fault in Our Stars were scraped from Amazon and ranked according to helpfulness for each review, Amazon gives users the option to mark a review as either helpful or unhelpful, so the helpfulness score is defined as the ratio of helpful votes to total votes. All reviews with fewer votes than a certain threshold value were excluded, giving us a total of 33 reviews for each book. After scraping, each review was mapped onto a bag-ofwords representation with nltk. To reduce the dimensionality of this representation, the samples were further mapped into [0, 1] 10 and [0, 1] 15 using non-metric multidimensional scaling, a dimensionality reduction technique that preserves the relative distances (in this case, Euclidean distance in the bag-of-words space) between points. A good reference for this technique can be found in [CC08]. Implicitly, we are assuming that helpful reviews for the same book are more similar to each other in word choice than to unhelpful reviews, which means that we might be able to induce something similar to the helpfulness ranking with a reference point in the low-dimensional Euclidean space. This is not obviously a safe assumption to make, but it is borne out by the strength of our results. For each of d = 10 and d = 15 representations, we ran our multiple-reference-point sorter with R = 11 and R = 5 for 20 iterations, where in each iteration the sorter received the samples in a random order. The number of queries used, the number of inversions between the partial ranking output by the algorithm and the correct ranking on those elements, and the number of elements in each partial order are all plotted in Figure 2, relative to the maximum possible values for each of these. Though the performance of our algorithm on this noisy data set is not directly comparable to the performance of the single-point algorithm on the Aural Sonar data set, the value for the proportion of inversions present is similar (at approximately 40% for d = 10 and 35% for d = 15) while the proportion of queries used is much higher (at approximately 30% for d = 10 and 40% for d = 15, compared to 15% in [JN11] for d = 2). There are a number of factors that influence the numbers we obtain. 1. Noise in the data set. The Aural Sonar data set used in [JN11] was likely much more amenable to being embedded in a Euclidean space than our ad hoc solution. Previous work in [PPA06] suggests that a faithful 2-dimensional embedding of the data set exists, which allows the algorithm to work with a very small number of comparisons. Our data set, on the other hand, has only intuition backing up its suitability. 2. The difficulty of the problem. The two books we chose have reviews with very similar average helpfulness ratings, so the true ranking on the merged list requires a great deal of interleaving. In this case, there are many more possible orderings for the samples than in the case where all the samples have a single reference point, so many more comparisons would be necessary to tease apart the ranking to a similar degree of accuracy. There are no values for proportion of ranked elements in [JN11], so it is difficult to say how well our algorithm does in that respect, but we seem to rank close to all of the elements in both the d = 10 and d = 15 cases. Between the d = 10 and d = 15 cases, the d = 15 case sorts a greater proportion of the elements with a smaller proportion of inversions, though requires a substantially greater number of queries to do so. This is unsurprising embedding the points into a higher-dimensional space allows for a more faithful representation of distances in the original space, which translates to a lower rate of error. On the flip side, it is more difficult to impute values in a higher-dimensional space, so we more often need to make queries to compute the value of a comparison.

4 Figure 1: Mean and standard deviation of the number of queries are plotted against the dimensionality of the points with n = 100. The dashed line represents the mean number of queries that TimSort uses on the same data sets. Figure 2: Mean values for various proportions are plotted here. The bars represent the maximum and minimum values of the proportions over 20 trials.

5 4. REFERENCES [CC08] M.A.A. Cox and T.F. Cox. Multidimensional scaling. In Handbook of data visualization, pages Springer-Verlag, [JN11] Kevin G. Jamieson and Robert D. Nowak. Active ranking using pairwise comparisons. CoRR, abs/ , [PPA06] S. Philips, J. Pitton, and L. Atlas. Perceptual feature identification for active sonar echoes. In OCEANS 2006, pages 1 6, Sept 2006.

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

6 Randomized rounding of semidefinite programs

6 Randomized rounding of semidefinite programs 6 Randomized rounding of semidefinite programs We now turn to a new tool which gives substantially improved performance guarantees for some problems We now show how nonlinear programming relaxations can

More information

1 The range query problem

1 The range query problem CS268: Geometric Algorithms Handout #12 Design and Analysis Original Handout #12 Stanford University Thursday, 19 May 1994 Original Lecture #12: Thursday, May 19, 1994 Topics: Range Searching with Partition

More information

Applied Algorithm Design Lecture 3

Applied Algorithm Design Lecture 3 Applied Algorithm Design Lecture 3 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 3 1 / 75 PART I : GREEDY ALGORITHMS Pietro Michiardi (Eurecom) Applied Algorithm

More information

Lower and Upper Bound Theory. Prof:Dr. Adnan YAZICI Dept. of Computer Engineering Middle East Technical Univ. Ankara - TURKEY

Lower and Upper Bound Theory. Prof:Dr. Adnan YAZICI Dept. of Computer Engineering Middle East Technical Univ. Ankara - TURKEY Lower and Upper Bound Theory Prof:Dr. Adnan YAZICI Dept. of Computer Engineering Middle East Technical Univ. Ankara - TURKEY 1 Lower and Upper Bound Theory How fast can we sort? Lower-Bound Theory can

More information

Bagging for One-Class Learning

Bagging for One-Class Learning Bagging for One-Class Learning David Kamm December 13, 2008 1 Introduction Consider the following outlier detection problem: suppose you are given an unlabeled data set and make the assumptions that one

More information

Multidimensional Divide and Conquer 2 Spatial Joins

Multidimensional Divide and Conquer 2 Spatial Joins Multidimensional Divide and Conque Spatial Joins Yufei Tao ITEE University of Queensland Today we will continue our discussion of the divide and conquer method in computational geometry. This lecture will

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Active Clustering and Ranking

Active Clustering and Ranking Active Clustering and Ranking Rob Nowak, University of Wisconsin-Madison IMA Workshop on "High-Dimensional Phenomena" (9/26-30, 2011) Gautam Dasarathy Brian Eriksson (Madison/Boston) Kevin Jamieson Aarti

More information

arxiv: v1 [cs.ma] 8 May 2018

arxiv: v1 [cs.ma] 8 May 2018 Ordinal Approximation for Social Choice, Matching, and Facility Location Problems given Candidate Positions Elliot Anshelevich and Wennan Zhu arxiv:1805.03103v1 [cs.ma] 8 May 2018 May 9, 2018 Abstract

More information

Getting to Know Your Data

Getting to Know Your Data Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss

More information

One-Point Geometric Crossover

One-Point Geometric Crossover One-Point Geometric Crossover Alberto Moraglio School of Computing and Center for Reasoning, University of Kent, Canterbury, UK A.Moraglio@kent.ac.uk Abstract. Uniform crossover for binary strings has

More information

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T. Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement

More information

NUMERICAL METHODS PERFORMANCE OPTIMIZATION IN ELECTROLYTES PROPERTIES MODELING

NUMERICAL METHODS PERFORMANCE OPTIMIZATION IN ELECTROLYTES PROPERTIES MODELING NUMERICAL METHODS PERFORMANCE OPTIMIZATION IN ELECTROLYTES PROPERTIES MODELING Dmitry Potapov National Research Nuclear University MEPHI, Russia, Moscow, Kashirskoe Highway, The European Laboratory for

More information

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Pebble Sets in Convex Polygons

Pebble Sets in Convex Polygons 2 1 Pebble Sets in Convex Polygons Kevin Iga, Randall Maddox June 15, 2005 Abstract Lukács and András posed the problem of showing the existence of a set of n 2 points in the interior of a convex n-gon

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Computer Graphics Prof. Sukhendu Das Dept. of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 14

Computer Graphics Prof. Sukhendu Das Dept. of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 14 Computer Graphics Prof. Sukhendu Das Dept. of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 14 Scan Converting Lines, Circles and Ellipses Hello everybody, welcome again

More information

Design and Analysis of Algorithms

Design and Analysis of Algorithms Design and Analysis of Algorithms CSE 5311 Lecture 8 Sorting in Linear Time Junzhou Huang, Ph.D. Department of Computer Science and Engineering CSE5311 Design and Analysis of Algorithms 1 Sorting So Far

More information

Interleaving Schemes on Circulant Graphs with Two Offsets

Interleaving Schemes on Circulant Graphs with Two Offsets Interleaving Schemes on Circulant raphs with Two Offsets Aleksandrs Slivkins Department of Computer Science Cornell University Ithaca, NY 14853 slivkins@cs.cornell.edu Jehoshua Bruck Department of Electrical

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Topological Classification of Data Sets without an Explicit Metric

Topological Classification of Data Sets without an Explicit Metric Topological Classification of Data Sets without an Explicit Metric Tim Harrington, Andrew Tausz and Guillaume Troianowski December 10, 2008 A contemporary problem in data analysis is understanding the

More information

Jie Gao Computer Science Department Stony Brook University

Jie Gao Computer Science Department Stony Brook University Localization of Sensor Networks II Jie Gao Computer Science Department Stony Brook University 1 Rigidity theory Given a set of rigid bars connected by hinges, rigidity theory studies whether you can move

More information

Crossing Families. Abstract

Crossing Families. Abstract Crossing Families Boris Aronov 1, Paul Erdős 2, Wayne Goddard 3, Daniel J. Kleitman 3, Michael Klugerman 3, János Pach 2,4, Leonard J. Schulman 3 Abstract Given a set of points in the plane, a crossing

More information

26 The closest pair problem

26 The closest pair problem The closest pair problem 1 26 The closest pair problem Sweep algorithms solve many kinds of proximity problems efficiently. We present a simple sweep that solves the two-dimensional closest pair problem

More information

Chapter 2: The Normal Distribution

Chapter 2: The Normal Distribution Chapter 2: The Normal Distribution 2.1 Density Curves and the Normal Distributions 2.2 Standard Normal Calculations 1 2 Histogram for Strength of Yarn Bobbins 15.60 16.10 16.60 17.10 17.60 18.10 18.60

More information

Math 190: Quotient Topology Supplement

Math 190: Quotient Topology Supplement Math 190: Quotient Topology Supplement 1. Introduction The purpose of this document is to give an introduction to the quotient topology. The quotient topology is one of the most ubiquitous constructions

More information

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS An Undergraduate Research Scholars Thesis by DENISE IRVIN Submitted to the Undergraduate Research Scholars program at Texas

More information

CSE151 Assignment 2 Markov Decision Processes in the Grid World

CSE151 Assignment 2 Markov Decision Processes in the Grid World CSE5 Assignment Markov Decision Processes in the Grid World Grace Lin A484 gclin@ucsd.edu Tom Maddock A55645 tmaddock@ucsd.edu Abstract Markov decision processes exemplify sequential problems, which are

More information

Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube

Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube Kavish Gandhi April 4, 2015 Abstract A geodesic in the hypercube is the shortest possible path between two vertices. Leader and Long

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Recursively Enumerable Languages, Turing Machines, and Decidability

Recursively Enumerable Languages, Turing Machines, and Decidability Recursively Enumerable Languages, Turing Machines, and Decidability 1 Problem Reduction: Basic Concepts and Analogies The concept of problem reduction is simple at a high level. You simply take an algorithm

More information

Constructing Hidden Units using Examples and Queries

Constructing Hidden Units using Examples and Queries Constructing Hidden Units using Examples and Queries Eric B. Baum Kevin J. Lang NEC Research Institute 4 Independence Way Princeton, NJ 08540 ABSTRACT While the network loading problem for 2-layer threshold

More information

II (Sorting and) Order Statistics

II (Sorting and) Order Statistics II (Sorting and) Order Statistics Heapsort Quicksort Sorting in Linear Time Medians and Order Statistics 8 Sorting in Linear Time The sorting algorithms introduced thus far are comparison sorts Any comparison

More information

1 Minimum Cut Problem

1 Minimum Cut Problem CS 6 Lecture 6 Min Cut and Karger s Algorithm Scribes: Peng Hui How, Virginia Williams (05) Date: November 7, 07 Anthony Kim (06), Mary Wootters (07) Adapted from Virginia Williams lecture notes Minimum

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

CS 395T Computational Learning Theory. Scribe: Wei Tang

CS 395T Computational Learning Theory. Scribe: Wei Tang CS 395T Computational Learning Theory Lecture 1: September 5th, 2007 Lecturer: Adam Klivans Scribe: Wei Tang 1.1 Introduction Many tasks from real application domain can be described as a process of learning.

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Lecture 5: Duality Theory

Lecture 5: Duality Theory Lecture 5: Duality Theory Rajat Mittal IIT Kanpur The objective of this lecture note will be to learn duality theory of linear programming. We are planning to answer following questions. What are hyperplane

More information

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material

More information

Overview of Clustering

Overview of Clustering based on Loïc Cerfs slides (UFMG) April 2017 UCBL LIRIS DM2L Example of applicative problem Student profiles Given the marks received by students for different courses, how to group the students so that

More information

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551,

More information

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines Linear Models Lecture Outline: Numeric Prediction: Linear Regression Linear Classification The Perceptron Support Vector Machines Reading: Chapter 4.6 Witten and Frank, 2nd ed. Chapter 4 of Mitchell Solving

More information

Algorithms, Games, and Networks February 21, Lecture 12

Algorithms, Games, and Networks February 21, Lecture 12 Algorithms, Games, and Networks February, 03 Lecturer: Ariel Procaccia Lecture Scribe: Sercan Yıldız Overview In this lecture, we introduce the axiomatic approach to social choice theory. In particular,

More information

6 Distributed data management I Hashing

6 Distributed data management I Hashing 6 Distributed data management I Hashing There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication

More information

1 (15 points) LexicoSort

1 (15 points) LexicoSort CS161 Homework 2 Due: 22 April 2016, 12 noon Submit on Gradescope Handed out: 15 April 2016 Instructions: Please answer the following questions to the best of your ability. If you are asked to show your

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Exact Algorithms Lecture 7: FPT Hardness and the ETH Exact Algorithms Lecture 7: FPT Hardness and the ETH February 12, 2016 Lecturer: Michael Lampis 1 Reminder: FPT algorithms Definition 1. A parameterized problem is a function from (χ, k) {0, 1} N to {0,

More information

Multi-Cluster Interleaving on Paths and Cycles

Multi-Cluster Interleaving on Paths and Cycles Multi-Cluster Interleaving on Paths and Cycles Anxiao (Andrew) Jiang, Member, IEEE, Jehoshua Bruck, Fellow, IEEE Abstract Interleaving codewords is an important method not only for combatting burst-errors,

More information

3 Nonlinear Regression

3 Nonlinear Regression CSC 4 / CSC D / CSC C 3 Sometimes linear models are not sufficient to capture the real-world phenomena, and thus nonlinear models are necessary. In regression, all such models will have the same basic

More information

Spatial Information Based Image Classification Using Support Vector Machine

Spatial Information Based Image Classification Using Support Vector Machine Spatial Information Based Image Classification Using Support Vector Machine P.Jeevitha, Dr. P. Ganesh Kumar PG Scholar, Dept of IT, Regional Centre of Anna University, Coimbatore, India. Assistant Professor,

More information

Efficient Pairwise Classification

Efficient Pairwise Classification Efficient Pairwise Classification Sang-Hyeun Park and Johannes Fürnkranz TU Darmstadt, Knowledge Engineering Group, D-64289 Darmstadt, Germany Abstract. Pairwise classification is a class binarization

More information

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 24: Online Algorithms

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 24: Online Algorithms princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 24: Online Algorithms Lecturer: Matt Weinberg Scribe:Matt Weinberg Lecture notes sourced from Avrim Blum s lecture notes here: http://www.cs.cmu.edu/

More information

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CSC 447: Parallel Programming for Multi- Core and Cluster Systems CSC 447: Parallel Programming for Multi- Core and Cluster Systems Parallel Sorting Algorithms Instructor: Haidar M. Harmanani Spring 2016 Topic Overview Issues in Sorting on Parallel Computers Sorting

More information

Matching and Alignment: What is the Cost of User Post-match Effort?

Matching and Alignment: What is the Cost of User Post-match Effort? Matching and Alignment: What is the Cost of User Post-match Effort? (Short paper) Fabien Duchateau 1 and Zohra Bellahsene 2 and Remi Coletta 2 1 Norwegian University of Science and Technology NO-7491 Trondheim,

More information

Consensus, impossibility results and Paxos. Ken Birman

Consensus, impossibility results and Paxos. Ken Birman Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1}

More information

COMP Data Structures

COMP Data Structures COMP 2140 - Data Structures Shahin Kamali Topic 5 - Sorting University of Manitoba Based on notes by S. Durocher. COMP 2140 - Data Structures 1 / 55 Overview Review: Insertion Sort Merge Sort Quicksort

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Efficient Optimal Linear Boosting of A Pair of Classifiers

Efficient Optimal Linear Boosting of A Pair of Classifiers Efficient Optimal Linear Boosting of A Pair of Classifiers Victor Boyarshinov Dept Computer Science Rensselaer Poly. Institute boyarv@cs.rpi.edu Malik Magdon-Ismail Dept Computer Science Rensselaer Poly.

More information

MATH3016: OPTIMIZATION

MATH3016: OPTIMIZATION MATH3016: OPTIMIZATION Lecturer: Dr Huifu Xu School of Mathematics University of Southampton Highfield SO17 1BJ Southampton Email: h.xu@soton.ac.uk 1 Introduction What is optimization? Optimization is

More information

Applying the Q n Estimator Online

Applying the Q n Estimator Online Applying the Q n Estimator Online Robin Nunkesser 1, Karen Schettlinger 2, and Roland Fried 2 1 Department of Computer Science, Univ. Dortmund, 44221 Dortmund Robin.Nunkesser@udo.edu 2 Department of Statistics,

More information

Lecture 8 Parallel Algorithms II

Lecture 8 Parallel Algorithms II Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel

More information

Detecting Clusters and Outliers for Multidimensional

Detecting Clusters and Outliers for Multidimensional Kennesaw State University DigitalCommons@Kennesaw State University Faculty Publications 2008 Detecting Clusters and Outliers for Multidimensional Data Yong Shi Kennesaw State University, yshi5@kennesaw.edu

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set - solutions Thursday, October What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove. Do not

More information

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel

More information

CS675: Convex and Combinatorial Optimization Spring 2018 Consequences of the Ellipsoid Algorithm. Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Spring 2018 Consequences of the Ellipsoid Algorithm. Instructor: Shaddin Dughmi CS675: Convex and Combinatorial Optimization Spring 2018 Consequences of the Ellipsoid Algorithm Instructor: Shaddin Dughmi Outline 1 Recapping the Ellipsoid Method 2 Complexity of Convex Optimization

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

CS264: Beyond Worst-Case Analysis Lecture #19: Self-Improving Algorithms

CS264: Beyond Worst-Case Analysis Lecture #19: Self-Improving Algorithms CS264: Beyond Worst-Case Analysis Lecture #19: Self-Improving Algorithms Tim Roughgarden March 14, 2017 1 Preliminaries The last few lectures discussed several interpolations between worst-case and average-case

More information

Representation Learning for Clustering: A Statistical Framework

Representation Learning for Clustering: A Statistical Framework Representation Learning for Clustering: A Statistical Framework Hassan Ashtiani School of Computer Science University of Waterloo mhzokaei@uwaterloo.ca Shai Ben-David School of Computer Science University

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Exploiting a database to predict the in-flight stability of the F-16

Exploiting a database to predict the in-flight stability of the F-16 Exploiting a database to predict the in-flight stability of the F-16 David Amsallem and Julien Cortial December 12, 2008 1 Introduction Among the critical phenomena that have to be taken into account when

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Approximate Nearest Line Search in High Dimensions

Approximate Nearest Line Search in High Dimensions Approximate Nearest Line Search in High Dimensions Sepideh Mahabadi MIT mahabadi@mit.edu Abstract We consider the Approximate Nearest Line Search (NLS) problem. Given a set L of N lines in the high dimensional

More information

Training Digital Circuits with Hamming Clustering

Training Digital Circuits with Hamming Clustering IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: FUNDAMENTAL THEORY AND APPLICATIONS, VOL. 47, NO. 4, APRIL 2000 513 Training Digital Circuits with Hamming Clustering Marco Muselli, Member, IEEE, and Diego

More information

HMMT February 2018 February 10, 2018

HMMT February 2018 February 10, 2018 HMMT February 2018 February 10, 2018 Combinatorics 1. Consider a 2 3 grid where each entry is one of 0, 1, and 2. For how many such grids is the sum of the numbers in every row and in every column a multiple

More information

A New Pool Control Method for Boolean Compressed Sensing Based Adaptive Group Testing

A New Pool Control Method for Boolean Compressed Sensing Based Adaptive Group Testing Proceedings of APSIPA Annual Summit and Conference 27 2-5 December 27, Malaysia A New Pool Control Method for Boolean Compressed Sensing Based Adaptive roup Testing Yujia Lu and Kazunori Hayashi raduate

More information

Evaluating Robot Systems

Evaluating Robot Systems Evaluating Robot Systems November 6, 2008 There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California,

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17 5.1 Introduction You should all know a few ways of sorting in O(n log n)

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Notes in Computational Geometry Voronoi Diagrams

Notes in Computational Geometry Voronoi Diagrams Notes in Computational Geometry Voronoi Diagrams Prof. Sandeep Sen and Prof. Amit Kumar Indian Institute of Technology, Delhi Voronoi Diagrams In this lecture, we study Voronoi Diagrams, also known as

More information

Smoothing Dissimilarities for Cluster Analysis: Binary Data and Functional Data

Smoothing Dissimilarities for Cluster Analysis: Binary Data and Functional Data Smoothing Dissimilarities for Cluster Analysis: Binary Data and unctional Data David B. University of South Carolina Department of Statistics Joint work with Zhimin Chen University of South Carolina Current

More information

UNLABELED SENSING: RECONSTRUCTION ALGORITHM AND THEORETICAL GUARANTEES

UNLABELED SENSING: RECONSTRUCTION ALGORITHM AND THEORETICAL GUARANTEES UNLABELED SENSING: RECONSTRUCTION ALGORITHM AND THEORETICAL GUARANTEES Golnoosh Elhami, Adam Scholefield, Benjamín Béjar Haro and Martin Vetterli School of Computer and Communication Sciences École Polytechnique

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

Fundamental Properties of Graphs

Fundamental Properties of Graphs Chapter three In many real-life situations we need to know how robust a graph that represents a certain network is, how edges or vertices can be removed without completely destroying the overall connectivity,

More information

Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

More information

The Fibonacci hypercube

The Fibonacci hypercube AUSTRALASIAN JOURNAL OF COMBINATORICS Volume 40 (2008), Pages 187 196 The Fibonacci hypercube Fred J. Rispoli Department of Mathematics and Computer Science Dowling College, Oakdale, NY 11769 U.S.A. Steven

More information

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are

More information

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech Foundations of Machine Learning CentraleSupélec Paris Fall 2016 7. Nearest neighbors Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning

More information